Bruce Han的博客

不积跬步,无以至千里;不积小流,无以成江海。

0%

Pandas入门

Pandas是基于NumPy的一种工具,该工具是为了解决数据分析任务而创建的。Pandas纳入 了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供 了大量能使我们快速便捷地处理数据的函数和方法。它是使Python成为强大而高效的数据 分析环境的重要因素之一。

1
2
import numpy as np
import pandas as pd

Pandas的数据结构介绍

pandas的主要数据结构为Series和DataFrame。

Series

类似于一维数组的对象,由一组数据(各种NumPy数据类型)以及一组与之相关的数据标 签(索引)组成。

1
2
obj = pd.Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64
  • 上述例子中,左边一列为索引,右边一列为数据。由于没有指定索引,自动创建0到\(N-1\)(\(N\)为数据长度)的整数索引。
  • 也可以通过valuesindex属性获取其表示形式和索引对象。
1
obj.values
array([ 4,  7, -5,  3], dtype=int64)
1
obj.index
RangeIndex(start=0, stop=4, step=1)

可以创建时通过index参数指定索引。

1
2
obj2 = pd.Series([4, 7 , -5, 3], index=['d', 'b', 'a', 'c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
1
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')

与NumPy数组类似,可以通过索引的方式选取Series中的单个值或一组值:

1
obj2['a']
-5
1
2
obj2['d'] = 6
obj2[1]
7
1
obj2[['c','a','d']]
c    3
a   -5
d    6
dtype: int64

NumPy运算(如根据布尔型数组过滤、标量乘法、应用数学函数等)都会保留索引和值之间的链接。

1
obj2
d    6
b    7
a   -5
c    3
dtype: int64
1
obj2[obj2 > 0]
d    6
b    7
c    3
dtype: int64
1
obj2 * 2
d    12
b    14
a   -10
c     6
dtype: int64
1
np.exp(obj2)
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

我们可以将Series看成是一个定长的有序字典,因为它是索引值到数据值的一个映射。因此,它可以用在许多原本需要字典参数的函数中。

1
'b' in obj2
True
1
'e' in obj2
False
  • 可以通过Python字典直接创建Series,
  • 如果只传入一个字典,则结果Series中的索引就是原字典的键(有序排列)。
    1
    2
    3
    sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':1600, 'Utah':5000}
    obj3 = pd.Series(sdata)
    obj3
Ohio      35000
Texas     71000
Oregon     1600
Utah       5000
dtype: int64
  • 如果传入字典,并额外指定索引,则数据中与索引相匹配的值会被找出来并放在相应的 位置上,而索引中找不到对应值的其结果会被置为NaN(非数字,not a number)
1
2
3
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
California        NaN
Ohio          35000.0
Oregon         1600.0
Texas         71000.0
dtype: float64
  • pandas的isnullnotnull函数可以用于检测缺失数据;同时,该函数也是Series对象的函数。
1
pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
1
pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
1
obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
1
obj4.notnull()
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series对象在算术运算时会自动对齐不同索引的数据。

1
obj3
Ohio      35000
Texas     71000
Oregon     1600
Utah       5000
dtype: int64
1
obj4
California        NaN
Ohio          35000.0
Oregon         1600.0
Texas         71000.0
dtype: float64
1
obj3 + obj4
California         NaN
Ohio           70000.0
Oregon          3200.0
Texas         142000.0
Utah               NaN
dtype: float64

Series对象及其索引都有一个name属性,输出时index索引的name将被作为列标题。

1
2
3
obj4.name = 'population'
obj4.index.name = 'state'
obj4
state
California        NaN
Ohio          35000.0
Oregon         1600.0
Texas         71000.0
Name: population, dtype: float64

Series的索引可以通过赋值的方式就地修改:

1
obj
0    4
1    7
2   -5
3    3
dtype: int64
1
2
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

表格型数据结构。既有行索引,又有列索引;还可以看作是由Series组成的字典(共用一个索引)。

  1. 传入一个等长列表或NumPy数组组成的词典。DataFrame会自动添加索引,且全部列会被有序排列。
  2. 如果指定了序列,则DataFrame的列会按照指定顺序进行排列。
1
2
3
4
5
data = {'state':['Ohio','Ohio', 'Ohio','Nevada' ,'Nevada'],
'year':[2000, 2001, 2002, 2001, 2002],
'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
df1 = pd.DataFrame(data)
df1
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
1
2
# 指定列排列顺序
pd.DataFrame(data, columns=['year', 'state', 'pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

如果传入的列在数据中找不到,则会产生NA值。

1
2
3
df2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
df2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
1
df2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
  • 列获取。通过类似字典标记属性的方式,可以将DataFrame的列获取为一个Series。返回Series拥有原DataFrame的相同索引,且其name属性已经被相应地设置好了。
1
df2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
1
df2.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
  • 行获取。基于标签的索引用.loc,基于位置的索引用.iloc
1
df2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
1
df2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
1
df2.iloc[2]
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
  • 赋值。可以将标量、列表或数组赋予某个列,从而对列通过赋值方式进行修改。
1
2
df2['debt'] = 16.5
df2
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
1
2
df2['debt'] = np.arange(5)
df2
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4

将列表或数组赋值给某列时,其长度必须与DataFrame的长度匹配。如果赋值的是一个 Series,就会精确匹配DataFrame的索引,所有的空位会被填上缺失值。

1
2
3
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
df2['debt'] = val
df2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

为不存的列赋值将会创建一个新列。关键字del用于删除列:

1
2
df2['estern'] = df2.state == 'Ohio'
df2
year state pop debt estern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
1
2
del df2['estern']
df2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

通过索引方式返回的列只是相应数据的视图,而非副本。

  • 嵌套字典。外层字典的键作为列,内层字典的键作为行,当然也可以对结果进行转置。内层字典的键会被合并、排序,以形成最终的索引。如果显式指定了索引,则不会。
1
2
3
4
pop = {'Nevada':{2001:2.4, 2002:2.9},
'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
df3 = pd.DataFrame(pop)
df3
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
1
df3.T
2001 2002 2000
Nevada 2.4 2.9 NaN
Ohio 1.7 3.6 1.5
1
pd.DataFrame(pop, index=[2001, 2002, 2003])
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
  • 通过Series构造DataFrame。
1
df3
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
1
2
pdata = {'Ohio':df3['Ohio'][:-1], 'Neveda':df3['Nevada'][:2]}
pdata
{'Ohio': 2001    1.7
 2002    3.6
 Name: Ohio, dtype: float64, 'Neveda': 2001    2.4
 2002    2.9
 Name: Nevada, dtype: float64}
1
pd.DataFrame(pdata)
Ohio Neveda
2001 1.7 2.4
2002 3.6 2.9
1
list('abcd')
['a', 'b', 'c', 'd']
1
2
3
4
df=pd.DataFrame(np.arange(12).reshape((4,3)),
index=list('abcd'),
columns=['one', 'two', 'three'])
df
one two three
a 0 1 2
b 3 4 5
c 6 7 8
d 9 10 11
1
2
3
df.index.name = 'alpha'
df.columns.name = 'number'
df
number one two three
alpha
a 0 1 2
b 3 4 5
c 6 7 8
d 9 10 11

可以输入给DataFrame构造器的数据

类型 说明
二维ndaray 数据矩阵,还可以传入行标题和列标题。
由数组、列表或元组组成的字典 每个序列会变成DataFrame的一列,所有列的长度必须相等。
NumPy的结构化/记录数组 类似于“由数组组成的字典”
由Series组成的字典 每个Series会成一列。如果没有显示指定索引,则个Serie的索引会被合并成结果的行索引。

Tips

  • 如果设置了DataFrame的indexcolumnsname属性,则这些信息也会被显示出来。
  • values属性会以二维ndarray的形式返回DataFrame的数据
  • 如果DataFrame各列的数据类型不同,则值数组的数据类型会选用能兼容所有列的数据类型。
1
df3
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
1
2
3
df3.index.name = 'year'
df3.columns.name = 'state'
df3
state Nevada Ohio
year
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
1
df3.values
array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])
1
df2.values
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

索引对象

pandas的索引对象负责管理轴标签和其他元数据(轴名称等)。构建Series或DataFrame时,所用的任何数组或其他序列的标签都会被转化为一个Index

1
2
3
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
1
index[1:]
Index(['b', 'c'], dtype='object')

Index对象时不可修改的(imutable)

1
2
# 下面命令运行将出错
index[2] = 'd'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-79-82f3412925c0> in <module>
      1 # 下属命令运行将出错
----> 2 index[2] = 'd'


D:\program\Scoop\apps\python\current\Lib\site-packages\pandas\core\indexes\base.py

in setitem(self, key, value) 4258 4259 def setitem(self, key, value): -> 4260 raise TypeError("Index does not support mutable operations") 4261 4262 def getitem(self, key):

TypeError: Index does not support mutable operations
1
2
3
index = pd.Index(np.arange(3))
obj2 = pd.Series([1.5, -2.5, 0], index=index)
obj2.index is index
True
  • 除了长的像数组,Index的功能类似于一个固定大小的集合:
1
df3
state Nevada Ohio
year
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
1
'Ohio' in df3.columns
True
1
2003 in df3.index
False

Index的方法和属性:

方法 说明
append 连接另一个Index对象,产生一个新的Index
diff 计算差集,并得到一个Index
intersection 计算交集
union 计算并集
isin 是否包含在参数集合中的布尔型数组
delete 删除索引i处的元素,并得到新的Index
drop 删除传入的值,并得到一个新的Index
insert 将元素插入到索引i处,并得到新的Index
is_monotonic 当各元素大于等于前一个元素时,返回True
is_unique Index没有重复时返回True
unique 计算Index中唯一的数组

Series和DataFrame的基本功能

重新索引

  1. pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象
  2. 如果某个索引值当前并不存在,则会引入缺失值。可以通过fill_value参数指定默认缺失值。
1
2
obj1 = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj1
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
1
2
3
# 重新指定Index,长度需跟数据一致
obj1.index = pd.Index(list("abcd"))
obj1
a    4.5
b    7.2
c   -5.3
d    3.6
dtype: float64
1
2
3
# 新索引长度可以与旧索引不一致
obj2 = obj1.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a    4.5
b    7.2
c   -5.3
d    3.6
e    NaN
dtype: float64
1
obj1.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
a    4.5
b    7.2
c   -5.3
d    3.6
e    0.0
dtype: float64
1
obj1
a    4.5
b    7.2
c   -5.3
d    3.6
dtype: float64
  • 对于时间序列等有序数据,重新索引可能需要做一些插值处理,method选项可以完成,譬如ffill可以实现前向填充。
1
2
obj3 = pd.Series(['blue', 'purple', 'yelow'], index=[0, 2, 4])
obj3
0      blue
2    purple
4     yelow
dtype: object
1
obj3.reindex(range(6))
0      blue
1       NaN
2    purple
3       NaN
4     yelow
5       NaN
dtype: object
1
obj3.reindex(range(6), method='ffill')
0      blue
1      blue
2    purple
3    purple
4     yelow
5     yelow
dtype: object

reindex的(插值)method选项

参数 说明
ffill或pad 前向填充值
bfill或backfill 后向填充值
1
2
3
4
date_index = pd.date_range('/1/8/2018', periods=6, freq='D')
df4 = pd.DataFrame({"Price":[100, 101, np.nan, 100, 98, 88]},
index=date_index)
df4
Price
2018-01-08 100.0
2018-01-09 101.0
2018-01-10 NaN
2018-01-11 100.0
2018-01-12 98.0
2018-01-13 88.0
1
2
date_index2 = pd.date_range('1/3/2018', periods=12, freq='D')
df4.reindex(date_index2)
Price
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 NaN
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 100.0
2018-01-09 101.0
2018-01-10 NaN
2018-01-11 100.0
2018-01-12 98.0
2018-01-13 88.0
2018-01-14 NaN
1
df4.reindex(date_index2, method='bfill')
Price
2018-01-03 100.0
2018-01-04 100.0
2018-01-05 100.0
2018-01-06 100.0
2018-01-07 100.0
2018-01-08 100.0
2018-01-09 101.0
2018-01-10 NaN
2018-01-11 100.0
2018-01-12 98.0
2018-01-13 88.0
2018-01-14 NaN

对于DataFrame,reindex可以修改行索引、列索引,或两个都修改。如果仅传入一个序列,则会重新索引行:

1
2
3
4
df5= pd.DataFrame(np.arange(9).reshape((3,3)) ,
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'] )
df5
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
1
2
df6 = df5.reindex(['a', 'b', 'c', 'd'])
df6
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
1
2
states = ['Texas', 'Utah', 'California']
states
['Texas', 'Utah', 'California']
1
df5
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
1
2
# 重新索引列
df5.reindex(columns=states)
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
1
2
# 行重新索引
df5.reindex(index=['a', 'b', 'c', 'd'])
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
1
df5
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
1
2
states.sort()
states
['California', 'Texas', 'Utah']
1
2
3
# 同时进行列和行索引
df7=df5.reindex(index=['a', 'b', 'c', 'd'],columns=states)
df7
California Texas Utah
a 2.0 1.0 NaN
b NaN NaN NaN
c 5.0 4.0 NaN
d 8.0 7.0 NaN
1
2
3
df7=df5.reindex(index=['a', 'b', 'c', 'd'],
columns=['Texas', 'Utah', 'California'])
df7
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
1
df7.fillna(method='ffill')
Texas Utah California
a 1.0 NaN 2.0
b 1.0 NaN 2.0
c 4.0 NaN 5.0
d 7.0 NaN 8.0

丢弃指定轴上的项

丢弃某条轴上的一个或多个项,只需一个索引数组或列表即可。drop方法返回一个在指定轴上删除指定值后的新对象。

1
2
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
1
2
obj11=obj.drop('c')    # 返回新的对象,原对象不变
obj11
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
1
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
1
obj.drop(['a', 'd'])
b    1.0
c    2.0
e    4.0
dtype: float64
  • 对于DataFrame,可以删除任何轴上的索引值。
1
2
3
4
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
1
data.drop(['Colorado', 'Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15
1
data.drop('two', axis=1)   # 必须指定轴,否则出错。
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
1
data.drop(['one', 'four'], axis=1)
two three
Ohio 1 2
Colorado 5 6
Utah 9 10
New York 13 14

索引、选取和过滤

Serries索引

Series的索引的工作方式类似于NumPy数组的索引,只不过Series的索引值不只是整数,还可以是标签。

1
2
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
1
obj['b']
1.0
1
obj[2]
2.0
1
obj[2:4]
c    2.0
d    3.0
dtype: float64
1
obj[['a', 'd', 'c']]
a    0.0
d    3.0
c    2.0
dtype: float64
1
obj[[1, 3]]
  • b 1.0 d 3.0 dtype: float64
1
obj[obj < 2]
a    0.0
b    1.0
dtype: float64

利用标签的切片运算与普通的Python切片运算不同,其末端是包含的(inclusive)

1
obj['b':'c']
b    1.0
c    2.0
dtype: float64
1
2
obj['b':'c'] = 5
obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

DataFrame索引

对DataFrame进行索引其实就是获取一个或多个列。

1
2
3
4
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
1
data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
1
data[['two']]
two
Ohio 1
Colorado 5
Utah 9
New York 13
1
data[['three', 'two']]
three two
Ohio 2 1
Colorado 6 5
Utah 10 9
New York 14 13
1
data[:2]   # 下标切片
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
1
data[data['three'] > 5]   # 布尔索引
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
1
data < 5
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
1
2
data[data < 5] = 0
data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
  • 为了在DataFrame的行上进行标签索引,引入了专门的loc字段,从而可以通过NumPy式的的标记法以及轴标签从DaataFrame中选取行和列的子集。
  • 如果是索引值索引,则可以使用iloc字段
1
2
slice = data.loc['Colorado', ['two', 'three']]  # 返回Series
slice
two      5
three    6
Name: Colorado, dtype: int32
1
data.loc[['Colorado', 'Utah'], ['four','one', 'two']]  # 不能数字和标签混用
four one two
Colorado 7 0 5
Utah 11 8 9
1
data.iloc[[1,2],[3,0,1]]  # 索引值索引
four one two
Colorado 7 0 5
Utah 11 8 9
1
data.iloc[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32
1
data.loc[:'Utah','two']  # 标签切片索引,结束标签inclusive
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32
1
data[data['three'] > 5].iloc[:,:3] # 索引值切片结束not inclusive
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
1
data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

DataFrame的索引选项

类型 说明
df[val] 选取DataFrame的单一列,或一组列
df.loc[val,val] 通过标签索引,选取DataFrame的行、列,或行列的交集
df.iloc[val,val] 通过下标值索引,选取DataFrame的行、列,或行列的交集
df.reindex方法 将一个或多个轴匹配到新的索引
df.at[val, val] 通过行标签和列标签选取单个值。
1
2
3
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
1
data.at['New York', 'three']
14.0
1
2
3
data.at['Utah','four'] = 15
data.at['two','Utah'] = 0 # 增加新的行列
data
one two three four Utah
Ohio 0.0 1.0 2.0 3.0 NaN
Colorado 4.0 5.0 6.0 7.0 NaN
Utah 8.0 9.0 10.0 15.0 NaN
New York 12.0 13.0 14.0 15.0 NaN
two NaN NaN NaN NaN 0.0

算术运算和数据对齐

  • pandas可以对不同索引的对象进行算术运算。将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
1
2
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
1
2
3
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
index=['a', 'c', 'e', 'f', 'g'])
s2
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
1
s1  + s2
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
  • 自动的数据对齐操作在不重叠的索引处引入NA值,缺失值会在算术运算中传播。
  • 对于DataFrame,对齐操作同样发生在行和列上。
1
2
3
4
df1 = pd.DataFrame(np.arange(9.0).reshape((3, 3)),
columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df1
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
1
2
3
4
df2 = pd.DataFrame(np.arange(12).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df2
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
1
df1 + df2
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

在算术方法中填充值

1
2
3
df1 = pd.DataFrame(np.arange(12.0).reshape((3, 4)),
columns=list('abcd'))
df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
1
2
3
df2 = pd.DataFrame(np.arange(20.0).reshape((4, 5)),
columns=list('abcde'))
df2
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
1
df1 + df2
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
1
df1.add(df2,fill_value=0)
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
1
df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0

算术算法 1. add: 加法 2. sub: 减法 3. div: 除法/ 4. mul: 乘法

DataFrame和Series之间的运算

1
2
arr = np.arange(12).reshape((3, 4))
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
1
arr[0]
array([0, 1, 2, 3])
1
arr - arr[0]
array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])
  • 上述模式成为广播(broadcasting),DataFrame和Series之间的运算也差不多。
  • 默认情况下,DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。
1
2
3
4
df3 = pd.DataFrame(np.arange(12.0).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df3
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
1
2
series = df3.iloc[0]
series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
1
df3 - series
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
1
df3.sub(series, axis=1)
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
  • 如果某个索引值在DataFrame或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集
1
2
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2
b    0
e    1
f    2
dtype: int64
1
df3
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
1
df3 + series2
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN

如果希望匹配行且在列上广播,则必须使用算术运算方法

1
df3
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
1
2
series3 = df3['d']
series3
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64
1
df3.sub(series3, axis=0) # 传入的轴号就是希望匹配的轴
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
1
2
3
df5 = pd.DataFrame(np.arange(12).reshape(4,3),
columns=np.arange(3))
df5
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
1
2
series5 = pd.Series(np.arange(3))
series5
0    0
1    1
2    2
dtype: int32
1
df5 - series5
0 1 2
0 0 0 0
1 3 3 3
2 6 6 6
3 9 9 9
1
df5.sub(series5)
0 1 2
0 0 0 0
1 3 3 3
2 6 6 6
3 9 9 9
1
df5.sub(series5, axis=0)
0 1 2
0 0.0 1.0 2.0
1 2.0 3.0 4.0
2 4.0 5.0 6.0
3 NaN NaN NaN

函数应用和映射

NumPy的ufuncs(元素级数组方法)也可以用于操作pandas对象

1
2
3
4
df2 = pd.DataFrame(np.random.randn(4, 3), 
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df2
b d e
Utah -0.792212 1.594514 -0.946233
Ohio 1.131797 0.371889 -0.015236
Texas -0.810233 1.431196 -0.819002
Oregon 0.080920 0.177602 0.128606
1
np.abs(df2)
b d e
Utah 0.792212 1.594514 0.946233
Ohio 1.131797 0.371889 0.015236
Texas 0.810233 1.431196 0.819002
Oregon 0.080920 0.177602 0.128606
  • 还可以用DataFrame的apply方法将函数应用到由各列或行所形成的一维数组上。参数axis指名Axis along which the function is applied
1
2
3
f = lambda x:x.max() - x.min()

df2.apply(f, axis=0) # 默认用在行上,即每一列的最大值 - 最小值
b    1.942029
d    1.416912
e    1.074839
dtype: float64
1
df2.apply(f, axis=1)
Utah      2.540747
Ohio      1.147033
Texas     2.250198
Oregon    0.096682
dtype: float64
1
2
3
def f(x):
return pd.Series([x.min(), x.max(), x.max()- x.min()],
index=['min', 'max', 'gap'])
1
df2.apply(f, axis=0)
b d e
min -0.810233 0.177602 -0.946233
max 1.131797 1.594514 0.128606
gap 1.942029 1.416912 1.074839
1
df2.apply(f, axis=1)
min max gap
Utah -0.946233 1.594514 2.540747
Ohio -0.015236 1.131797 1.147033
Texas -0.819002 1.431196 2.250198
Oregon 0.080920 0.177602 0.096682
  • 许多最为常见的数组统计功能都被是现成DataFrame的方法,如summean,因而无需使用apply方法。
1
df2.sum(axis=0)
b   -0.389728
d    3.575201
e   -1.651866
dtype: float64
1
df2.mean(axis=1)
Utah     -0.047977
Ohio      0.496150
Texas    -0.066013
Oregon    0.129043
dtype: float64
  • 此外,还可以用元素级的Python函数,采用applymap即可实现
1
2
3
format = lambda x:'%.2f' % x

df2.applymap(format)
b d e
Utah -0.79 1.59 -0.95
Ohio 1.13 0.37 -0.02
Texas -0.81 1.43 -0.82
Oregon 0.08 0.18 0.13
1
df2['e'].map(format)
Utah      -0.95
Ohio      -0.02
Texas     -0.82
Oregon     0.13
Name: e, dtype: object

排序和排名

排序Sorting

根据条件对数据集进行排序(sorting)。要对行或列索引进行排序(按字典顺序),可使用sort_index方法,其将返回一个已排序的新对象。

1
2
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj
d    0
a    1
b    2
c    3
dtype: int64
1
obj.sort_index()
a    1
b    2
c    3
d    0
dtype: int64
  • DataFrame,可以根据任意轴上的索引进行排序。
1
2
3
4
df6 = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
df6
d a b c
three 0 1 2 3
one 4 5 6 7
1
df6.sort_index(axis=0)
d a b c
one 4 5 6 7
three 0 1 2 3
1
df6.sort_index(axis=1)
a b c d
three 1 2 3 0
one 5 6 7 4
  • 默认是按升序排列,也可以通过ascending=False参数进行降序排列。
  • 如果按值对Series排序,可使用sort_values()方法。排序时,任何缺失数据都会被放到Series的末尾。
1
df6.sort_index(axis=1, ascending=False)
d c b a
three 0 3 2 1
one 4 7 6 5
1
2
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj
0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64
1
obj.sort_values()
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64
1
obj.sort_values(ascending=False) # 降序排列
2    7.0
0    4.0
5    2.0
4   -3.0
1    NaN
3    NaN
dtype: float64
  • 在DataFrame上,可以根据一个或多个列中的值进行排序,通过将一个或多个列名传递给by选项即可。
1
2
df7 = pd.DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
df7
b a
0 4 0
1 7 1
2 -3 0
3 2 1
1
df7.sort_values(by='b')
b a
2 -3 0
3 2 1
0 4 0
1 7 1
1
df7.sort_values(by=['a', 'b'])
b a
2 -3 0
0 4 0
3 2 1
1 7 1
1
df7.sort_values(by=['a', 'b'], ascending=False)
b a
1 7 1
3 2 1
0 4 0
2 -3 0

排名ranking

  • 排名(ranking),跟排序密切相关,会增设一个排名值(从1开始,一直到数组中有效数据的数量)
  • 默认情况下,rank是通过“为各组分配一个平均排名”的方式破坏平级关系的。
1
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
1
obj.rank()
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
1
obj.rank(method='first')   # 原始数据中先出现的排名高
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
1
obj.rank(ascending=False, method='max')  # 使用整个组的最大排名
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

排名时用于破坏平级关系的method选项

method 说明
average 默认,在相等的分组中,为各个值分配平均排名
min 使用整个分组的最小排名
max 使用整个分组的最大排名
first 按值在原始数据中出现的顺序分配排名
1
2
3
4
df4 = pd.DataFrame({'b':[4.3, 7, -3, 2], 
'a':[0, 1, 0, 1],
'c':[-2, 5, 8, -2.5]})
df4
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
1
df4.rank(axis=1)
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0
1
df4.rank(axis=0)
b a c
0 3.0 1.5 2.0
1 4.0 3.5 3.0
2 1.0 1.5 4.0
3 2.0 3.5 1.0
1
df4.rank(method='min', ascending=False)
b a c
0 2.0 3.0 3.0
1 1.0 1.0 2.0
2 4.0 3.0 1.0
3 3.0 1.0 4.0

带有重复值的轴索引

  • 许多pandas函数,如reindex都要求标签唯一,但并不是强制性的。
  • 索引的is_unique属性可以检查其值是否唯一
1
2
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
a    0
a    1
b    2
b    3
c    4
dtype: int64
1
obj.index.is_unique
False
  • 如果某个索引对应多个值,则返回一个Series;而对应单个值的,则返回一个标量值。
  • DataFrame的行进行索引时也是如此。
1
obj['a']
a    0
a    1
dtype: int64
1
obj['c']
4
1
2
3
4
import numpy as np
df = pd.DataFrame(np.random.randn(4, 3),
index=list('aabb'),columns=[1,2,3])
df
1 2 3
a -0.052453 1.164686 -0.046114
a -0.273958 -0.749853 -1.216034
b 0.139408 0.776466 -1.407135
b -1.034602 -0.421413 -1.416837
1
df.iloc[:2,2:]
3
a -0.046114
a -1.216034
1
df.loc['a']
1 2 3
a -0.052453 1.164686 -0.046114
a -0.273958 -0.749853 -1.216034
1
df.loc['a',2]
a    1.164686
a   -0.749853
Name: 2, dtype: float64

汇总和计算描述统计

pandas对象拥有一组常用的数学和统计方法,大多属于约简和汇总统计。

1
2
3
4
5
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], 
[np.nan, np.nan], [0.75, -1.4]],
index=list('abcd'),
columns=['one', 'two'])
df
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.4
1
df.sum(axis=0)  # 返回含有列小计的Series
one    9.25
two   -5.90
dtype: float64
1
df.sum(axis=1)
a    1.40
b    2.60
c    0.00
d   -0.65
dtype: float64
  • NA值会被自动排除,除非整个切片(行或列)都是NA。通过skipna选项可以禁用该功能。
1
df.mean(axis=1, skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.325
dtype: float64
1
df.idxmax() # 最大值的索引标签
one    b
two    d
dtype: object
1
df.idxmin(axis=1) # 按列,最小值的索引
a    one
b    two
c    NaN
d    two
dtype: object
1
df.cumsum()
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.9
1
2
df
df.describe()
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.4
one two
count 3.000000 2.000000
mean 3.083333 -2.950000
std 3.493685 2.192031
min 0.750000 -4.500000
25% 1.075000 -3.725000
50% 1.400000 -2.950000
75% 4.250000 -2.175000
max 7.100000 -1.400000
1
2
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object
1
obj.describe()
count     16
unique     3
top        a
freq       8
dtype: object

相关系数和协方差

1
2
3
4
stock = pd.read_csv('./examples/stock.dat', sep='\t', index_col=0)

stock
stock.columns
AAPL MSFT GSPC
Date
2000-01-03 3.625643 39.334630 1455.219971
2000-01-04 3.319964 38.005900 1399.420044
2000-01-05 3.368548 38.406628 1402.109985
2000-01-06 3.077039 37.120080 1403.449951
2000-01-07 3.222794 37.605172 1441.469971
2000-01-10 3.166112 37.879354 1457.599976
2000-01-11 3.004162 36.909170 1438.560059
Index(['AAPL', 'MSFT', 'GSPC'], dtype='object')
1
stock["AAPL"]
Date
2000-01-03    3.625643
2000-01-04    3.319964
2000-01-05    3.368548
2000-01-06    3.077039
2000-01-07    3.222794
2000-01-10    3.166112
2000-01-11    3.004162
Name: AAPL, dtype: float64
1
stock.AAPL.corr(stock.MSFT)
0.9788313126479431
1
stock.AAPL.cov(stock.MSFT)
0.16580021717290455
1
stock.corr()
AAPL MSFT GSPC
AAPL 1.000000 0.978831 0.122897
MSFT 0.978831 1.000000 0.213535
GSPC 0.122897 0.213535 1.000000
1
stock.cov()
AAPL MSFT GSPC
AAPL 0.043003 0.165800 0.657975
MSFT 0.165800 0.667200 4.503164
GSPC 0.657975 4.503164 666.562274
  • Series的corr方法用于计算两个Series中重叠的、非NA的、按索引对齐的值的相关系数。类似的,cov用于计算协方差。
  • DataFrame的corrcov方法将以DataFrame的形式返回完整的相关系数或协方差矩阵。
1
stock.corrwith(stock.AAPL)
AAPL    1.000000
MSFT    0.978831
GSPC    0.122897
dtype: float64

唯一值、值计数及成员资格

1
2
3
4
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

uniques = obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
  • unique函数返回唯一值数组,未排序。如果需要可以对结果再次进行排序(uniques.sort())。
  • value_counts用于计算一个Series中各值出现的频率。
  • isin用于判断矢量化集合的成员资格,可用于选取Series或DataFrame列中的数据的子集。
1
obj.value_counts()  # 默认按降序排列
c    3
a    3
b    2
d    1
dtype: int64
1
2
# 还是顶级pandas方法,可用于任何数组或序列
pd.value_counts(obj.values, sort=False)
a    3
b    2
c    3
d    1
dtype: int64
1
2
mask = obj.isin(['b', 'c'])
mask
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
1
obj[mask] # 布尔索引
0    c
5    b
6    b
7    c
8    c
dtype: object
  • value_counts应用到DataFrame中
1
2
3
data = pd.DataFrame({'Qu1':[1, 3, 4, 3, 4],
'Qu2':[2, 2, 1, 2, 3],
'Qu3':[1, 5, 2, 4, 4]})
1
data
Qu1 Qu2 Qu3
0 1 2 1
1 3 2 5
2 4 1 2
3 3 2 4
4 4 3 4
1
2
result = data.apply(pd.value_counts, axis=1)
result
1 2 3 4 5
0 2.0 1.0 NaN NaN NaN
1 NaN 1.0 1.0 NaN 1.0
2 1.0 1.0 NaN 1.0 NaN
3 NaN 1.0 1.0 1.0 NaN
4 NaN NaN 1.0 2.0 NaN
1
result.fillna(0)
1 2 3 4 5
0 2.0 1.0 0.0 0.0 0.0
1 0.0 1.0 1.0 0.0 1.0
2 1.0 1.0 0.0 1.0 0.0
3 0.0 1.0 1.0 1.0 0.0
4 0.0 0.0 1.0 2.0 0.0

缺失数据处理

  • pandas使用NaN表示浮点和非浮点数组中的缺失值,Python中内置的None也会被当作Na处理。
  • Na的处理方法包括:
    1. dropna:根据各标签值中是否存在缺失值进行过滤
    2. fillna:用指定值或插值方法(如ffillbfill)填充缺失数据
    3. isnull:返回一个含有布尔值的对象
    4. notnullisnull的否定形式
1
2
string_data = pd.Series(['aard', 'arti', np.nan, 'avocdo'])
string_data
0      aard
1      arti
2       NaN
3    avocdo
dtype: object
1
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool
1
2
string_data[0] = None
string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool

滤除缺失数据

1
2
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64
1
data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64
  • 对于DataFrame,dropna默认丢失任何含有NA的行。
1
2
3
4
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
[np.nan, np.nan, np.nan],
[np.nan, 6.5, 3.0]])
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
1
2
cleaned = data.dropna(axis=0)
cleaned
0 1 2
0 1.0 6.5 3.0
1
data.dropna(how='all') # 只丢弃全部为NA的行
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
1
data.dropna(axis=1, how="all")
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
1
data.dropna(axis=0, how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
  • 只留一部分观测数据,可以用thresh参数实现。
1
2
df = pd.DataFrame(np.random.randn(7, 3))
df
0 1 2
0 1.426088 1.516618 1.900981
1 0.219221 0.326165 0.399649
2 -1.612351 0.429544 1.726309
3 0.067660 -2.268722 0.592288
4 -1.508975 2.512999 1.473433
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
1
2
df.loc[:4, 1] = np.nan
df
0 1 2
0 1.426088 NaN 1.900981
1 0.219221 NaN 0.399649
2 -1.612351 NaN 1.726309
3 0.067660 NaN 0.592288
4 -1.508975 NaN 1.473433
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
1
2
df.loc[:2, 2] = np.nan
df
0 1 2
0 1.426088 NaN NaN
1 0.219221 NaN NaN
2 -1.612351 NaN NaN
3 0.067660 NaN 0.592288
4 -1.508975 NaN 1.473433
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
1
df.dropna(thresh=3)   # 每行至少三个非NA值
0 1 2
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
1
df.dropna(thresh=2)
0 1 2
3 0.067660 NaN 0.592288
4 -1.508975 NaN 1.473433
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
1
df.dropna(thresh=4, axis=1)
0 2
0 1.426088 NaN
1 0.219221 NaN
2 -1.612351 NaN
3 0.067660 0.592288
4 -1.508975 1.473433
5 -1.218525 -2.078230
6 -0.234676 -0.634214

填充缺失数据

1
df.fillna(0)
0 1 2
0 1.426088 0.000000 0.000000
1 0.219221 0.000000 0.000000
2 -1.612351 0.000000 0.000000
3 0.067660 0.000000 0.592288
4 -1.508975 0.000000 1.473433
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
1
df.fillna({1:0.5, 2:-1}) # 标签1列中的NA用0.5代替,3中的用-1代替
0 1 2
0 1.426088 0.500000 -1.000000
1 0.219221 0.500000 -1.000000
2 -1.612351 0.500000 -1.000000
3 0.067660 0.500000 0.592288
4 -1.508975 0.500000 1.473433
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
1
df.fillna({1:5,2:10})
0 1 2
0 1.426088 5.000000 10.000000
1 0.219221 5.000000 10.000000
2 -1.612351 5.000000 10.000000
3 0.067660 5.000000 0.592288
4 -1.508975 5.000000 1.473433
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
  • 字典填充只能用于列(axis=0,沿行填充列),不能用于行(axis=1)。
1
df.fillna({1:5,2:10},axis=1)


NotImplementedError                       Traceback (most recent call last)

<ipython-input-283-b1b4b331430f> in <module>
----> 1 df.fillna({1:5,2:10},axis=1)


D:\program\Scoop\apps\python\current\Lib\site-packages\pandas\core\frame.py

in fillna(self, value, method, axis, inplace, limit, downcast, kwargs) 4257 limit=limit, 4258 downcast=downcast, -> 4259 kwargs 4260 ) 4261

D:\program\Scoop\apps\python\current\Lib\site-packages\pandas\core\generic.py

in fillna(self, value, method, axis, inplace, limit, downcast) 6262 if axis == 1: 6263 raise NotImplementedError( -> 6264 "Currently only can fill " 6265 "with dict/Series column " 6266 "by column"

NotImplementedError: Currently only can fill with dict/Series column by column

fillna默认返回新对象,但也可以对现有对象进行就地修改

1
2
_ = df.fillna(0, inplace=True)
df
0 1 2
0 1.426088 0.000000 0.000000
1 0.219221 0.000000 0.000000
2 -1.612351 0.000000 0.000000
3 0.067660 0.000000 0.592288
4 -1.508975 0.000000 1.473433
5 -1.218525 -1.855301 -2.078230
6 -0.234676 0.254110 -0.634214
1
2
3
4
df8 = pd.DataFrame(np.random.randn(6, 3))
df8.loc[2:4, 1] = np.nan
df8.loc[3:4,2] = np.nan
df8
0 1 2
0 -1.023072 -0.563391 -0.787469
1 -0.848494 1.109691 0.421535
2 1.671679 NaN 0.089971
3 -0.556844 NaN NaN
4 -1.665812 NaN NaN
5 1.077345 0.754463 -0.039842
1
df8.fillna(method='bfill') # 向后填充
0 1 2
0 -1.023072 -0.563391 -0.787469
1 -0.848494 1.109691 0.421535
2 1.671679 0.754463 0.089971
3 -0.556844 0.754463 -0.039842
4 -1.665812 0.754463 -0.039842
5 1.077345 0.754463 -0.039842
1
df8.fillna(method='ffill', limit=2) # 向前填充,最多填充两个
0 1 2
0 -1.023072 -0.563391 -0.787469
1 -0.848494 1.109691 0.421535
2 1.671679 1.109691 0.089971
3 -0.556844 1.109691 0.089971
4 -1.665812 NaN 0.089971
5 1.077345 0.754463 -0.039842
1
2
data = pd.Series([1., np.nan, 3.5, np.nan, 7])
data.fillna(data.mean()) # 以均值填充
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

层次化索引

层次化索引(hierarchical indexing),能在一个轴上拥有多个(两个以上)的索引级别,从而可以以低维度的形式处理高维度上数据。

1
2
3
4
data = pd.Series(np.random.randn(10), 
index=[['a','a','a','b','b','b','c','c','d','d'],
[1,2,3,1,2,3,1,2,2,3]])
data
a  1   -1.071029
   2   -0.850225
   3    1.122493
b  1    1.265563
   2    0.406997
   3    0.739587
c  1    1.666785
   2   -1.706725
d  2   -3.175604
   3   -0.359642
dtype: float64
1
data.index
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )
1
data['b']
1    1.265563
2    0.406997
3    0.739587
dtype: float64
1
data['b':'c']
b  1    1.265563
   2    0.406997
   3    0.739587
c  1    1.666785
   2   -1.706725
dtype: float64
1
data.loc[['b', 'd']]
b  1    1.265563
   2    0.406997
   3    0.739587
d  2   -3.175604
   3   -0.359642
dtype: float64
1
2
data
data[:, 2] # 内层标签索引
a  1   -1.071029
   2   -0.850225
   3    1.122493
b  1    1.265563
   2    0.406997
   3    0.739587
c  1    1.666785
   2   -1.706725
d  2   -3.175604
   3   -0.359642
dtype: float64
a   -0.850225
b    0.406997
c   -1.706725
d   -3.175604
dtype: float64
1
data.unstack()
1 2 3
a -1.071029 -0.850225 1.122493
b 1.265563 0.406997 0.739587
c 1.666785 -1.706725 NaN
d NaN -3.175604 -0.359642
1
data.unstack().stack()
a  1   -1.071029
   2   -0.850225
   3    1.122493
b  1    1.265563
   2    0.406997
   3    0.739587
c  1    1.666785
   2   -1.706725
d  2   -3.175604
   3   -0.359642
dtype: float64

对于DataFrame每条轴都可以有分层索引。

1
2
3
4
5
6
7
df = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
index=[['a', 'a', 'b', 'b'],
[1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])

df
Ohio Colorado
Green Red Green
a 1 0.0 1.0 2.0
2 3.0 4.0 5.0
b 1 6.0 7.0 8.0
2 9.0 10.0 11.0
1
2
3
4
5
# 可以为每层索引指定名字
df.index.names = ['key1', 'key2']
df.columns.names=['state', 'color']

df
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0.0 1.0 2.0
2 3.0 4.0 5.0
b 1 6.0 7.0 8.0
2 9.0 10.0 11.0
1
df['Ohio']
color Green Red
key1 key2
a 1 0.0 1.0
2 3.0 4.0
b 1 6.0 7.0
2 9.0 10.0
1
df.loc['a']
state Ohio Colorado
color Green Red Green
key2
1 0.0 1.0 2.0
2 3.0 4.0 5.0
1
df.loc['a', 'Ohio']
color Green Red
key2
1 0.0 1.0
2 3.0 4.0

重排分级顺序

  • swaplevel,接受两个级别编号或名称,并返回一个互换了级别的新对象,但数据不会发生变化。

  • sort_index(level=...),根据单个级别的值对数据进行排序

1
df.swaplevel('key1', 'key2')
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0.0 1.0 2.0
2 a 3.0 4.0 5.0
1 b 6.0 7.0 8.0
2 b 9.0 10.0 11.0
1
df.sort_index(level=0)
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0.0 1.0 2.0
2 3.0 4.0 5.0
b 1 6.0 7.0 8.0
2 9.0 10.0 11.0
1
df.swaplevel(0, 1).sort_index(0)
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0.0 1.0 2.0
b 6.0 7.0 8.0
2 a 3.0 4.0 5.0
b 9.0 10.0 11.0
1
df.sort_index(axis=1, level=1)
state Colorado Ohio
color Green Green Red
key1 key2
a 1 2.0 0.0 1.0
2 5.0 3.0 4.0
b 1 8.0 6.0 7.0
2 11.0 9.0 10.0

根据级别汇总统计

1
df
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0.0 1.0 2.0
2 3.0 4.0 5.0
b 1 6.0 7.0 8.0
2 9.0 10.0 11.0
1
df.sum(level='key2')
state Ohio Colorado
color Green Red Green
key2
1 6.0 8.0 10.0
2 12.0 14.0 16.0
1
df.sum(axis=1, level='color')
color Green Red
key1 key2
a 1 2.0 1.0
2 8.0 4.0
b 1 14.0 7.0
2 20.0 10.0

使用DataFrame的列 为索引

1
2
3
4
df11 = pd.DataFrame({'a':range(7), 'b':range(7, 0, -1), 
'c':['one','one','one','two','two','two','two'],
'd':[0, 1, 2, 0, 1, 2, 3]})
df11
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
  • set_index函数可以将一个或多个列转换为行索引,并创建一个新的DataFrame
  • reset_index的功能刚好相反,层次化索引的级别会被移到列里面。
1
df11.set_index('a')
b c d
a
0 7 one 0
1 6 one 1
2 5 one 2
3 4 two 0
4 3 two 1
5 2 two 2
6 1 two 3
1
2
3
df12 = df11.set_index(['c','d'])

df12
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
1
df12.reset_index()
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1

面板数据

pandas中还有一种数据结构,Panel(面板),三维版的DataFrame,请自行搜索相关资料。