Pandas是基于NumPy的一种工具,该工具是为了解决数据分析任务而创建的。Pandas纳入
了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供
了大量能使我们快速便捷地处理数据的函数和方法。它是使Python成为强大而高效的数据
分析环境的重要因素之一。
1 2
| import numpy as np import pandas as pd
|
Pandas的数据结构介绍
pandas的主要数据结构为Series和DataFrame。
Series
类似于一维数组的对象,由一组数据(各种NumPy数据类型)以及一组与之相关的数据标
签(索引)组成。
1 2
| obj = pd.Series([4, 7, -5, 3]) obj
|
0 4
1 7
2 -5
3 3
dtype: int64
- 上述例子中,左边一列为索引,右边一列为数据。由于没有指定索引,自动创建0到\(N-1\)(\(N\)为数据长度)的整数索引。
- 也可以通过values和index属性获取其表示形式和索引对象。
array([ 4, 7, -5, 3], dtype=int64)
RangeIndex(start=0, stop=4, step=1)
可以创建时通过index参数指定索引。
1 2 3
| obj2 = pd.Series([4, 7 , -5, 3], index=['d', 'b', 'a', 'c']) obj2
|
d 4
b 7
a -5
c 3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')
与NumPy数组类似,可以通过索引的方式选取Series中的单个值或一组值:
-5
7
c 3
a -5
d 6
dtype: int64
NumPy运算(如根据布尔型数组过滤、标量乘法、应用数学函数等)都会保留索引和值之间的链接。
d 6
b 7
a -5
c 3
dtype: int64
d 6
b 7
c 3
dtype: int64
d 12
b 14
a -10
c 6
dtype: int64
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64
我们可以将Series看成是一个定长的有序字典,因为它是索引值到数据值的一个映射。因此,它可以用在许多原本需要字典参数的函数中。
True
False
- 可以通过Python字典直接创建Series,
- 如果只传入一个字典,则结果Series中的索引就是原字典的键(有序排列)。
1 2 3
| sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':1600, 'Utah':5000} obj3 = pd.Series(sdata) obj3
|
Ohio 35000
Texas 71000
Oregon 1600
Utah 5000
dtype: int64
- 如果传入字典,并额外指定索引,则数据中与索引相匹配的值会被找出来并放在相应的
位置上,而索引中找不到对应值的其结果会被置为NaN(非数字,not a
number)
1 2 3 4
| states = ['California', 'Ohio', 'Oregon', 'Texas'] obj4 = pd.Series(sdata, index=states) obj4
|
California NaN
Ohio 35000.0
Oregon 1600.0
Texas 71000.0
dtype: float64
- pandas的
isnull
和notnull
函数可以用于检测缺失数据;同时,该函数也是Series对象的函数。
California True
Ohio False
Oregon False
Texas False
dtype: bool
California False
Ohio True
Oregon True
Texas True
dtype: bool
California True
Ohio False
Oregon False
Texas False
dtype: bool
California False
Ohio True
Oregon True
Texas True
dtype: bool
Series对象在算术运算时会自动对齐不同索引的数据。
Ohio 35000
Texas 71000
Oregon 1600
Utah 5000
dtype: int64
California NaN
Ohio 35000.0
Oregon 1600.0
Texas 71000.0
dtype: float64
California NaN
Ohio 70000.0
Oregon 3200.0
Texas 142000.0
Utah NaN
dtype: float64
Series对象及其索引都有一个name属性,输出时index索引的name将被作为列标题。
1 2 3
| obj4.name = 'population' obj4.index.name = 'state' obj4
|
state
California NaN
Ohio 35000.0
Oregon 1600.0
Texas 71000.0
Name: population, dtype: float64
Series的索引可以通过赋值的方式就地修改:
0 4
1 7
2 -5
3 3
dtype: int64
1 2
| obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] obj
|
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
DataFrame
表格型数据结构。既有行索引,又有列索引;还可以看作是由Series组成的字典(共用一个索引)。
- 传入一个等长列表或NumPy数组组成的词典。DataFrame会自动添加索引,且全部列会被有序排列。
- 如果指定了序列,则DataFrame的列会按照指定顺序进行排列。
1 2 3 4 5
| data = {'state':['Ohio','Ohio', 'Ohio','Nevada' ,'Nevada'], 'year':[2000, 2001, 2002, 2001, 2002], 'pop':[1.5, 1.7, 3.6, 2.4, 2.9]} df1 = pd.DataFrame(data) df1
|
|
state
|
year
|
pop
|
0
|
Ohio
|
2000
|
1.5
|
1
|
Ohio
|
2001
|
1.7
|
2
|
Ohio
|
2002
|
3.6
|
3
|
Nevada
|
2001
|
2.4
|
4
|
Nevada
|
2002
|
2.9
|
1 2
| pd.DataFrame(data, columns=['year', 'state', 'pop'])
|
|
year
|
state
|
pop
|
0
|
2000
|
Ohio
|
1.5
|
1
|
2001
|
Ohio
|
1.7
|
2
|
2002
|
Ohio
|
3.6
|
3
|
2001
|
Nevada
|
2.4
|
4
|
2002
|
Nevada
|
2.9
|
如果传入的列在数据中找不到,则会产生NA值。
1 2 3
| df2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five']) df2
|
|
year
|
state
|
pop
|
debt
|
one
|
2000
|
Ohio
|
1.5
|
NaN
|
two
|
2001
|
Ohio
|
1.7
|
NaN
|
three
|
2002
|
Ohio
|
3.6
|
NaN
|
four
|
2001
|
Nevada
|
2.4
|
NaN
|
five
|
2002
|
Nevada
|
2.9
|
NaN
|
Index(['year', 'state', 'pop', 'debt'], dtype='object')
- 列获取。通过类似字典标记或属性的方式,可以将DataFrame的列获取为一个Series。返回Series拥有原DataFrame的相同索引,且其name属性已经被相应地设置好了。
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
- 行获取。基于标签的索引用
.loc
,基于位置的索引用.iloc
:
|
year
|
state
|
pop
|
debt
|
one
|
2000
|
Ohio
|
1.5
|
NaN
|
two
|
2001
|
Ohio
|
1.7
|
NaN
|
three
|
2002
|
Ohio
|
3.6
|
NaN
|
four
|
2001
|
Nevada
|
2.4
|
NaN
|
five
|
2002
|
Nevada
|
2.9
|
NaN
|
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
- 赋值。可以将标量、列表或数组赋予某个列,从而对列通过赋值方式进行修改。
|
year
|
state
|
pop
|
debt
|
one
|
2000
|
Ohio
|
1.5
|
16.5
|
two
|
2001
|
Ohio
|
1.7
|
16.5
|
three
|
2002
|
Ohio
|
3.6
|
16.5
|
four
|
2001
|
Nevada
|
2.4
|
16.5
|
five
|
2002
|
Nevada
|
2.9
|
16.5
|
1 2
| df2['debt'] = np.arange(5) df2
|
|
year
|
state
|
pop
|
debt
|
one
|
2000
|
Ohio
|
1.5
|
0
|
two
|
2001
|
Ohio
|
1.7
|
1
|
three
|
2002
|
Ohio
|
3.6
|
2
|
four
|
2001
|
Nevada
|
2.4
|
3
|
five
|
2002
|
Nevada
|
2.9
|
4
|
将列表或数组赋值给某列时,其长度必须与DataFrame的长度匹配。如果赋值的是一个
Series,就会精确匹配DataFrame的索引,所有的空位会被填上缺失值。
1 2 3
| val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) df2['debt'] = val df2
|
|
year
|
state
|
pop
|
debt
|
one
|
2000
|
Ohio
|
1.5
|
NaN
|
two
|
2001
|
Ohio
|
1.7
|
-1.2
|
three
|
2002
|
Ohio
|
3.6
|
NaN
|
four
|
2001
|
Nevada
|
2.4
|
-1.5
|
five
|
2002
|
Nevada
|
2.9
|
-1.7
|
为不存的列赋值将会创建一个新列。关键字del
用于删除列:
1 2
| df2['estern'] = df2.state == 'Ohio' df2
|
|
year
|
state
|
pop
|
debt
|
estern
|
one
|
2000
|
Ohio
|
1.5
|
NaN
|
True
|
two
|
2001
|
Ohio
|
1.7
|
-1.2
|
True
|
three
|
2002
|
Ohio
|
3.6
|
NaN
|
True
|
four
|
2001
|
Nevada
|
2.4
|
-1.5
|
False
|
five
|
2002
|
Nevada
|
2.9
|
-1.7
|
False
|
1 2
| del df2['estern'] df2.columns
|
Index(['year', 'state', 'pop', 'debt'], dtype='object')
通过索引方式返回的列只是相应数据的视图,而非副本。
- 嵌套字典。外层字典的键作为列,内层字典的键作为行,当然也可以对结果进行转置。内层字典的键会被合并、排序,以形成最终的索引。如果显式指定了索引,则不会。
1 2 3 4
| pop = {'Nevada':{2001:2.4, 2002:2.9}, 'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}} df3 = pd.DataFrame(pop) df3
|
|
Nevada
|
Ohio
|
2001
|
2.4
|
1.7
|
2002
|
2.9
|
3.6
|
2000
|
NaN
|
1.5
|
|
2001
|
2002
|
2000
|
Nevada
|
2.4
|
2.9
|
NaN
|
Ohio
|
1.7
|
3.6
|
1.5
|
1
| pd.DataFrame(pop, index=[2001, 2002, 2003])
|
|
Nevada
|
Ohio
|
2001
|
2.4
|
1.7
|
2002
|
2.9
|
3.6
|
2003
|
NaN
|
NaN
|
|
Nevada
|
Ohio
|
2001
|
2.4
|
1.7
|
2002
|
2.9
|
3.6
|
2000
|
NaN
|
1.5
|
1 2
| pdata = {'Ohio':df3['Ohio'][:-1], 'Neveda':df3['Nevada'][:2]} pdata
|
{'Ohio': 2001 1.7
2002 3.6
Name: Ohio, dtype: float64, 'Neveda': 2001 2.4
2002 2.9
Name: Nevada, dtype: float64}
|
Ohio
|
Neveda
|
2001
|
1.7
|
2.4
|
2002
|
3.6
|
2.9
|
['a', 'b', 'c', 'd']
1 2 3 4
| df=pd.DataFrame(np.arange(12).reshape((4,3)), index=list('abcd'), columns=['one', 'two', 'three']) df
|
|
one
|
two
|
three
|
a
|
0
|
1
|
2
|
b
|
3
|
4
|
5
|
c
|
6
|
7
|
8
|
d
|
9
|
10
|
11
|
1 2 3
| df.index.name = 'alpha' df.columns.name = 'number' df
|
number
|
one
|
two
|
three
|
alpha
|
|
|
|
a
|
0
|
1
|
2
|
b
|
3
|
4
|
5
|
c
|
6
|
7
|
8
|
d
|
9
|
10
|
11
|
可以输入给DataFrame构造器的数据
二维ndaray |
数据矩阵,还可以传入行标题和列标题。 |
由数组、列表或元组组成的字典 |
每个序列会变成DataFrame的一列,所有列的长度必须相等。 |
NumPy的结构化/记录数组 |
类似于“由数组组成的字典” |
由Series组成的字典 |
每个Series会成一列。如果没有显示指定索引,则个Serie的索引会被合并成结果的行索引。 |
Tips
- 如果设置了DataFrame的
index
和columns
的name
属性,则这些信息也会被显示出来。
values
属性会以二维ndarray的形式返回DataFrame的数据
- 如果DataFrame各列的数据类型不同,则值数组的数据类型会选用能兼容所有列的数据类型。
|
Nevada
|
Ohio
|
2001
|
2.4
|
1.7
|
2002
|
2.9
|
3.6
|
2000
|
NaN
|
1.5
|
1 2 3
| df3.index.name = 'year' df3.columns.name = 'state' df3
|
state
|
Nevada
|
Ohio
|
year
|
|
|
2001
|
2.4
|
1.7
|
2002
|
2.9
|
3.6
|
2000
|
NaN
|
1.5
|
array([[2.4, 1.7],
[2.9, 3.6],
[nan, 1.5]])
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7]], dtype=object)
索引对象
pandas的索引对象负责管理轴标签和其他元数据(轴名称等)。构建Series或DataFrame时,所用的任何数组或其他序列的标签都会被转化为一个Index
:
1 2 3
| obj = pd.Series(range(3), index=['a', 'b', 'c']) index = obj.index index
|
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')
Index对象时不可修改的(imutable)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-79-82f3412925c0> in <module>
1 # 下属命令运行将出错
----> 2 index[2] = 'd'
D:\program\Scoop\apps\python\current\Lib\site-packages\pandas\core\indexes\base.py
in setitem(self, key, value) 4258 4259 def
setitem(self, key, value): -> 4260 raise
TypeError("Index does not support mutable operations") 4261 4262 def
getitem(self, key):
TypeError: Index does not support mutable operations
1 2 3
| index = pd.Index(np.arange(3)) obj2 = pd.Series([1.5, -2.5, 0], index=index) obj2.index is index
|
True
- 除了长的像数组,
Index
的功能类似于一个固定大小的集合:
state
|
Nevada
|
Ohio
|
year
|
|
|
2001
|
2.4
|
1.7
|
2002
|
2.9
|
3.6
|
2000
|
NaN
|
1.5
|
True
False
Index
的方法和属性:
append |
连接另一个Index 对象,产生一个新的Index |
diff |
计算差集,并得到一个Index |
intersection |
计算交集 |
union |
计算并集 |
isin |
是否包含在参数集合中的布尔型数组 |
delete |
删除索引i处的元素,并得到新的Index |
drop |
删除传入的值,并得到一个新的Index |
insert |
将元素插入到索引i处,并得到新的Index |
is_monotonic |
当各元素大于等于前一个元素时,返回True |
is_unique |
当Index 没有重复时返回True |
unique |
计算Index 中唯一的数组 |
Series和DataFrame的基本功能
重新索引
- pandas对象的一个重要方法是
reindex
,其作用是创建一个适应新索引的新对象。
- 如果某个索引值当前并不存在,则会引入缺失值。可以通过
fill_value
参数指定默认缺失值。
1 2
| obj1 = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) obj1
|
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
1 2 3
| obj1.index = pd.Index(list("abcd")) obj1
|
a 4.5
b 7.2
c -5.3
d 3.6
dtype: float64
1 2 3
| obj2 = obj1.reindex(['a', 'b', 'c', 'd', 'e']) obj2
|
a 4.5
b 7.2
c -5.3
d 3.6
e NaN
dtype: float64
1
| obj1.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
|
a 4.5
b 7.2
c -5.3
d 3.6
e 0.0
dtype: float64
a 4.5
b 7.2
c -5.3
d 3.6
dtype: float64
- 对于时间序列等有序数据,重新索引可能需要做一些插值处理,method选项可以完成,譬如
ffill
可以实现前向填充。
1 2
| obj3 = pd.Series(['blue', 'purple', 'yelow'], index=[0, 2, 4]) obj3
|
0 blue
2 purple
4 yelow
dtype: object
0 blue
1 NaN
2 purple
3 NaN
4 yelow
5 NaN
dtype: object
1
| obj3.reindex(range(6), method='ffill')
|
0 blue
1 blue
2 purple
3 purple
4 yelow
5 yelow
dtype: object
reindex的(插值)method选项
ffill或pad |
前向填充值 |
bfill或backfill |
后向填充值 |
1 2 3 4
| date_index = pd.date_range('/1/8/2018', periods=6, freq='D') df4 = pd.DataFrame({"Price":[100, 101, np.nan, 100, 98, 88]}, index=date_index) df4
|
|
Price
|
2018-01-08
|
100.0
|
2018-01-09
|
101.0
|
2018-01-10
|
NaN
|
2018-01-11
|
100.0
|
2018-01-12
|
98.0
|
2018-01-13
|
88.0
|
1 2
| date_index2 = pd.date_range('1/3/2018', periods=12, freq='D') df4.reindex(date_index2)
|
|
Price
|
2018-01-03
|
NaN
|
2018-01-04
|
NaN
|
2018-01-05
|
NaN
|
2018-01-06
|
NaN
|
2018-01-07
|
NaN
|
2018-01-08
|
100.0
|
2018-01-09
|
101.0
|
2018-01-10
|
NaN
|
2018-01-11
|
100.0
|
2018-01-12
|
98.0
|
2018-01-13
|
88.0
|
2018-01-14
|
NaN
|
1
| df4.reindex(date_index2, method='bfill')
|
|
Price
|
2018-01-03
|
100.0
|
2018-01-04
|
100.0
|
2018-01-05
|
100.0
|
2018-01-06
|
100.0
|
2018-01-07
|
100.0
|
2018-01-08
|
100.0
|
2018-01-09
|
101.0
|
2018-01-10
|
NaN
|
2018-01-11
|
100.0
|
2018-01-12
|
98.0
|
2018-01-13
|
88.0
|
2018-01-14
|
NaN
|
对于DataFrame,reindex
可以修改行索引、列索引,或两个都修改。如果仅传入一个序列,则会重新索引行:
1 2 3 4
| df5= pd.DataFrame(np.arange(9).reshape((3,3)) , index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'] ) df5
|
|
Ohio
|
Texas
|
California
|
a
|
0
|
1
|
2
|
c
|
3
|
4
|
5
|
d
|
6
|
7
|
8
|
1 2
| df6 = df5.reindex(['a', 'b', 'c', 'd']) df6
|
|
Ohio
|
Texas
|
California
|
a
|
0.0
|
1.0
|
2.0
|
b
|
NaN
|
NaN
|
NaN
|
c
|
3.0
|
4.0
|
5.0
|
d
|
6.0
|
7.0
|
8.0
|
1 2
| states = ['Texas', 'Utah', 'California'] states
|
['Texas', 'Utah', 'California']
|
Ohio
|
Texas
|
California
|
a
|
0
|
1
|
2
|
c
|
3
|
4
|
5
|
d
|
6
|
7
|
8
|
1 2
| df5.reindex(columns=states)
|
|
Texas
|
Utah
|
California
|
a
|
1
|
NaN
|
2
|
c
|
4
|
NaN
|
5
|
d
|
7
|
NaN
|
8
|
1 2
| df5.reindex(index=['a', 'b', 'c', 'd'])
|
|
Ohio
|
Texas
|
California
|
a
|
0.0
|
1.0
|
2.0
|
b
|
NaN
|
NaN
|
NaN
|
c
|
3.0
|
4.0
|
5.0
|
d
|
6.0
|
7.0
|
8.0
|
|
Ohio
|
Texas
|
California
|
a
|
0
|
1
|
2
|
c
|
3
|
4
|
5
|
d
|
6
|
7
|
8
|
['California', 'Texas', 'Utah']
1 2 3
| df7=df5.reindex(index=['a', 'b', 'c', 'd'],columns=states) df7
|
|
California
|
Texas
|
Utah
|
a
|
2.0
|
1.0
|
NaN
|
b
|
NaN
|
NaN
|
NaN
|
c
|
5.0
|
4.0
|
NaN
|
d
|
8.0
|
7.0
|
NaN
|
1 2 3
| df7=df5.reindex(index=['a', 'b', 'c', 'd'], columns=['Texas', 'Utah', 'California']) df7
|
|
Texas
|
Utah
|
California
|
a
|
1.0
|
NaN
|
2.0
|
b
|
NaN
|
NaN
|
NaN
|
c
|
4.0
|
NaN
|
5.0
|
d
|
7.0
|
NaN
|
8.0
|
1
| df7.fillna(method='ffill')
|
|
Texas
|
Utah
|
California
|
a
|
1.0
|
NaN
|
2.0
|
b
|
1.0
|
NaN
|
2.0
|
c
|
4.0
|
NaN
|
5.0
|
d
|
7.0
|
NaN
|
8.0
|
丢弃指定轴上的项
丢弃某条轴上的一个或多个项,只需一个索引数组或列表即可。drop
方法返回一个在指定轴上删除指定值后的新对象。
1 2
| obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) obj
|
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
1 2
| obj11=obj.drop('c') obj11
|
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
b 1.0
c 2.0
e 4.0
dtype: float64
- 对于DataFrame,可以删除任何轴上的索引值。
1 2 3 4
| data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four']) data
|
|
one
|
two
|
three
|
four
|
Ohio
|
0
|
1
|
2
|
3
|
Colorado
|
4
|
5
|
6
|
7
|
Utah
|
8
|
9
|
10
|
11
|
New York
|
12
|
13
|
14
|
15
|
1
| data.drop(['Colorado', 'Ohio'])
|
|
one
|
two
|
three
|
four
|
Utah
|
8
|
9
|
10
|
11
|
New York
|
12
|
13
|
14
|
15
|
1
| data.drop('two', axis=1)
|
|
one
|
three
|
four
|
Ohio
|
0
|
2
|
3
|
Colorado
|
4
|
6
|
7
|
Utah
|
8
|
10
|
11
|
New York
|
12
|
14
|
15
|
1
| data.drop(['one', 'four'], axis=1)
|
|
two
|
three
|
Ohio
|
1
|
2
|
Colorado
|
5
|
6
|
Utah
|
9
|
10
|
New York
|
13
|
14
|
索引、选取和过滤
Serries索引
Series的索引的工作方式类似于NumPy数组的索引,只不过Series的索引值不只是整数,还可以是标签。
1 2
| obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd']) obj
|
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
1.0
2.0
c 2.0
d 3.0
dtype: float64
a 0.0
d 3.0
c 2.0
dtype: float64
a 0.0
b 1.0
dtype: float64
利用标签的切片运算与普通的Python切片运算不同,其末端是包含的(inclusive)
b 1.0
c 2.0
dtype: float64
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
DataFrame索引
对DataFrame进行索引其实就是获取一个或多个列。
1 2 3 4
| data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four']) data
|
|
one
|
two
|
three
|
four
|
Ohio
|
0
|
1
|
2
|
3
|
Colorado
|
4
|
5
|
6
|
7
|
Utah
|
8
|
9
|
10
|
11
|
New York
|
12
|
13
|
14
|
15
|
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
|
two
|
Ohio
|
1
|
Colorado
|
5
|
Utah
|
9
|
New York
|
13
|
|
three
|
two
|
Ohio
|
2
|
1
|
Colorado
|
6
|
5
|
Utah
|
10
|
9
|
New York
|
14
|
13
|
|
one
|
two
|
three
|
four
|
Ohio
|
0
|
1
|
2
|
3
|
Colorado
|
4
|
5
|
6
|
7
|
|
one
|
two
|
three
|
four
|
Colorado
|
4
|
5
|
6
|
7
|
Utah
|
8
|
9
|
10
|
11
|
New York
|
12
|
13
|
14
|
15
|
|
one
|
two
|
three
|
four
|
Ohio
|
True
|
True
|
True
|
True
|
Colorado
|
True
|
False
|
False
|
False
|
Utah
|
False
|
False
|
False
|
False
|
New York
|
False
|
False
|
False
|
False
|
|
one
|
two
|
three
|
four
|
Ohio
|
0
|
0
|
0
|
0
|
Colorado
|
0
|
5
|
6
|
7
|
Utah
|
8
|
9
|
10
|
11
|
New York
|
12
|
13
|
14
|
15
|
- 为了在DataFrame的行上进行标签索引,引入了专门的
loc
字段,从而可以通过NumPy式的的标记法以及轴标签从DaataFrame中选取行和列的子集。
- 如果是索引值索引,则可以使用
iloc
字段
1 2
| slice = data.loc['Colorado', ['two', 'three']] slice
|
two 5
three 6
Name: Colorado, dtype: int32
1
| data.loc[['Colorado', 'Utah'], ['four','one', 'two']]
|
|
four
|
one
|
two
|
Colorado
|
7
|
0
|
5
|
Utah
|
11
|
8
|
9
|
1
| data.iloc[[1,2],[3,0,1]]
|
|
four
|
one
|
two
|
Colorado
|
7
|
0
|
5
|
Utah
|
11
|
8
|
9
|
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
1
| data[data['three'] > 5].iloc[:,:3]
|
|
one
|
two
|
three
|
Colorado
|
0
|
5
|
6
|
Utah
|
8
|
9
|
10
|
New York
|
12
|
13
|
14
|
|
one
|
two
|
three
|
four
|
Ohio
|
0
|
0
|
0
|
0
|
Colorado
|
0
|
5
|
6
|
7
|
Utah
|
8
|
9
|
10
|
11
|
New York
|
12
|
13
|
14
|
15
|
DataFrame的索引选项
df[val] |
选取DataFrame的单一列,或一组列 |
df.loc[val,val] |
通过标签索引,选取DataFrame的行、列,或行列的交集 |
df.iloc[val,val] |
通过下标值索引,选取DataFrame的行、列,或行列的交集 |
df.reindex 方法 |
将一个或多个轴匹配到新的索引 |
df.at[val, val] |
通过行标签和列标签选取单个值。 |
1 2 3
| data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
|
1
| data.at['New York', 'three']
|
14.0
1 2 3
| data.at['Utah','four'] = 15 data.at['two','Utah'] = 0 data
|
|
one
|
two
|
three
|
four
|
Utah
|
Ohio
|
0.0
|
1.0
|
2.0
|
3.0
|
NaN
|
Colorado
|
4.0
|
5.0
|
6.0
|
7.0
|
NaN
|
Utah
|
8.0
|
9.0
|
10.0
|
15.0
|
NaN
|
New York
|
12.0
|
13.0
|
14.0
|
15.0
|
NaN
|
two
|
NaN
|
NaN
|
NaN
|
NaN
|
0.0
|
算术运算和数据对齐
- pandas可以对不同索引的对象进行算术运算。将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
1 2
| s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e']) s1
|
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
1 2 3
| s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g']) s2
|
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
- 自动的数据对齐操作在不重叠的索引处引入NA值,缺失值会在算术运算中传播。
- 对于DataFrame,对齐操作同样发生在行和列上。
1 2 3 4
| df1 = pd.DataFrame(np.arange(9.0).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado']) df1
|
|
b
|
c
|
d
|
Ohio
|
0.0
|
1.0
|
2.0
|
Texas
|
3.0
|
4.0
|
5.0
|
Colorado
|
6.0
|
7.0
|
8.0
|
1 2 3 4
| df2 = pd.DataFrame(np.arange(12).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) df2
|
|
b
|
d
|
e
|
Utah
|
0
|
1
|
2
|
Ohio
|
3
|
4
|
5
|
Texas
|
6
|
7
|
8
|
Oregon
|
9
|
10
|
11
|
|
b
|
c
|
d
|
e
|
Colorado
|
NaN
|
NaN
|
NaN
|
NaN
|
Ohio
|
3.0
|
NaN
|
6.0
|
NaN
|
Oregon
|
NaN
|
NaN
|
NaN
|
NaN
|
Texas
|
9.0
|
NaN
|
12.0
|
NaN
|
Utah
|
NaN
|
NaN
|
NaN
|
NaN
|
在算术方法中填充值
1 2 3
| df1 = pd.DataFrame(np.arange(12.0).reshape((3, 4)), columns=list('abcd')) df1
|
|
a
|
b
|
c
|
d
|
0
|
0.0
|
1.0
|
2.0
|
3.0
|
1
|
4.0
|
5.0
|
6.0
|
7.0
|
2
|
8.0
|
9.0
|
10.0
|
11.0
|
1 2 3
| df2 = pd.DataFrame(np.arange(20.0).reshape((4, 5)), columns=list('abcde')) df2
|
|
a
|
b
|
c
|
d
|
e
|
0
|
0.0
|
1.0
|
2.0
|
3.0
|
4.0
|
1
|
5.0
|
6.0
|
7.0
|
8.0
|
9.0
|
2
|
10.0
|
11.0
|
12.0
|
13.0
|
14.0
|
3
|
15.0
|
16.0
|
17.0
|
18.0
|
19.0
|
|
a
|
b
|
c
|
d
|
e
|
0
|
0.0
|
2.0
|
4.0
|
6.0
|
NaN
|
1
|
9.0
|
11.0
|
13.0
|
15.0
|
NaN
|
2
|
18.0
|
20.0
|
22.0
|
24.0
|
NaN
|
3
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
1
| df1.add(df2,fill_value=0)
|
|
a
|
b
|
c
|
d
|
e
|
0
|
0.0
|
2.0
|
4.0
|
6.0
|
4.0
|
1
|
9.0
|
11.0
|
13.0
|
15.0
|
9.0
|
2
|
18.0
|
20.0
|
22.0
|
24.0
|
14.0
|
3
|
15.0
|
16.0
|
17.0
|
18.0
|
19.0
|
|
a
|
b
|
c
|
d
|
0
|
0.0
|
1.0
|
2.0
|
3.0
|
1
|
4.0
|
5.0
|
6.0
|
7.0
|
2
|
8.0
|
9.0
|
10.0
|
11.0
|
算术算法 1. add: 加法 2. sub: 减法 3. div: 除法/ 4. mul: 乘法
DataFrame和Series之间的运算
1 2
| arr = np.arange(12).reshape((3, 4)) arr
|
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([0, 1, 2, 3])
array([[0, 0, 0, 0],
[4, 4, 4, 4],
[8, 8, 8, 8]])
- 上述模式成为广播(broadcasting),DataFrame和Series之间的运算也差不多。
- 默认情况下,DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。
1 2 3 4
| df3 = pd.DataFrame(np.arange(12.0).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) df3
|
|
b
|
d
|
e
|
Utah
|
0.0
|
1.0
|
2.0
|
Ohio
|
3.0
|
4.0
|
5.0
|
Texas
|
6.0
|
7.0
|
8.0
|
Oregon
|
9.0
|
10.0
|
11.0
|
1 2
| series = df3.iloc[0] series
|
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
|
b
|
d
|
e
|
Utah
|
0.0
|
0.0
|
0.0
|
Ohio
|
3.0
|
3.0
|
3.0
|
Texas
|
6.0
|
6.0
|
6.0
|
Oregon
|
9.0
|
9.0
|
9.0
|
|
b
|
d
|
e
|
Utah
|
0.0
|
0.0
|
0.0
|
Ohio
|
3.0
|
3.0
|
3.0
|
Texas
|
6.0
|
6.0
|
6.0
|
Oregon
|
9.0
|
9.0
|
9.0
|
- 如果某个索引值在DataFrame或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集。
1 2
| series2 = pd.Series(range(3), index=['b', 'e', 'f']) series2
|
b 0
e 1
f 2
dtype: int64
|
b
|
d
|
e
|
Utah
|
0.0
|
1.0
|
2.0
|
Ohio
|
3.0
|
4.0
|
5.0
|
Texas
|
6.0
|
7.0
|
8.0
|
Oregon
|
9.0
|
10.0
|
11.0
|
|
b
|
d
|
e
|
f
|
Utah
|
0.0
|
NaN
|
3.0
|
NaN
|
Ohio
|
3.0
|
NaN
|
6.0
|
NaN
|
Texas
|
6.0
|
NaN
|
9.0
|
NaN
|
Oregon
|
9.0
|
NaN
|
12.0
|
NaN
|
如果希望匹配行且在列上广播,则必须使用算术运算方法
|
b
|
d
|
e
|
Utah
|
0.0
|
1.0
|
2.0
|
Ohio
|
3.0
|
4.0
|
5.0
|
Texas
|
6.0
|
7.0
|
8.0
|
Oregon
|
9.0
|
10.0
|
11.0
|
1 2
| series3 = df3['d'] series3
|
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
1
| df3.sub(series3, axis=0)
|
|
b
|
d
|
e
|
Utah
|
-1.0
|
0.0
|
1.0
|
Ohio
|
-1.0
|
0.0
|
1.0
|
Texas
|
-1.0
|
0.0
|
1.0
|
Oregon
|
-1.0
|
0.0
|
1.0
|
1 2 3
| df5 = pd.DataFrame(np.arange(12).reshape(4,3), columns=np.arange(3)) df5
|
|
0
|
1
|
2
|
0
|
0
|
1
|
2
|
1
|
3
|
4
|
5
|
2
|
6
|
7
|
8
|
3
|
9
|
10
|
11
|
1 2
| series5 = pd.Series(np.arange(3)) series5
|
0 0
1 1
2 2
dtype: int32
|
0
|
1
|
2
|
0
|
0
|
0
|
0
|
1
|
3
|
3
|
3
|
2
|
6
|
6
|
6
|
3
|
9
|
9
|
9
|
|
0
|
1
|
2
|
0
|
0
|
0
|
0
|
1
|
3
|
3
|
3
|
2
|
6
|
6
|
6
|
3
|
9
|
9
|
9
|
1
| df5.sub(series5, axis=0)
|
|
0
|
1
|
2
|
0
|
0.0
|
1.0
|
2.0
|
1
|
2.0
|
3.0
|
4.0
|
2
|
4.0
|
5.0
|
6.0
|
3
|
NaN
|
NaN
|
NaN
|
函数应用和映射
NumPy的ufuncs(元素级数组方法)也可以用于操作pandas对象
1 2 3 4
| df2 = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) df2
|
|
b
|
d
|
e
|
Utah
|
-0.792212
|
1.594514
|
-0.946233
|
Ohio
|
1.131797
|
0.371889
|
-0.015236
|
Texas
|
-0.810233
|
1.431196
|
-0.819002
|
Oregon
|
0.080920
|
0.177602
|
0.128606
|
|
b
|
d
|
e
|
Utah
|
0.792212
|
1.594514
|
0.946233
|
Ohio
|
1.131797
|
0.371889
|
0.015236
|
Texas
|
0.810233
|
1.431196
|
0.819002
|
Oregon
|
0.080920
|
0.177602
|
0.128606
|
- 还可以用DataFrame的
apply
方法将函数应用到由各列或行所形成的一维数组上。参数axis指名Axis
along which the function is applied
1 2 3
| f = lambda x:x.max() - x.min()
df2.apply(f, axis=0)
|
b 1.942029
d 1.416912
e 1.074839
dtype: float64
Utah 2.540747
Ohio 1.147033
Texas 2.250198
Oregon 0.096682
dtype: float64
1 2 3
| def f(x): return pd.Series([x.min(), x.max(), x.max()- x.min()], index=['min', 'max', 'gap'])
|
|
b
|
d
|
e
|
min
|
-0.810233
|
0.177602
|
-0.946233
|
max
|
1.131797
|
1.594514
|
0.128606
|
gap
|
1.942029
|
1.416912
|
1.074839
|
|
min
|
max
|
gap
|
Utah
|
-0.946233
|
1.594514
|
2.540747
|
Ohio
|
-0.015236
|
1.131797
|
1.147033
|
Texas
|
-0.819002
|
1.431196
|
2.250198
|
Oregon
|
0.080920
|
0.177602
|
0.096682
|
- 许多最为常见的数组统计功能都被是现成DataFrame的方法,如
sum
和mean
,因而无需使用apply方法。
b -0.389728
d 3.575201
e -1.651866
dtype: float64
Utah -0.047977
Ohio 0.496150
Texas -0.066013
Oregon 0.129043
dtype: float64
- 此外,还可以用元素级的Python函数,采用
applymap
即可实现
1 2 3
| format = lambda x:'%.2f' % x
df2.applymap(format)
|
|
b
|
d
|
e
|
Utah
|
-0.79
|
1.59
|
-0.95
|
Ohio
|
1.13
|
0.37
|
-0.02
|
Texas
|
-0.81
|
1.43
|
-0.82
|
Oregon
|
0.08
|
0.18
|
0.13
|
Utah -0.95
Ohio -0.02
Texas -0.82
Oregon 0.13
Name: e, dtype: object
排序和排名
排序Sorting
根据条件对数据集进行排序(sorting)。要对行或列索引进行排序(按字典顺序),可使用sort_index
方法,其将返回一个已排序的新对象。
1 2
| obj = pd.Series(range(4), index=['d', 'a', 'b', 'c']) obj
|
d 0
a 1
b 2
c 3
dtype: int64
a 1
b 2
c 3
d 0
dtype: int64
- DataFrame,可以根据任意轴上的索引进行排序。
1 2 3 4
| df6 = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c']) df6
|
|
d
|
a
|
b
|
c
|
three
|
0
|
1
|
2
|
3
|
one
|
4
|
5
|
6
|
7
|
|
d
|
a
|
b
|
c
|
one
|
4
|
5
|
6
|
7
|
three
|
0
|
1
|
2
|
3
|
|
a
|
b
|
c
|
d
|
three
|
1
|
2
|
3
|
0
|
one
|
5
|
6
|
7
|
4
|
- 默认是按升序排列,也可以通过
ascending=False
参数进行降序排列。
- 如果按值对Series排序,可使用
sort_values()
方法。排序时,任何缺失数据都会被放到Series的末尾。
1
| df6.sort_index(axis=1, ascending=False)
|
|
d
|
c
|
b
|
a
|
three
|
0
|
3
|
2
|
1
|
one
|
4
|
7
|
6
|
5
|
1 2
| obj = pd.Series([4, np.nan, 7, np.nan, -3, 2]) obj
|
0 4.0
1 NaN
2 7.0
3 NaN
4 -3.0
5 2.0
dtype: float64
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
1
| obj.sort_values(ascending=False)
|
2 7.0
0 4.0
5 2.0
4 -3.0
1 NaN
3 NaN
dtype: float64
- 在DataFrame上,可以根据一个或多个列中的值进行排序,通过将一个或多个列名传递给by选项即可。
1 2
| df7 = pd.DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]}) df7
|
|
b
|
a
|
0
|
4
|
0
|
1
|
7
|
1
|
2
|
-3
|
0
|
3
|
2
|
1
|
|
b
|
a
|
2
|
-3
|
0
|
3
|
2
|
1
|
0
|
4
|
0
|
1
|
7
|
1
|
1
| df7.sort_values(by=['a', 'b'])
|
|
b
|
a
|
2
|
-3
|
0
|
0
|
4
|
0
|
3
|
2
|
1
|
1
|
7
|
1
|
1
| df7.sort_values(by=['a', 'b'], ascending=False)
|
|
b
|
a
|
1
|
7
|
1
|
3
|
2
|
1
|
0
|
4
|
0
|
2
|
-3
|
0
|
排名ranking
- 排名(ranking),跟排序密切相关,会增设一个排名值(从1开始,一直到数组中有效数据的数量)
- 默认情况下,
rank
是通过“为各组分配一个平均排名”的方式破坏平级关系的。
1
| obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
|
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
1
| obj.rank(method='first')
|
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
1
| obj.rank(ascending=False, method='max')
|
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
排名时用于破坏平级关系的method选项
average |
默认,在相等的分组中,为各个值分配平均排名 |
min |
使用整个分组的最小排名 |
max |
使用整个分组的最大排名 |
first |
按值在原始数据中出现的顺序分配排名 |
1 2 3 4
| df4 = pd.DataFrame({'b':[4.3, 7, -3, 2], 'a':[0, 1, 0, 1], 'c':[-2, 5, 8, -2.5]}) df4
|
|
b
|
a
|
c
|
0
|
4.3
|
0
|
-2.0
|
1
|
7.0
|
1
|
5.0
|
2
|
-3.0
|
0
|
8.0
|
3
|
2.0
|
1
|
-2.5
|
|
b
|
a
|
c
|
0
|
3.0
|
2.0
|
1.0
|
1
|
3.0
|
1.0
|
2.0
|
2
|
1.0
|
2.0
|
3.0
|
3
|
3.0
|
2.0
|
1.0
|
|
b
|
a
|
c
|
0
|
3.0
|
1.5
|
2.0
|
1
|
4.0
|
3.5
|
3.0
|
2
|
1.0
|
1.5
|
4.0
|
3
|
2.0
|
3.5
|
1.0
|
1
| df4.rank(method='min', ascending=False)
|
|
b
|
a
|
c
|
0
|
2.0
|
3.0
|
3.0
|
1
|
1.0
|
1.0
|
2.0
|
2
|
4.0
|
3.0
|
1.0
|
3
|
3.0
|
1.0
|
4.0
|
带有重复值的轴索引
- 许多pandas函数,如
reindex
都要求标签唯一,但并不是强制性的。
- 索引的
is_unique
属性可以检查其值是否唯一
1 2
| obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c']) obj
|
a 0
a 1
b 2
b 3
c 4
dtype: int64
False
- 如果某个索引对应多个值,则返回一个Series;而对应单个值的,则返回一个标量值。
- DataFrame的行进行索引时也是如此。
a 0
a 1
dtype: int64
4
1 2 3 4
| import numpy as np df = pd.DataFrame(np.random.randn(4, 3), index=list('aabb'),columns=[1,2,3]) df
|
|
1
|
2
|
3
|
a
|
-0.052453
|
1.164686
|
-0.046114
|
a
|
-0.273958
|
-0.749853
|
-1.216034
|
b
|
0.139408
|
0.776466
|
-1.407135
|
b
|
-1.034602
|
-0.421413
|
-1.416837
|
|
3
|
a
|
-0.046114
|
a
|
-1.216034
|
|
1
|
2
|
3
|
a
|
-0.052453
|
1.164686
|
-0.046114
|
a
|
-0.273958
|
-0.749853
|
-1.216034
|
a 1.164686
a -0.749853
Name: 2, dtype: float64
汇总和计算描述统计
pandas对象拥有一组常用的数学和统计方法,大多属于约简和汇总统计。
1 2 3 4 5
| df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.4]], index=list('abcd'), columns=['one', 'two']) df
|
|
one
|
two
|
a
|
1.40
|
NaN
|
b
|
7.10
|
-4.5
|
c
|
NaN
|
NaN
|
d
|
0.75
|
-1.4
|
one 9.25
two -5.90
dtype: float64
a 1.40
b 2.60
c 0.00
d -0.65
dtype: float64
- NA值会被自动排除,除非整个切片(行或列)都是NA。通过
skipna
选项可以禁用该功能。
1
| df.mean(axis=1, skipna=False)
|
a NaN
b 1.300
c NaN
d -0.325
dtype: float64
one b
two d
dtype: object
a one
b two
c NaN
d two
dtype: object
|
one
|
two
|
a
|
1.40
|
NaN
|
b
|
8.50
|
-4.5
|
c
|
NaN
|
NaN
|
d
|
9.25
|
-5.9
|
|
one
|
two
|
a
|
1.40
|
NaN
|
b
|
7.10
|
-4.5
|
c
|
NaN
|
NaN
|
d
|
0.75
|
-1.4
|
|
one
|
two
|
count
|
3.000000
|
2.000000
|
mean
|
3.083333
|
-2.950000
|
std
|
3.493685
|
2.192031
|
min
|
0.750000
|
-4.500000
|
25%
|
1.075000
|
-3.725000
|
50%
|
1.400000
|
-2.950000
|
75%
|
4.250000
|
-2.175000
|
max
|
7.100000
|
-1.400000
|
1 2
| obj = pd.Series(['a', 'a', 'b', 'c'] * 4) obj
|
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
count 16
unique 3
top a
freq 8
dtype: object
相关系数和协方差
1 2 3 4
| stock = pd.read_csv('./examples/stock.dat', sep='\t', index_col=0)
stock stock.columns
|
|
AAPL
|
MSFT
|
GSPC
|
Date
|
|
|
|
2000-01-03
|
3.625643
|
39.334630
|
1455.219971
|
2000-01-04
|
3.319964
|
38.005900
|
1399.420044
|
2000-01-05
|
3.368548
|
38.406628
|
1402.109985
|
2000-01-06
|
3.077039
|
37.120080
|
1403.449951
|
2000-01-07
|
3.222794
|
37.605172
|
1441.469971
|
2000-01-10
|
3.166112
|
37.879354
|
1457.599976
|
2000-01-11
|
3.004162
|
36.909170
|
1438.560059
|
Index(['AAPL', 'MSFT', 'GSPC'], dtype='object')
Date
2000-01-03 3.625643
2000-01-04 3.319964
2000-01-05 3.368548
2000-01-06 3.077039
2000-01-07 3.222794
2000-01-10 3.166112
2000-01-11 3.004162
Name: AAPL, dtype: float64
1
| stock.AAPL.corr(stock.MSFT)
|
0.9788313126479431
1
| stock.AAPL.cov(stock.MSFT)
|
0.16580021717290455
|
AAPL
|
MSFT
|
GSPC
|
AAPL
|
1.000000
|
0.978831
|
0.122897
|
MSFT
|
0.978831
|
1.000000
|
0.213535
|
GSPC
|
0.122897
|
0.213535
|
1.000000
|
|
AAPL
|
MSFT
|
GSPC
|
AAPL
|
0.043003
|
0.165800
|
0.657975
|
MSFT
|
0.165800
|
0.667200
|
4.503164
|
GSPC
|
0.657975
|
4.503164
|
666.562274
|
- Series的
corr
方法用于计算两个Series中重叠的、非NA的、按索引对齐的值的相关系数。类似的,cov
用于计算协方差。
- DataFrame的
corr
和cov
方法将以DataFrame的形式返回完整的相关系数或协方差矩阵。
1
| stock.corrwith(stock.AAPL)
|
AAPL 1.000000
MSFT 0.978831
GSPC 0.122897
dtype: float64
唯一值、值计数及成员资格
1 2 3 4
| obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique() uniques
|
array(['c', 'a', 'd', 'b'], dtype=object)
unique
函数返回唯一值数组,未排序。如果需要可以对结果再次进行排序(uniques.sort()
)。
value_counts
用于计算一个Series中各值出现的频率。
isin
用于判断矢量化集合的成员资格,可用于选取Series或DataFrame列中的数据的子集。
c 3
a 3
b 2
d 1
dtype: int64
1 2
| pd.value_counts(obj.values, sort=False)
|
a 3
b 2
c 3
d 1
dtype: int64
1 2
| mask = obj.isin(['b', 'c']) mask
|
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
0 c
5 b
6 b
7 c
8 c
dtype: object
- 将
value_counts
应用到DataFrame中
1 2 3
| data = pd.DataFrame({'Qu1':[1, 3, 4, 3, 4], 'Qu2':[2, 2, 1, 2, 3], 'Qu3':[1, 5, 2, 4, 4]})
|
|
Qu1
|
Qu2
|
Qu3
|
0
|
1
|
2
|
1
|
1
|
3
|
2
|
5
|
2
|
4
|
1
|
2
|
3
|
3
|
2
|
4
|
4
|
4
|
3
|
4
|
1 2
| result = data.apply(pd.value_counts, axis=1) result
|
|
1
|
2
|
3
|
4
|
5
|
0
|
2.0
|
1.0
|
NaN
|
NaN
|
NaN
|
1
|
NaN
|
1.0
|
1.0
|
NaN
|
1.0
|
2
|
1.0
|
1.0
|
NaN
|
1.0
|
NaN
|
3
|
NaN
|
1.0
|
1.0
|
1.0
|
NaN
|
4
|
NaN
|
NaN
|
1.0
|
2.0
|
NaN
|
|
1
|
2
|
3
|
4
|
5
|
0
|
2.0
|
1.0
|
0.0
|
0.0
|
0.0
|
1
|
0.0
|
1.0
|
1.0
|
0.0
|
1.0
|
2
|
1.0
|
1.0
|
0.0
|
1.0
|
0.0
|
3
|
0.0
|
1.0
|
1.0
|
1.0
|
0.0
|
4
|
0.0
|
0.0
|
1.0
|
2.0
|
0.0
|
缺失数据处理
- pandas使用NaN表示浮点和非浮点数组中的缺失值,Python中内置的None也会被当作Na处理。
- Na的处理方法包括:
dropna
:根据各标签值中是否存在缺失值进行过滤
fillna
:用指定值或插值方法(如ffill
或bfill
)填充缺失数据
isnull
:返回一个含有布尔值的对象
notnull
:isnull
的否定形式
1 2
| string_data = pd.Series(['aard', 'arti', np.nan, 'avocdo']) string_data
|
0 aard
1 arti
2 NaN
3 avocdo
dtype: object
0 False
1 False
2 True
3 False
dtype: bool
1 2
| string_data[0] = None string_data.isnull()
|
0 True
1 False
2 True
3 False
dtype: bool
滤除缺失数据
1 2
| data = pd.Series([1, np.nan, 3.5, np.nan, 7]) data.dropna()
|
0 1.0
2 3.5
4 7.0
dtype: float64
0 1.0
2 3.5
4 7.0
dtype: float64
- 对于DataFrame,
dropna
默认丢失任何含有NA的行。
1 2 3 4
| data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.0]]) data
|
|
0
|
1
|
2
|
0
|
1.0
|
6.5
|
3.0
|
1
|
1.0
|
NaN
|
NaN
|
2
|
NaN
|
NaN
|
NaN
|
3
|
NaN
|
6.5
|
3.0
|
1 2
| cleaned = data.dropna(axis=0) cleaned
|
|
0
|
1
|
2
|
0
|
1.0
|
6.5
|
3.0
|
1
|
1.0
|
NaN
|
NaN
|
3
|
NaN
|
6.5
|
3.0
|
1
| data.dropna(axis=1, how="all")
|
|
0
|
1
|
2
|
0
|
1.0
|
6.5
|
3.0
|
1
|
1.0
|
NaN
|
NaN
|
2
|
NaN
|
NaN
|
NaN
|
3
|
NaN
|
6.5
|
3.0
|
1
| data.dropna(axis=0, how='all')
|
|
0
|
1
|
2
|
0
|
1.0
|
6.5
|
3.0
|
1
|
1.0
|
NaN
|
NaN
|
3
|
NaN
|
6.5
|
3.0
|
1 2
| df = pd.DataFrame(np.random.randn(7, 3)) df
|
|
0
|
1
|
2
|
0
|
1.426088
|
1.516618
|
1.900981
|
1
|
0.219221
|
0.326165
|
0.399649
|
2
|
-1.612351
|
0.429544
|
1.726309
|
3
|
0.067660
|
-2.268722
|
0.592288
|
4
|
-1.508975
|
2.512999
|
1.473433
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
1 2
| df.loc[:4, 1] = np.nan df
|
|
0
|
1
|
2
|
0
|
1.426088
|
NaN
|
1.900981
|
1
|
0.219221
|
NaN
|
0.399649
|
2
|
-1.612351
|
NaN
|
1.726309
|
3
|
0.067660
|
NaN
|
0.592288
|
4
|
-1.508975
|
NaN
|
1.473433
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
1 2
| df.loc[:2, 2] = np.nan df
|
|
0
|
1
|
2
|
0
|
1.426088
|
NaN
|
NaN
|
1
|
0.219221
|
NaN
|
NaN
|
2
|
-1.612351
|
NaN
|
NaN
|
3
|
0.067660
|
NaN
|
0.592288
|
4
|
-1.508975
|
NaN
|
1.473433
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
|
0
|
1
|
2
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
|
0
|
1
|
2
|
3
|
0.067660
|
NaN
|
0.592288
|
4
|
-1.508975
|
NaN
|
1.473433
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
1
| df.dropna(thresh=4, axis=1)
|
|
0
|
2
|
0
|
1.426088
|
NaN
|
1
|
0.219221
|
NaN
|
2
|
-1.612351
|
NaN
|
3
|
0.067660
|
0.592288
|
4
|
-1.508975
|
1.473433
|
5
|
-1.218525
|
-2.078230
|
6
|
-0.234676
|
-0.634214
|
填充缺失数据
|
0
|
1
|
2
|
0
|
1.426088
|
0.000000
|
0.000000
|
1
|
0.219221
|
0.000000
|
0.000000
|
2
|
-1.612351
|
0.000000
|
0.000000
|
3
|
0.067660
|
0.000000
|
0.592288
|
4
|
-1.508975
|
0.000000
|
1.473433
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
1
| df.fillna({1:0.5, 2:-1})
|
|
0
|
1
|
2
|
0
|
1.426088
|
0.500000
|
-1.000000
|
1
|
0.219221
|
0.500000
|
-1.000000
|
2
|
-1.612351
|
0.500000
|
-1.000000
|
3
|
0.067660
|
0.500000
|
0.592288
|
4
|
-1.508975
|
0.500000
|
1.473433
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
|
0
|
1
|
2
|
0
|
1.426088
|
5.000000
|
10.000000
|
1
|
0.219221
|
5.000000
|
10.000000
|
2
|
-1.612351
|
5.000000
|
10.000000
|
3
|
0.067660
|
5.000000
|
0.592288
|
4
|
-1.508975
|
5.000000
|
1.473433
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
- 字典填充只能用于列(axis=0,沿行填充列),不能用于行(axis=1)。
1
| df.fillna({1:5,2:10},axis=1)
|
NotImplementedError Traceback (most recent call last)
<ipython-input-283-b1b4b331430f> in <module>
----> 1 df.fillna({1:5,2:10},axis=1)
D:\program\Scoop\apps\python\current\Lib\site-packages\pandas\core\frame.py
in fillna(self, value, method, axis, inplace, limit, downcast,
kwargs) 4257 limit=limit, 4258 downcast=downcast, -> 4259
kwargs 4260 ) 4261
D:\program\Scoop\apps\python\current\Lib\site-packages\pandas\core\generic.py
in fillna(self, value, method, axis, inplace, limit, downcast) 6262
if axis == 1: 6263 raise NotImplementedError( -> 6264 "Currently only
can fill " 6265 "with dict/Series column " 6266 "by column"
NotImplementedError: Currently only can fill with dict/Series column by column
fillna默认返回新对象,但也可以对现有对象进行就地修改
1 2
| _ = df.fillna(0, inplace=True) df
|
|
0
|
1
|
2
|
0
|
1.426088
|
0.000000
|
0.000000
|
1
|
0.219221
|
0.000000
|
0.000000
|
2
|
-1.612351
|
0.000000
|
0.000000
|
3
|
0.067660
|
0.000000
|
0.592288
|
4
|
-1.508975
|
0.000000
|
1.473433
|
5
|
-1.218525
|
-1.855301
|
-2.078230
|
6
|
-0.234676
|
0.254110
|
-0.634214
|
1 2 3 4
| df8 = pd.DataFrame(np.random.randn(6, 3)) df8.loc[2:4, 1] = np.nan df8.loc[3:4,2] = np.nan df8
|
|
0
|
1
|
2
|
0
|
-1.023072
|
-0.563391
|
-0.787469
|
1
|
-0.848494
|
1.109691
|
0.421535
|
2
|
1.671679
|
NaN
|
0.089971
|
3
|
-0.556844
|
NaN
|
NaN
|
4
|
-1.665812
|
NaN
|
NaN
|
5
|
1.077345
|
0.754463
|
-0.039842
|
1
| df8.fillna(method='bfill')
|
|
0
|
1
|
2
|
0
|
-1.023072
|
-0.563391
|
-0.787469
|
1
|
-0.848494
|
1.109691
|
0.421535
|
2
|
1.671679
|
0.754463
|
0.089971
|
3
|
-0.556844
|
0.754463
|
-0.039842
|
4
|
-1.665812
|
0.754463
|
-0.039842
|
5
|
1.077345
|
0.754463
|
-0.039842
|
1
| df8.fillna(method='ffill', limit=2)
|
|
0
|
1
|
2
|
0
|
-1.023072
|
-0.563391
|
-0.787469
|
1
|
-0.848494
|
1.109691
|
0.421535
|
2
|
1.671679
|
1.109691
|
0.089971
|
3
|
-0.556844
|
1.109691
|
0.089971
|
4
|
-1.665812
|
NaN
|
0.089971
|
5
|
1.077345
|
0.754463
|
-0.039842
|
1 2
| data = pd.Series([1., np.nan, 3.5, np.nan, 7]) data.fillna(data.mean())
|
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
层次化索引
层次化索引(hierarchical
indexing),能在一个轴上拥有多个(两个以上)的索引级别,从而可以以低维度的形式处理高维度上数据。
1 2 3 4
| data = pd.Series(np.random.randn(10), index=[['a','a','a','b','b','b','c','c','d','d'], [1,2,3,1,2,3,1,2,2,3]]) data
|
a 1 -1.071029
2 -0.850225
3 1.122493
b 1 1.265563
2 0.406997
3 0.739587
c 1 1.666785
2 -1.706725
d 2 -3.175604
3 -0.359642
dtype: float64
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 2),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
1 1.265563
2 0.406997
3 0.739587
dtype: float64
b 1 1.265563
2 0.406997
3 0.739587
c 1 1.666785
2 -1.706725
dtype: float64
b 1 1.265563
2 0.406997
3 0.739587
d 2 -3.175604
3 -0.359642
dtype: float64
a 1 -1.071029
2 -0.850225
3 1.122493
b 1 1.265563
2 0.406997
3 0.739587
c 1 1.666785
2 -1.706725
d 2 -3.175604
3 -0.359642
dtype: float64
a -0.850225
b 0.406997
c -1.706725
d -3.175604
dtype: float64
|
1
|
2
|
3
|
a
|
-1.071029
|
-0.850225
|
1.122493
|
b
|
1.265563
|
0.406997
|
0.739587
|
c
|
1.666785
|
-1.706725
|
NaN
|
d
|
NaN
|
-3.175604
|
-0.359642
|
a 1 -1.071029
2 -0.850225
3 1.122493
b 1 1.265563
2 0.406997
3 0.739587
c 1 1.666785
2 -1.706725
d 2 -3.175604
3 -0.359642
dtype: float64
对于DataFrame每条轴都可以有分层索引。
1 2 3 4 5 6 7
| df = pd.DataFrame(np.arange(12.).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
df
|
|
|
Ohio
|
Colorado
|
|
|
Green
|
Red
|
Green
|
a
|
1
|
0.0
|
1.0
|
2.0
|
2
|
3.0
|
4.0
|
5.0
|
b
|
1
|
6.0
|
7.0
|
8.0
|
2
|
9.0
|
10.0
|
11.0
|
1 2 3 4 5
| df.index.names = ['key1', 'key2'] df.columns.names=['state', 'color']
df
|
|
state
|
Ohio
|
Colorado
|
|
color
|
Green
|
Red
|
Green
|
key1
|
key2
|
|
|
|
a
|
1
|
0.0
|
1.0
|
2.0
|
2
|
3.0
|
4.0
|
5.0
|
b
|
1
|
6.0
|
7.0
|
8.0
|
2
|
9.0
|
10.0
|
11.0
|
|
color
|
Green
|
Red
|
key1
|
key2
|
|
|
a
|
1
|
0.0
|
1.0
|
2
|
3.0
|
4.0
|
b
|
1
|
6.0
|
7.0
|
2
|
9.0
|
10.0
|
state
|
Ohio
|
Colorado
|
color
|
Green
|
Red
|
Green
|
key2
|
|
|
|
1
|
0.0
|
1.0
|
2.0
|
2
|
3.0
|
4.0
|
5.0
|
color
|
Green
|
Red
|
key2
|
|
|
1
|
0.0
|
1.0
|
2
|
3.0
|
4.0
|
重排分级顺序
1
| df.swaplevel('key1', 'key2')
|
|
state
|
Ohio
|
Colorado
|
|
color
|
Green
|
Red
|
Green
|
key2
|
key1
|
|
|
|
1
|
a
|
0.0
|
1.0
|
2.0
|
2
|
a
|
3.0
|
4.0
|
5.0
|
1
|
b
|
6.0
|
7.0
|
8.0
|
2
|
b
|
9.0
|
10.0
|
11.0
|
|
state
|
Ohio
|
Colorado
|
|
color
|
Green
|
Red
|
Green
|
key1
|
key2
|
|
|
|
a
|
1
|
0.0
|
1.0
|
2.0
|
2
|
3.0
|
4.0
|
5.0
|
b
|
1
|
6.0
|
7.0
|
8.0
|
2
|
9.0
|
10.0
|
11.0
|
1
| df.swaplevel(0, 1).sort_index(0)
|
|
state
|
Ohio
|
Colorado
|
|
color
|
Green
|
Red
|
Green
|
key2
|
key1
|
|
|
|
1
|
a
|
0.0
|
1.0
|
2.0
|
b
|
6.0
|
7.0
|
8.0
|
2
|
a
|
3.0
|
4.0
|
5.0
|
b
|
9.0
|
10.0
|
11.0
|
1
| df.sort_index(axis=1, level=1)
|
|
state
|
Colorado
|
Ohio
|
|
color
|
Green
|
Green
|
Red
|
key1
|
key2
|
|
|
|
a
|
1
|
2.0
|
0.0
|
1.0
|
2
|
5.0
|
3.0
|
4.0
|
b
|
1
|
8.0
|
6.0
|
7.0
|
2
|
11.0
|
9.0
|
10.0
|
根据级别汇总统计
|
state
|
Ohio
|
Colorado
|
|
color
|
Green
|
Red
|
Green
|
key1
|
key2
|
|
|
|
a
|
1
|
0.0
|
1.0
|
2.0
|
2
|
3.0
|
4.0
|
5.0
|
b
|
1
|
6.0
|
7.0
|
8.0
|
2
|
9.0
|
10.0
|
11.0
|
state
|
Ohio
|
Colorado
|
color
|
Green
|
Red
|
Green
|
key2
|
|
|
|
1
|
6.0
|
8.0
|
10.0
|
2
|
12.0
|
14.0
|
16.0
|
1
| df.sum(axis=1, level='color')
|
|
color
|
Green
|
Red
|
key1
|
key2
|
|
|
a
|
1
|
2.0
|
1.0
|
2
|
8.0
|
4.0
|
b
|
1
|
14.0
|
7.0
|
2
|
20.0
|
10.0
|
使用DataFrame的列 为索引
1 2 3 4
| df11 = pd.DataFrame({'a':range(7), 'b':range(7, 0, -1), 'c':['one','one','one','two','two','two','two'], 'd':[0, 1, 2, 0, 1, 2, 3]}) df11
|
|
a
|
b
|
c
|
d
|
0
|
0
|
7
|
one
|
0
|
1
|
1
|
6
|
one
|
1
|
2
|
2
|
5
|
one
|
2
|
3
|
3
|
4
|
two
|
0
|
4
|
4
|
3
|
two
|
1
|
5
|
5
|
2
|
two
|
2
|
6
|
6
|
1
|
two
|
3
|
set_index
函数可以将一个或多个列转换为行索引,并创建一个新的DataFrame
reset_index
的功能刚好相反,层次化索引的级别会被移到列里面。
|
b
|
c
|
d
|
a
|
|
|
|
0
|
7
|
one
|
0
|
1
|
6
|
one
|
1
|
2
|
5
|
one
|
2
|
3
|
4
|
two
|
0
|
4
|
3
|
two
|
1
|
5
|
2
|
two
|
2
|
6
|
1
|
two
|
3
|
1 2 3
| df12 = df11.set_index(['c','d'])
df12
|
|
|
a
|
b
|
c
|
d
|
|
|
one
|
0
|
0
|
7
|
1
|
1
|
6
|
2
|
2
|
5
|
two
|
0
|
3
|
4
|
1
|
4
|
3
|
2
|
5
|
2
|
3
|
6
|
1
|
|
c
|
d
|
a
|
b
|
0
|
one
|
0
|
0
|
7
|
1
|
one
|
1
|
1
|
6
|
2
|
one
|
2
|
2
|
5
|
3
|
two
|
0
|
3
|
4
|
4
|
two
|
1
|
4
|
3
|
5
|
two
|
2
|
5
|
2
|
6
|
two
|
3
|
6
|
1
|
面板数据
pandas中还有一种数据结构,Panel(面板),三维版的DataFrame,请自行搜索相关资料。