对数据进行分组,并对各组应用一个函数,这是数据分析工作中的重要环节。Pandas提供
了groupby功能,对数据进行切片、切块和摘要操作。本部分内容主要包括
根据一个键或多个键(可以是函数、数组或DataFrame列名)拆分Pandas对象
计算分组摘要,例如计数、均值、标准差,或者用户自定义函数
对DataFrame列应用各种各样的函数
计算透视表或交叉表
执行分位数分析
1 2 3 4 5 6 7 8 import numpy as npimport pandas as pdPREVIOUS_MAX_ROWS = pd.options.display.max_rows pd.options.display.max_rows = 20 np.random.seed(12345 ) import matplotlib.pyplot as pltplt.rc('figure' , figsize=(10 , 6 )) np.set_printoptions(precision=4 , suppress=True )
GroupBy技术
split-apply-combine(拆分——应用——合并) 过程
pandas对象(无论是Series、DataFrame,还是其它)中的数据会根据提供的一个或多
个键被拆分(Split)为多组。拆分是在对象特定的轴上执行的,例如DataFrame可以
在其行(axes=0)或列(axe=1)上进行分组。
将一个函数应用(apply)到各个分组并产生一个新值。
所有函数执行的结果合并(combine)到最终结果对象中。
1 2 3 4 5 df = pd.DataFrame({'key1' : ['a' , 'a' , 'b' , 'b' , 'a' ], 'key2' : ['one' , 'two' , 'one' , 'two' , 'one' ], 'data1' : np.random.randn(5 ), 'data2' : np.random.randn(5 )}) df
key1
key2
data1
data2
0
a
one
0.678661
0.227290
1
a
two
-0.125921
0.922264
2
b
one
0.150581
-2.153545
3
b
two
-0.884475
-0.365757
4
a
one
-0.620521
-0.375842
0.23296944633426425
3.783517166545032
key1 aabba
key2 onetwoonetwoone
data1 1.16485
data2 3.78352
dtype: object
1 2 grouped = df['data1' ].groupby(df['key1' ]) grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f0788501c70>
变量grouped
是一个GroubBy 对象,还没有进行任何计算,但已有分组计算所需要的信息。
GroupBy 对象的mean
方法可以计算分组平均值。
key1
a -0.022594
b -0.366947
Name: data1, dtype: float64
key1
a -0.067781
b -0.733894
Name: data1, dtype: float64
1 2 means = df['data1' ].groupby([df['key1' ], df['key2' ]]).mean() means
key1 key2
a one 0.880536
two 0.478943
b one -0.519439
two -0.555730
Name: data1, dtype: float64
key2
one
two
key1
a
0.880536
0.478943
b
-0.519439
-0.555730
列表或数组,其长度与待分组的轴一样
表示DataFrame某个列名的值
字典或Series,给出待分组轴上的值与分组名之间的关系
函数,用于处理轴索引或索引中的各个标签
列表或数组进行分组
上述例子,分组键为Series。实际上可以为任意长度合适的数组 。
1 2 3 states = np.array(['Ohio' , 'California' , 'California' , 'Ohio' , 'Ohio' ]) years = np.array([2005 , 2005 , 2006 , 2005 , 2006 ]) df['data1' ].groupby([states, years]).mean()
California 2005 0.478943
2006 -0.519439
Ohio 2005 -0.380219
2006 1.965781
Name: data1, dtype: float64
还可以将列名(可以是字符串、数字或其他Python对象)用作分组键 。
1 2 df df.groupby('key1' ).mean()
key1
key2
data1
data2
0
a
one
0.678661
0.227290
1
a
two
-0.125921
0.922264
2
b
one
0.150581
-2.153545
3
b
two
-0.884475
-0.365757
4
a
one
-0.620521
-0.375842
data1
data2
key1
a
-0.022594
0.257904
b
-0.366947
-1.259651
1 df.groupby('key1' ).idxmax()
data1
data2
key1
a
4
0
b
2
3
1 df.groupby(['key1' , 'key2' ]).mean()
data1
data2
key1
key2
a
one
0.029070
-0.074276
two
-0.125921
0.922264
b
one
0.150581
-2.153545
two
-0.884475
-0.365757
1 df.groupby(['key1' ,'key2' ]).sum ()
data1
data2
key1
key2
a
one
1.761073
2.639841
two
0.478943
0.092908
b
one
-0.519439
0.281746
two
-0.555730
0.769023
GroupBy 的size()
方法可以返回一个含有分组大小的Series
1 df.groupby('key1' ).size()
key1
a 3
b 2
dtype: int64
对分组进行迭代
GroupBy
对象支持迭代,可以产生一组二元元组,由分组名和数据块组成。
1 2 3 for (name, group) in df.groupby('key1' ): print ("Group %s:" % name) print (group)
Group a:
key1 key2 data1 data2
0 a one -0.204708 1.393406
1 a two 0.478943 0.092908
4 a one 1.965781 1.246435
Group b:
key1 key2 data1 data2
2 b one -0.519439 0.281746
3 b two -0.555730 0.769023
对于多重键的情况,元组的第一个元素是由键值组成的元组
可以对上述数据片段做任何操作,也可以转化为字典。
1 2 3 for ((k1, k2), group) in df.groupby(['key1' , 'key2' ]): print ((k1, k2)) print (group)
('a', 'one')
key1 key2 data1 data2
0 a one -0.204708 1.393406
4 a one 1.965781 1.246435
('a', 'two')
key1 key2 data1 data2
1 a two 0.478943 0.092908
('b', 'one')
key1 key2 data1 data2
2 b one -0.519439 0.281746
('b', 'two')
key1 key2 data1 data2
3 b two -0.55573 0.769023
1 2 ll = list (df.groupby('key1' )) ll
[('a',
key1 key2 data1 data2
0 a one -0.204708 1.393406
1 a two 0.478943 0.092908
4 a one 1.965781 1.246435),
('b',
key1 key2 data1 data2
2 b one -0.519439 0.281746
3 b two -0.555730 0.769023)]
'a'
key1
key2
data1
data2
0
a
one
-0.204708
1.393406
1
a
two
0.478943
0.092908
4
a
one
1.965781
1.246435
1 2 pieces = dict (list (df.groupby('key1' ))) pieces
{'a': key1 key2 data1 data2
0 a one -0.204708 1.393406
1 a two 0.478943 0.092908
4 a one 1.965781 1.246435,
'b': key1 key2 data1 data2
2 b one -0.519439 0.281746
3 b two -0.555730 0.769023}
key1
key2
data1
data2
2
b
one
-0.519439
0.281746
3
b
two
-0.555730
0.769023
groupby
默认是在axis=0上进行分组,通过设置也可以在其他任何轴上进行的分组。
key1
key2
data1
data2
0
a
one
-0.204708
1.393406
1
a
two
0.478943
0.092908
2
b
one
-0.519439
0.281746
3
b
two
-0.555730
0.769023
4
a
one
1.965781
1.246435
key1 object
key2 object
data1 float64
data2 float64
dtype: object
1 2 grouped = df.groupby(df.dtypes,axis=1 ) list (grouped)
[(dtype('float64'),
data1 data2
0 -0.204708 1.393406
1 0.478943 0.092908
2 -0.519439 0.281746
3 -0.555730 0.769023
4 1.965781 1.246435),
(dtype('O'),
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one)]
1 2 3 for dtype, group in grouped: print (dtype) print (group)
float64
data1 data2
0 -0.204708 1.393406
1 0.478943 0.092908
2 -0.519439 0.281746
3 -0.555730 0.769023
4 1.965781 1.246435
object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one
选取一个或一组列
由DataFrame产生的GroupBy对象,如果用一个或一组字符串列名对其进行索引,就能实现选取部分列进行聚合。
1 2 3 df.groupby('key1' )['data1' ] df.groupby('key1' )[['data2' ]]
1 2 3 df['data1' ].groupby(df['key1' ]) df[['data2' ]].groupby(df['key1' ])
1 df.groupby(['key1' , 'key2' ])['data2' ]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f076d843a00>
1 df.groupby(['key1' , 'key2' ])[['data2' ]]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f076d8434f0>
1 2 3 4 ss = df.groupby(['key1' , 'key2' ])[['data2' ]] ss ss.mean()
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f076d8808b0>
data2
key1
key2
a
one
1.319920
two
0.092908
b
one
0.281746
two
0.769023
1 2 df.groupby(['key1' , 'key2' ])['data2' ].mean()
key1 key2
a one 1.319920
two 0.092908
b one 0.281746
two 0.769023
Name: data2, dtype: float64
1 2 3 s_grouped = df.groupby(['key1' , 'key2' ])['data2' ] s_grouped s_grouped.mean()
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f076d854fa0>
key1 key2
a one 1.319920
two 0.092908
b one 0.281746
two 0.769023
Name: data2, dtype: float64
通过字典或Series进行分组
1 2 3 4 5 people = pd.DataFrame(np.random.randn(5 , 5 ), columns=['a' , 'b' , 'c' , 'd' , 'e' ], index=['Joe' , 'Steve' , 'Wes' , 'Jim' , 'Travis' ]) people.iloc[2 :3 , [1 , 2 ]] = np.nan people
a
b
c
d
e
Joe
1.007189
-1.296221
0.274992
0.228913
1.352917
Steve
0.886429
-2.001637
-0.371843
1.669025
-0.438570
Wes
-0.539741
NaN
NaN
-1.021228
-0.577087
Jim
0.124121
0.302614
0.523772
0.000940
1.343810
Travis
-0.713544
-0.831154
-2.370232
-1.860761
-0.860757
1 2 3 mapping = {'a' : 'red' , 'b' : 'red' , 'c' : 'blue' , 'd' : 'blue' , 'e' : 'red' , 'f' : 'orange' } mapping
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}
1 2 3 4 by_column = people.groupby(mapping, axis=1 ) d=dict (list (by_column)) d['blue' ] d['red' ]
c
d
Joe
0.274992
0.228913
Steve
-0.371843
1.669025
Wes
NaN
-1.021228
Jim
0.523772
0.000940
Travis
-2.370232
-1.860761
a
b
e
Joe
1.007189
-1.296221
1.352917
Steve
0.886429
-2.001637
-0.438570
Wes
-0.539741
NaN
-0.577087
Jim
0.124121
0.302614
1.343810
Travis
-0.713544
-0.831154
-0.860757
blue
red
Joe
0.503905
1.063885
Steve
1.297183
-1.553778
Wes
-1.021228
-1.116829
Jim
0.524712
1.770545
Travis
-4.230992
-2.405455
1 arr = np.array(['red' , 'red' , 'blue' , 'blue' , 'red' ])
1 people.groupby(arr, axis=1 ).sum ()
blue
red
Joe
0.503905
1.063885
Steve
1.297183
-1.553778
Wes
-1.021228
-1.116829
Jim
0.524712
1.770545
Travis
-4.230992
-2.405455
1 2 3 map_series = pd.Series(mapping) map_series people.groupby(map_series, axis=1 ).count()
a red
b red
c blue
d blue
e red
f orange
dtype: object
blue
red
Joe
2
3
Steve
2
3
Wes
1
2
Jim
2
3
Travis
2
3
通过函数进行分组
字典或Series,定义了一种分组映射关系。
Python函数定义映射关系更有创意、抽象,任何被当作分组键的函数都会在各个索引值上被调用一次,其返回的值被用作分组的名称 。
函数和字典可以混用
1 2 list (people.groupby(len ))
[(3,
a b c d e
Joe 1.007189 -1.296221 0.274992 0.228913 1.352917
Wes -0.539741 NaN NaN -1.021228 -0.577087
Jim 0.124121 0.302614 0.523772 0.000940 1.343810),
(5,
a b c d e
Steve 0.886429 -2.001637 -0.371843 1.669025 -0.43857),
(6,
a b c d e
Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757)]
1 people.groupby(len ).sum ()
a
b
c
d
e
3
0.591569
-0.993608
0.798764
-0.791374
2.119639
5
0.886429
-2.001637
-0.371843
1.669025
-0.438570
6
-0.713544
-0.831154
-2.370232
-1.860761
-0.860757
1 2 key_list = ['one' , 'one' , 'one' , 'two' , 'two' ] people.groupby([len , key_list]).min ()
a
b
c
d
e
3
one
-0.539741
-1.296221
0.274992
-1.021228
-0.577087
two
0.124121
0.302614
0.523772
0.000940
1.343810
5
one
0.886429
-2.001637
-0.371843
1.669025
-0.438570
6
two
-0.713544
-0.831154
-2.370232
-1.860761
-0.860757
根据索引级别进行分组
1 2 3 4 5 columns = pd.MultiIndex.from_arrays([['US' , 'US' , 'US' , 'JP' , 'JP' ], [1 , 3 , 5 , 1 , 3 ]], names=['cty' , 'tenor' ]) hier_df = pd.DataFrame(np.random.randn(4 , 5 ), columns=columns) hier_df
cty
US
JP
tenor
1
3
5
1
3
0
0.560145
-1.265934
0.119827
-1.063512
0.332883
1
-2.359419
-0.199543
-1.541996
-0.970736
-1.307030
2
0.286350
0.377984
-0.753887
0.331286
1.349742
3
0.069877
0.246674
-0.011862
1.004812
1.327195
1 hier_df.groupby(level='cty' , axis=1 ).count()
cty
JP
US
0
2
3
1
2
3
2
2
3
3
2
3
数据聚合
所谓聚合,指的是任何能够从数组产生标量值的转换过程
许多常见的聚合运算,经过优化的groupby的方法
count
分组中非Na值的数量
sum
分组中非Na值的和
mean
分组中非Na值的平均值
median
分组中非Na值的算术中位数
std
,var
无偏(分母为\(n-1\) )标准差和方差,即样本标准差和方差
min
、max
分组中非Na值的最小值、最大值
prod
分组中非Na值的积
first
、last
分组中第一个、最后一个非Na值
data1
data2
key1
a
0.746672
0.910916
b
-0.537585
0.525384
聚合运算并不是只能使用这些方法,实际上可以使用自己定义的聚合运算,还可以调用分组对象上已经定义好的任何方法。
quantile
可以计算Series或DataFrame列的样本分位数
1 2 3 4 df grouped = df.groupby('key1' ) grouped['data1' ].quantile(0.9 )
key1
key2
data1
data2
0
a
one
-0.204708
1.393406
1
a
two
0.478943
0.092908
2
b
one
-0.519439
0.281746
3
b
two
-0.555730
0.769023
4
a
one
1.965781
1.246435
key1
a 1.668413
b -0.523068
Name: data1, dtype: float64
还可以使用自己定义的聚合运算,以及分组对象上已经定义好的任何方法
如果要使用自己的聚合函数,只需将其传入aggregate
或agg
方法即可。
1 2 3 4 def peak_to_peak (arr ): return arr.max () - arr.min () grouped.agg(peak_to_peak)
data1
data2
key1
a
2.170488
1.300498
b
0.036292
0.487276
面像列的多函数应用
更高级的聚合功能。
1 2 tips = pd.read_csv('examples/tips.csv' ) tips
total_bill
tip
sex
smoker
day
time
size
0
16.99
1.01
Female
No
Sun
Dinner
2
1
10.34
1.66
Male
No
Sun
Dinner
3
2
21.01
3.50
Male
No
Sun
Dinner
3
3
23.68
3.31
Male
No
Sun
Dinner
2
4
24.59
3.61
Female
No
Sun
Dinner
4
...
...
...
...
...
...
...
...
239
29.03
5.92
Male
No
Sat
Dinner
3
240
27.18
2.00
Female
Yes
Sat
Dinner
2
241
22.67
2.00
Male
Yes
Sat
Dinner
2
242
17.82
1.75
Male
No
Sat
Dinner
2
243
18.78
3.00
Female
No
Thur
Dinner
2
244 rows × 7 columns
1 2 3 4 tips['tip_pct' ] = tips['tip' ] / tips['total_bill' ] tips.tail() tips.head(6 )
total_bill
tip
sex
smoker
day
time
size
tip_pct
239
29.03
5.92
Male
No
Sat
Dinner
3
0.203927
240
27.18
2.00
Female
Yes
Sat
Dinner
2
0.073584
241
22.67
2.00
Male
Yes
Sat
Dinner
2
0.088222
242
17.82
1.75
Male
No
Sat
Dinner
2
0.098204
243
18.78
3.00
Female
No
Thur
Dinner
2
0.159744
total_bill
tip
sex
smoker
day
time
size
tip_pct
0
16.99
1.01
Female
No
Sun
Dinner
2
0.059447
1
10.34
1.66
Male
No
Sun
Dinner
3
0.160542
2
21.01
3.50
Male
No
Sun
Dinner
3
0.166587
3
23.68
3.31
Male
No
Sun
Dinner
2
0.139780
4
24.59
3.61
Female
No
Sun
Dinner
4
0.146808
5
25.29
4.71
Male
No
Sun
Dinner
4
0.186240
1 grouped = tips.groupby(['day' , 'smoker' ])
1 2 grouped_pct = grouped['tip_pct' ] grouped_pct.mean()
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
1 2 grouped_pct.agg('mean' )
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
1 grouped_pct.agg(['mean' , 'std' , peak_to_peak])
mean
std
peak_to_peak
day
smoker
Fri
No
0.151650
0.028123
0.067349
Yes
0.174783
0.051293
0.159925
Sat
No
0.158048
0.039767
0.235193
Yes
0.147906
0.061375
0.290095
Sun
No
0.160113
0.042347
0.193226
Yes
0.187250
0.154134
0.644685
Thur
No
0.160298
0.038774
0.193350
Yes
0.163863
0.039389
0.151240
如果传入一个由(name,
function) 元组组成的列表,则各元组的第一个元素会被作为DataFrame的列名。
1 grouped_pct.agg(['mean' , np.std])
mean
std
day
smoker
Fri
No
0.151650
0.028123
Yes
0.174783
0.051293
Sat
No
0.158048
0.039767
Yes
0.147906
0.061375
Sun
No
0.160113
0.042347
Yes
0.187250
0.154134
Thur
No
0.160298
0.038774
Yes
0.163863
0.039389
1 grouped_pct.agg([('foo' , 'mean' ), ('bar' , np.std)])
foo
bar
day
smoker
Fri
No
0.151650
0.028123
Yes
0.174783
0.051293
Sat
No
0.158048
0.039767
Yes
0.147906
0.061375
Sun
No
0.160113
0.042347
Yes
0.187250
0.154134
Thur
No
0.160298
0.038774
Yes
0.163863
0.039389
1 2 3 functions = ['count' , 'mean' , 'max' ] result = grouped[['tip_pct' , 'total_bill' ]].agg(functions) result
tip_pct
total_bill
count
mean
max
count
mean
max
day
smoker
Fri
No
4
0.151650
0.187735
4
18.420000
22.75
Yes
15
0.174783
0.263480
15
16.813333
40.17
Sat
No
45
0.158048
0.291990
45
19.661778
48.33
Yes
42
0.147906
0.325733
42
21.276667
50.81
Sun
No
57
0.160113
0.252672
57
20.506667
48.17
Yes
19
0.187250
0.710345
19
24.120000
45.35
Thur
No
45
0.160298
0.266312
45
17.113111
41.19
Yes
17
0.163863
0.241255
17
19.190588
43.11
count
mean
max
day
smoker
Fri
No
4
0.151650
0.187735
Yes
15
0.174783
0.263480
Sat
No
45
0.158048
0.291990
Yes
42
0.147906
0.325733
Sun
No
57
0.160113
0.252672
Yes
19
0.187250
0.710345
Thur
No
45
0.160298
0.266312
Yes
17
0.163863
0.241255
1 2 3 ftuples = [('Durchschnitt' , 'mean' ), ('Abweichung' , np.var)] grouped[['tip_pct' , 'total_bill' ]].agg(ftuples)
tip_pct
total_bill
Durchschnitt
Abweichung
Durchschnitt
Abweichung
day
smoker
Fri
No
0.151650
0.000791
18.420000
25.596333
Yes
0.174783
0.002631
16.813333
82.562438
Sat
No
0.158048
0.001581
19.661778
79.908965
Yes
0.147906
0.003767
21.276667
101.387535
Sun
No
0.160113
0.001793
20.506667
66.099980
Yes
0.187250
0.023757
24.120000
109.046044
Thur
No
0.160298
0.001503
17.113111
59.625081
Yes
0.163863
0.001551
19.190588
69.808518
不同的列应用不同的函数,向agg传入从列名映射到函数的字典
1 2 grouped.agg({'tip' : np.max , 'size' : 'sum' })
tip
size
day
smoker
Fri
No
3.50
9
Yes
4.73
31
Sat
No
9.00
115
Yes
10.00
104
Sun
No
6.00
167
Yes
6.50
49
Thur
No
6.70
112
Yes
5.00
40
1 2 grouped.agg({'tip_pct' : [('小' ,'min' ), ('大' ,'max' ), 'mean' , 'std' ], 'size' : 'sum' })
tip_pct
size
小
大
mean
std
sum
day
smoker
Fri
No
0.120385
0.187735
0.151650
0.028123
9
Yes
0.103555
0.263480
0.174783
0.051293
31
Sat
No
0.056797
0.291990
0.158048
0.039767
115
Yes
0.035638
0.325733
0.147906
0.061375
104
Sun
No
0.059447
0.252672
0.160113
0.042347
167
Yes
0.065660
0.710345
0.187250
0.154134
49
Thur
No
0.072961
0.266312
0.160298
0.038774
112
Yes
0.090014
0.241255
0.163863
0.039389
40
以无索引形式返回聚合数据
1 tips.groupby(['day' , 'smoker' ], as_index=False ).mean()
day
smoker
total_bill
tip
size
tip_pct
0
Fri
No
18.420000
2.812500
2.250000
0.151650
1
Fri
Yes
16.813333
2.714000
2.066667
0.174783
2
Sat
No
19.661778
3.102889
2.555556
0.158048
3
Sat
Yes
21.276667
2.875476
2.476190
0.147906
4
Sun
No
20.506667
3.167895
2.929825
0.160113
5
Sun
Yes
24.120000
3.516842
2.578947
0.187250
6
Thur
No
17.113111
2.673778
2.488889
0.160298
7
Thur
Yes
19.190588
3.030000
2.352941
0.163863
1 tips.groupby(['day' , 'smoker' ], as_index=True ).mean()
total_bill
tip
size
tip_pct
day
smoker
Fri
No
18.420000
2.812500
2.250000
0.151650
Yes
16.813333
2.714000
2.066667
0.174783
Sat
No
19.661778
3.102889
2.555556
0.158048
Yes
21.276667
2.875476
2.476190
0.147906
Sun
No
20.506667
3.167895
2.929825
0.160113
Yes
24.120000
3.516842
2.578947
0.187250
Thur
No
17.113111
2.673778
2.488889
0.160298
Yes
19.190588
3.030000
2.352941
0.163863
分组级运算和转换
聚合运算不过是分组运算的一种,是数据转换的一个特例:接受将一维数组转化为标量值的函数
transform
和apply
方法可以执行更多的分组运算
key1
key2
data1
data2
0
a
one
0.678661
0.227290
1
a
two
-0.125921
0.922264
2
b
one
0.150581
-2.153545
3
b
two
-0.884475
-0.365757
4
a
one
-0.620521
-0.375842
如何在df中添加一个用于存放各索引分组平均值的列?可以先聚合再合并。
1 df.groupby('key1' ).mean()
data1
data2
key1
a
-0.022594
0.257904
b
-0.366947
-1.259651
1 2 k1_means = df.groupby('key1' ).mean().add_prefix('mean_' ) k1_means
mean_data1
mean_data2
key1
a
-0.022594
0.257904
b
-0.366947
-1.259651
1 pd.merge(df, k1_means, left_on='key1' , right_index=True )
key1
key2
data1
data2
mean_data1
mean_data2
0
a
one
0.678661
0.227290
-0.022594
0.257904
1
a
two
-0.125921
0.922264
-0.022594
0.257904
4
a
one
-0.620521
-0.375842
-0.022594
0.257904
2
b
one
0.150581
-2.153545
-0.366947
-1.259651
3
b
two
-0.884475
-0.365757
-0.366947
-1.259651
除此之外,我们还可以在groupby上使用transform
方法。
transform
会将一个函数应用到各个分组,然后将结果放置到适当的位置。
如果分组产生的是一个标量值,则改值会被广播出去
1 df.groupby('key1' ).transform(np.mean)
data1
data2
0
-0.022594
0.257904
1
-0.022594
0.257904
2
-0.366947
-1.259651
3
-0.366947
-1.259651
4
-0.022594
0.257904
a
b
c
d
e
Joe
1.007189
-1.296221
0.274992
0.228913
1.352917
Steve
0.886429
-2.001637
-0.371843
1.669025
-0.438570
Wes
-0.539741
NaN
NaN
-1.021228
-0.577087
Jim
0.124121
0.302614
0.523772
0.000940
1.343810
Travis
-0.713544
-0.831154
-2.370232
-1.860761
-0.860757
1 2 key = ['one' , 'two' , 'one' , 'two' , 'one' ] people.groupby(key).mean()
a
b
c
d
e
one
-0.082032
-1.063687
-1.047620
-0.884358
-0.028309
two
0.505275
-0.849512
0.075965
0.834983
0.452620
1 people.groupby(key).transform(np.mean)
a
b
c
d
e
Joe
-0.082032
-1.063687
-1.047620
-0.884358
-0.028309
Steve
0.505275
-0.849512
0.075965
0.834983
0.452620
Wes
-0.082032
-1.063687
-1.047620
-0.884358
-0.028309
Jim
0.505275
-0.849512
0.075965
0.834983
0.452620
Travis
-0.082032
-1.063687
-1.047620
-0.884358
-0.028309
如果希望希望从各组中减去平均值,可以先创建一个距平化函数(demeaning
function),然后将其传给transform
1 2 def demean (arr ): return arr - arr.mean()
1 2 demeaned = people.groupby(key).transform(demean) demeaned
a
b
c
d
e
Joe
1.089221
-0.232534
1.322612
1.113271
1.381226
Steve
0.381154
-1.152125
-0.447807
0.834043
-0.891190
Wes
-0.457709
NaN
NaN
-0.136869
-0.548778
Jim
-0.381154
1.152125
0.447807
-0.834043
0.891190
Travis
-0.631512
0.232534
-1.322612
-0.976402
-0.832448
1 demeaned.groupby(key).sum ()
a
b
c
d
e
one
0.000000e+00
-2.220446e-16
0.0
2.220446e-16
0.0
two
-5.551115e-17
0.000000e+00
0.0
0.000000e+00
0.0
跟aggregate
一样,transform
也是一个存在着严格条件的函数:
传入的函数只能产生两种结果,要么产生一个可以广播的标量值(np.mean
),要么产生一个相同大小的结果数组。
apply
是一般化的GroupBy方法
Apply:一般性的拆分--应用--合并
GroupBy 对象的apply
方法,会将待处理的对象拆分为多个片段,然后对各片段调用传入的函数,最后将各片段组合到一起,即:split--apply--combine
1 2 3 4 def top (df, n=5 , column='tip_pct' ): return df.sort_values(by=column, ascending=False )[-n:] top(tips, n=6 )
total_bill
tip
sex
smoker
day
time
size
tip_pct
210
30.06
2.00
Male
Yes
Sat
Dinner
3
0.066534
187
30.46
2.00
Male
Yes
Sun
Dinner
5
0.065660
0
16.99
1.01
Female
No
Sun
Dinner
2
0.059447
57
26.41
1.50
Female
No
Sat
Dinner
2
0.056797
102
44.30
2.50
Female
Yes
Sat
Dinner
3
0.056433
237
32.83
1.17
Male
Yes
Sat
Dinner
2
0.035638
1 top(tips,n=10 ,column='total_bill' )
total_bill
tip
sex
smoker
day
time
size
tip_pct
126
8.52
1.48
Male
No
Thur
Lunch
2
0.173709
135
8.51
1.25
Female
No
Thur
Lunch
2
0.146886
145
8.35
1.50
Female
No
Thur
Lunch
2
0.179641
218
7.74
1.44
Male
Yes
Sat
Dinner
2
0.186047
195
7.56
1.44
Male
No
Thur
Lunch
2
0.190476
149
7.51
2.00
Male
No
Thur
Lunch
2
0.266312
111
7.25
1.00
Female
No
Sat
Dinner
1
0.137931
172
7.25
5.15
Male
Yes
Sun
Dinner
2
0.710345
92
5.75
1.00
Female
Yes
Fri
Dinner
2
0.173913
67
3.07
1.00
Female
Yes
Sat
Dinner
1
0.325733
1 tips.groupby('smoker' ).apply(top)
total_bill
tip
sex
smoker
day
time
size
tip_pct
smoker
No
130
19.08
1.50
Male
No
Thur
Lunch
2
0.078616
146
18.64
1.36
Female
No
Thur
Lunch
3
0.072961
48
28.55
2.05
Male
No
Sun
Dinner
3
0.071804
0
16.99
1.01
Female
No
Sun
Dinner
2
0.059447
57
26.41
1.50
Female
No
Sat
Dinner
2
0.056797
Yes
240
27.18
2.00
Female
Yes
Sat
Dinner
2
0.073584
210
30.06
2.00
Male
Yes
Sat
Dinner
3
0.066534
187
30.46
2.00
Male
Yes
Sun
Dinner
5
0.065660
102
44.30
2.50
Female
Yes
Sat
Dinner
3
0.056433
237
32.83
1.17
Male
Yes
Sat
Dinner
2
0.035638
如果传给apply
的函数能够接受其他参数或关键字,则可以将这些内容放在函数名后面一并传入。
1 tips.groupby(['smoker' , 'day' ]).apply(top, n=1 , column='total_bill' )
total_bill
tip
sex
smoker
day
time
size
tip_pct
smoker
day
No
Fri
99
12.46
1.50
Male
No
Fri
Dinner
2
0.120385
Sat
111
7.25
1.00
Female
No
Sat
Dinner
1
0.137931
Sun
6
8.77
2.00
Male
No
Sun
Dinner
2
0.228050
Thur
149
7.51
2.00
Male
No
Thur
Lunch
2
0.266312
Yes
Fri
92
5.75
1.00
Female
Yes
Fri
Dinner
2
0.173913
Sat
67
3.07
1.00
Female
Yes
Sat
Dinner
1
0.325733
Sun
172
7.25
5.15
Male
Yes
Sun
Dinner
2
0.710345
Thur
196
10.34
2.00
Male
Yes
Thur
Lunch
2
0.193424
1 2 result = tips.groupby('smoker' )['tip_pct' ].describe() result
count
mean
std
min
25%
50%
75%
max
smoker
No
151.0
0.159328
0.039910
0.056797
0.136906
0.155625
0.185014
0.291990
Yes
93.0
0.163196
0.085119
0.035638
0.106771
0.153846
0.195059
0.710345
在GroupBy中调用describe
之类方法时,相当于运行了下面两条代码:
1 2 f = lambda x:x.describe() grouped.apply(f)
1 2 f = lambda x: x['tip_pct' ].describe() tips.groupby(['smoker' ]).apply(f)
tip_pct
count
mean
std
min
25%
50%
75%
max
smoker
No
151.0
0.159328
0.039910
0.056797
0.136906
0.155625
0.185014
0.291990
Yes
93.0
0.163196
0.085119
0.035638
0.106771
0.153846
0.195059
0.710345
smoker
No count 151.000000
mean 0.159328
std 0.039910
min 0.056797
25% 0.136906
50% 0.155625
75% 0.185014
max 0.291990
Yes count 93.000000
mean 0.163196
std 0.085119
min 0.035638
25% 0.106771
50% 0.153846
75% 0.195059
max 0.710345
dtype: float64
smoker
No
Yes
count
151.000000
93.000000
mean
0.159328
0.163196
std
0.039910
0.085119
min
0.056797
0.035638
25%
0.136906
0.106771
50%
0.155625
0.153846
75%
0.185014
0.195059
max
0.291990
0.710345
禁止分组键
默认情况下,分组键会跟原始对象的索引键共同构成结果对象的层次化索引。将group_keys=False
传入groupby
可禁止该效果。
1 tips.groupby('smoker' ).apply(top,n=5 )
total_bill
tip
sex
smoker
day
time
size
tip_pct
smoker
No
130
19.08
1.50
Male
No
Thur
Lunch
2
0.078616
146
18.64
1.36
Female
No
Thur
Lunch
3
0.072961
48
28.55
2.05
Male
No
Sun
Dinner
3
0.071804
0
16.99
1.01
Female
No
Sun
Dinner
2
0.059447
57
26.41
1.50
Female
No
Sat
Dinner
2
0.056797
Yes
240
27.18
2.00
Female
Yes
Sat
Dinner
2
0.073584
210
30.06
2.00
Male
Yes
Sat
Dinner
3
0.066534
187
30.46
2.00
Male
Yes
Sun
Dinner
5
0.065660
102
44.30
2.50
Female
Yes
Sat
Dinner
3
0.056433
237
32.83
1.17
Male
Yes
Sat
Dinner
2
0.035638
1 tips.groupby('smoker' , group_keys=False ).apply(top)
total_bill
tip
sex
smoker
day
time
size
tip_pct
130
19.08
1.50
Male
No
Thur
Lunch
2
0.078616
146
18.64
1.36
Female
No
Thur
Lunch
3
0.072961
48
28.55
2.05
Male
No
Sun
Dinner
3
0.071804
0
16.99
1.01
Female
No
Sun
Dinner
2
0.059447
57
26.41
1.50
Female
No
Sat
Dinner
2
0.056797
240
27.18
2.00
Female
Yes
Sat
Dinner
2
0.073584
210
30.06
2.00
Male
Yes
Sat
Dinner
3
0.066534
187
30.46
2.00
Male
Yes
Sun
Dinner
5
0.065660
102
44.30
2.50
Female
Yes
Sat
Dinner
3
0.056433
237
32.83
1.17
Male
Yes
Sat
Dinner
2
0.035638
Quantile and Bucket
Analysis(分位数和桶分析)
根据指定面元或样本分位数将数据拆分为多块的工具,譬如cut
和qcut
。
pandas.cut
: Return indices of half-open bins to which
each value of x belongs.
pandas.qcut
:Quantile-based discretization function.
Discretize variable into equal-sized buckets based on rank or based on
sample quantiles.
将上述函数与groupby结合起来进行分位数分析或桶分析。
为了便于分析,连续数据常常被离散化或拆分为“面元”。
1 ages = [20 , 22 , 25 , 27 , 21 , 23 , 37 , 31 , 61 , 45 , 41 , 32 ]
将这些数据划分为“18到25”,“26到35”,“35-60”以及“60以上”几个面元,
1 2 3 bins = [18 , 25 , 35 , 60 , 100 ] cats = pd.cut(ages, bins) cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
基本语法:
pandas.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False)
x:进行划分的一维数组
bins:
整数---将x划分为多少个等间距的区间;序列—将x划分在指定的序列中
是否用标记来代替返回的bins
1 pd.cut(np.array([.2 , 1.4 , 2.5 , 6.2 , 9.7 , 2.1 ]), 3 )
[(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < (6.533, 9.7]]
1 2 pd.cut(np.array([.2 , 1.4 , 2.5 , 6.2 , 9.7 , 2.1 ]), ... 3 , labels=["good" , "medium" , "bad" ])
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]
类似的还有pd.qcut
方法。
pd.cut
是根据值本身确定间隔确定,每个间隔是相同的;而pd.qcut
是根据频率来确定间隔,即每个间隔包含的频率数是相同的
1 2 factors = np.random.randn(9 ) factors
array([ 0.7587, -0.6823, -1.0385, 0.6351, -0.6 , -0.1591, -0.4176,
1.1415, -2.043 ])
[(0.08, 1.141], (-0.982, 0.08], (-2.046, -0.982], (0.08, 1.141], (-0.982, 0.08], (-0.982, 0.08], (-0.982, 0.08], (0.08, 1.141],
(-2.046, -0.982]] Categories (3, interval[float64]): [(-2.046,
-0.982] < (-0.982, 0.08] < (0.08, 1.141]]
1 pd.cut(factors, 3 , labels=['A' ,'B' ,'C' ])
[C, B, A, C, B, B, B, C, A]
Categories (3, object): [A < B < C]
[(0.106, 1.141], (-2.044, -0.627], (-2.044, -0.627], (0.106, 1.141], (-0.627, 0.106], (-0.627, 0.106], (-0.627, 0.106], (0.106, 1.141], (-2.044, -0.627]]
Categories (3, interval[float64]): [(-2.044, -0.627] < (-0.627, 0.106] < (0.106, 1.141]]
1 pd.qcut(factors, 3 , labels=['A' ,'B' ,'C' ])
[C, A, A, C, B, B, B, C, A]
Categories (3, object): [A < B < C]
可以将这些函数与GroupBy结合起来,实现对数据集的桶(bucket)和分位数(quantile)分析
1 2 3 frame = pd.DataFrame({'data1' : np.random.randn(1000 ), 'data2' : np.random.randn(1000 )}) frame
data1
data2
0
-0.919262
1.165148
1
-1.549106
-0.621249
2
0.022185
-0.799318
3
0.758363
0.777233
4
-0.660524
-0.612905
...
...
...
995
-0.459849
-0.574654
996
0.333392
0.786210
997
-0.254742
-1.393822
998
-0.448301
0.359262
999
-1.261344
1.170900
1000 rows × 2 columns
1 2 quartiles = pd.cut(frame.data1, 4 ) quartiles[:10 ]
0 (-1.23, 0.489] 1 (-2.956, -1.23] 2 (-1.23, 0.489] 3 (0.489, 2.208]
4 (-1.23, 0.489] 5 (0.489, 2.208] 6 (-1.23, 0.489] 7 (-1.23, 0.489] 8
(0.489, 2.208] 9 (0.489, 2.208] Name: data1, dtype: category Categories
(4, interval[float64]): [(-2.956, -1.23] < (-1.23, 0.489] <
(0.489, 2.208] < (2.208, 3.928]]
1 2 3 4 5 6 def get_stats (group ): return {'min' : group.min (), 'max' : group.max (), 'count' : group.count(), 'mean' : group.mean()} grouped = frame.data2.groupby(quartiles) grouped.apply(get_stats)
data1
(-2.956, -1.23] min -3.399312
max 1.670835
count 95.000000
mean -0.039521
(-1.23, 0.489] min -2.989741
max 3.260383
count 598.000000
mean -0.002051
(0.489, 2.208] min -3.745356
max 2.954439
count 297.000000
mean 0.081822
(2.208, 3.928] min -1.929776
max 1.765640
count 10.000000
mean 0.024750
Name: data2, dtype: float64
1 grouped.apply(get_stats).unstack()
min
max
count
mean
data1
(-2.956, -1.23]
-3.399312
1.670835
95.0
-0.039521
(-1.23, 0.489]
-2.989741
3.260383
598.0
-0.002051
(0.489, 2.208]
-3.745356
2.954439
297.0
0.081822
(2.208, 3.928]
-1.929776
1.765640
10.0
0.024750
1 2 3 4 5 grouping = pd.qcut(frame.data1, 10 , labels=False ) grouping grouped = frame.data2.groupby(grouping) grouped.apply(get_stats).unstack()
0 1
1 0
2 5
3 7
4 2
..
995 3
996 6
997 4
998 3
999 0
Name: data1, Length: 1000, dtype: int64
min
max
count
mean
data1
0
-3.399312
1.670835
100.0
-0.049902
1
-1.950098
2.628441
100.0
0.030989
2
-2.925113
2.527939
100.0
-0.067179
3
-2.315555
3.260383
100.0
0.065713
4
-2.047939
2.074345
100.0
-0.111653
5
-2.989741
2.184810
100.0
0.052130
6
-2.223506
2.458842
100.0
-0.021489
7
-3.056990
2.954439
100.0
-0.026459
8
-3.745356
2.735527
100.0
0.103406
9
-2.064111
2.377020
100.0
0.220122
用于特定分组的值填充缺失值
1 2 3 s = pd.Series(np.random.randn(6 )) s[::2 ] = np.nan s
0 NaN
1 0.326045
2 NaN
3 -3.428254
4 NaN
5 -0.439938
dtype: float64
0 -1.180716
1 0.326045
2 -1.180716
3 -3.428254
4 -1.180716
5 -0.439938
dtype: float64
1 states = ['Ohio' , 'New York' , 'Vermont' , 'Florida' , 'Oregon' , 'Nevada' , 'California' , 'Idaho' ]
1 group_key = ['East' ] * 4 + ['West' ] * 4
1 2 data = pd.Series(np.random.randn(8 ), index=states) data
Ohio -0.867165
New York -0.100440
Vermont 1.018334
Florida -1.266473
Oregon 0.166839
Nevada 0.133483
California 0.672024
Idaho 0.546705
dtype: float64
1 2 data[['Vermont' , 'Nevada' , 'Idaho' ]] = np.nan data
Ohio -0.867165
New York -0.100440
Vermont NaN
Florida -1.266473
Oregon 0.166839
Nevada NaN
California 0.672024
Idaho NaN
dtype: float64
1 fill_mean = lambda g:g.fillna(g.mean())
1 data.groupby(group_key).apply(fill_mean)
Ohio -0.867165
New York -0.100440
Vermont -0.744693
Florida -1.266473
Oregon 0.166839
Nevada 0.419432
California 0.672024
Idaho 0.419432
dtype: float64
1 fill_values = {'East' :0.5 , 'West' :-1 }
1 fill_func = lambda g: g.fillna(fill_values[g.name])
1 data.groupby(group_key).apply(fill_func)
Ohio -0.867165
New York -0.100440
Vermont 0.500000
Florida -1.266473
Oregon 0.166839
Nevada -1.000000
California 0.672024
Idaho -1.000000
dtype: float64
分组加权平均数和相关系数
根据groupby的“拆分——应用——合并”范式,DataFrame的列与列之间,或者两个Serise之间可以进行分组运算(加权平均)
1 2 3 4 df = pd.DataFrame({'category' :['a' ,'a' ,'a' ,'a' ,'b' ,'b' ,'b' ,'b' ], 'data' :np.random.randn(8 ), 'weights' :np.random.rand(8 )}) df
category
data
weights
0
a
-0.204708
0.748907
1
a
0.478943
0.653570
2
a
-0.519439
0.747715
3
a
-0.555730
0.961307
4
b
1.965781
0.008388
5
b
1.393406
0.106444
6
b
0.092908
0.298704
7
b
0.281746
0.656411
1 grouped11 = df.groupby("category" )
1 wavg = lambda g: np.average(g['data' ], weights=g['weights' ])
category
a -0.245188
b 0.352824
dtype: float64
1 wavg2 = lambda g: pd.Series(np.average(g['data' ], weights=g['weights' ]), index=['WAgv' ])
WAgv
category
a
-0.245188
b
0.352824
1 2 3 close_px = pd.read_csv('./examples/stock_px.csv' , parse_dates=True , index_col=0 ) close_px
AA
AAPL
GE
IBM
JNJ
MSFT
PEP
SPX
XOM
1990-02-01
4.98
7.86
2.87
16.79
4.27
0.51
6.04
328.79
6.12
1990-02-02
5.04
8.00
2.87
16.89
4.37
0.51
6.09
330.92
6.24
1990-02-05
5.07
8.18
2.87
17.32
4.34
0.51
6.05
331.85
6.25
1990-02-06
5.01
8.12
2.88
17.56
4.32
0.51
6.15
329.66
6.23
1990-02-07
5.04
7.77
2.91
17.93
4.38
0.51
6.17
333.75
6.33
...
...
...
...
...
...
...
...
...
...
2011-10-10
10.09
388.81
16.14
186.62
64.43
26.94
61.87
1194.89
76.28
2011-10-11
10.30
400.29
16.14
185.00
63.96
27.00
60.95
1195.54
76.27
2011-10-12
10.05
402.19
16.40
186.12
64.33
26.96
62.70
1207.25
77.16
2011-10-13
10.10
408.43
16.22
186.82
64.23
27.18
62.36
1203.66
76.37
2011-10-14
10.26
422.00
16.60
190.53
64.72
27.27
62.24
1224.58
78.11
5472 rows × 9 columns
pct_change
用于当前元素与先前元素的相差百分比。
AA
AAPL
GE
IBM
JNJ
MSFT
PEP
SPX
XOM
1990-02-01
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1990-02-02
0.012048
0.017812
0.000000
0.005956
0.023419
0.000000
0.008278
0.006478
0.019608
1990-02-05
0.005952
0.022500
0.000000
0.025459
-0.006865
0.000000
-0.006568
0.002810
0.001603
1990-02-06
-0.011834
-0.007335
0.003484
0.013857
-0.004608
0.000000
0.016529
-0.006599
-0.003200
1990-02-07
0.005988
-0.043103
0.010417
0.021071
0.013889
0.000000
0.003252
0.012407
0.016051
...
...
...
...
...
...
...
...
...
...
2011-10-10
0.039135
0.051406
0.041290
0.023192
0.020592
0.026286
0.013930
0.034125
0.036977
2011-10-11
0.020813
0.029526
0.000000
-0.008681
-0.007295
0.002227
-0.014870
0.000544
-0.000131
2011-10-12
-0.024272
0.004747
0.016109
0.006054
0.005785
-0.001481
0.028712
0.009795
0.011669
2011-10-13
0.004975
0.015515
-0.010976
0.003761
-0.001554
0.008160
-0.005423
-0.002974
-0.010238
2011-10-14
0.015842
0.033225
0.023428
0.019859
0.007629
0.003311
-0.001924
0.017380
0.022784
5472 rows × 9 columns
1 2 rets = close_px.pct_change().dropna() rets.head()
AA
AAPL
GE
IBM
JNJ
MSFT
PEP
SPX
XOM
1990-02-02
0.012048
0.017812
0.000000
0.005956
0.023419
0.0
0.008278
0.006478
0.019608
1990-02-05
0.005952
0.022500
0.000000
0.025459
-0.006865
0.0
-0.006568
0.002810
0.001603
1990-02-06
-0.011834
-0.007335
0.003484
0.013857
-0.004608
0.0
0.016529
-0.006599
-0.003200
1990-02-07
0.005988
-0.043103
0.010417
0.021071
0.013889
0.0
0.003252
0.012407
0.016051
1990-02-08
0.000000
-0.007722
0.003436
-0.003904
0.018265
0.0
0.008104
-0.002367
0.003160
1 spx_corr = lambda x: x.corrwith(x['SPX' ])
DatetimeIndex(['1990-02-01', '1990-02-02', '1990-02-05', '1990-02-06',
'1990-02-07', '1990-02-08', '1990-02-09', '1990-02-12',
'1990-02-13', '1990-02-14',
...
'2011-10-03', '2011-10-04', '2011-10-05', '2011-10-06',
'2011-10-07', '2011-10-10', '2011-10-11', '2011-10-12',
'2011-10-13', '2011-10-14'],
dtype='datetime64[ns]', length=5472, freq=None)
Int64Index([1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,
...
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011],
dtype='int64', length=5472)
1 2 by_year = rets.groupby(lambda x: x.year)
AA
AAPL
GE
IBM
JNJ
MSFT
PEP
SPX
XOM
1990
0.595024
0.545067
0.752187
0.738361
0.801145
0.586691
0.783168
1.0
0.517586
1991
0.453574
0.365315
0.759607
0.557046
0.646401
0.524225
0.641775
1.0
0.569335
1992
0.398180
0.498732
0.632685
0.262232
0.515740
0.492345
0.473871
1.0
0.318408
1993
0.259069
0.238578
0.447257
0.211269
0.451503
0.425377
0.385089
1.0
0.318952
1994
0.428549
0.268420
0.572996
0.385162
0.372962
0.436585
0.450516
1.0
0.395078
...
...
...
...
...
...
...
...
...
...
2007
0.642427
0.508118
0.796945
0.603906
0.568423
0.658770
0.651911
1.0
0.786264
2008
0.781057
0.681434
0.777337
0.833074
0.801005
0.804626
0.709264
1.0
0.828303
2009
0.735642
0.707103
0.713086
0.684513
0.603146
0.654902
0.541474
1.0
0.797921
2010
0.745700
0.710105
0.822285
0.783638
0.689896
0.730118
0.626655
1.0
0.839057
2011
0.882045
0.691931
0.864595
0.802730
0.752379
0.800996
0.592029
1.0
0.859975
22 rows × 9 columns
上述表格计算的结果是每年各个股股票日收率(百分数变化)跟指数的相关系数。
1 2 3 fig, ax1 = plt.subplots(figsize=(8 ,6 )) ax1.set_title("Correlation with SPX" ) by_year.apply(spx_corr).plot(ax=ax1)
Text(0.5, 1.0, 'Correlation with SPX')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9c59fff8b0>
面向分组的线性回归
1 import statsmodels.api as sm
1 2 3 4 5 6 7 8 def reg (data, yvar, xvars ): Y = data[yvar] X = data[xvars] X['intercept' ] = 1 result = sm.OLS(Y, X).fit() return result.params
1 by_year.apply(reg, 'AAPL' , ['SPX' ])
SPX
intercept
1990
1.512772
0.001395
1991
1.187351
0.000396
1992
1.832427
0.000164
1993
1.390470
-0.002657
1994
1.190277
0.001617
...
...
...
2007
1.198761
0.003438
2008
0.968016
-0.001110
2009
0.879103
0.002954
2010
1.052608
0.001261
2011
0.806605
0.001514
22 rows × 2 columns
透视表和交叉表
透视表
透视表(pivot
table)是一种常见的数据汇总工具,根据一个或多个键对数据进行聚合,并根据行和列上的分组键将数据分配到各个矩形区域中。
Pandas利用groupby功能和层次化索引的重塑运算制造透视表
DataFrame的pivot_table
方法,也是Pandas顶级函数。
total_bill
tip
sex
smoker
day
time
size
tip_pct
0
16.99
1.01
Female
No
Sun
Dinner
2
0.059447
1
10.34
1.66
Male
No
Sun
Dinner
3
0.160542
2
21.01
3.50
Male
No
Sun
Dinner
3
0.166587
3
23.68
3.31
Male
No
Sun
Dinner
2
0.139780
4
24.59
3.61
Female
No
Sun
Dinner
4
0.146808
...
...
...
...
...
...
...
...
...
239
29.03
5.92
Male
No
Sat
Dinner
3
0.203927
240
27.18
2.00
Female
Yes
Sat
Dinner
2
0.073584
241
22.67
2.00
Male
Yes
Sat
Dinner
2
0.088222
242
17.82
1.75
Male
No
Sat
Dinner
2
0.098204
243
18.78
3.00
Female
No
Thur
Dinner
2
0.159744
244 rows × 8 columns
1 tips.pivot_table(index='smoker' , columns='sex' , values=['total_bill' , 'tip' ])
tip
total_bill
sex
Female
Male
Female
Male
smoker
No
2.773519
3.113402
18.105185
19.791237
Yes
2.931515
3.051167
17.977879
22.284500
1 tips.pivot_table(index=['sex' , 'day' ], columns=['smoker' ], values=['tip_pct' , 'size' ])
size
tip_pct
smoker
No
Yes
No
Yes
sex
day
Female
Fri
2.500000
2.000000
0.165296
0.209129
Sat
2.307692
2.200000
0.147993
0.163817
Sun
3.071429
2.500000
0.165710
0.237075
Thur
2.480000
2.428571
0.155971
0.163073
Male
Fri
2.000000
2.125000
0.138005
0.144730
Sat
2.656250
2.629630
0.162132
0.139067
Sun
2.883721
2.600000
0.158291
0.173964
Thur
2.500000
2.300000
0.165706
0.164417
1 2 tips.pivot_table(index=['sex' , 'day' ], columns=['smoker' ], values=['tip_pct' , 'size' ],margins=True )
size
tip_pct
smoker
No
Yes
All
No
Yes
All
sex
day
Female
Fri
2.500000
2.000000
2.111111
0.165296
0.209129
0.199388
Sat
2.307692
2.200000
2.250000
0.147993
0.163817
0.156470
Sun
3.071429
2.500000
2.944444
0.165710
0.237075
0.181569
Thur
2.480000
2.428571
2.468750
0.155971
0.163073
0.157525
Male
Fri
2.000000
2.125000
2.100000
0.138005
0.144730
0.143385
Sat
2.656250
2.629630
2.644068
0.162132
0.139067
0.151577
Sun
2.883721
2.600000
2.810345
0.158291
0.173964
0.162344
Thur
2.500000
2.300000
2.433333
0.165706
0.164417
0.165276
All
2.668874
2.408602
2.569672
0.159328
0.163196
0.160803
上述表中给出的是分组的平均值,这是分组的默认聚合类型。可以通过aggfunc
参数进行修改,譬如aggfunc=np.sum
进行汇总聚合。
1 2 3 4 5 6 7 8 df = pd.DataFrame( { "fruit" : ["apple" , "orange" , "apple" , "avocado" , "orange" ], "customer" : ["ben" , "alice" , "ben" , "josh" , "steve" ], "quantity" : [1 , 2 , 3 , 1 , 2 ], } ) df
fruit
customer
quantity
0
apple
ben
1
1
orange
alice
2
2
apple
ben
3
3
avocado
josh
1
4
orange
steve
2
1 df.pivot_table(index="fruit" , columns="customer" , values="quantity" , aggfunc=np.sum )
customer
alice
ben
josh
steve
fruit
apple
NaN
4.0
NaN
NaN
avocado
NaN
NaN
1.0
NaN
orange
2.0
NaN
NaN
2.0
1 df.pivot_table(index="fruit" , columns="customer" , values="quantity" , aggfunc=np.sum , fill_value=0 )
customer
alice
ben
josh
steve
fruit
apple
0
4
0
0
avocado
0
0
1
0
orange
2
0
0
2
1 df.pivot_table(index="fruit" , columns="customer" , values="quantity" , aggfunc=[np.sum , np.mean], fill_value=0 )
sum
mean
customer
alice
ben
josh
steve
alice
ben
josh
steve
fruit
apple
0
4
0
0
0
2
0
0
avocado
0
0
1
0
0
0
1
0
orange
2
0
0
2
2
0
0
2
交叉表
交叉表(Cross-tabulation,简称cross-tab)是一种用于计算分组频率的特殊透视表。
1 2 3 4 data = pd.DataFrame({'Sample' :np.arange(1 ,11 ), 'Gender' :['F' ,'M' ,'F' ,'M' ,'M' ,'M' ,'F' ,'F' ,'M' ,'F' ], 'Handness' :['R' ,'L' ,'R' ,'R' ,'L' ,'R' ,'R' ,'L' ,'R' ,'R' ]}) data
Sample
Gender
Handness
0
1
F
R
1
2
M
L
2
3
F
R
3
4
M
R
4
5
M
L
5
6
M
R
6
7
F
R
7
8
F
L
8
9
M
R
9
10
F
R
1 data.pivot_table(index='Gender' , columns='Handness' , values='Sample' , aggfunc='count' ,margins=True )
Handness
L
R
All
Gender
F
1
4
5
M
2
3
5
All
3
7
10
1 pd.crosstab(data.Gender, data.Handness, margins=True )
Handness
L
R
All
Gender
F
1
4
5
M
2
3
5
All
3
7
10
1 pd.crosstab(index=[tips.time, tips.day], columns=[tips.smoker], margins=True )
smoker
No
Yes
All
time
day
Dinner
Fri
3
9
12
Sat
45
42
87
Sun
57
19
76
Thur
1
0
1
Lunch
Fri
1
6
7
Thur
44
17
61
All
151
93
244
综合性示例
1 2 corona = pd.read_csv('./examples/countries-aggregated.csv' ,parse_dates=True ) corona
Date
Country
Confirmed
Recovered
Deaths
0
2020-01-22
Afghanistan
0
0
0
1
2020-01-22
Albania
0
0
0
2
2020-01-22
Algeria
0
0
0
3
2020-01-22
Andorra
0
0
0
4
2020-01-22
Angola
0
0
0
...
...
...
...
...
...
15905
2020-04-16
West Bank and Gaza
374
63
2
15906
2020-04-16
Western Sahara
6
0
0
15907
2020-04-16
Yemen
1
0
0
15908
2020-04-16
Zambia
48
30
2
15909
2020-04-16
Zimbabwe
23
1
3
15910 rows × 5 columns
1 2 corona['Infected' ] = corona['Confirmed' ] > 0
1 2 infected = corona.groupby('Date' )['Infected' ].sum () infected
Date
2020-01-22 6.0
2020-01-23 8.0
2020-01-24 9.0
2020-01-25 11.0
2020-01-26 13.0
...
2020-04-12 185.0
2020-04-13 185.0
2020-04-14 185.0
2020-04-15 185.0
2020-04-16 185.0
Name: Infected, Length: 86, dtype: float64
1 2 3 fig,ax2 = plt.subplots(1 ,1 ) infected.plot(ax=ax2) ax2.set_title("感染国家数" )
<matplotlib.axes._subplots.AxesSubplot at 0x7f9c512c6d30>
Text(0.5, 1.0, '感染国家数')
感染国家按所在洲进行统计
需要对国家进行(国家-洲)匹配
1 2 countries = pd.read_csv("./examples/countries.csv" ) countries
Continent
Country
0
Africa
Algeria
1
Africa
Angola
2
Africa
Benin
3
Africa
Botswana
4
Africa
Burkina Faso
...
...
...
189
South America
Paraguay
190
South America
Peru
191
South America
Suriname
192
South America
Uruguay
193
South America
Venezuela
194 rows × 2 columns
1 2 df = pd.merge(corona, countries, left_on='Country' , right_on='Country' , how='inner' ) df
Date
Country
Confirmed
Recovered
Deaths
Infected
Continent
0
2020-01-22
Afghanistan
0
0
0
False
Asia
1
2020-01-23
Afghanistan
0
0
0
False
Asia
2
2020-01-24
Afghanistan
0
0
0
False
Asia
3
2020-01-25
Afghanistan
0
0
0
False
Asia
4
2020-01-26
Afghanistan
0
0
0
False
Asia
...
...
...
...
...
...
...
...
14873
2020-04-12
Zimbabwe
14
0
3
True
Africa
14874
2020-04-13
Zimbabwe
17
0
3
True
Africa
14875
2020-04-14
Zimbabwe
17
0
3
True
Africa
14876
2020-04-15
Zimbabwe
23
1
3
True
Africa
14877
2020-04-16
Zimbabwe
23
1
3
True
Africa
14878 rows × 7 columns
1 2 countries = pd.pivot_table(index='Date' , columns='Continent' , values='Infected' ,data=df, aggfunc=sum ) countries
Continent
Africa
Asia
Europe
North America
Oceania
South America
Date
2020-01-22
0.0
4.0
0.0
1.0
0.0
0.0
2020-01-23
0.0
6.0
0.0
1.0
0.0
0.0
2020-01-24
0.0
6.0
1.0
1.0
0.0
0.0
2020-01-25
0.0
8.0
1.0
1.0
0.0
0.0
2020-01-26
0.0
8.0
1.0
2.0
1.0
0.0
...
...
...
...
...
...
...
2020-04-12
50.0
40.0
44.0
23.0
4.0
12.0
2020-04-13
50.0
40.0
44.0
23.0
4.0
12.0
2020-04-14
50.0
40.0
44.0
23.0
4.0
12.0
2020-04-15
50.0
40.0
44.0
23.0
4.0
12.0
2020-04-16
50.0
40.0
44.0
23.0
4.0
12.0
86 rows × 6 columns
1 2 3 fig3,ax3 = plt.subplots(1 ,1 ) countries.plot(ax=ax3) ax3.set_title("各洲感染国家数" )
20200417161126.png
1 2 patients = pd.pivot_table(index='Date' , columns='Continent' , values=['Infected' ,'Deaths' ],data=df, aggfunc=sum ) patients
Deaths
Infected
Continent
Africa
Asia
Europe
North America
Oceania
South America
Africa
Asia
Europe
North America
Oceania
South America
Date
2020-01-22
0
17
0
0
0
0
0.0
4.0
0.0
1.0
0.0
0.0
2020-01-23
0
18
0
0
0
0
0.0
6.0
0.0
1.0
0.0
0.0
2020-01-24
0
26
0
0
0
0
0.0
6.0
1.0
1.0
0.0
0.0
2020-01-25
0
42
0
0
0
0
0.0
8.0
1.0
1.0
0.0
0.0
2020-01-26
0
56
0
0
0
0
0.0
8.0
1.0
2.0
1.0
0.0
...
...
...
...
...
...
...
...
...
...
...
...
...
2020-04-12
783
11071
76539
23348
64
2081
50.0
40.0
44.0
23.0
4.0
12.0
2020-04-13
828
11430
79736
24960
66
2247
50.0
40.0
44.0
23.0
4.0
12.0
2020-04-14
867
11839
83022
27433
71
2511
50.0
40.0
44.0
23.0
4.0
12.0
2020-04-15
904
12164
87890
30126
72
2774
50.0
40.0
44.0
23.0
4.0
12.0
2020-04-16
958
12590
91865
35034
72
3027
50.0
40.0
44.0
23.0
4.0
12.0
86 rows × 12 columns
1 2 3 fig4,ax4 = plt.subplots(1,1) patients['Deaths'].plot(ax=ax4) ax4.set_title("各州死亡人数")
20200417160738.png