Bruce Han的博客

不积跬步,无以至千里;不积小流,无以成江海。

0%

Pandas之数据聚合与分组计算

对数据进行分组,并对各组应用一个函数,这是数据分析工作中的重要环节。Pandas提供 了groupby功能,对数据进行切片、切块和摘要操作。本部分内容主要包括

  1. 根据一个键或多个键(可以是函数、数组或DataFrame列名)拆分Pandas对象
  2. 计算分组摘要,例如计数、均值、标准差,或者用户自定义函数
  3. 对DataFrame列应用各种各样的函数
  4. 计算透视表或交叉表
  5. 执行分位数分析
1
%matplotlib notebook
1
2
3
4
5
6
7
8
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

GroupBy技术

split-apply-combine(拆分——应用——合并)过程

  1. pandas对象(无论是Series、DataFrame,还是其它)中的数据会根据提供的一个或多 个键被拆分(Split)为多组。拆分是在对象特定的轴上执行的,例如DataFrame可以 在其行(axes=0)或列(axe=1)上进行分组。
  2. 将一个函数应用(apply)到各个分组并产生一个新值。
  3. 所有函数执行的结果合并(combine)到最终结果对象中。
1
2
3
4
5
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
df
key1 key2 data1 data2
0 a one 0.678661 0.227290
1 a two -0.125921 0.922264
2 b one 0.150581 -2.153545
3 b two -0.884475 -0.365757
4 a one -0.620521 -0.375842
1
df['data1'].mean()
0.23296944633426425
1
df['data2'].sum()
3.783517166545032
1
df.sum()
key1               aabba
key2     onetwoonetwoone
data1            1.16485
data2            3.78352
dtype: object
1
2
grouped = df['data1'].groupby(df['key1'])
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f0788501c70>
  • 变量grouped是一个GroubBy对象,还没有进行任何计算,但已有分组计算所需要的信息。
  • GroupBy对象的mean方法可以计算分组平均值。
1
grouped.mean() # 返回Series
key1
a   -0.022594
b   -0.366947
Name: data1, dtype: float64
1
grouped.sum()
key1
a   -0.067781
b   -0.733894
Name: data1, dtype: float64
1
2
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means # 具有层次化索引的Series
key1  key2
a     one     0.880536
      two     0.478943
b     one    -0.519439
      two    -0.555730
Name: data1, dtype: float64
1
means.unstack()
key2 one two
key1
a 0.880536 0.478943
b -0.519439 -0.555730
  • 分组键可以由多种形式,且类型不必相同:
  1. 列表或数组,其长度与待分组的轴一样
  2. 表示DataFrame某个列名的值
  3. 字典或Series,给出待分组轴上的值与分组名之间的关系
  4. 函数,用于处理轴索引或索引中的各个标签

列表或数组进行分组

  • 上述例子,分组键为Series。实际上可以为任意长度合适的数组
1
2
3
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()
California  2005    0.478943
            2006   -0.519439
Ohio        2005   -0.380219
            2006    1.965781
Name: data1, dtype: float64
  • 还可以将列名(可以是字符串、数字或其他Python对象)用作分组键
1
2
df
df.groupby('key1').mean()
key1 key2 data1 data2
0 a one 0.678661 0.227290
1 a two -0.125921 0.922264
2 b one 0.150581 -2.153545
3 b two -0.884475 -0.365757
4 a one -0.620521 -0.375842
data1 data2
key1
a -0.022594 0.257904
b -0.366947 -1.259651
1
df.groupby('key1').idxmax()
data1 data2
key1
a 4 0
b 2 3
1
df.groupby(['key1', 'key2']).mean()
data1 data2
key1 key2
a one 0.029070 -0.074276
two -0.125921 0.922264
b one 0.150581 -2.153545
two -0.884475 -0.365757
1
df.groupby(['key1','key2']).sum()
data1 data2
key1 key2
a one 1.761073 2.639841
two 0.478943 0.092908
b one -0.519439 0.281746
two -0.555730 0.769023
  • GroupBysize()方法可以返回一个含有分组大小的Series
1
df.groupby('key1').size()
key1
a    3
b    2
dtype: int64

对分组进行迭代

  • GroupBy对象支持迭代,可以产生一组二元元组,由分组名和数据块组成。
1
2
3
for (name, group) in df.groupby('key1'):
print("Group %s:" % name)
print(group)
Group a:
  key1 key2     data1     data2
0    a  one -0.204708  1.393406
1    a  two  0.478943  0.092908
4    a  one  1.965781  1.246435
Group b:
  key1 key2     data1     data2
2    b  one -0.519439  0.281746
3    b  two -0.555730  0.769023
  • 对于多重键的情况,元组的第一个元素是由键值组成的元组
  • 可以对上述数据片段做任何操作,也可以转化为字典。
1
2
3
for ((k1, k2), group) in df.groupby(['key1', 'key2']):
print((k1, k2))
print(group)
('a', 'one')
  key1 key2     data1     data2
0    a  one -0.204708  1.393406
4    a  one  1.965781  1.246435
('a', 'two')
  key1 key2     data1     data2
1    a  two  0.478943  0.092908
('b', 'one')
  key1 key2     data1     data2
2    b  one -0.519439  0.281746
('b', 'two')
  key1 key2    data1     data2
3    b  two -0.55573  0.769023
1
2
ll = list(df.groupby('key1'))
ll
[('a',
    key1 key2     data1     data2
  0    a  one -0.204708  1.393406
  1    a  two  0.478943  0.092908
  4    a  one  1.965781  1.246435),
 ('b',
    key1 key2     data1     data2
  2    b  one -0.519439  0.281746
  3    b  two -0.555730  0.769023)]
1
2
ll[0][0]
ll[0][1]
'a'
key1 key2 data1 data2
0 a one -0.204708 1.393406
1 a two 0.478943 0.092908
4 a one 1.965781 1.246435
1
2
pieces = dict(list(df.groupby('key1')))
pieces
{'a':   key1 key2     data1     data2
 0    a  one -0.204708  1.393406
 1    a  two  0.478943  0.092908
 4    a  one  1.965781  1.246435,
 'b':   key1 key2     data1     data2
 2    b  one -0.519439  0.281746
 3    b  two -0.555730  0.769023}
1
pieces['b']
key1 key2 data1 data2
2 b one -0.519439 0.281746
3 b two -0.555730 0.769023
  • groupby默认是在axis=0上进行分组,通过设置也可以在其他任何轴上进行的分组。
1
2
df
df.dtypes
key1 key2 data1 data2
0 a one -0.204708 1.393406
1 a two 0.478943 0.092908
2 b one -0.519439 0.281746
3 b two -0.555730 0.769023
4 a one 1.965781 1.246435
key1      object
key2      object
data1    float64
data2    float64
dtype: object
1
2
grouped = df.groupby(df.dtypes,axis=1)
list(grouped)
[(dtype('float64'),
        data1     data2
  0 -0.204708  1.393406
  1  0.478943  0.092908
  2 -0.519439  0.281746
  3 -0.555730  0.769023
  4  1.965781  1.246435),
 (dtype('O'),
    key1 key2
  0    a  one
  1    a  two
  2    b  one
  3    b  two
  4    a  one)]
1
2
3
for dtype, group in grouped:
print(dtype)
print(group)
float64
      data1     data2
0 -0.204708  1.393406
1  0.478943  0.092908
2 -0.519439  0.281746
3 -0.555730  0.769023
4  1.965781  1.246435
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one

选取一个或一组列

由DataFrame产生的GroupBy对象,如果用一个或一组字符串列名对其进行索引,就能实现选取部分列进行聚合。

1
2
3
df.groupby('key1')['data1']

df.groupby('key1')[['data2']]
1
2
3
df['data1'].groupby(df['key1'])

df[['data2']].groupby(df['key1'])
1
df.groupby(['key1', 'key2'])['data2']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f076d843a00>
1
df.groupby(['key1', 'key2'])[['data2']]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f076d8434f0>
1
2
3
4
# 只计算data2列的平均值,并以DataFrame形式返回结果
ss = df.groupby(['key1', 'key2'])[['data2']]
ss
ss.mean()
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f076d8808b0>
data2
key1 key2
a one 1.319920
two 0.092908
b one 0.281746
two 0.769023
1
2
# 只计算data2列的平均值,并以Series形式返回结果
df.groupby(['key1', 'key2'])['data2'].mean()
key1  key2
a     one     1.319920
      two     0.092908
b     one     0.281746
      two     0.769023
Name: data2, dtype: float64
1
2
3
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped
s_grouped.mean()
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f076d854fa0>
key1  key2
a     one     1.319920
      two     0.092908
b     one     0.281746
      two     0.769023
Name: data2, dtype: float64

通过字典或Series进行分组

1
2
3
4
5
people = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
people
a b c d e
Joe 1.007189 -1.296221 0.274992 0.228913 1.352917
Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570
Wes -0.539741 NaN NaN -1.021228 -0.577087
Jim 0.124121 0.302614 0.523772 0.000940 1.343810
Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
1
2
3
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
'd': 'blue', 'e': 'red', 'f' : 'orange'}
mapping
{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}
1
2
3
4
by_column = people.groupby(mapping, axis=1)
d=dict(list(by_column))
d['blue']
d['red']
c d
Joe 0.274992 0.228913
Steve -0.371843 1.669025
Wes NaN -1.021228
Jim 0.523772 0.000940
Travis -2.370232 -1.860761
a b e
Joe 1.007189 -1.296221 1.352917
Steve 0.886429 -2.001637 -0.438570
Wes -0.539741 NaN -0.577087
Jim 0.124121 0.302614 1.343810
Travis -0.713544 -0.831154 -0.860757
1
d.get('orange',None)
1
by_column.sum()
blue red
Joe 0.503905 1.063885
Steve 1.297183 -1.553778
Wes -1.021228 -1.116829
Jim 0.524712 1.770545
Travis -4.230992 -2.405455
1
arr = np.array(['red', 'red', 'blue', 'blue', 'red'])
1
people.groupby(arr, axis=1).sum()
blue red
Joe 0.503905 1.063885
Steve 1.297183 -1.553778
Wes -1.021228 -1.116829
Jim 0.524712 1.770545
Travis -4.230992 -2.405455
1
2
3
map_series = pd.Series(mapping)
map_series
people.groupby(map_series, axis=1).count()
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3

通过函数进行分组

  • 字典或Series,定义了一种分组映射关系。
  • Python函数定义映射关系更有创意、抽象,任何被当作分组键的函数都会在各个索引值上被调用一次,其返回的值被用作分组的名称
  • 函数和字典可以混用
1
2
# 根据人名长度进行分组
list(people.groupby(len))
[(3,
              a         b         c         d         e
  Joe  1.007189 -1.296221  0.274992  0.228913  1.352917
  Wes -0.539741       NaN       NaN -1.021228 -0.577087
  Jim  0.124121  0.302614  0.523772  0.000940  1.343810),
 (5,
                a         b         c         d        e
  Steve  0.886429 -2.001637 -0.371843  1.669025 -0.43857),
 (6,
                 a         b         c         d         e
  Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757)]
1
people.groupby(len).sum()
a b c d e
3 0.591569 -0.993608 0.798764 -0.791374 2.119639
5 0.886429 -2.001637 -0.371843 1.669025 -0.438570
6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
1
2
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()
a b c d e
3 one -0.539741 -1.296221 0.274992 -1.021228 -0.577087
two 0.124121 0.302614 0.523772 0.000940 1.343810
5 one 0.886429 -2.001637 -0.371843 1.669025 -0.438570
6 two -0.713544 -0.831154 -2.370232 -1.860761 -0.860757

根据索引级别进行分组

1
2
3
4
5
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
[1, 3, 5, 1, 3]],
names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df
cty US JP
tenor 1 3 5 1 3
0 0.560145 -1.265934 0.119827 -1.063512 0.332883
1 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
2 0.286350 0.377984 -0.753887 0.331286 1.349742
3 0.069877 0.246674 -0.011862 1.004812 1.327195
1
hier_df.groupby(level='cty', axis=1).count()
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3

数据聚合

  • 所谓聚合,指的是任何能够从数组产生标量值的转换过程
  • 许多常见的聚合运算,经过优化的groupby的方法
函数名 说明
count 分组中非Na值的数量
sum 分组中非Na值的和
mean 分组中非Na值的平均值
median 分组中非Na值的算术中位数
stdvar 无偏(分母为\(n-1\))标准差和方差,即样本标准差和方差
minmax 分组中非Na值的最小值、最大值
prod 分组中非Na值的积
firstlast 分组中第一个、最后一个非Na值
1
grouped.mean()
data1 data2
key1
a 0.746672 0.910916
b -0.537585 0.525384
  • 聚合运算并不是只能使用这些方法,实际上可以使用自己定义的聚合运算,还可以调用分组对象上已经定义好的任何方法。
  • quantile可以计算Series或DataFrame列的样本分位数
1
2
3
4
df
grouped = df.groupby('key1')
# 样本分位数
grouped['data1'].quantile(0.9)
key1 key2 data1 data2
0 a one -0.204708 1.393406
1 a two 0.478943 0.092908
2 b one -0.519439 0.281746
3 b two -0.555730 0.769023
4 a one 1.965781 1.246435
key1
a    1.668413
b   -0.523068
Name: data1, dtype: float64
  • 还可以使用自己定义的聚合运算,以及分组对象上已经定义好的任何方法
  • 如果要使用自己的聚合函数,只需将其传入aggregateagg方法即可。
1
2
3
def peak_to_peak(arr):
return arr.max() - arr.min()
grouped.agg(peak_to_peak)
data1 data2
key1
a 2.170488 1.300498
b 0.036292 0.487276

面像列的多函数应用

更高级的聚合功能。

1
2
tips = pd.read_csv('examples/tips.csv')
tips
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

1
2
3
4
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips.tail()
tips.head(6)
total_bill tip sex smoker day time size tip_pct
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
5 25.29 4.71 Male No Sun Dinner 4 0.186240
1
grouped = tips.groupby(['day', 'smoker'])
1
2
grouped_pct = grouped['tip_pct']
grouped_pct.mean()
day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64
1
2
# 将函数名以字符串形式传入
grouped_pct.agg('mean')
day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64
1
grouped_pct.agg(['mean', 'std', peak_to_peak])
mean std peak_to_peak
day smoker
Fri No 0.151650 0.028123 0.067349
Yes 0.174783 0.051293 0.159925
Sat No 0.158048 0.039767 0.235193
Yes 0.147906 0.061375 0.290095
Sun No 0.160113 0.042347 0.193226
Yes 0.187250 0.154134 0.644685
Thur No 0.160298 0.038774 0.193350
Yes 0.163863 0.039389 0.151240
  • 如果传入一个由(name, function)元组组成的列表,则各元组的第一个元素会被作为DataFrame的列名。
1
grouped_pct.agg(['mean', np.std])
mean std
day smoker
Fri No 0.151650 0.028123
Yes 0.174783 0.051293
Sat No 0.158048 0.039767
Yes 0.147906 0.061375
Sun No 0.160113 0.042347
Yes 0.187250 0.154134
Thur No 0.160298 0.038774
Yes 0.163863 0.039389
1
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
foo bar
day smoker
Fri No 0.151650 0.028123
Yes 0.174783 0.051293
Sat No 0.158048 0.039767
Yes 0.147906 0.061375
Sun No 0.160113 0.042347
Yes 0.187250 0.154134
Thur No 0.160298 0.038774
Yes 0.163863 0.039389
1
2
3
functions = ['count', 'mean', 'max']
result = grouped[['tip_pct', 'total_bill']].agg(functions)
result
tip_pct total_bill
count mean max count mean max
day smoker
Fri No 4 0.151650 0.187735 4 18.420000 22.75
Yes 15 0.174783 0.263480 15 16.813333 40.17
Sat No 45 0.158048 0.291990 45 19.661778 48.33
Yes 42 0.147906 0.325733 42 21.276667 50.81
Sun No 57 0.160113 0.252672 57 20.506667 48.17
Yes 19 0.187250 0.710345 19 24.120000 45.35
Thur No 45 0.160298 0.266312 45 17.113111 41.19
Yes 17 0.163863 0.241255 17 19.190588 43.11
1
result['tip_pct']
count mean max
day smoker
Fri No 4 0.151650 0.187735
Yes 15 0.174783 0.263480
Sat No 45 0.158048 0.291990
Yes 42 0.147906 0.325733
Sun No 57 0.160113 0.252672
Yes 19 0.187250 0.710345
Thur No 45 0.160298 0.266312
Yes 17 0.163863 0.241255
1
2
3
# 自定义名称的元组列表
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped[['tip_pct', 'total_bill']].agg(ftuples)
tip_pct total_bill
Durchschnitt Abweichung Durchschnitt Abweichung
day smoker
Fri No 0.151650 0.000791 18.420000 25.596333
Yes 0.174783 0.002631 16.813333 82.562438
Sat No 0.158048 0.001581 19.661778 79.908965
Yes 0.147906 0.003767 21.276667 101.387535
Sun No 0.160113 0.001793 20.506667 66.099980
Yes 0.187250 0.023757 24.120000 109.046044
Thur No 0.160298 0.001503 17.113111 59.625081
Yes 0.163863 0.001551 19.190588 69.808518
  • 不同的列应用不同的函数,向agg传入从列名映射到函数的字典
1
2
# 不同的列应用不同的函数,向agg传入从列名映射到函数的字典
grouped.agg({'tip' : np.max, 'size' : 'sum'})
tip size
day smoker
Fri No 3.50 9
Yes 4.73 31
Sat No 9.00 115
Yes 10.00 104
Sun No 6.00 167
Yes 6.50 49
Thur No 6.70 112
Yes 5.00 40
1
2
grouped.agg({'tip_pct' : [('小','min'), ('大','max'), 'mean', 'std'],
'size' : 'sum'})
tip_pct size
mean std sum
day smoker
Fri No 0.120385 0.187735 0.151650 0.028123 9
Yes 0.103555 0.263480 0.174783 0.051293 31
Sat No 0.056797 0.291990 0.158048 0.039767 115
Yes 0.035638 0.325733 0.147906 0.061375 104
Sun No 0.059447 0.252672 0.160113 0.042347 167
Yes 0.065660 0.710345 0.187250 0.154134 49
Thur No 0.072961 0.266312 0.160298 0.038774 112
Yes 0.090014 0.241255 0.163863 0.039389 40

以无索引形式返回聚合数据

  • as_index=False以禁用索引
1
tips.groupby(['day', 'smoker'], as_index=False).mean()
day smoker total_bill tip size tip_pct
0 Fri No 18.420000 2.812500 2.250000 0.151650
1 Fri Yes 16.813333 2.714000 2.066667 0.174783
2 Sat No 19.661778 3.102889 2.555556 0.158048
3 Sat Yes 21.276667 2.875476 2.476190 0.147906
4 Sun No 20.506667 3.167895 2.929825 0.160113
5 Sun Yes 24.120000 3.516842 2.578947 0.187250
6 Thur No 17.113111 2.673778 2.488889 0.160298
7 Thur Yes 19.190588 3.030000 2.352941 0.163863
1
tips.groupby(['day', 'smoker'], as_index=True).mean()
total_bill tip size tip_pct
day smoker
Fri No 18.420000 2.812500 2.250000 0.151650
Yes 16.813333 2.714000 2.066667 0.174783
Sat No 19.661778 3.102889 2.555556 0.158048
Yes 21.276667 2.875476 2.476190 0.147906
Sun No 20.506667 3.167895 2.929825 0.160113
Yes 24.120000 3.516842 2.578947 0.187250
Thur No 17.113111 2.673778 2.488889 0.160298
Yes 19.190588 3.030000 2.352941 0.163863

分组级运算和转换

  • 聚合运算不过是分组运算的一种,是数据转换的一个特例:接受将一维数组转化为标量值的函数
  • transformapply方法可以执行更多的分组运算
1
df
key1 key2 data1 data2
0 a one 0.678661 0.227290
1 a two -0.125921 0.922264
2 b one 0.150581 -2.153545
3 b two -0.884475 -0.365757
4 a one -0.620521 -0.375842
  • 如何在df中添加一个用于存放各索引分组平均值的列?可以先聚合再合并。
1
df.groupby('key1').mean()
data1 data2
key1
a -0.022594 0.257904
b -0.366947 -1.259651
1
2
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means
mean_data1 mean_data2
key1
a -0.022594 0.257904
b -0.366947 -1.259651
1
pd.merge(df, k1_means, left_on='key1', right_index=True)
key1 key2 data1 data2 mean_data1 mean_data2
0 a one 0.678661 0.227290 -0.022594 0.257904
1 a two -0.125921 0.922264 -0.022594 0.257904
4 a one -0.620521 -0.375842 -0.022594 0.257904
2 b one 0.150581 -2.153545 -0.366947 -1.259651
3 b two -0.884475 -0.365757 -0.366947 -1.259651

除此之外,我们还可以在groupby上使用transform方法。

  • transform会将一个函数应用到各个分组,然后将结果放置到适当的位置。
  • 如果分组产生的是一个标量值,则改值会被广播出去
1
df.groupby('key1').transform(np.mean)
data1 data2
0 -0.022594 0.257904
1 -0.022594 0.257904
2 -0.366947 -1.259651
3 -0.366947 -1.259651
4 -0.022594 0.257904
1
people
a b c d e
Joe 1.007189 -1.296221 0.274992 0.228913 1.352917
Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570
Wes -0.539741 NaN NaN -1.021228 -0.577087
Jim 0.124121 0.302614 0.523772 0.000940 1.343810
Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
1
2
key = ['one', 'two', 'one', 'two', 'one']
people.groupby(key).mean()
a b c d e
one -0.082032 -1.063687 -1.047620 -0.884358 -0.028309
two 0.505275 -0.849512 0.075965 0.834983 0.452620
1
people.groupby(key).transform(np.mean)
a b c d e
Joe -0.082032 -1.063687 -1.047620 -0.884358 -0.028309
Steve 0.505275 -0.849512 0.075965 0.834983 0.452620
Wes -0.082032 -1.063687 -1.047620 -0.884358 -0.028309
Jim 0.505275 -0.849512 0.075965 0.834983 0.452620
Travis -0.082032 -1.063687 -1.047620 -0.884358 -0.028309
  • 如果希望希望从各组中减去平均值,可以先创建一个距平化函数(demeaning function),然后将其传给transform
1
2
def demean(arr):
return arr - arr.mean()
1
2
demeaned = people.groupby(key).transform(demean)
demeaned
a b c d e
Joe 1.089221 -0.232534 1.322612 1.113271 1.381226
Steve 0.381154 -1.152125 -0.447807 0.834043 -0.891190
Wes -0.457709 NaN NaN -0.136869 -0.548778
Jim -0.381154 1.152125 0.447807 -0.834043 0.891190
Travis -0.631512 0.232534 -1.322612 -0.976402 -0.832448
1
demeaned.groupby(key).sum()
a b c d e
one 0.000000e+00 -2.220446e-16 0.0 2.220446e-16 0.0
two -5.551115e-17 0.000000e+00 0.0 0.000000e+00 0.0

aggregate一样,transform也是一个存在着严格条件的函数:

  • 传入的函数只能产生两种结果,要么产生一个可以广播的标量值(np.mean),要么产生一个相同大小的结果数组。
  • apply是一般化的GroupBy方法

Apply:一般性的拆分--应用--合并

  • GroupBy对象的apply方法,会将待处理的对象拆分为多个片段,然后对各片段调用传入的函数,最后将各片段组合到一起,即:split--apply--combine
1
2
3
4
# 最高的n个tip_pct最高的值
def top(df, n=5, column='tip_pct'):
return df.sort_values(by=column, ascending=False)[-n:]
top(tips, n=6)
total_bill tip sex smoker day time size tip_pct
210 30.06 2.00 Male Yes Sat Dinner 3 0.066534
187 30.46 2.00 Male Yes Sun Dinner 5 0.065660
0 16.99 1.01 Female No Sun Dinner 2 0.059447
57 26.41 1.50 Female No Sat Dinner 2 0.056797
102 44.30 2.50 Female Yes Sat Dinner 3 0.056433
237 32.83 1.17 Male Yes Sat Dinner 2 0.035638
1
top(tips,n=10,column='total_bill')
total_bill tip sex smoker day time size tip_pct
126 8.52 1.48 Male No Thur Lunch 2 0.173709
135 8.51 1.25 Female No Thur Lunch 2 0.146886
145 8.35 1.50 Female No Thur Lunch 2 0.179641
218 7.74 1.44 Male Yes Sat Dinner 2 0.186047
195 7.56 1.44 Male No Thur Lunch 2 0.190476
149 7.51 2.00 Male No Thur Lunch 2 0.266312
111 7.25 1.00 Female No Sat Dinner 1 0.137931
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
92 5.75 1.00 Female Yes Fri Dinner 2 0.173913
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
1
tips.groupby('smoker').apply(top)
total_bill tip sex smoker day time size tip_pct
smoker
No 130 19.08 1.50 Male No Thur Lunch 2 0.078616
146 18.64 1.36 Female No Thur Lunch 3 0.072961
48 28.55 2.05 Male No Sun Dinner 3 0.071804
0 16.99 1.01 Female No Sun Dinner 2 0.059447
57 26.41 1.50 Female No Sat Dinner 2 0.056797
Yes 240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
210 30.06 2.00 Male Yes Sat Dinner 3 0.066534
187 30.46 2.00 Male Yes Sun Dinner 5 0.065660
102 44.30 2.50 Female Yes Sat Dinner 3 0.056433
237 32.83 1.17 Male Yes Sat Dinner 2 0.035638
  • 如果传给apply的函数能够接受其他参数或关键字,则可以将这些内容放在函数名后面一并传入。
1
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')
total_bill tip sex smoker day time size tip_pct
smoker day
No Fri 99 12.46 1.50 Male No Fri Dinner 2 0.120385
Sat 111 7.25 1.00 Female No Sat Dinner 1 0.137931
Sun 6 8.77 2.00 Male No Sun Dinner 2 0.228050
Thur 149 7.51 2.00 Male No Thur Lunch 2 0.266312
Yes Fri 92 5.75 1.00 Female Yes Fri Dinner 2 0.173913
Sat 67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
Sun 172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
Thur 196 10.34 2.00 Male Yes Thur Lunch 2 0.193424
1
2
result = tips.groupby('smoker')['tip_pct'].describe()
result
count mean std min 25% 50% 75% max
smoker
No 151.0 0.159328 0.039910 0.056797 0.136906 0.155625 0.185014 0.291990
Yes 93.0 0.163196 0.085119 0.035638 0.106771 0.153846 0.195059 0.710345
  • 在GroupBy中调用describe之类方法时,相当于运行了下面两条代码:
1
2
f = lambda x:x.describe()
grouped.apply(f)
1
2
f = lambda x: x['tip_pct'].describe()
tips.groupby(['smoker']).apply(f)
tip_pct count mean std min 25% 50% 75% max
smoker
No 151.0 0.159328 0.039910 0.056797 0.136906 0.155625 0.185014 0.291990
Yes 93.0 0.163196 0.085119 0.035638 0.106771 0.153846 0.195059 0.710345
1
2
rr = result.stack()
rr
smoker       
No      count    151.000000
        mean       0.159328
        std        0.039910
        min        0.056797
        25%        0.136906
        50%        0.155625
        75%        0.185014
        max        0.291990
Yes     count     93.000000
        mean       0.163196
        std        0.085119
        min        0.035638
        25%        0.106771
        50%        0.153846
        75%        0.195059
        max        0.710345
dtype: float64
1
rr.unstack(level=0)
smoker No Yes
count 151.000000 93.000000
mean 0.159328 0.163196
std 0.039910 0.085119
min 0.056797 0.035638
25% 0.136906 0.106771
50% 0.155625 0.153846
75% 0.185014 0.195059
max 0.291990 0.710345

禁止分组键

默认情况下,分组键会跟原始对象的索引键共同构成结果对象的层次化索引。将group_keys=False传入groupby可禁止该效果。

1
tips.groupby('smoker').apply(top,n=5)
total_bill tip sex smoker day time size tip_pct
smoker
No 130 19.08 1.50 Male No Thur Lunch 2 0.078616
146 18.64 1.36 Female No Thur Lunch 3 0.072961
48 28.55 2.05 Male No Sun Dinner 3 0.071804
0 16.99 1.01 Female No Sun Dinner 2 0.059447
57 26.41 1.50 Female No Sat Dinner 2 0.056797
Yes 240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
210 30.06 2.00 Male Yes Sat Dinner 3 0.066534
187 30.46 2.00 Male Yes Sun Dinner 5 0.065660
102 44.30 2.50 Female Yes Sat Dinner 3 0.056433
237 32.83 1.17 Male Yes Sat Dinner 2 0.035638
1
tips.groupby('smoker', group_keys=False).apply(top)
total_bill tip sex smoker day time size tip_pct
130 19.08 1.50 Male No Thur Lunch 2 0.078616
146 18.64 1.36 Female No Thur Lunch 3 0.072961
48 28.55 2.05 Male No Sun Dinner 3 0.071804
0 16.99 1.01 Female No Sun Dinner 2 0.059447
57 26.41 1.50 Female No Sat Dinner 2 0.056797
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
210 30.06 2.00 Male Yes Sat Dinner 3 0.066534
187 30.46 2.00 Male Yes Sun Dinner 5 0.065660
102 44.30 2.50 Female Yes Sat Dinner 3 0.056433
237 32.83 1.17 Male Yes Sat Dinner 2 0.035638

Quantile and Bucket Analysis(分位数和桶分析)

  • 根据指定面元或样本分位数将数据拆分为多块的工具,譬如cutqcut
  • pandas.cut: Return indices of half-open bins to which each value of x belongs.
  • pandas.qcut:Quantile-based discretization function. Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
  • 将上述函数与groupby结合起来进行分位数分析或桶分析。

为了便于分析,连续数据常常被离散化或拆分为“面元”。

1
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

将这些数据划分为“18到25”,“26到35”,“35-60”以及“60以上”几个面元,

1
2
3
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

基本语法: pandas.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False)

  • x:进行划分的一维数组
  • bins: 整数---将x划分为多少个等间距的区间;序列—将x划分在指定的序列中
  • 是否用标记来代替返回的bins
1
pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3)
[(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < (6.533, 9.7]]
1
2
pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
... 3, labels=["good", "medium", "bad"])
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]
  • 类似的还有pd.qcut方法。
  • pd.cut是根据值本身确定间隔确定,每个间隔是相同的;而pd.qcut是根据频率来确定间隔,即每个间隔包含的频率数是相同的
1
2
factors = np.random.randn(9)
factors
array([ 0.7587, -0.6823, -1.0385,  0.6351, -0.6   , -0.1591, -0.4176,
        1.1415, -2.043 ])
1
pd.cut(factors, 3)
[(0.08, 1.141], (-0.982, 0.08], (-2.046, -0.982], (0.08, 1.141], (-0.982, 0.08], (-0.982, 0.08], (-0.982, 0.08], (0.08, 1.141],

(-2.046, -0.982]] Categories (3, interval[float64]): [(-2.046, -0.982] < (-0.982, 0.08] < (0.08, 1.141]]

1
pd.cut(factors, 3, labels=['A','B','C'])
[C, B, A, C, B, B, B, C, A]
Categories (3, object): [A < B < C]
1
pd.qcut(factors, 3)
[(0.106, 1.141], (-2.044, -0.627], (-2.044, -0.627], (0.106, 1.141], (-0.627, 0.106], (-0.627, 0.106], (-0.627, 0.106], (0.106, 1.141], (-2.044, -0.627]]
Categories (3, interval[float64]): [(-2.044, -0.627] < (-0.627, 0.106] < (0.106, 1.141]]
1
pd.qcut(factors, 3, labels=['A','B','C'])
[C, A, A, C, B, B, B, C, A]
Categories (3, object): [A < B < C]
  • 可以将这些函数与GroupBy结合起来,实现对数据集的桶(bucket)和分位数(quantile)分析
1
2
3
frame = pd.DataFrame({'data1': np.random.randn(1000),
'data2': np.random.randn(1000)})
frame
data1 data2
0 -0.919262 1.165148
1 -1.549106 -0.621249
2 0.022185 -0.799318
3 0.758363 0.777233
4 -0.660524 -0.612905
... ... ...
995 -0.459849 -0.574654
996 0.333392 0.786210
997 -0.254742 -1.393822
998 -0.448301 0.359262
999 -1.261344 1.170900

1000 rows × 2 columns

1
2
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]

0 (-1.23, 0.489] 1 (-2.956, -1.23] 2 (-1.23, 0.489] 3 (0.489, 2.208] 4 (-1.23, 0.489] 5 (0.489, 2.208] 6 (-1.23, 0.489] 7 (-1.23, 0.489] 8 (0.489, 2.208] 9 (0.489, 2.208] Name: data1, dtype: category Categories (4, interval[float64]): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2.208] < (2.208, 3.928]]

1
2
3
4
5
6
# 对分组求最小、最大、计数以及均值
def get_stats(group):
return {'min': group.min(), 'max': group.max(),
'count': group.count(), 'mean': group.mean()}
grouped = frame.data2.groupby(quartiles)
grouped.apply(get_stats)
data1                 
(-2.956, -1.23]  min       -3.399312
                 max        1.670835
                 count     95.000000
                 mean      -0.039521
(-1.23, 0.489]   min       -2.989741
                 max        3.260383
                 count    598.000000
                 mean      -0.002051
(0.489, 2.208]   min       -3.745356
                 max        2.954439
                 count    297.000000
                 mean       0.081822
(2.208, 3.928]   min       -1.929776
                 max        1.765640
                 count     10.000000
                 mean       0.024750
Name: data2, dtype: float64
1
grouped.apply(get_stats).unstack()
min max count mean
data1
(-2.956, -1.23] -3.399312 1.670835 95.0 -0.039521
(-1.23, 0.489] -2.989741 3.260383 598.0 -0.002051
(0.489, 2.208] -3.745356 2.954439 297.0 0.081822
(2.208, 3.928] -1.929776 1.765640 10.0 0.024750
1
2
3
4
5
# Return quantile numbers
grouping = pd.qcut(frame.data1, 10, labels=False) # labels=False, return only integer indicators of the bins
grouping
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
0      1
1      0
2      5
3      7
4      2
      ..
995    3
996    6
997    4
998    3
999    0
Name: data1, Length: 1000, dtype: int64
min max count mean
data1
0 -3.399312 1.670835 100.0 -0.049902
1 -1.950098 2.628441 100.0 0.030989
2 -2.925113 2.527939 100.0 -0.067179
3 -2.315555 3.260383 100.0 0.065713
4 -2.047939 2.074345 100.0 -0.111653
5 -2.989741 2.184810 100.0 0.052130
6 -2.223506 2.458842 100.0 -0.021489
7 -3.056990 2.954439 100.0 -0.026459
8 -3.745356 2.735527 100.0 0.103406
9 -2.064111 2.377020 100.0 0.220122

用于特定分组的值填充缺失值

1
2
3
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s
0         NaN
1    0.326045
2         NaN
3   -3.428254
4         NaN
5   -0.439938
dtype: float64
1
s.fillna(s.mean())
0   -1.180716
1    0.326045
2   -1.180716
3   -3.428254
4   -1.180716
5   -0.439938
dtype: float64
1
states = ['Ohio', 'New York', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California', 'Idaho']
1
group_key = ['East'] * 4 + ['West'] * 4
1
2
data = pd.Series(np.random.randn(8), index=states)
data
Ohio         -0.867165
New York     -0.100440
Vermont       1.018334
Florida      -1.266473
Oregon        0.166839
Nevada        0.133483
California    0.672024
Idaho         0.546705
dtype: float64
1
2
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data
Ohio         -0.867165
New York     -0.100440
Vermont            NaN
Florida      -1.266473
Oregon        0.166839
Nevada             NaN
California    0.672024
Idaho              NaN
dtype: float64
1
fill_mean = lambda g:g.fillna(g.mean())
1
data.groupby(group_key).apply(fill_mean)
Ohio         -0.867165
New York     -0.100440
Vermont      -0.744693
Florida      -1.266473
Oregon        0.166839
Nevada        0.419432
California    0.672024
Idaho         0.419432
dtype: float64
  • 也可以用预定义的填充值进行填充
1
fill_values = {'East':0.5, 'West':-1}
1
fill_func = lambda g: g.fillna(fill_values[g.name])
1
data.groupby(group_key).apply(fill_func)
Ohio         -0.867165
New York     -0.100440
Vermont       0.500000
Florida      -1.266473
Oregon        0.166839
Nevada       -1.000000
California    0.672024
Idaho        -1.000000
dtype: float64

分组加权平均数和相关系数

根据groupby的“拆分——应用——合并”范式,DataFrame的列与列之间,或者两个Serise之间可以进行分组运算(加权平均)

1
2
3
4
df = pd.DataFrame({'category':['a','a','a','a','b','b','b','b'], 
'data':np.random.randn(8),
'weights':np.random.rand(8)})
df
category data weights
0 a -0.204708 0.748907
1 a 0.478943 0.653570
2 a -0.519439 0.747715
3 a -0.555730 0.961307
4 b 1.965781 0.008388
5 b 1.393406 0.106444
6 b 0.092908 0.298704
7 b 0.281746 0.656411
1
grouped11 = df.groupby("category")
1
wavg = lambda g: np.average(g['data'], weights=g['weights'])
1
grouped11.apply(wavg)
category
a   -0.245188
b    0.352824
dtype: float64
1
wavg2 = lambda g: pd.Series(np.average(g['data'], weights=g['weights']), index=['WAgv'])
1
grouped11.apply(wavg2)
WAgv
category
a -0.245188
b 0.352824
1
2
3
# 标准普尔500指数
close_px = pd.read_csv('./examples/stock_px.csv', parse_dates=True, index_col=0)
close_px
AA AAPL GE IBM JNJ MSFT PEP SPX XOM
1990-02-01 4.98 7.86 2.87 16.79 4.27 0.51 6.04 328.79 6.12
1990-02-02 5.04 8.00 2.87 16.89 4.37 0.51 6.09 330.92 6.24
1990-02-05 5.07 8.18 2.87 17.32 4.34 0.51 6.05 331.85 6.25
1990-02-06 5.01 8.12 2.88 17.56 4.32 0.51 6.15 329.66 6.23
1990-02-07 5.04 7.77 2.91 17.93 4.38 0.51 6.17 333.75 6.33
... ... ... ... ... ... ... ... ... ...
2011-10-10 10.09 388.81 16.14 186.62 64.43 26.94 61.87 1194.89 76.28
2011-10-11 10.30 400.29 16.14 185.00 63.96 27.00 60.95 1195.54 76.27
2011-10-12 10.05 402.19 16.40 186.12 64.33 26.96 62.70 1207.25 77.16
2011-10-13 10.10 408.43 16.22 186.82 64.23 27.18 62.36 1203.66 76.37
2011-10-14 10.26 422.00 16.60 190.53 64.72 27.27 62.24 1224.58 78.11

5472 rows × 9 columns

pct_change用于当前元素与先前元素的相差百分比。

1
close_px.pct_change()
AA AAPL GE IBM JNJ MSFT PEP SPX XOM
1990-02-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1990-02-02 0.012048 0.017812 0.000000 0.005956 0.023419 0.000000 0.008278 0.006478 0.019608
1990-02-05 0.005952 0.022500 0.000000 0.025459 -0.006865 0.000000 -0.006568 0.002810 0.001603
1990-02-06 -0.011834 -0.007335 0.003484 0.013857 -0.004608 0.000000 0.016529 -0.006599 -0.003200
1990-02-07 0.005988 -0.043103 0.010417 0.021071 0.013889 0.000000 0.003252 0.012407 0.016051
... ... ... ... ... ... ... ... ... ...
2011-10-10 0.039135 0.051406 0.041290 0.023192 0.020592 0.026286 0.013930 0.034125 0.036977
2011-10-11 0.020813 0.029526 0.000000 -0.008681 -0.007295 0.002227 -0.014870 0.000544 -0.000131
2011-10-12 -0.024272 0.004747 0.016109 0.006054 0.005785 -0.001481 0.028712 0.009795 0.011669
2011-10-13 0.004975 0.015515 -0.010976 0.003761 -0.001554 0.008160 -0.005423 -0.002974 -0.010238
2011-10-14 0.015842 0.033225 0.023428 0.019859 0.007629 0.003311 -0.001924 0.017380 0.022784

5472 rows × 9 columns

1
2
rets = close_px.pct_change().dropna()
rets.head()
AA AAPL GE IBM JNJ MSFT PEP SPX XOM
1990-02-02 0.012048 0.017812 0.000000 0.005956 0.023419 0.0 0.008278 0.006478 0.019608
1990-02-05 0.005952 0.022500 0.000000 0.025459 -0.006865 0.0 -0.006568 0.002810 0.001603
1990-02-06 -0.011834 -0.007335 0.003484 0.013857 -0.004608 0.0 0.016529 -0.006599 -0.003200
1990-02-07 0.005988 -0.043103 0.010417 0.021071 0.013889 0.0 0.003252 0.012407 0.016051
1990-02-08 0.000000 -0.007722 0.003436 -0.003904 0.018265 0.0 0.008104 -0.002367 0.003160
1
spx_corr = lambda x: x.corrwith(x['SPX'])
1
close_px.index
DatetimeIndex(['1990-02-01', '1990-02-02', '1990-02-05', '1990-02-06',
               '1990-02-07', '1990-02-08', '1990-02-09', '1990-02-12',
               '1990-02-13', '1990-02-14',
               ...
               '2011-10-03', '2011-10-04', '2011-10-05', '2011-10-06',
               '2011-10-07', '2011-10-10', '2011-10-11', '2011-10-12',
               '2011-10-13', '2011-10-14'],
              dtype='datetime64[ns]', length=5472, freq=None)
1
2
# 时间类型数据具有year,month,day属性
close_px.index.year
Int64Index([1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,
            ...
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011],
           dtype='int64', length=5472)
1
2
# 通过函数进行分组:任何被当作分组键的函数会在各个索引值上被调用一次
by_year = rets.groupby(lambda x: x.year)
1
by_year.apply(spx_corr)
AA AAPL GE IBM JNJ MSFT PEP SPX XOM
1990 0.595024 0.545067 0.752187 0.738361 0.801145 0.586691 0.783168 1.0 0.517586
1991 0.453574 0.365315 0.759607 0.557046 0.646401 0.524225 0.641775 1.0 0.569335
1992 0.398180 0.498732 0.632685 0.262232 0.515740 0.492345 0.473871 1.0 0.318408
1993 0.259069 0.238578 0.447257 0.211269 0.451503 0.425377 0.385089 1.0 0.318952
1994 0.428549 0.268420 0.572996 0.385162 0.372962 0.436585 0.450516 1.0 0.395078
... ... ... ... ... ... ... ... ... ...
2007 0.642427 0.508118 0.796945 0.603906 0.568423 0.658770 0.651911 1.0 0.786264
2008 0.781057 0.681434 0.777337 0.833074 0.801005 0.804626 0.709264 1.0 0.828303
2009 0.735642 0.707103 0.713086 0.684513 0.603146 0.654902 0.541474 1.0 0.797921
2010 0.745700 0.710105 0.822285 0.783638 0.689896 0.730118 0.626655 1.0 0.839057
2011 0.882045 0.691931 0.864595 0.802730 0.752379 0.800996 0.592029 1.0 0.859975

22 rows × 9 columns

上述表格计算的结果是每年各个股股票日收率(百分数变化)跟指数的相关系数。

1
2
3
fig, ax1 = plt.subplots(figsize=(8,6))
ax1.set_title("Correlation with SPX")
by_year.apply(spx_corr).plot(ax=ax1)
Text(0.5, 1.0, 'Correlation with SPX')






<matplotlib.axes._subplots.AxesSubplot at 0x7f9c59fff8b0>

面向分组的线性回归

1
import statsmodels.api as sm
1
2
3
4
5
6
7
8
def reg(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]

X['intercept'] = 1
result = sm.OLS(Y, X).fit()

return result.params
1
by_year.apply(reg, 'AAPL', ['SPX'])
SPX intercept
1990 1.512772 0.001395
1991 1.187351 0.000396
1992 1.832427 0.000164
1993 1.390470 -0.002657
1994 1.190277 0.001617
... ... ...
2007 1.198761 0.003438
2008 0.968016 -0.001110
2009 0.879103 0.002954
2010 1.052608 0.001261
2011 0.806605 0.001514

22 rows × 2 columns

透视表和交叉表

透视表

透视表(pivot table)是一种常见的数据汇总工具,根据一个或多个键对数据进行聚合,并根据行和列上的分组键将数据分配到各个矩形区域中。

  • Pandas利用groupby功能和层次化索引的重塑运算制造透视表
  • DataFrame的pivot_table方法,也是Pandas顶级函数。
1
tips
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
... ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744

244 rows × 8 columns

1
tips.pivot_table(index='smoker', columns='sex', values=['total_bill', 'tip'])
tip total_bill
sex Female Male Female Male
smoker
No 2.773519 3.113402 18.105185 19.791237
Yes 2.931515 3.051167 17.977879 22.284500
1
tips.pivot_table(index=['sex', 'day'], columns=['smoker'], values=['tip_pct', 'size'])
size tip_pct
smoker No Yes No Yes
sex day
Female Fri 2.500000 2.000000 0.165296 0.209129
Sat 2.307692 2.200000 0.147993 0.163817
Sun 3.071429 2.500000 0.165710 0.237075
Thur 2.480000 2.428571 0.155971 0.163073
Male Fri 2.000000 2.125000 0.138005 0.144730
Sat 2.656250 2.629630 0.162132 0.139067
Sun 2.883721 2.600000 0.158291 0.173964
Thur 2.500000 2.300000 0.165706 0.164417
1
2
# 传入margins=True可以添加分项小计:All列为平均数
tips.pivot_table(index=['sex', 'day'], columns=['smoker'], values=['tip_pct', 'size'],margins=True)
size tip_pct
smoker No Yes All No Yes All
sex day
Female Fri 2.500000 2.000000 2.111111 0.165296 0.209129 0.199388
Sat 2.307692 2.200000 2.250000 0.147993 0.163817 0.156470
Sun 3.071429 2.500000 2.944444 0.165710 0.237075 0.181569
Thur 2.480000 2.428571 2.468750 0.155971 0.163073 0.157525
Male Fri 2.000000 2.125000 2.100000 0.138005 0.144730 0.143385
Sat 2.656250 2.629630 2.644068 0.162132 0.139067 0.151577
Sun 2.883721 2.600000 2.810345 0.158291 0.173964 0.162344
Thur 2.500000 2.300000 2.433333 0.165706 0.164417 0.165276
All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803

上述表中给出的是分组的平均值,这是分组的默认聚合类型。可以通过aggfunc参数进行修改,譬如aggfunc=np.sum进行汇总聚合。

1
2
3
4
5
6
7
8
df = pd.DataFrame(
{
"fruit": ["apple", "orange", "apple", "avocado", "orange"],
"customer": ["ben", "alice", "ben", "josh", "steve"],
"quantity": [1, 2, 3, 1, 2],
}
)
df
fruit customer quantity
0 apple ben 1
1 orange alice 2
2 apple ben 3
3 avocado josh 1
4 orange steve 2
1
df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=np.sum)
customer alice ben josh steve
fruit
apple NaN 4.0 NaN NaN
avocado NaN NaN 1.0 NaN
orange 2.0 NaN NaN 2.0
1
df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=np.sum, fill_value=0)
customer alice ben josh steve
fruit
apple 0 4 0 0
avocado 0 0 1 0
orange 2 0 0 2
1
df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=[np.sum, np.mean], fill_value=0)
sum mean
customer alice ben josh steve alice ben josh steve
fruit
apple 0 4 0 0 0 2 0 0
avocado 0 0 1 0 0 0 1 0
orange 2 0 0 2 2 0 0 2

交叉表

交叉表(Cross-tabulation,简称cross-tab)是一种用于计算分组频率的特殊透视表。

1
2
3
4
data = pd.DataFrame({'Sample':np.arange(1,11), 
'Gender':['F','M','F','M','M','M','F','F','M','F'],
'Handness':['R','L','R','R','L','R','R','L','R','R']})
data
Sample Gender Handness
0 1 F R
1 2 M L
2 3 F R
3 4 M R
4 5 M L
5 6 M R
6 7 F R
7 8 F L
8 9 M R
9 10 F R
1
data.pivot_table(index='Gender', columns='Handness', values='Sample', aggfunc='count',margins=True)
Handness L R All
Gender
F 1 4 5
M 2 3 5
All 3 7 10
1
pd.crosstab(data.Gender, data.Handness, margins=True)
Handness L R All
Gender
F 1 4 5
M 2 3 5
All 3 7 10
1
pd.crosstab(index=[tips.time, tips.day], columns=[tips.smoker], margins=True)
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244

综合性示例

1
2
corona = pd.read_csv('./examples/countries-aggregated.csv',parse_dates=True)
corona
Date Country Confirmed Recovered Deaths
0 2020-01-22 Afghanistan 0 0 0
1 2020-01-22 Albania 0 0 0
2 2020-01-22 Algeria 0 0 0
3 2020-01-22 Andorra 0 0 0
4 2020-01-22 Angola 0 0 0
... ... ... ... ... ...
15905 2020-04-16 West Bank and Gaza 374 63 2
15906 2020-04-16 Western Sahara 6 0 0
15907 2020-04-16 Yemen 1 0 0
15908 2020-04-16 Zambia 48 30 2
15909 2020-04-16 Zimbabwe 23 1 3

15910 rows × 5 columns

1
2
# 是否感染
corona['Infected'] = corona['Confirmed'] > 0
1
2
infected = corona.groupby('Date')['Infected'].sum()
infected
Date
2020-01-22      6.0
2020-01-23      8.0
2020-01-24      9.0
2020-01-25     11.0
2020-01-26     13.0
              ...  
2020-04-12    185.0
2020-04-13    185.0
2020-04-14    185.0
2020-04-15    185.0
2020-04-16    185.0
Name: Infected, Length: 86, dtype: float64
1
2
3
fig,ax2 = plt.subplots(1,1)
infected.plot(ax=ax2)
ax2.set_title("感染国家数")
<matplotlib.axes._subplots.AxesSubplot at 0x7f9c512c6d30>






Text(0.5, 1.0, '感染国家数')

感染国家按所在洲进行统计

需要对国家进行(国家-洲)匹配

1
2
countries = pd.read_csv("./examples/countries.csv")
countries
Continent Country
0 Africa Algeria
1 Africa Angola
2 Africa Benin
3 Africa Botswana
4 Africa Burkina Faso
... ... ...
189 South America Paraguay
190 South America Peru
191 South America Suriname
192 South America Uruguay
193 South America Venezuela

194 rows × 2 columns

1
2
df = pd.merge(corona, countries, left_on='Country', right_on='Country', how='inner') # 因国家名称信息不统一,忽略掉少量数据
df
Date Country Confirmed Recovered Deaths Infected Continent
0 2020-01-22 Afghanistan 0 0 0 False Asia
1 2020-01-23 Afghanistan 0 0 0 False Asia
2 2020-01-24 Afghanistan 0 0 0 False Asia
3 2020-01-25 Afghanistan 0 0 0 False Asia
4 2020-01-26 Afghanistan 0 0 0 False Asia
... ... ... ... ... ... ... ...
14873 2020-04-12 Zimbabwe 14 0 3 True Africa
14874 2020-04-13 Zimbabwe 17 0 3 True Africa
14875 2020-04-14 Zimbabwe 17 0 3 True Africa
14876 2020-04-15 Zimbabwe 23 1 3 True Africa
14877 2020-04-16 Zimbabwe 23 1 3 True Africa

14878 rows × 7 columns

1
2
countries = pd.pivot_table(index='Date', columns='Continent', values='Infected',data=df, aggfunc=sum)
countries
Continent Africa Asia Europe North America Oceania South America
Date
2020-01-22 0.0 4.0 0.0 1.0 0.0 0.0
2020-01-23 0.0 6.0 0.0 1.0 0.0 0.0
2020-01-24 0.0 6.0 1.0 1.0 0.0 0.0
2020-01-25 0.0 8.0 1.0 1.0 0.0 0.0
2020-01-26 0.0 8.0 1.0 2.0 1.0 0.0
... ... ... ... ... ... ...
2020-04-12 50.0 40.0 44.0 23.0 4.0 12.0
2020-04-13 50.0 40.0 44.0 23.0 4.0 12.0
2020-04-14 50.0 40.0 44.0 23.0 4.0 12.0
2020-04-15 50.0 40.0 44.0 23.0 4.0 12.0
2020-04-16 50.0 40.0 44.0 23.0 4.0 12.0

86 rows × 6 columns

1
2
3
fig3,ax3 = plt.subplots(1,1)
countries.plot(ax=ax3)
ax3.set_title("各洲感染国家数")
20200417161126.png
1
2
patients = pd.pivot_table(index='Date', columns='Continent', values=['Infected','Deaths'],data=df, aggfunc=sum)
patients
Deaths Infected
Continent Africa Asia Europe North America Oceania South America Africa Asia Europe North America Oceania South America
Date
2020-01-22 0 17 0 0 0 0 0.0 4.0 0.0 1.0 0.0 0.0
2020-01-23 0 18 0 0 0 0 0.0 6.0 0.0 1.0 0.0 0.0
2020-01-24 0 26 0 0 0 0 0.0 6.0 1.0 1.0 0.0 0.0
2020-01-25 0 42 0 0 0 0 0.0 8.0 1.0 1.0 0.0 0.0
2020-01-26 0 56 0 0 0 0 0.0 8.0 1.0 2.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
2020-04-12 783 11071 76539 23348 64 2081 50.0 40.0 44.0 23.0 4.0 12.0
2020-04-13 828 11430 79736 24960 66 2247 50.0 40.0 44.0 23.0 4.0 12.0
2020-04-14 867 11839 83022 27433 71 2511 50.0 40.0 44.0 23.0 4.0 12.0
2020-04-15 904 12164 87890 30126 72 2774 50.0 40.0 44.0 23.0 4.0 12.0
2020-04-16 958 12590 91865 35034 72 3027 50.0 40.0 44.0 23.0 4.0 12.0

86 rows × 12 columns

1
2
3
fig4,ax4 = plt.subplots(1,1)
patients['Deaths'].plot(ax=ax4)
ax4.set_title("各州死亡人数")
20200417160738.png