Bruce Han的博客

不积跬步,无以至千里;不积小流,无以成江海。

0%

Pandas之数据规整化:清理、转换、合并与重塑

数据分析和建模方面大量的工作是用在数据准备上:加载、清理、转换以及重塑。本文主要介绍数据集的合并、重塑和转换。

1
2
3
4
5
import numpy as np
import pandas as pd
pd.options.display.max_rows = 20 # 最多显示20行
np.random.seed(12345)
np.set_printoptions(precision=4, suppress=True) # 精确到4为小数

合并数据集

  • pandas.merge可根据一个或多个键将不同DataFrame中的行连接起来。
  • pandas.concat可以沿着一条轴将多个对象堆叠到一起
  • combine_first可以将重复数据编接在一起,用一个对象中的值填充另一个对象中的缺失值。

数据库风格的DataFrame合并

DataFrame的合并(merge)或连接(join)运算是通过一个或多个键将行链接起来。

1
2
s = pd.read_csv('./examples/student.csv')
s
student_id student_name teacher_id
0 1 xiaoming 1
1 2 xiaohua 1
2 3 xiaoli 1
3 4 xiaozhang 2
4 5 xiaosun 3
5 6 xiaohong 5
1
2
t = pd.read_csv('./examples/teacher.csv')
t
teacher_id teacher_name
0 1 zhangsan
1 2 lisi
2 3 wangwu
3 4 zhaoliu
1
pd.merge(s,t)  #默认为内连接
student_id student_name teacher_id teacher_name
0 1 xiaoming 1 zhangsan
1 2 xiaohua 1 zhangsan
2 3 xiaoli 1 zhangsan
3 4 xiaozhang 2 lisi
4 5 xiaosun 3 wangwu
  • 如果没有指明用哪个列进行连接,merge会将重叠列的列名当作键。因此,最好显示指定连接键
  • 如果两个对象的列名,也可以分别进行指定。
  • 默认情况下,merge做的是inner连接连接,返回的是交集。其他方式,还有left,right以及outer。
  • left是显示左表中的所有记录,right返回右表中的所有记录,而outer返回的是键的并集,其组合了左连接和右连接的效果。
1
pd.merge(s,t,on='teacher_id') #明确指定连接键
student_id student_name teacher_id teacher_name
0 1 xiaoming 1 zhangsan
1 2 xiaohua 1 zhangsan
2 3 xiaoli 1 zhangsan
3 4 xiaozhang 2 lisi
4 5 xiaosun 3 wangwu
1
s.merge(t, on='teacher_id')
student_id student_name teacher_id teacher_name
0 1 xiaoming 1 zhangsan
1 2 xiaohua 1 zhangsan
2 3 xiaoli 1 zhangsan
3 4 xiaozhang 2 lisi
4 5 xiaosun 3 wangwu
1
2
3
4
5
# 列名不同,需分别进行指定
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
'data2': range(3)})
1
df3
lkey data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
1
df4
rkey data2
0 a 0
1 b 1
2 d 2
1
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
1
pd.merge(s, t, how='inner')
student_id student_name teacher_id teacher_name
0 1 xiaoming 1 zhangsan
1 2 xiaohua 1 zhangsan
2 3 xiaoli 1 zhangsan
3 4 xiaozhang 2 lisi
4 5 xiaosun 3 wangwu
1
pd.merge(s, t, how='left')
student_id student_name teacher_id teacher_name
0 1 xiaoming 1 zhangsan
1 2 xiaohua 1 zhangsan
2 3 xiaoli 1 zhangsan
3 4 xiaozhang 2 lisi
4 5 xiaosun 3 wangwu
5 6 xiaohong 5 NaN
1
pd.merge(s, t, how='right')
student_id student_name teacher_id teacher_name
0 1.0 xiaoming 1 zhangsan
1 2.0 xiaohua 1 zhangsan
2 3.0 xiaoli 1 zhangsan
3 4.0 xiaozhang 2 lisi
4 5.0 xiaosun 3 wangwu
5 NaN NaN 4 zhaoliu
1
pd.merge(s, t, how='outer')
student_id student_name teacher_id teacher_name
0 1.0 xiaoming 1 zhangsan
1 2.0 xiaohua 1 zhangsan
2 3.0 xiaoli 1 zhangsan
3 4.0 xiaozhang 2 lisi
4 5.0 xiaosun 3 wangwu
5 6.0 xiaohong 5 NaN
6 NaN NaN 4 zhaoliu
  • 多对多合并操作,返回的是行的笛卡尔积。
  • 连接方式只影响出现在结果中的键。
1
2
3
4
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})

df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
1
2
3
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
'data2': range(5)})
df2
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
1
pd.merge(df1, df2, on='key', how='left')
key data1 data2
0 b 0 1.0
1 b 0 3.0
2 b 1 1.0
3 b 1 3.0
4 a 2 0.0
5 a 2 2.0
6 c 3 NaN
7 a 4 0.0
8 a 4 2.0
9 b 5 1.0
10 b 5 3.0
1
pd.merge(df1, df2, how='inner')
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2
1
df1.merge(df2,on=['key'],how='outer')
key data1 data2
0 b 0.0 1.0
1 b 0.0 3.0
2 b 1.0 1.0
3 b 1.0 3.0
4 b 5.0 1.0
5 b 5.0 3.0
6 a 2.0 0.0
7 a 2.0 2.0
8 a 4.0 0.0
9 a 4.0 2.0
10 c 3.0 NaN
11 d NaN 4.0
  • 根据多个键进行合并,传入一个由列名组成的列表即可。
  • 可以理解为多个键形成一系列元组,并将其当作单个连接键
1
2
3
4
5
6
7
8
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
left
right
key1 key2 lval
0 foo one 1
1 foo two 2
2 bar one 3
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
1
pd.merge(right,left, on=['key1', 'key2'], how='right')
key1 key2 rval lval
0 foo one 4.0 1
1 foo one 5.0 1
2 bar one 6.0 3
3 foo two NaN 2
1
pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
  • 对于重复列名,可以指定一个suffixes选项,用于附加到重复列名上。
1
pd.merge(left, right, on='key1')
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
1
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

merge函数的参数

参数 说明
left 左侧DataFrame
right 右侧DataFrame
how 连接方式,可选为innerouterleftright,默认为inner
on 用于连接的列名,必须同时存在于左右两个DataFrame中。如果不指定且其他参数也未指定,则以共同列名合并
left_on 左侧DataFrame中的连接键列
right_on 右侧DataFrame中的连接键列
left_index TrueorFalse,是否将左侧DataFrame的行索引作为连接键
right_index TrueorFalse,是否将右侧DataFrame的行索引作为连接键
sort TrueorFalse,合并后是否对数据进行排序。默认为True,大规模数据时禁用可获得更好的性能
suffixes 字符串元组,用于追加到重叠列名的末尾,默认为(_x, _y)

索引上合并

1
2
3
4
5
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
left1
right1
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
group_val
a 3.5
b 7.0
1
pd.merge(left1, right1, left_on='key', right_index=True)
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
1
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
  • 层次化索引
1
2
3
4
5
6
7
8
9
10
11
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
'Ohio', 'Ohio'],
[2001, 2000, 2000, 2000, 2001, 2002]],
columns=['event1', 'event2'])
lefth
righth
key1 key2 data
0 Ohio 2000 0.0
1 Ohio 2001 1.0
2 Ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
1
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
key1 key2 data event1 event2
0 Ohio 2000 0.0 4 5
0 Ohio 2000 0.0 6 7
1 Ohio 2001 1.0 8 9
2 Ohio 2002 2.0 10 11
3 Nevada 2001 3.0 0 1
1
righth.swaplevel(0, 1)
event1 event2
2001 Nevada 0 1
2000 Nevada 2 3
Ohio 4 5
Ohio 6 7
2001 Ohio 8 9
2002 Ohio 10 11
1
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
key1 key2 data event1 event2
0 Ohio 2000 0.0 4 5
0 Ohio 2000 0.0 6 7
1 Ohio 2001 1.0 8 9
2 Ohio 2002 2.0 10 11
3 Nevada 2001 3.0 0 1
1
2
pd.merge(lefth, righth, left_on=['key1', 'key2'],
right_index=True, how='outer')
key1 key2 data event1 event2
0 Ohio 2000 0.0 4.0 5.0
0 Ohio 2000 0.0 6.0 7.0
1 Ohio 2001 1.0 8.0 9.0
2 Ohio 2002 2.0 10.0 11.0
3 Nevada 2001 3.0 0.0 1.0
4 Nevada 2002 4.0 NaN NaN
4 Nevada 2000 NaN 2.0 3.0
1
2
3
4
5
6
7
8
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
index=['a', 'c', 'e'],
columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
index=['b', 'c', 'd', 'e'],
columns=['Missouri', 'Alabama'])
left2
right2
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
1
2
pd.merge(left2, right2, how='outer', left_index=True, right_index=True, 
sort=False)
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
1
left2.merge(right2, how='outer', left_index=True, right_index=True)
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
  • DataFrame还有一个join实例方法,可以更为方便地实现索引合并。可以用于合并多个带有相同或相似索引的DataFrame对象,而不管是否有重叠列。
1
left2.join(right2, how='outer')
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
1
left1.join(right1, on='key')
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
1
2
3
4
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
index=['a', 'c', 'e', 'f'],
columns=['New York', 'Oregon'])
another
New York Oregon
a 7.0 8.0
c 9.0 10.0
e 11.0 12.0
f 16.0 17.0
1
left2.join([right2, another])
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
1
left2.join([right2, another], how='outer')
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
b NaN NaN 7.0 8.0 NaN NaN
d NaN NaN 11.0 12.0 NaN NaN
f NaN NaN NaN NaN 16.0 17.0

轴向连接

数组的合并与拆分

  • Numpy有一个用于合并原始Numpy数组的concatenate函数,可以按指定轴将一个由数组组成的序列(如元组、列表等)连接到一起。
1
2
arr = np.arange(12).reshape((3, 4))
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
1
np.concatenate([arr, arr], axis=1)
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])
1
np.concatenate([arr, arr], axis=0)
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
1
2
3
4
a = np.array([[1, 2], [3, 4]])
b = np.array([[5,6]])
a
b
array([[1, 2],
       [3, 4]])

array([[5, 6]])

1
2
a.shape
b.shape
(2, 2)

(1, 2)

1
np.concatenate((a, b), axis=0)
array([[1, 2],
       [3, 4],
       [5, 6]])
1
np.concatenate((a, b.T), axis=1)
array([[1, 2, 5],
       [3, 4, 6]])

对于常见的连接操作,NumPy提供了比较方便的的方法(如vstackhstack),因此,上面的操作也可以写为

1
np.vstack((arr, arr))
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
1
np.hstack((arr, arr))
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])
1
np.vstack((a, b))
array([[1, 2],
       [3, 4],
       [5, 6]])
1
np.hstack((a, b.T))
array([[1, 2, 5],
       [3, 4, 6]])

相反的,split用于将一个数组沿指定的轴拆分为多个数组

1
2
arr2 = np.random.randn(5, 2)
arr2
array([[ 0.0009,  1.3438],
       [-0.7135, -0.8312],
       [-2.3702, -1.8608],
       [-0.8608,  0.5601],
       [-1.2659,  0.1198]])
1
2
first,second,third = np.split(arr2, [1, 3])
first
array([[0.0009, 1.3438]])
1
2
second
third
array([[-0.7135, -0.8312],
       [-2.3702, -1.8608]])

array([[-0.8608, 0.5601], [-1.2659, 0.1198]])

Pandas的轴向连接

  • 对于pandas对象,带有标签的轴能进一步扩展数组连接运算。pandas有一个concat函数
1
2
s1 = pd.Series([0, 1], index=['a', 'b'])
s1
a    0
b    1
dtype: int64
1
2
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s2
c    2
d    3
e    4
dtype: int64
1
2
s3 = pd.Series([5, 6], index=['f', 'g'])
s3
f    5
g    6
dtype: int64
1
s1.add(s2,fill_value=0)
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
1
pd.concat([s1, s2, s3]) # 默认在axis=0上连接,产生一个新的Series
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
1
2
# 如果在axis=1上连接,则产生一个DataFrame,默认为外连接,索引的有序并集
pd.concat([s1, s2, s3], axis=1,sort=False)
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
1
2
s4 = pd.concat([s1, s3])
s4
a    0
b    1
f    5
g    6
dtype: int64
1
pd.concat([s1, s4], axis=1)
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
1
2
# 内连接,返回交集
pd.concat([s1, s4], axis=1, join='inner')
0 1
a 0 0
b 1 1
  • 使用keys可以在想要的连接轴上创建层次化索引
1
2
result = pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])
result
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
1
result.unstack() # 将数据的行旋转为列
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
  • 如果沿着axis=1对Series进行合并,则keys会成为DataFrame的列标题。
1
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
  • 上述逻辑同样适合于DataFrame。
1
2
3
4
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
columns=['three', 'four'])
1
df1
one two
a 0 1
b 2 3
c 4 5
1
df2
three four
a 5 6
c 7 8
1
pd.concat([df1, df2], axis=1)
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
1
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
  • 如果keys选项传入的不是列表而是字典,则字典的键会被当作keys选项的值。
  • ignore_index不保留连接轴上的索引,产生一组新索引。
1
pd.concat({'level1': df1, 'level2': df2}, axis=1)
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
1
2
3
# 创建分层级别的名称
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
names=['upper', 'lower'])
upper level1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
1
2
3
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df1
a b c d
0 -0.204708 0.478943 -0.519439 -0.555730
1 1.965781 1.393406 0.092908 0.281746
2 0.769023 1.246435 1.007189 -1.296221
1
df2
b d a
0 0.274992 0.228913 1.352917
1 0.886429 -2.001637 -0.371843
1
pd.concat([df1,df2])
a b c d
0 -0.204708 0.478943 -0.519439 -0.555730
1 1.965781 1.393406 0.092908 0.281746
2 0.769023 1.246435 1.007189 -1.296221
0 1.352917 0.274992 NaN 0.228913
1 -0.371843 0.886429 NaN -2.001637
1
pd.concat([df1, df2], ignore_index=True)
a b c d
0 -0.204708 0.478943 -0.519439 -0.555730
1 1.965781 1.393406 0.092908 0.281746
2 0.769023 1.246435 1.007189 -1.296221
3 1.352917 0.274992 NaN 0.228913
4 -0.371843 0.886429 NaN -2.001637

concat函数的参数

参数 说明
objs 参与连接的pandas对象的列表或字典,必须参数
axis 连接的轴向,默认axis=0
join 连接方式,可选为inner(交集)、outer(并集),默认为join=outer
join_axes 指明用于其他n-1条轴的索引
keys 与连接对象有关的值,用于形成连接轴上的层次化索引。可以是任意的列表或数组、元组数组(如果将level设为多级数组的话)
levels 指定用于层次化索引各级别上的索引
names 创建分层级别的名称
verify_integrity 检查结果对象新轴上的重复情况,如果发现则引发异常。默认为False
ignore_index 不保留连接轴上的索引,产生一组新索引

合并重叠数据

1
2
3
4
5
6
7
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan
a
b
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

f 0.0 e 1.0 d 2.0 c 3.0 b 4.0 a NaN dtype: float64

1
np.where(pd.isnull(a), b, a)
array([0. , 2.5, 2. , 3.5, 4.5, nan])
1
pd.Series(np.where(pd.isnull(a), b, a))
0    0.0
1    2.5
2    2.0
3    3.5
4    4.5
5    NaN
dtype: float64
1
pd.concat([a, b])
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64
  • combie_first方法,将其NA值用另一个同位置(相同索引)的值代替,会数据自动对齐。
  • Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
1
a.combine_first(b)
f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64
1
b.combine_first(a)
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64
1
2
b
b[:-2]
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

f 0.0 e 1.0 d 2.0 c 3.0 dtype: float64

1
2
a
a[2:]
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

d NaN c 3.5 b 4.5 a NaN dtype: float64

1
b[:-2].combine_first(a[2:])
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64
  • Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
1
2
3
4
5
6
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})
df1
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
1
df2
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
1
df1.combine_first(df2)
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN

重塑(reshape)和透视(pivot)

层次化索引重塑

  • stack将数据的列旋转为行。默认dropna=True
  • unstack将数的行旋转为列。 默认操作的是最内层,可以传入分层级别的编号或名称,对其他级别进行unstack操作。
1
2
3
4
5
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number'))
data
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
1
2
result = data.stack() # 列旋转为行,得到一个层次化索引的Series
result
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32
1
result.unstack() # 行旋转为列,层次化索引的Series得到一个DataFrame
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
1
result.unstack(0)
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
1
result.unstack('state')
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
1
2
3
4
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
s1
s2
a    0
b    1
c    2
d    3
dtype: int64

c 4 d 5 e 6 dtype: int64

1
2
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64
1
data2.unstack()
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
1
2

data2.unstack().stack()
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64
1
data2.unstack().stack(dropna=False)
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64
1
2
3
df = pd.DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))
df
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
1
df.unstack()
side left right
number one two three one two three
state
Ohio 0 1 2 5 6 7
Colorado 3 4 5 8 9 10
1
df.unstack('state')
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
1
2
df
df.unstack('state').stack('side')
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7

透视表 pivot

Return reshaped DataFrame organized by given index / column values.

DataFrame.pivot(self, index=None, columns=None, values=None) → 'DataFrame'

  • index str or object, optional Column to use to make new frame’s index. If None, uses existing index.

  • columns str or object Column to use to make new frame’s columns.

  • values str, object or a list of the previous, optional Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.

1
2
3
4
5
6
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df
foo bar baz zoo
0 one A 1 x
1 one B 2 y
2 one C 3 z
3 two A 4 q
4 two B 5 w
5 two C 6 t
1
df.pivot(index='foo', columns='bar', values='baz')
bar A B C
foo
one 1 2 3
two 4 5 6
1
df.pivot(index='foo', columns='bar')['baz']
bar A B C
foo
one 1 2 3
two 4 5 6
1
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
baz zoo
bar A B C A B C
foo
one 1 2 3 x y z
two 4 5 6 q w t
  • When there are any index, columns combinations with multiple values. DataFrame.pivot_table when you need to aggregate.

透视表pivot_table

pivot_tableCreate a spreadsheet-style pivot table as a DataFrame.

1
2
3
4
df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
"bar": ['A', 'A', 'B', 'C'],
"baz": [1, 2, 3, 4]})
df
foo bar baz
0 one A 1
1 one A 2
2 two B 3
3 two C 4
1
df.pivot(index='foo', columns='bar', values='bar')
bar A B C
foo
one 1.5 NaN NaN
two NaN 3.0 4.0

数据的转换

移除重复数据

1
2
3
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
1
2
# 判断各行是否重复,返回一个布尔型的Series
data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
  • drop_duplicates方法返回一个移除了重复行的DataFrame
  • 默认会判断全部列,也可以指定部分列进行判断
  • duplicateddrop_duplicates默认保留的是第一个出现的值组合,传入参数take_last=True则保留最后一个。
1
data.drop_duplicates()
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
1
2
data['v1'] = range(7)
data.drop_duplicates(['k1'])
k1 k2 v1
0 one 1 0
1 two 1 1
1
data.drop_duplicates(['k1', 'k2'], keep='last')
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6

利用函数或映射进行数据转换

根据数组、Series或DataFrame列中的值来实现转换工作。

1
2
3
4
5
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
1
2
3
4
5
6
7
8
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
1
2
lowercased = data['food'].str.lower()
lowercased
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
  • Series的map方法可以接受一个函数或含有映射关系的字典。
1
2
data['animal'] = lowercased.map(meat_to_animal)
data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
1
data['food'].map(lambda x: meat_to_animal[x.lower()])
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

部分常用string的函数

方法 功能
isalnum() 如果字符串至少有一个字符并且所有字符都是字母或数字则返 回 True,否则返回 False
isalpha() 如果字符串至少有一个字符并且所有字符都是字母则返回 True, 否则返回 False
isdigit() 如果字符串只包含数字则返回 True 否则返回 False
isnumeric() 如果字符串中只包含数字字符,则返回 True,否则返回 False
isspace() 如果字符串中只包含空白,则返回 True,否则返回 False.
lower() 转换字符串中所有大写字符为小写
islower() 如果字符串中包含至少一个区分大小写的字符,并且所有这些(区分大小写的)字符都是小写,则返回 True,否则返回 False
capitalize() 将字符串的第一个字符转换为大写
isupper() 如果字符串中包含至少一个区分大小写的字符,并且所有这些(区分大小写的)字符都是大写,则返回 Tru
title() 返回"标题化"的字符串,就是说所有单词都是以大写开始,其余字母均为小写
istitle 如果字符串是标题化的(见 title())则返回 True,否则返回 False

替换值

1
2
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64
1
data.replace(-999, np.nan)
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64
1
2
# 一次性替换多个值
data.replace([-999, -1000], np.nan)
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64
  • 对不同的值进行不同的替换,传入替换关系组成的列表或字典。
1
data.replace([-999, -1000], [np.nan, 0])
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64
1
data.replace({-999: np.nan, -1000: 0})
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

重命名轴索引

  • 跟Series的值一样,轴标签也可以函数或映射进行转换,从而得到一个新的对象。跟Series一样,轴标签也有一个map函数
  • 轴还可以就地修改,而无需新建一个数据结构
1
2
3
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
1
2
transform = lambda x: x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
1
2
data.index = data.index.map(transform)
data
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

rename创建数据集的转换版,而不是修改原始数据。

1
data.rename(index=str.title, columns=str.upper)
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
1
2
3
# 部分轴标签的更新
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
1
2
3
# 就地修改
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11