数据分析和建模方面大量的工作是用在数据准备上:加载、清理、转换以及重塑。本文主要介绍数据集的合并、重塑和转换。
1 2 3 4 5 import numpy as npimport pandas as pdpd.options.display.max_rows = 20 np.random.seed(12345 ) np.set_printoptions(precision=4 , suppress=True )
合并数据集
pandas.merge
可根据一个或多个键将不同DataFrame中的行连接起来。
pandas.concat
可以沿着一条轴将多个对象堆叠到一起
combine_first
可以将重复数据编接在一起,用一个对象中的值填充另一个对象中的缺失值。
数据库风格的DataFrame合并
DataFrame的合并(merge)或连接(join)运算是通过一个或多个键将行链接起来。
1 2 s = pd.read_csv('./examples/student.csv' ) s
student_id
student_name
teacher_id
0
1
xiaoming
1
1
2
xiaohua
1
2
3
xiaoli
1
3
4
xiaozhang
2
4
5
xiaosun
3
5
6
xiaohong
5
1 2 t = pd.read_csv('./examples/teacher.csv' ) t
teacher_id
teacher_name
0
1
zhangsan
1
2
lisi
2
3
wangwu
3
4
zhaoliu
student_id
student_name
teacher_id
teacher_name
0
1
xiaoming
1
zhangsan
1
2
xiaohua
1
zhangsan
2
3
xiaoli
1
zhangsan
3
4
xiaozhang
2
lisi
4
5
xiaosun
3
wangwu
如果没有指明用哪个列进行连接,merge会将重叠列的列名当作键。因此,最好显示指定连接键
如果两个对象的列名,也可以分别进行指定。
默认情况下,merge做的是inner连接连接,返回的是交集。其他方式,还有left,right以及outer。
left是显示左表中的所有记录,right返回右表中的所有记录,而outer返回的是键的并集,其组合了左连接和右连接的效果。
1 pd.merge(s,t,on='teacher_id' )
student_id
student_name
teacher_id
teacher_name
0
1
xiaoming
1
zhangsan
1
2
xiaohua
1
zhangsan
2
3
xiaoli
1
zhangsan
3
4
xiaozhang
2
lisi
4
5
xiaosun
3
wangwu
1 s.merge(t, on='teacher_id' )
student_id
student_name
teacher_id
teacher_name
0
1
xiaoming
1
zhangsan
1
2
xiaohua
1
zhangsan
2
3
xiaoli
1
zhangsan
3
4
xiaozhang
2
lisi
4
5
xiaosun
3
wangwu
1 2 3 4 5 df3 = pd.DataFrame({'lkey' : ['b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ], 'data1' : range (7 )}) df4 = pd.DataFrame({'rkey' : ['a' , 'b' , 'd' ], 'data2' : range (3 )})
lkey
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
a
5
6
b
6
rkey
data2
0
a
0
1
b
1
2
d
2
1 pd.merge(df3, df4, left_on='lkey' , right_on='rkey' )
lkey
data1
rkey
data2
0
b
0
b
1
1
b
1
b
1
2
b
6
b
1
3
a
2
a
0
4
a
4
a
0
5
a
5
a
0
1 pd.merge(s, t, how='inner' )
student_id
student_name
teacher_id
teacher_name
0
1
xiaoming
1
zhangsan
1
2
xiaohua
1
zhangsan
2
3
xiaoli
1
zhangsan
3
4
xiaozhang
2
lisi
4
5
xiaosun
3
wangwu
1 pd.merge(s, t, how='left' )
student_id
student_name
teacher_id
teacher_name
0
1
xiaoming
1
zhangsan
1
2
xiaohua
1
zhangsan
2
3
xiaoli
1
zhangsan
3
4
xiaozhang
2
lisi
4
5
xiaosun
3
wangwu
5
6
xiaohong
5
NaN
1 pd.merge(s, t, how='right' )
student_id
student_name
teacher_id
teacher_name
0
1.0
xiaoming
1
zhangsan
1
2.0
xiaohua
1
zhangsan
2
3.0
xiaoli
1
zhangsan
3
4.0
xiaozhang
2
lisi
4
5.0
xiaosun
3
wangwu
5
NaN
NaN
4
zhaoliu
1 pd.merge(s, t, how='outer' )
student_id
student_name
teacher_id
teacher_name
0
1.0
xiaoming
1
zhangsan
1
2.0
xiaohua
1
zhangsan
2
3.0
xiaoli
1
zhangsan
3
4.0
xiaozhang
2
lisi
4
5.0
xiaosun
3
wangwu
5
6.0
xiaohong
5
NaN
6
NaN
NaN
4
zhaoliu
多对多合并操作,返回的是行的笛卡尔积。
连接方式只影响出现在结果中的键。
1 2 3 4 df1 = pd.DataFrame({'key' : ['b' , 'b' , 'a' , 'c' , 'a' , 'b' ], 'data1' : range (6 )}) df1
key
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
b
5
1 2 3 df2 = pd.DataFrame({'key' : ['a' , 'b' , 'a' , 'b' , 'd' ], 'data2' : range (5 )}) df2
key
data2
0
a
0
1
b
1
2
a
2
3
b
3
4
d
4
1 pd.merge(df1, df2, on='key' , how='left' )
key
data1
data2
0
b
0
1.0
1
b
0
3.0
2
b
1
1.0
3
b
1
3.0
4
a
2
0.0
5
a
2
2.0
6
c
3
NaN
7
a
4
0.0
8
a
4
2.0
9
b
5
1.0
10
b
5
3.0
1 pd.merge(df1, df2, how='inner' )
key
data1
data2
0
b
0
1
1
b
0
3
2
b
1
1
3
b
1
3
4
b
5
1
5
b
5
3
6
a
2
0
7
a
2
2
8
a
4
0
9
a
4
2
1 df1.merge(df2,on=['key' ],how='outer' )
key
data1
data2
0
b
0.0
1.0
1
b
0.0
3.0
2
b
1.0
1.0
3
b
1.0
3.0
4
b
5.0
1.0
5
b
5.0
3.0
6
a
2.0
0.0
7
a
2.0
2.0
8
a
4.0
0.0
9
a
4.0
2.0
10
c
3.0
NaN
11
d
NaN
4.0
根据多个键进行合并,传入一个由列名组成的列表即可。
可以理解为多个键形成一系列元组,并将其当作单个连接键
1 2 3 4 5 6 7 8 left = pd.DataFrame({'key1' : ['foo' , 'foo' , 'bar' ], 'key2' : ['one' , 'two' , 'one' ], 'lval' : [1 , 2 , 3 ]}) right = pd.DataFrame({'key1' : ['foo' , 'foo' , 'bar' , 'bar' ], 'key2' : ['one' , 'one' , 'one' , 'two' ], 'rval' : [4 , 5 , 6 , 7 ]}) left right
key1
key2
lval
0
foo
one
1
1
foo
two
2
2
bar
one
3
key1
key2
rval
0
foo
one
4
1
foo
one
5
2
bar
one
6
3
bar
two
7
1 pd.merge(right,left, on=['key1' , 'key2' ], how='right' )
key1
key2
rval
lval
0
foo
one
4.0
1
1
foo
one
5.0
1
2
bar
one
6.0
3
3
foo
two
NaN
2
1 pd.merge(left, right, on=['key1' , 'key2' ], how='outer' )
key1
key2
lval
rval
0
foo
one
1.0
4.0
1
foo
one
1.0
5.0
2
foo
two
2.0
NaN
3
bar
one
3.0
6.0
4
bar
two
NaN
7.0
对于重复列名,可以指定一个suffixes
选项,用于附加到重复列名上。
1 pd.merge(left, right, on='key1' )
key1
key2_x
lval
key2_y
rval
0
foo
one
1
one
4
1
foo
one
1
one
5
2
foo
two
2
one
4
3
foo
two
2
one
5
4
bar
one
3
one
6
5
bar
one
3
two
7
1 pd.merge(left, right, on='key1' , suffixes=('_left' , '_right' ))
key1
key2_left
lval
key2_right
rval
0
foo
one
1
one
4
1
foo
one
1
one
5
2
foo
two
2
one
4
3
foo
two
2
one
5
4
bar
one
3
one
6
5
bar
one
3
two
7
merge函数的参数
left
左侧DataFrame
right
右侧DataFrame
how
连接方式,可选为inner
、outer
、left
、right
,默认为inner
on
用于连接的列名,必须同时存在于左右两个DataFrame中。如果不指定且其他参数也未指定,则以共同列名合并
left_on
左侧DataFrame中的连接键列
right_on
右侧DataFrame中的连接键列
left_index
True
orFalse
,是否将左侧DataFrame的行索引作为连接键
right_index
True
orFalse
,是否将右侧DataFrame的行索引作为连接键
sort
True
orFalse
,合并后是否对数据进行排序。默认为True
,大规模数据时禁用可获得更好的性能
suffixes
字符串元组,用于追加到重叠列名的末尾,默认为(_x, _y)
索引上合并
1 2 3 4 5 left1 = pd.DataFrame({'key' : ['a' , 'b' , 'a' , 'a' , 'b' , 'c' ], 'value' : range (6 )}) right1 = pd.DataFrame({'group_val' : [3.5 , 7 ]}, index=['a' , 'b' ]) left1 right1
key
value
0
a
0
1
b
1
2
a
2
3
a
3
4
b
4
5
c
5
1 pd.merge(left1, right1, left_on='key' , right_index=True )
key
value
group_val
0
a
0
3.5
2
a
2
3.5
3
a
3
3.5
1
b
1
7.0
4
b
4
7.0
1 pd.merge(left1, right1, left_on='key' , right_index=True , how='outer' )
key
value
group_val
0
a
0
3.5
2
a
2
3.5
3
a
3
3.5
1
b
1
7.0
4
b
4
7.0
5
c
5
NaN
1 2 3 4 5 6 7 8 9 10 11 lefth = pd.DataFrame({'key1' : ['Ohio' , 'Ohio' , 'Ohio' , 'Nevada' , 'Nevada' ], 'key2' : [2000 , 2001 , 2002 , 2001 , 2002 ], 'data' : np.arange(5. )}) righth = pd.DataFrame(np.arange(12 ).reshape((6 , 2 )), index=[['Nevada' , 'Nevada' , 'Ohio' , 'Ohio' , 'Ohio' , 'Ohio' ], [2001 , 2000 , 2000 , 2000 , 2001 , 2002 ]], columns=['event1' , 'event2' ]) lefth righth
key1
key2
data
0
Ohio
2000
0.0
1
Ohio
2001
1.0
2
Ohio
2002
2.0
3
Nevada
2001
3.0
4
Nevada
2002
4.0
event1
event2
Nevada
2001
0
1
2000
2
3
Ohio
2000
4
5
2000
6
7
2001
8
9
2002
10
11
1 pd.merge(lefth, righth, left_on=['key1' , 'key2' ], right_index=True )
key1
key2
data
event1
event2
0
Ohio
2000
0.0
4
5
0
Ohio
2000
0.0
6
7
1
Ohio
2001
1.0
8
9
2
Ohio
2002
2.0
10
11
3
Nevada
2001
3.0
0
1
event1
event2
2001
Nevada
0
1
2000
Nevada
2
3
Ohio
4
5
Ohio
6
7
2001
Ohio
8
9
2002
Ohio
10
11
1 pd.merge(lefth, righth, left_on=['key1' , 'key2' ], right_index=True )
key1
key2
data
event1
event2
0
Ohio
2000
0.0
4
5
0
Ohio
2000
0.0
6
7
1
Ohio
2001
1.0
8
9
2
Ohio
2002
2.0
10
11
3
Nevada
2001
3.0
0
1
1 2 pd.merge(lefth, righth, left_on=['key1' , 'key2' ], right_index=True , how='outer' )
key1
key2
data
event1
event2
0
Ohio
2000
0.0
4.0
5.0
0
Ohio
2000
0.0
6.0
7.0
1
Ohio
2001
1.0
8.0
9.0
2
Ohio
2002
2.0
10.0
11.0
3
Nevada
2001
3.0
0.0
1.0
4
Nevada
2002
4.0
NaN
NaN
4
Nevada
2000
NaN
2.0
3.0
1 2 3 4 5 6 7 8 left2 = pd.DataFrame([[1. , 2. ], [3. , 4. ], [5. , 6. ]], index=['a' , 'c' , 'e' ], columns=['Ohio' , 'Nevada' ]) right2 = pd.DataFrame([[7. , 8. ], [9. , 10. ], [11. , 12. ], [13 , 14 ]], index=['b' , 'c' , 'd' , 'e' ], columns=['Missouri' , 'Alabama' ]) left2 right2
Ohio
Nevada
a
1.0
2.0
c
3.0
4.0
e
5.0
6.0
Missouri
Alabama
b
7.0
8.0
c
9.0
10.0
d
11.0
12.0
e
13.0
14.0
1 2 pd.merge(left2, right2, how='outer' , left_index=True , right_index=True , sort=False )
Ohio
Nevada
Missouri
Alabama
a
1.0
2.0
NaN
NaN
b
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
d
NaN
NaN
11.0
12.0
e
5.0
6.0
13.0
14.0
1 left2.merge(right2, how='outer' , left_index=True , right_index=True )
Ohio
Nevada
Missouri
Alabama
a
1.0
2.0
NaN
NaN
b
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
d
NaN
NaN
11.0
12.0
e
5.0
6.0
13.0
14.0
DataFrame还有一个join
实例方法,可以更为方便地实现索引合并。可以用于合并多个带有相同或相似索引的DataFrame对象,而不管是否有重叠列。
1 left2.join(right2, how='outer' )
Ohio
Nevada
Missouri
Alabama
a
1.0
2.0
NaN
NaN
b
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
d
NaN
NaN
11.0
12.0
e
5.0
6.0
13.0
14.0
1 left1.join(right1, on='key' )
key
value
group_val
0
a
0
3.5
1
b
1
7.0
2
a
2
3.5
3
a
3
3.5
4
b
4
7.0
5
c
5
NaN
1 2 3 4 another = pd.DataFrame([[7. , 8. ], [9. , 10. ], [11. , 12. ], [16. , 17. ]], index=['a' , 'c' , 'e' , 'f' ], columns=['New York' , 'Oregon' ]) another
New York
Oregon
a
7.0
8.0
c
9.0
10.0
e
11.0
12.0
f
16.0
17.0
1 left2.join([right2, another])
Ohio
Nevada
Missouri
Alabama
New York
Oregon
a
1.0
2.0
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
9.0
10.0
e
5.0
6.0
13.0
14.0
11.0
12.0
1 left2.join([right2, another], how='outer' )
Ohio
Nevada
Missouri
Alabama
New York
Oregon
a
1.0
2.0
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
9.0
10.0
e
5.0
6.0
13.0
14.0
11.0
12.0
b
NaN
NaN
7.0
8.0
NaN
NaN
d
NaN
NaN
11.0
12.0
NaN
NaN
f
NaN
NaN
NaN
NaN
16.0
17.0
轴向连接
数组的合并与拆分
Numpy有一个用于合并原始Numpy数组的concatenate
函数,可以按指定轴将一个由数组组成的序列(如元组、列表等)连接到一起。
1 2 arr = np.arange(12 ).reshape((3 , 4 )) arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
1 np.concatenate([arr, arr], axis=1 )
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
1 np.concatenate([arr, arr], axis=0 )
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
1 2 3 4 a = np.array([[1 , 2 ], [3 , 4 ]]) b = np.array([[5 ,6 ]]) a b
array([[1, 2],
[3, 4]])
array([[5, 6]])
(2, 2)
(1, 2)
1 np.concatenate((a, b), axis=0 )
array([[1, 2],
[3, 4],
[5, 6]])
1 np.concatenate((a, b.T), axis=1 )
array([[1, 2, 5],
[3, 4, 6]])
对于常见的连接操作,NumPy提供了比较方便的的方法(如vstack
和hstack
),因此,上面的操作也可以写为
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
array([[1, 2],
[3, 4],
[5, 6]])
array([[1, 2, 5],
[3, 4, 6]])
相反的,split
用于将一个数组沿指定的轴拆分为多个数组
1 2 arr2 = np.random.randn(5 , 2 ) arr2
array([[ 0.0009, 1.3438],
[-0.7135, -0.8312],
[-2.3702, -1.8608],
[-0.8608, 0.5601],
[-1.2659, 0.1198]])
1 2 first,second,third = np.split(arr2, [1 , 3 ]) first
array([[0.0009, 1.3438]])
array([[-0.7135, -0.8312],
[-2.3702, -1.8608]])
array([[-0.8608, 0.5601], [-1.2659, 0.1198]])
Pandas的轴向连接
对于pandas对象,带有标签的轴能进一步扩展数组连接运算。pandas有一个concat
函数
1 2 s1 = pd.Series([0 , 1 ], index=['a' , 'b' ]) s1
a 0
b 1
dtype: int64
1 2 s2 = pd.Series([2 , 3 , 4 ], index=['c' , 'd' , 'e' ]) s2
c 2
d 3
e 4
dtype: int64
1 2 s3 = pd.Series([5 , 6 ], index=['f' , 'g' ]) s3
f 5
g 6
dtype: int64
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
1 2 pd.concat([s1, s2, s3], axis=1 ,sort=False )
0
1
2
a
0.0
NaN
NaN
b
1.0
NaN
NaN
c
NaN
2.0
NaN
d
NaN
3.0
NaN
e
NaN
4.0
NaN
f
NaN
NaN
5.0
g
NaN
NaN
6.0
1 2 s4 = pd.concat([s1, s3]) s4
a 0
b 1
f 5
g 6
dtype: int64
1 pd.concat([s1, s4], axis=1 )
0
1
a
0.0
0
b
1.0
1
f
NaN
5
g
NaN
6
1 2 pd.concat([s1, s4], axis=1 , join='inner' )
1 2 result = pd.concat([s1, s2, s3], keys=['one' , 'two' , 'three' ]) result
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
a
b
c
d
e
f
g
one
0.0
1.0
NaN
NaN
NaN
NaN
NaN
two
NaN
NaN
2.0
3.0
4.0
NaN
NaN
three
NaN
NaN
NaN
NaN
NaN
5.0
6.0
如果沿着axis=1对Series进行合并,则keys会成为DataFrame的列标题。
1 pd.concat([s1, s2, s3], axis=1 , keys=['one' , 'two' , 'three' ])
one
two
three
a
0.0
NaN
NaN
b
1.0
NaN
NaN
c
NaN
2.0
NaN
d
NaN
3.0
NaN
e
NaN
4.0
NaN
f
NaN
NaN
5.0
g
NaN
NaN
6.0
1 2 3 4 df1 = pd.DataFrame(np.arange(6 ).reshape(3 , 2 ), index=['a' , 'b' , 'c' ], columns=['one' , 'two' ]) df2 = pd.DataFrame(5 + np.arange(4 ).reshape(2 , 2 ), index=['a' , 'c' ], columns=['three' , 'four' ])
one
two
a
0
1
b
2
3
c
4
5
1 pd.concat([df1, df2], axis=1 )
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
1 pd.concat([df1, df2], axis=1 , keys=['level1' , 'level2' ])
level1
level2
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
如果keys
选项传入的不是列表而是字典,则字典的键会被当作keys选项的值。
ignore_index
不保留连接轴上的索引,产生一组新索引。
1 pd.concat({'level1' : df1, 'level2' : df2}, axis=1 )
level1
level2
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
1 2 3 pd.concat([df1, df2], axis=1 , keys=['level1' , 'level2' ], names=['upper' , 'lower' ])
upper
level1
level2
lower
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
1 2 3 df1 = pd.DataFrame(np.random.randn(3 , 4 ), columns=['a' , 'b' , 'c' , 'd' ]) df2 = pd.DataFrame(np.random.randn(2 , 3 ), columns=['b' , 'd' , 'a' ]) df1
a
b
c
d
0
-0.204708
0.478943
-0.519439
-0.555730
1
1.965781
1.393406
0.092908
0.281746
2
0.769023
1.246435
1.007189
-1.296221
b
d
a
0
0.274992
0.228913
1.352917
1
0.886429
-2.001637
-0.371843
a
b
c
d
0
-0.204708
0.478943
-0.519439
-0.555730
1
1.965781
1.393406
0.092908
0.281746
2
0.769023
1.246435
1.007189
-1.296221
0
1.352917
0.274992
NaN
0.228913
1
-0.371843
0.886429
NaN
-2.001637
1 pd.concat([df1, df2], ignore_index=True )
a
b
c
d
0
-0.204708
0.478943
-0.519439
-0.555730
1
1.965781
1.393406
0.092908
0.281746
2
0.769023
1.246435
1.007189
-1.296221
3
1.352917
0.274992
NaN
0.228913
4
-0.371843
0.886429
NaN
-2.001637
concat函数的参数
objs
参与连接的pandas对象的列表或字典,必须参数
axis
连接的轴向,默认axis=0
join
连接方式,可选为inner
(交集)、outer
(并集),默认为join=outer
join_axes
指明用于其他n-1条轴的索引
keys
与连接对象有关的值,用于形成连接轴上的层次化索引。可以是任意的列表或数组、元组数组(如果将level设为多级数组的话)
levels
指定用于层次化索引各级别上的索引
names
创建分层级别的名称
verify_integrity
检查结果对象新轴上的重复情况,如果发现则引发异常。默认为False
ignore_index
不保留连接轴上的索引,产生一组新索引
合并重叠数据
1 2 3 4 5 6 7 a = pd.Series([np.nan, 2.5 , np.nan, 3.5 , 4.5 , np.nan], index=['f' , 'e' , 'd' , 'c' , 'b' , 'a' ]) b = pd.Series(np.arange(len (a), dtype=np.float64), index=['f' , 'e' , 'd' , 'c' , 'b' , 'a' ]) b[-1 ] = np.nan a b
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
f 0.0 e 1.0 d 2.0 c 3.0 b 4.0 a NaN dtype: float64
1 np.where(pd.isnull(a), b, a)
array([0. , 2.5, 2. , 3.5, 4.5, nan])
1 pd.Series(np.where(pd.isnull(a), b, a))
0 0.0
1 2.5
2 2.0
3 3.5
4 4.5
5 NaN
dtype: float64
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
combie_first
方法,将其NA值用另一个同位置(相同索引)的值代替,会数据自动对齐。
Combine Series values, choosing the calling Series’s values first.
Result index will be the union of the two indexes
f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a NaN
dtype: float64
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
f 0.0 e 1.0 d 2.0 c 3.0 dtype: float64
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
d NaN c 3.5 b 4.5 a NaN dtype: float64
1 b[:-2 ].combine_first(a[2 :])
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
Combine two DataFrame objects and default to non-null values in
frame calling the method. Result index columns will be the union
of the respective indexes and columns
1 2 3 4 5 6 df1 = pd.DataFrame({'a' : [1. , np.nan, 5. , np.nan], 'b' : [np.nan, 2. , np.nan, 6. ], 'c' : range (2 , 18 , 4 )}) df2 = pd.DataFrame({'a' : [5. , 4. , np.nan, 3. , 7. ], 'b' : [np.nan, 3. , 4. , 6. , 8. ]}) df1
a
b
c
0
1.0
NaN
2
1
NaN
2.0
6
2
5.0
NaN
10
3
NaN
6.0
14
a
b
0
5.0
NaN
1
4.0
3.0
2
NaN
4.0
3
3.0
6.0
4
7.0
8.0
a
b
c
0
1.0
NaN
2.0
1
4.0
2.0
6.0
2
5.0
4.0
10.0
3
3.0
6.0
14.0
4
7.0
8.0
NaN
重塑(reshape)和透视(pivot)
层次化索引重塑
stack
将数据的列旋转为行。默认dropna=True
unstack
将数的行旋转为列。
默认操作的是最内层,可以传入分层级别的编号或名称,对其他级别进行unstack操作。
1 2 3 4 5 data = pd.DataFrame(np.arange(6 ).reshape((2 , 3 )), index=pd.Index(['Ohio' , 'Colorado' ], name='state' ), columns=pd.Index(['one' , 'two' , 'three' ], name='number' )) data
number
one
two
three
state
Ohio
0
1
2
Colorado
3
4
5
1 2 result = data.stack() result
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
number
one
two
three
state
Ohio
0
1
2
Colorado
3
4
5
state
Ohio
Colorado
number
one
0
3
two
1
4
three
2
5
state
Ohio
Colorado
number
one
0
3
two
1
4
three
2
5
1 2 3 4 s1 = pd.Series([0 , 1 , 2 , 3 ], index=['a' , 'b' , 'c' , 'd' ]) s2 = pd.Series([4 , 5 , 6 ], index=['c' , 'd' , 'e' ]) s1 s2
a 0
b 1
c 2
d 3
dtype: int64
c 4 d 5 e 6 dtype: int64
1 2 data2 = pd.concat([s1, s2], keys=['one' , 'two' ]) data2
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
a
b
c
d
e
one
0.0
1.0
2.0
3.0
NaN
two
NaN
NaN
4.0
5.0
6.0
1 2 data2.unstack().stack()
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
1 data2.unstack().stack(dropna=False )
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
1 2 3 df = pd.DataFrame({'left' : result, 'right' : result + 5 }, columns=pd.Index(['left' , 'right' ], name='side' )) df
side
left
right
state
number
Ohio
one
0
5
two
1
6
three
2
7
Colorado
one
3
8
two
4
9
three
5
10
side
left
right
number
one
two
three
one
two
three
state
Ohio
0
1
2
5
6
7
Colorado
3
4
5
8
9
10
side
left
right
state
Ohio
Colorado
Ohio
Colorado
number
one
0
3
5
8
two
1
4
6
9
three
2
5
7
10
1 2 df df.unstack('state' ).stack('side' )
side
left
right
state
number
Ohio
one
0
5
two
1
6
three
2
7
Colorado
one
3
8
two
4
9
three
5
10
state
Colorado
Ohio
number
side
one
left
3
0
right
8
5
two
left
4
1
right
9
6
three
left
5
2
right
10
7
透视表 pivot
Return reshaped DataFrame organized by given index / column
values.
DataFrame.pivot(self, index=None, columns=None, values=None)
→ 'DataFrame'
index str or object, optional Column to use to make new frame’s
index. If None, uses existing index.
columns str or object Column to use to make new frame’s
columns.
values str, object or a list of the previous, optional Column(s)
to use for populating new frame’s values. If not specified, all
remaining columns will be used and the result will have hierarchically
indexed columns.
1 2 3 4 5 6 df = pd.DataFrame({'foo' : ['one' , 'one' , 'one' , 'two' , 'two' , 'two' ], 'bar' : ['A' , 'B' , 'C' , 'A' , 'B' , 'C' ], 'baz' : [1 , 2 , 3 , 4 , 5 , 6 ], 'zoo' : ['x' , 'y' , 'z' , 'q' , 'w' , 't' ]}) df
foo
bar
baz
zoo
0
one
A
1
x
1
one
B
2
y
2
one
C
3
z
3
two
A
4
q
4
two
B
5
w
5
two
C
6
t
1 df.pivot(index='foo' , columns='bar' , values='baz' )
bar
A
B
C
foo
one
1
2
3
two
4
5
6
1 df.pivot(index='foo' , columns='bar' )['baz' ]
bar
A
B
C
foo
one
1
2
3
two
4
5
6
1 df.pivot(index='foo' , columns='bar' , values=['baz' , 'zoo' ])
baz
zoo
bar
A
B
C
A
B
C
foo
one
1
2
3
x
y
z
two
4
5
6
q
w
t
When there are any index, columns combinations with multiple values.
DataFrame.pivot_table when you need to aggregate.
透视表pivot_table
pivot_table
Create a spreadsheet-style pivot table as a
DataFrame.
1 2 3 4 df = pd.DataFrame({"foo" : ['one' , 'one' , 'two' , 'two' ], "bar" : ['A' , 'A' , 'B' , 'C' ], "baz" : [1 , 2 , 3 , 4 ]}) df
foo
bar
baz
0
one
A
1
1
one
A
2
2
two
B
3
3
two
C
4
1 df.pivot(index='foo' , columns='bar' , values='bar' )
bar
A
B
C
foo
one
1.5
NaN
NaN
two
NaN
3.0
4.0
数据的转换
移除重复数据
1 2 3 data = pd.DataFrame({'k1' : ['one' , 'two' ] * 3 + ['two' ], 'k2' : [1 , 1 , 2 , 3 , 3 , 4 , 4 ]}) data
k1
k2
0
one
1
1
two
1
2
one
2
3
two
3
4
one
3
5
two
4
6
two
4
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
drop_duplicates
方法返回一个移除了重复行的DataFrame
默认会判断全部列,也可以指定部分列进行判断
duplicated
和drop_duplicates
默认保留的是第一个出现的值组合,传入参数take_last=True 则保留最后一个。
k1
k2
0
one
1
1
two
1
2
one
2
3
two
3
4
one
3
5
two
4
1 2 data['v1' ] = range (7 ) data.drop_duplicates(['k1' ])
k1
k2
v1
0
one
1
0
1
two
1
1
1 data.drop_duplicates(['k1' , 'k2' ], keep='last' )
k1
k2
v1
0
one
1
0
1
two
1
1
2
one
2
2
3
two
3
3
4
one
3
4
6
two
4
6
利用函数或映射进行数据转换
根据数组、Series或DataFrame列中的值来实现转换工作。
1 2 3 4 5 data = pd.DataFrame({'food' : ['bacon' , 'pulled pork' , 'bacon' , 'Pastrami' , 'corned beef' , 'Bacon' , 'pastrami' , 'honey ham' , 'nova lox' ], 'ounces' : [4 , 3 , 12 , 6 , 7.5 , 8 , 3 , 5 , 6 ]}) data
food
ounces
0
bacon
4.0
1
pulled pork
3.0
2
bacon
12.0
3
Pastrami
6.0
4
corned beef
7.5
5
Bacon
8.0
6
pastrami
3.0
7
honey ham
5.0
8
nova lox
6.0
1 2 3 4 5 6 7 8 meat_to_animal = { 'bacon' : 'pig' , 'pulled pork' : 'pig' , 'pastrami' : 'cow' , 'corned beef' : 'cow' , 'honey ham' : 'pig' , 'nova lox' : 'salmon' }
1 2 lowercased = data['food' ].str .lower() lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
Series的map
方法可以接受一个函数或含有映射关系的字典。
1 2 data['animal' ] = lowercased.map (meat_to_animal) data
food
ounces
animal
0
bacon
4.0
pig
1
pulled pork
3.0
pig
2
bacon
12.0
pig
3
Pastrami
6.0
cow
4
corned beef
7.5
cow
5
Bacon
8.0
pig
6
pastrami
3.0
cow
7
honey ham
5.0
pig
8
nova lox
6.0
salmon
1 data['food' ].map (lambda x: meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
部分常用string的函数
isalnum()
如果字符串至少有一个字符并且所有字符都是字母或数字则返 回
True,否则返回 False
isalpha()
如果字符串至少有一个字符并且所有字符都是字母则返回 True, 否则返回
False
isdigit()
如果字符串只包含数字则返回 True 否则返回 False
isnumeric()
如果字符串中只包含数字字符,则返回 True,否则返回 False
isspace()
如果字符串中只包含空白,则返回 True,否则返回 False.
lower()
转换字符串中所有大写字符为小写
islower()
如果字符串中包含至少一个区分大小写的字符,并且所有这些(区分大小写的)字符都是小写,则返回
True,否则返回 False
capitalize()
将字符串的第一个字符转换为大写
isupper()
如果字符串中包含至少一个区分大小写的字符,并且所有这些(区分大小写的)字符都是大写,则返回
Tru
title()
返回"标题化"的字符串,就是说所有单词都是以大写开始,其余字母均为小写
istitle
如果字符串是标题化的(见 title())则返回 True,否则返回 False
替换值
1 2 data = pd.Series([1. , -999. , 2. , -999. , -1000. , 3. ]) data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
1 data.replace(-999 , np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
1 2 data.replace([-999 , -1000 ], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
对不同的值进行不同的替换,传入替换关系组成的列表或字典。
1 data.replace([-999 , -1000 ], [np.nan, 0 ])
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
1 data.replace({-999 : np.nan, -1000 : 0 })
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
重命名轴索引
跟Series的值一样,轴标签也可以函数或映射进行转换,从而得到一个新的对象。跟Series一样,轴标签也有一个map
函数
轴还可以就地修改,而无需新建一个数据结构
1 2 3 data = pd.DataFrame(np.arange(12 ).reshape((3 , 4 )), index=['Ohio' , 'Colorado' , 'New York' ], columns=['one' , 'two' , 'three' , 'four' ])
1 2 transform = lambda x: x[:4 ].upper() data.index.map (transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
1 2 data.index = data.index.map (transform) data
one
two
three
four
OHIO
0
1
2
3
COLO
4
5
6
7
NEW
8
9
10
11
rename
创建数据集的转换版,而不是修改原始数据。
1 data.rename(index=str .title, columns=str .upper)
ONE
TWO
THREE
FOUR
Ohio
0
1
2
3
Colo
4
5
6
7
New
8
9
10
11
1 2 3 data.rename(index={'OHIO' : 'INDIANA' }, columns={'three' : 'peekaboo' })
one
two
peekaboo
four
INDIANA
0
1
2
3
COLO
4
5
6
7
NEW
8
9
10
11
1 2 3 data.rename(index={'OHIO' : 'INDIANA' }, inplace=True ) data
one
two
three
four
INDIANA
0
1
2
3
COLO
4
5
6
7
NEW
8
9
10
11