数据分析和建模方面大量的工作是用在数据准备上:加载、清理、转换以及重塑。本文主要介绍数据集的合并、重塑和转换。
1 2 3 4 5 import  numpy as  npimport  pandas as  pdpd.options.display.max_rows = 20    np.random.seed(12345 ) np.set_printoptions(precision=4 , suppress=True )   
合并数据集 
pandas.merge可根据一个或多个键将不同DataFrame中的行连接起来。pandas.concat可以沿着一条轴将多个对象堆叠到一起combine_first可以将重复数据编接在一起,用一个对象中的值填充另一个对象中的缺失值。 
数据库风格的DataFrame合并 
DataFrame的合并(merge)或连接(join)运算是通过一个或多个键将行链接起来。
1 2 s = pd.read_csv('./examples/student.csv' ) s 
 
student_id
 
student_name
 
teacher_id
 
 
 
0
 
1
 
xiaoming
 
1
 
 
1
 
2
 
xiaohua
 
1
 
 
2
 
3
 
xiaoli
 
1
 
 
3
 
4
 
xiaozhang
 
2
 
 
4
 
5
 
xiaosun
 
3
 
 
5
 
6
 
xiaohong
 
5
 
 
 
 
1 2 t = pd.read_csv('./examples/teacher.csv' ) t 
 
teacher_id
 
teacher_name
 
 
 
0
 
1
 
zhangsan
 
 
1
 
2
 
lisi
 
 
2
 
3
 
wangwu
 
 
3
 
4
 
zhaoliu
 
 
 
 
 
student_id
 
student_name
 
teacher_id
 
teacher_name
 
 
 
0
 
1
 
xiaoming
 
1
 
zhangsan
 
 
1
 
2
 
xiaohua
 
1
 
zhangsan
 
 
2
 
3
 
xiaoli
 
1
 
zhangsan
 
 
3
 
4
 
xiaozhang
 
2
 
lisi
 
 
4
 
5
 
xiaosun
 
3
 
wangwu
 
 
 
 
如果没有指明用哪个列进行连接,merge会将重叠列的列名当作键。因此,最好显示指定连接键  
如果两个对象的列名,也可以分别进行指定。 
默认情况下,merge做的是inner连接连接,返回的是交集。其他方式,还有left,right以及outer。 
left是显示左表中的所有记录,right返回右表中的所有记录,而outer返回的是键的并集,其组合了左连接和右连接的效果。 
 
1 pd.merge(s,t,on='teacher_id' )  
 
student_id
 
student_name
 
teacher_id
 
teacher_name
 
 
 
0
 
1
 
xiaoming
 
1
 
zhangsan
 
 
1
 
2
 
xiaohua
 
1
 
zhangsan
 
 
2
 
3
 
xiaoli
 
1
 
zhangsan
 
 
3
 
4
 
xiaozhang
 
2
 
lisi
 
 
4
 
5
 
xiaosun
 
3
 
wangwu
 
 
 
 
1 s.merge(t, on='teacher_id' ) 
 
student_id
 
student_name
 
teacher_id
 
teacher_name
 
 
 
0
 
1
 
xiaoming
 
1
 
zhangsan
 
 
1
 
2
 
xiaohua
 
1
 
zhangsan
 
 
2
 
3
 
xiaoli
 
1
 
zhangsan
 
 
3
 
4
 
xiaozhang
 
2
 
lisi
 
 
4
 
5
 
xiaosun
 
3
 
wangwu
 
 
 
 
1 2 3 4 5 df3 = pd.DataFrame({'lkey' : ['b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ],                     'data1' : range (7 )}) df4 = pd.DataFrame({'rkey' : ['a' , 'b' , 'd' ],                     'data2' : range (3 )}) 
 
lkey
 
data1
 
 
 
0
 
b
 
0
 
 
1
 
b
 
1
 
 
2
 
a
 
2
 
 
3
 
c
 
3
 
 
4
 
a
 
4
 
 
5
 
a
 
5
 
 
6
 
b
 
6
 
 
 
 
 
rkey
 
data2
 
 
 
0
 
a
 
0
 
 
1
 
b
 
1
 
 
2
 
d
 
2
 
 
 
 
1 pd.merge(df3, df4, left_on='lkey' , right_on='rkey' ) 
 
lkey
 
data1
 
rkey
 
data2
 
 
 
0
 
b
 
0
 
b
 
1
 
 
1
 
b
 
1
 
b
 
1
 
 
2
 
b
 
6
 
b
 
1
 
 
3
 
a
 
2
 
a
 
0
 
 
4
 
a
 
4
 
a
 
0
 
 
5
 
a
 
5
 
a
 
0
 
 
 
 
1 pd.merge(s, t, how='inner' ) 
 
student_id
 
student_name
 
teacher_id
 
teacher_name
 
 
 
0
 
1
 
xiaoming
 
1
 
zhangsan
 
 
1
 
2
 
xiaohua
 
1
 
zhangsan
 
 
2
 
3
 
xiaoli
 
1
 
zhangsan
 
 
3
 
4
 
xiaozhang
 
2
 
lisi
 
 
4
 
5
 
xiaosun
 
3
 
wangwu
 
 
 
 
1 pd.merge(s, t, how='left' ) 
 
student_id
 
student_name
 
teacher_id
 
teacher_name
 
 
 
0
 
1
 
xiaoming
 
1
 
zhangsan
 
 
1
 
2
 
xiaohua
 
1
 
zhangsan
 
 
2
 
3
 
xiaoli
 
1
 
zhangsan
 
 
3
 
4
 
xiaozhang
 
2
 
lisi
 
 
4
 
5
 
xiaosun
 
3
 
wangwu
 
 
5
 
6
 
xiaohong
 
5
 
NaN
 
 
 
 
1 pd.merge(s, t, how='right' ) 
 
student_id
 
student_name
 
teacher_id
 
teacher_name
 
 
 
0
 
1.0
 
xiaoming
 
1
 
zhangsan
 
 
1
 
2.0
 
xiaohua
 
1
 
zhangsan
 
 
2
 
3.0
 
xiaoli
 
1
 
zhangsan
 
 
3
 
4.0
 
xiaozhang
 
2
 
lisi
 
 
4
 
5.0
 
xiaosun
 
3
 
wangwu
 
 
5
 
NaN
 
NaN
 
4
 
zhaoliu
 
 
 
 
1 pd.merge(s, t, how='outer' ) 
 
student_id
 
student_name
 
teacher_id
 
teacher_name
 
 
 
0
 
1.0
 
xiaoming
 
1
 
zhangsan
 
 
1
 
2.0
 
xiaohua
 
1
 
zhangsan
 
 
2
 
3.0
 
xiaoli
 
1
 
zhangsan
 
 
3
 
4.0
 
xiaozhang
 
2
 
lisi
 
 
4
 
5.0
 
xiaosun
 
3
 
wangwu
 
 
5
 
6.0
 
xiaohong
 
5
 
NaN
 
 
6
 
NaN
 
NaN
 
4
 
zhaoliu
 
 
 
 
多对多合并操作,返回的是行的笛卡尔积。 
连接方式只影响出现在结果中的键。 
 
1 2 3 4 df1 = pd.DataFrame({'key' : ['b' , 'b' , 'a' , 'c' , 'a' , 'b' ],                     'data1' : range (6 )}) df1 
 
key
 
data1
 
 
 
0
 
b
 
0
 
 
1
 
b
 
1
 
 
2
 
a
 
2
 
 
3
 
c
 
3
 
 
4
 
a
 
4
 
 
5
 
b
 
5
 
 
 
 
1 2 3 df2 = pd.DataFrame({'key' : ['a' , 'b' , 'a' , 'b' , 'd' ],                     'data2' : range (5 )}) df2 
 
key
 
data2
 
 
 
0
 
a
 
0
 
 
1
 
b
 
1
 
 
2
 
a
 
2
 
 
3
 
b
 
3
 
 
4
 
d
 
4
 
 
 
 
1 pd.merge(df1, df2, on='key' , how='left' ) 
 
key
 
data1
 
data2
 
 
 
0
 
b
 
0
 
1.0
 
 
1
 
b
 
0
 
3.0
 
 
2
 
b
 
1
 
1.0
 
 
3
 
b
 
1
 
3.0
 
 
4
 
a
 
2
 
0.0
 
 
5
 
a
 
2
 
2.0
 
 
6
 
c
 
3
 
NaN
 
 
7
 
a
 
4
 
0.0
 
 
8
 
a
 
4
 
2.0
 
 
9
 
b
 
5
 
1.0
 
 
10
 
b
 
5
 
3.0
 
 
 
 
1 pd.merge(df1, df2, how='inner' ) 
 
key
 
data1
 
data2
 
 
 
0
 
b
 
0
 
1
 
 
1
 
b
 
0
 
3
 
 
2
 
b
 
1
 
1
 
 
3
 
b
 
1
 
3
 
 
4
 
b
 
5
 
1
 
 
5
 
b
 
5
 
3
 
 
6
 
a
 
2
 
0
 
 
7
 
a
 
2
 
2
 
 
8
 
a
 
4
 
0
 
 
9
 
a
 
4
 
2
 
 
 
 
1 df1.merge(df2,on=['key' ],how='outer' ) 
 
key
 
data1
 
data2
 
 
 
0
 
b
 
0.0
 
1.0
 
 
1
 
b
 
0.0
 
3.0
 
 
2
 
b
 
1.0
 
1.0
 
 
3
 
b
 
1.0
 
3.0
 
 
4
 
b
 
5.0
 
1.0
 
 
5
 
b
 
5.0
 
3.0
 
 
6
 
a
 
2.0
 
0.0
 
 
7
 
a
 
2.0
 
2.0
 
 
8
 
a
 
4.0
 
0.0
 
 
9
 
a
 
4.0
 
2.0
 
 
10
 
c
 
3.0
 
NaN
 
 
11
 
d
 
NaN
 
4.0
 
 
 
 
根据多个键进行合并,传入一个由列名组成的列表即可。 
可以理解为多个键形成一系列元组,并将其当作单个连接键  
 
1 2 3 4 5 6 7 8 left = pd.DataFrame({'key1' : ['foo' , 'foo' , 'bar' ],                      'key2' : ['one' , 'two' , 'one' ],                      'lval' : [1 , 2 , 3 ]}) right = pd.DataFrame({'key1' : ['foo' , 'foo' , 'bar' , 'bar' ],                       'key2' : ['one' , 'one' , 'one' , 'two' ],                       'rval' : [4 , 5 , 6 , 7 ]}) left right 
 
key1
 
key2
 
lval
 
 
 
0
 
foo
 
one
 
1
 
 
1
 
foo
 
two
 
2
 
 
2
 
bar
 
one
 
3
 
 
 
 
 
key1
 
key2
 
rval
 
 
 
0
 
foo
 
one
 
4
 
 
1
 
foo
 
one
 
5
 
 
2
 
bar
 
one
 
6
 
 
3
 
bar
 
two
 
7
 
 
 
 
1 pd.merge(right,left, on=['key1' , 'key2' ], how='right' ) 
 
key1
 
key2
 
rval
 
lval
 
 
 
0
 
foo
 
one
 
4.0
 
1
 
 
1
 
foo
 
one
 
5.0
 
1
 
 
2
 
bar
 
one
 
6.0
 
3
 
 
3
 
foo
 
two
 
NaN
 
2
 
 
 
 
1 pd.merge(left, right, on=['key1' , 'key2' ], how='outer' ) 
 
key1
 
key2
 
lval
 
rval
 
 
 
0
 
foo
 
one
 
1.0
 
4.0
 
 
1
 
foo
 
one
 
1.0
 
5.0
 
 
2
 
foo
 
two
 
2.0
 
NaN
 
 
3
 
bar
 
one
 
3.0
 
6.0
 
 
4
 
bar
 
two
 
NaN
 
7.0
 
 
 
 
对于重复列名,可以指定一个suffixes选项,用于附加到重复列名上。 
 
1 pd.merge(left, right, on='key1' ) 
 
key1
 
key2_x
 
lval
 
key2_y
 
rval
 
 
 
0
 
foo
 
one
 
1
 
one
 
4
 
 
1
 
foo
 
one
 
1
 
one
 
5
 
 
2
 
foo
 
two
 
2
 
one
 
4
 
 
3
 
foo
 
two
 
2
 
one
 
5
 
 
4
 
bar
 
one
 
3
 
one
 
6
 
 
5
 
bar
 
one
 
3
 
two
 
7
 
 
 
 
1 pd.merge(left, right, on='key1' , suffixes=('_left' , '_right' )) 
 
key1
 
key2_left
 
lval
 
key2_right
 
rval
 
 
 
0
 
foo
 
one
 
1
 
one
 
4
 
 
1
 
foo
 
one
 
1
 
one
 
5
 
 
2
 
foo
 
two
 
2
 
one
 
4
 
 
3
 
foo
 
two
 
2
 
one
 
5
 
 
4
 
bar
 
one
 
3
 
one
 
6
 
 
5
 
bar
 
one
 
3
 
two
 
7
 
 
 
 
merge函数的参数
 
 
left左侧DataFrame 
 
right右侧DataFrame 
 
how连接方式,可选为inner、outer、left、right,默认为inner 
 
on用于连接的列名,必须同时存在于左右两个DataFrame中。如果不指定且其他参数也未指定,则以共同列名合并 
 
left_on左侧DataFrame中的连接键列 
 
right_on右侧DataFrame中的连接键列 
 
left_indexTrueorFalse,是否将左侧DataFrame的行索引作为连接键 
right_indexTrueorFalse,是否将右侧DataFrame的行索引作为连接键 
sortTrueorFalse,合并后是否对数据进行排序。默认为True,大规模数据时禁用可获得更好的性能 
suffixes字符串元组,用于追加到重叠列名的末尾,默认为(_x, _y) 
 
 
索引上合并 
1 2 3 4 5 left1 = pd.DataFrame({'key' : ['a' , 'b' , 'a' , 'a' , 'b' , 'c' ],                       'value' : range (6 )}) right1 = pd.DataFrame({'group_val' : [3.5 , 7 ]}, index=['a' , 'b' ]) left1 right1 
 
key
 
value
 
 
 
0
 
a
 
0
 
 
1
 
b
 
1
 
 
2
 
a
 
2
 
 
3
 
a
 
3
 
 
4
 
b
 
4
 
 
5
 
c
 
5
 
 
 
 
1 pd.merge(left1, right1, left_on='key' , right_index=True ) 
 
key
 
value
 
group_val
 
 
 
0
 
a
 
0
 
3.5
 
 
2
 
a
 
2
 
3.5
 
 
3
 
a
 
3
 
3.5
 
 
1
 
b
 
1
 
7.0
 
 
4
 
b
 
4
 
7.0
 
 
 
 
1 pd.merge(left1, right1, left_on='key' , right_index=True , how='outer' ) 
 
key
 
value
 
group_val
 
 
 
0
 
a
 
0
 
3.5
 
 
2
 
a
 
2
 
3.5
 
 
3
 
a
 
3
 
3.5
 
 
1
 
b
 
1
 
7.0
 
 
4
 
b
 
4
 
7.0
 
 
5
 
c
 
5
 
NaN
 
 
 
 
1 2 3 4 5 6 7 8 9 10 11 lefth = pd.DataFrame({'key1' : ['Ohio' , 'Ohio' , 'Ohio' ,                                'Nevada' , 'Nevada' ],                       'key2' : [2000 , 2001 , 2002 , 2001 , 2002 ],                       'data' : np.arange(5. )}) righth = pd.DataFrame(np.arange(12 ).reshape((6 , 2 )),                       index=[['Nevada' , 'Nevada' , 'Ohio' , 'Ohio' ,                               'Ohio' , 'Ohio' ],                              [2001 , 2000 , 2000 , 2000 , 2001 , 2002 ]],                       columns=['event1' , 'event2' ]) lefth righth 
 
key1
 
key2
 
data
 
 
 
0
 
Ohio
 
2000
 
0.0
 
 
1
 
Ohio
 
2001
 
1.0
 
 
2
 
Ohio
 
2002
 
2.0
 
 
3
 
Nevada
 
2001
 
3.0
 
 
4
 
Nevada
 
2002
 
4.0
 
 
 
 
 
 
event1
 
event2
 
 
 
Nevada
 
2001
 
0
 
1
 
 
2000
 
2
 
3
 
 
Ohio
 
2000
 
4
 
5
 
 
2000
 
6
 
7
 
 
2001
 
8
 
9
 
 
2002
 
10
 
11
 
 
 
 
1 pd.merge(lefth, righth, left_on=['key1' , 'key2' ], right_index=True ) 
 
key1
 
key2
 
data
 
event1
 
event2
 
 
 
0
 
Ohio
 
2000
 
0.0
 
4
 
5
 
 
0
 
Ohio
 
2000
 
0.0
 
6
 
7
 
 
1
 
Ohio
 
2001
 
1.0
 
8
 
9
 
 
2
 
Ohio
 
2002
 
2.0
 
10
 
11
 
 
3
 
Nevada
 
2001
 
3.0
 
0
 
1
 
 
 
 
 
 
event1
 
event2
 
 
 
2001
 
Nevada
 
0
 
1
 
 
2000
 
Nevada
 
2
 
3
 
 
Ohio
 
4
 
5
 
 
Ohio
 
6
 
7
 
 
2001
 
Ohio
 
8
 
9
 
 
2002
 
Ohio
 
10
 
11
 
 
 
 
1 pd.merge(lefth, righth, left_on=['key1' , 'key2' ], right_index=True ) 
 
key1
 
key2
 
data
 
event1
 
event2
 
 
 
0
 
Ohio
 
2000
 
0.0
 
4
 
5
 
 
0
 
Ohio
 
2000
 
0.0
 
6
 
7
 
 
1
 
Ohio
 
2001
 
1.0
 
8
 
9
 
 
2
 
Ohio
 
2002
 
2.0
 
10
 
11
 
 
3
 
Nevada
 
2001
 
3.0
 
0
 
1
 
 
 
 
1 2 pd.merge(lefth, righth, left_on=['key1' , 'key2' ],          right_index=True , how='outer' ) 
 
key1
 
key2
 
data
 
event1
 
event2
 
 
 
0
 
Ohio
 
2000
 
0.0
 
4.0
 
5.0
 
 
0
 
Ohio
 
2000
 
0.0
 
6.0
 
7.0
 
 
1
 
Ohio
 
2001
 
1.0
 
8.0
 
9.0
 
 
2
 
Ohio
 
2002
 
2.0
 
10.0
 
11.0
 
 
3
 
Nevada
 
2001
 
3.0
 
0.0
 
1.0
 
 
4
 
Nevada
 
2002
 
4.0
 
NaN
 
NaN
 
 
4
 
Nevada
 
2000
 
NaN
 
2.0
 
3.0
 
 
 
 
1 2 3 4 5 6 7 8 left2 = pd.DataFrame([[1. , 2. ], [3. , 4. ], [5. , 6. ]],                      index=['a' , 'c' , 'e' ],                      columns=['Ohio' , 'Nevada' ]) right2 = pd.DataFrame([[7. , 8. ], [9. , 10. ], [11. , 12. ], [13 , 14 ]],                       index=['b' , 'c' , 'd' , 'e' ],                       columns=['Missouri' , 'Alabama' ]) left2 right2 
 
Ohio
 
Nevada
 
 
 
a
 
1.0
 
2.0
 
 
c
 
3.0
 
4.0
 
 
e
 
5.0
 
6.0
 
 
 
 
 
Missouri
 
Alabama
 
 
 
b
 
7.0
 
8.0
 
 
c
 
9.0
 
10.0
 
 
d
 
11.0
 
12.0
 
 
e
 
13.0
 
14.0
 
 
 
 
1 2 pd.merge(left2, right2, how='outer' , left_index=True , right_index=True ,           sort=False ) 
 
Ohio
 
Nevada
 
Missouri
 
Alabama
 
 
 
a
 
1.0
 
2.0
 
NaN
 
NaN
 
 
b
 
NaN
 
NaN
 
7.0
 
8.0
 
 
c
 
3.0
 
4.0
 
9.0
 
10.0
 
 
d
 
NaN
 
NaN
 
11.0
 
12.0
 
 
e
 
5.0
 
6.0
 
13.0
 
14.0
 
 
 
 
1 left2.merge(right2, how='outer' , left_index=True , right_index=True ) 
 
Ohio
 
Nevada
 
Missouri
 
Alabama
 
 
 
a
 
1.0
 
2.0
 
NaN
 
NaN
 
 
b
 
NaN
 
NaN
 
7.0
 
8.0
 
 
c
 
3.0
 
4.0
 
9.0
 
10.0
 
 
d
 
NaN
 
NaN
 
11.0
 
12.0
 
 
e
 
5.0
 
6.0
 
13.0
 
14.0
 
 
 
 
DataFrame还有一个join实例方法,可以更为方便地实现索引合并。可以用于合并多个带有相同或相似索引的DataFrame对象,而不管是否有重叠列。 
 
1 left2.join(right2, how='outer' ) 
 
Ohio
 
Nevada
 
Missouri
 
Alabama
 
 
 
a
 
1.0
 
2.0
 
NaN
 
NaN
 
 
b
 
NaN
 
NaN
 
7.0
 
8.0
 
 
c
 
3.0
 
4.0
 
9.0
 
10.0
 
 
d
 
NaN
 
NaN
 
11.0
 
12.0
 
 
e
 
5.0
 
6.0
 
13.0
 
14.0
 
 
 
 
1 left1.join(right1, on='key' ) 
 
key
 
value
 
group_val
 
 
 
0
 
a
 
0
 
3.5
 
 
1
 
b
 
1
 
7.0
 
 
2
 
a
 
2
 
3.5
 
 
3
 
a
 
3
 
3.5
 
 
4
 
b
 
4
 
7.0
 
 
5
 
c
 
5
 
NaN
 
 
 
 
1 2 3 4 another = pd.DataFrame([[7. , 8. ], [9. , 10. ], [11. , 12. ], [16. , 17. ]],                        index=['a' , 'c' , 'e' , 'f' ],                        columns=['New York' , 'Oregon' ]) another 
 
New York
 
Oregon
 
 
 
a
 
7.0
 
8.0
 
 
c
 
9.0
 
10.0
 
 
e
 
11.0
 
12.0
 
 
f
 
16.0
 
17.0
 
 
 
 
1 left2.join([right2, another]) 
 
Ohio
 
Nevada
 
Missouri
 
Alabama
 
New York
 
Oregon
 
 
 
a
 
1.0
 
2.0
 
NaN
 
NaN
 
7.0
 
8.0
 
 
c
 
3.0
 
4.0
 
9.0
 
10.0
 
9.0
 
10.0
 
 
e
 
5.0
 
6.0
 
13.0
 
14.0
 
11.0
 
12.0
 
 
 
 
1 left2.join([right2, another], how='outer' ) 
 
Ohio
 
Nevada
 
Missouri
 
Alabama
 
New York
 
Oregon
 
 
 
a
 
1.0
 
2.0
 
NaN
 
NaN
 
7.0
 
8.0
 
 
c
 
3.0
 
4.0
 
9.0
 
10.0
 
9.0
 
10.0
 
 
e
 
5.0
 
6.0
 
13.0
 
14.0
 
11.0
 
12.0
 
 
b
 
NaN
 
NaN
 
7.0
 
8.0
 
NaN
 
NaN
 
 
d
 
NaN
 
NaN
 
11.0
 
12.0
 
NaN
 
NaN
 
 
f
 
NaN
 
NaN
 
NaN
 
NaN
 
16.0
 
17.0
 
 
 
 
轴向连接 
数组的合并与拆分 
Numpy有一个用于合并原始Numpy数组的concatenate函数,可以按指定轴将一个由数组组成的序列(如元组、列表等)连接到一起。 
 
1 2 arr = np.arange(12 ).reshape((3 , 4 )) arr 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]]) 
1 np.concatenate([arr, arr], axis=1 ) 
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]]) 
1 np.concatenate([arr, arr], axis=0 ) 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]]) 
1 2 3 4 a = np.array([[1 , 2 ], [3 , 4 ]]) b = np.array([[5 ,6 ]]) a b 
array([[1, 2],
       [3, 4]]) 
array([[5, 6]])
 
(2, 2) 
(1, 2)
 
1 np.concatenate((a, b), axis=0 ) 
array([[1, 2],
       [3, 4],
       [5, 6]]) 
1 np.concatenate((a, b.T), axis=1 ) 
array([[1, 2, 5],
       [3, 4, 6]]) 
对于常见的连接操作,NumPy提供了比较方便的的方法(如vstack和hstack),因此,上面的操作也可以写为
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]]) 
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]]) 
array([[1, 2],
       [3, 4],
       [5, 6]]) 
array([[1, 2, 5],
       [3, 4, 6]]) 
相反的,split用于将一个数组沿指定的轴拆分为多个数组
1 2 arr2 = np.random.randn(5 , 2 ) arr2 
array([[ 0.0009,  1.3438],
       [-0.7135, -0.8312],
       [-2.3702, -1.8608],
       [-0.8608,  0.5601],
       [-1.2659,  0.1198]]) 
1 2 first,second,third = np.split(arr2, [1 , 3 ]) first 
array([[0.0009, 1.3438]]) 
array([[-0.7135, -0.8312],
       [-2.3702, -1.8608]]) 
array([[-0.8608, 0.5601], [-1.2659, 0.1198]])
 
Pandas的轴向连接 
对于pandas对象,带有标签的轴能进一步扩展数组连接运算。pandas有一个concat函数 
 
1 2 s1 = pd.Series([0 , 1 ], index=['a' , 'b' ]) s1 
a    0
b    1
dtype: int64 
1 2 s2 = pd.Series([2 , 3 , 4 ], index=['c' , 'd' , 'e' ]) s2 
c    2
d    3
e    4
dtype: int64 
1 2 s3 = pd.Series([5 , 6 ], index=['f' , 'g' ]) s3 
f    5
g    6
dtype: int64 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64 
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64 
1 2 pd.concat([s1, s2, s3], axis=1 ,sort=False )  
 
0
 
1
 
2
 
 
 
a
 
0.0
 
NaN
 
NaN
 
 
b
 
1.0
 
NaN
 
NaN
 
 
c
 
NaN
 
2.0
 
NaN
 
 
d
 
NaN
 
3.0
 
NaN
 
 
e
 
NaN
 
4.0
 
NaN
 
 
f
 
NaN
 
NaN
 
5.0
 
 
g
 
NaN
 
NaN
 
6.0
 
 
 
 
1 2 s4 = pd.concat([s1, s3]) s4 
a    0
b    1
f    5
g    6
dtype: int64 
1 pd.concat([s1, s4], axis=1 ) 
 
0
 
1
 
 
 
a
 
0.0
 
0
 
 
b
 
1.0
 
1
 
 
f
 
NaN
 
5
 
 
g
 
NaN
 
6
 
 
 
 
1 2 pd.concat([s1, s4], axis=1 , join='inner' ) 
1 2 result = pd.concat([s1, s2, s3], keys=['one' , 'two' , 'three' ]) result 
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64 
 
a
 
b
 
c
 
d
 
e
 
f
 
g
 
 
 
one
 
0.0
 
1.0
 
NaN
 
NaN
 
NaN
 
NaN
 
NaN
 
 
two
 
NaN
 
NaN
 
2.0
 
3.0
 
4.0
 
NaN
 
NaN
 
 
three
 
NaN
 
NaN
 
NaN
 
NaN
 
NaN
 
5.0
 
6.0
 
 
 
 
如果沿着axis=1对Series进行合并,则keys会成为DataFrame的列标题。 
 
1 pd.concat([s1, s2, s3], axis=1 , keys=['one' , 'two' , 'three' ]) 
 
one
 
two
 
three
 
 
 
a
 
0.0
 
NaN
 
NaN
 
 
b
 
1.0
 
NaN
 
NaN
 
 
c
 
NaN
 
2.0
 
NaN
 
 
d
 
NaN
 
3.0
 
NaN
 
 
e
 
NaN
 
4.0
 
NaN
 
 
f
 
NaN
 
NaN
 
5.0
 
 
g
 
NaN
 
NaN
 
6.0
 
 
 
 
1 2 3 4 df1 = pd.DataFrame(np.arange(6 ).reshape(3 , 2 ), index=['a' , 'b' , 'c' ],                    columns=['one' , 'two' ]) df2 = pd.DataFrame(5  + np.arange(4 ).reshape(2 , 2 ), index=['a' , 'c' ],                    columns=['three' , 'four' ]) 
 
one
 
two
 
 
 
a
 
0
 
1
 
 
b
 
2
 
3
 
 
c
 
4
 
5
 
 
 
 
1 pd.concat([df1, df2], axis=1 ) 
 
one
 
two
 
three
 
four
 
 
 
a
 
0
 
1
 
5.0
 
6.0
 
 
b
 
2
 
3
 
NaN
 
NaN
 
 
c
 
4
 
5
 
7.0
 
8.0
 
 
 
 
1 pd.concat([df1, df2], axis=1 , keys=['level1' , 'level2' ]) 
 
level1
 
level2
 
 
 
one
 
two
 
three
 
four
 
 
 
a
 
0
 
1
 
5.0
 
6.0
 
 
b
 
2
 
3
 
NaN
 
NaN
 
 
c
 
4
 
5
 
7.0
 
8.0
 
 
 
 
如果keys选项传入的不是列表而是字典,则字典的键会被当作keys选项的值。 
ignore_index不保留连接轴上的索引,产生一组新索引。 
1 pd.concat({'level1' : df1, 'level2' : df2}, axis=1 ) 
 
level1
 
level2
 
 
 
one
 
two
 
three
 
four
 
 
 
a
 
0
 
1
 
5.0
 
6.0
 
 
b
 
2
 
3
 
NaN
 
NaN
 
 
c
 
4
 
5
 
7.0
 
8.0
 
 
 
 
1 2 3 pd.concat([df1, df2], axis=1 , keys=['level1' , 'level2' ],           names=['upper' , 'lower' ]) 
upper
 
level1
 
level2
 
 
lower
 
one
 
two
 
three
 
four
 
 
 
a
 
0
 
1
 
5.0
 
6.0
 
 
b
 
2
 
3
 
NaN
 
NaN
 
 
c
 
4
 
5
 
7.0
 
8.0
 
 
 
 
1 2 3 df1 = pd.DataFrame(np.random.randn(3 , 4 ), columns=['a' , 'b' , 'c' , 'd' ]) df2 = pd.DataFrame(np.random.randn(2 , 3 ), columns=['b' , 'd' , 'a' ]) df1 
 
a
 
b
 
c
 
d
 
 
 
0
 
-0.204708
 
0.478943
 
-0.519439
 
-0.555730
 
 
1
 
1.965781
 
1.393406
 
0.092908
 
0.281746
 
 
2
 
0.769023
 
1.246435
 
1.007189
 
-1.296221
 
 
 
 
 
b
 
d
 
a
 
 
 
0
 
0.274992
 
0.228913
 
1.352917
 
 
1
 
0.886429
 
-2.001637
 
-0.371843
 
 
 
 
 
a
 
b
 
c
 
d
 
 
 
0
 
-0.204708
 
0.478943
 
-0.519439
 
-0.555730
 
 
1
 
1.965781
 
1.393406
 
0.092908
 
0.281746
 
 
2
 
0.769023
 
1.246435
 
1.007189
 
-1.296221
 
 
0
 
1.352917
 
0.274992
 
NaN
 
0.228913
 
 
1
 
-0.371843
 
0.886429
 
NaN
 
-2.001637
 
 
 
 
1 pd.concat([df1, df2], ignore_index=True ) 
 
a
 
b
 
c
 
d
 
 
 
0
 
-0.204708
 
0.478943
 
-0.519439
 
-0.555730
 
 
1
 
1.965781
 
1.393406
 
0.092908
 
0.281746
 
 
2
 
0.769023
 
1.246435
 
1.007189
 
-1.296221
 
 
3
 
1.352917
 
0.274992
 
NaN
 
0.228913
 
 
4
 
-0.371843
 
0.886429
 
NaN
 
-2.001637
 
 
 
 
concat函数的参数
 
 
objs参与连接的pandas对象的列表或字典,必须参数 
 
axis连接的轴向,默认axis=0 
 
join连接方式,可选为inner(交集)、outer(并集),默认为join=outer 
 
join_axes指明用于其他n-1条轴的索引 
 
keys与连接对象有关的值,用于形成连接轴上的层次化索引。可以是任意的列表或数组、元组数组(如果将level设为多级数组的话) 
 
levels指定用于层次化索引各级别上的索引 
 
names创建分层级别的名称 
 
verify_integrity检查结果对象新轴上的重复情况,如果发现则引发异常。默认为False 
 
ignore_index不保留连接轴上的索引,产生一组新索引 
 
 
合并重叠数据 
1 2 3 4 5 6 7 a = pd.Series([np.nan, 2.5 , np.nan, 3.5 , 4.5 , np.nan],               index=['f' , 'e' , 'd' , 'c' , 'b' , 'a' ]) b = pd.Series(np.arange(len (a), dtype=np.float64),               index=['f' , 'e' , 'd' , 'c' , 'b' , 'a' ]) b[-1 ] = np.nan a b 
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64 
f 0.0 e 1.0 d 2.0 c 3.0 b 4.0 a NaN dtype: float64
 
1 np.where(pd.isnull(a), b, a) 
array([0. , 2.5, 2. , 3.5, 4.5, nan]) 
1 pd.Series(np.where(pd.isnull(a), b, a)) 
0    0.0
1    2.5
2    2.0
3    3.5
4    4.5
5    NaN
dtype: float64 
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64 
combie_first方法,将其NA值用另一个同位置(相同索引)的值代替,会数据自动对齐。Combine Series values, choosing the calling Series’s values first.
Result index will be the union of the two indexes  
 
f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64 
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64 
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64 
f 0.0 e 1.0 d 2.0 c 3.0 dtype: float64
 
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64 
d NaN c 3.5 b 4.5 a NaN dtype: float64
 
1 b[:-2 ].combine_first(a[2 :]) 
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64 
Combine two DataFrame objects and default to non-null values in
frame calling the method. Result index columns will be the union
of the respective indexes and columns  
 
1 2 3 4 5 6 df1 = pd.DataFrame({'a' : [1. , np.nan, 5. , np.nan],                     'b' : [np.nan, 2. , np.nan, 6. ],                     'c' : range (2 , 18 , 4 )}) df2 = pd.DataFrame({'a' : [5. , 4. , np.nan, 3. , 7. ],                     'b' : [np.nan, 3. , 4. , 6. , 8. ]}) df1 
 
a
 
b
 
c
 
 
 
0
 
1.0
 
NaN
 
2
 
 
1
 
NaN
 
2.0
 
6
 
 
2
 
5.0
 
NaN
 
10
 
 
3
 
NaN
 
6.0
 
14
 
 
 
 
 
a
 
b
 
 
 
0
 
5.0
 
NaN
 
 
1
 
4.0
 
3.0
 
 
2
 
NaN
 
4.0
 
 
3
 
3.0
 
6.0
 
 
4
 
7.0
 
8.0
 
 
 
 
 
a
 
b
 
c
 
 
 
0
 
1.0
 
NaN
 
2.0
 
 
1
 
4.0
 
2.0
 
6.0
 
 
2
 
5.0
 
4.0
 
10.0
 
 
3
 
3.0
 
6.0
 
14.0
 
 
4
 
7.0
 
8.0
 
NaN
 
 
 
 
重塑(reshape)和透视(pivot) 
层次化索引重塑 
stack将数据的列旋转为行。默认dropna=Trueunstack将数的行旋转为列。
默认操作的是最内层,可以传入分层级别的编号或名称,对其他级别进行unstack操作。 
1 2 3 4 5 data = pd.DataFrame(np.arange(6 ).reshape((2 , 3 )),                     index=pd.Index(['Ohio' , 'Colorado' ], name='state' ),                     columns=pd.Index(['one' , 'two' , 'three' ],                     name='number' )) data 
number
 
one
 
two
 
three
 
 
state
 
 
 
 
 
 
Ohio
 
0
 
1
 
2
 
 
Colorado
 
3
 
4
 
5
 
 
 
 
1 2 result = data.stack()  result 
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32 
number
 
one
 
two
 
three
 
 
state
 
 
 
 
 
 
Ohio
 
0
 
1
 
2
 
 
Colorado
 
3
 
4
 
5
 
 
 
 
state
 
Ohio
 
Colorado
 
 
number
 
 
 
 
 
one
 
0
 
3
 
 
two
 
1
 
4
 
 
three
 
2
 
5
 
 
 
 
state
 
Ohio
 
Colorado
 
 
number
 
 
 
 
 
one
 
0
 
3
 
 
two
 
1
 
4
 
 
three
 
2
 
5
 
 
 
 
1 2 3 4 s1 = pd.Series([0 , 1 , 2 , 3 ], index=['a' , 'b' , 'c' , 'd' ]) s2 = pd.Series([4 , 5 , 6 ], index=['c' , 'd' , 'e' ]) s1 s2 
a    0
b    1
c    2
d    3
dtype: int64 
c 4 d 5 e 6 dtype: int64
 
1 2 data2 = pd.concat([s1, s2], keys=['one' , 'two' ]) data2 
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64 
 
a
 
b
 
c
 
d
 
e
 
 
 
one
 
0.0
 
1.0
 
2.0
 
3.0
 
NaN
 
 
two
 
NaN
 
NaN
 
4.0
 
5.0
 
6.0
 
 
 
 
1 2 data2.unstack().stack() 
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64 
1 data2.unstack().stack(dropna=False ) 
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64 
1 2 3 df = pd.DataFrame({'left' : result, 'right' : result + 5 },                   columns=pd.Index(['left' , 'right' ], name='side' )) df 
 
side
 
left
 
right
 
 
state
 
number
 
 
 
 
 
Ohio
 
one
 
0
 
5
 
 
two
 
1
 
6
 
 
three
 
2
 
7
 
 
Colorado
 
one
 
3
 
8
 
 
two
 
4
 
9
 
 
three
 
5
 
10
 
 
 
 
side
 
left
 
right
 
 
number
 
one
 
two
 
three
 
one
 
two
 
three
 
 
state
 
 
 
 
 
 
 
 
 
Ohio
 
0
 
1
 
2
 
5
 
6
 
7
 
 
Colorado
 
3
 
4
 
5
 
8
 
9
 
10
 
 
 
 
side
 
left
 
right
 
 
state
 
Ohio
 
Colorado
 
Ohio
 
Colorado
 
 
number
 
 
 
 
 
 
 
one
 
0
 
3
 
5
 
8
 
 
two
 
1
 
4
 
6
 
9
 
 
three
 
2
 
5
 
7
 
10
 
 
 
 
1 2 df df.unstack('state' ).stack('side' ) 
 
side
 
left
 
right
 
 
state
 
number
 
 
 
 
 
Ohio
 
one
 
0
 
5
 
 
two
 
1
 
6
 
 
three
 
2
 
7
 
 
Colorado
 
one
 
3
 
8
 
 
two
 
4
 
9
 
 
three
 
5
 
10
 
 
 
 
 
state
 
Colorado
 
Ohio
 
 
number
 
side
 
 
 
 
 
one
 
left
 
3
 
0
 
 
right
 
8
 
5
 
 
two
 
left
 
4
 
1
 
 
right
 
9
 
6
 
 
three
 
left
 
5
 
2
 
 
right
 
10
 
7
 
 
 
 
透视表 pivot 
Return reshaped DataFrame organized by given index / column
values.
DataFrame.pivot(self, index=None, columns=None, values=None)
→ 'DataFrame'
index str or object, optional Column to use to make new frame’s
index. If None, uses existing index.
columns str or object Column to use to make new frame’s
columns.
values str, object or a list of the previous, optional Column(s)
to use for populating new frame’s values. If not specified, all
remaining columns will be used and the result will have hierarchically
indexed columns.
 
1 2 3 4 5 6 df = pd.DataFrame({'foo' : ['one' , 'one' , 'one' , 'two' , 'two' ,                            'two' ],                    'bar' : ['A' , 'B' , 'C' , 'A' , 'B' , 'C' ],                    'baz' : [1 , 2 , 3 , 4 , 5 , 6 ],                    'zoo' : ['x' , 'y' , 'z' , 'q' , 'w' , 't' ]}) df 
 
foo
 
bar
 
baz
 
zoo
 
 
 
0
 
one
 
A
 
1
 
x
 
 
1
 
one
 
B
 
2
 
y
 
 
2
 
one
 
C
 
3
 
z
 
 
3
 
two
 
A
 
4
 
q
 
 
4
 
two
 
B
 
5
 
w
 
 
5
 
two
 
C
 
6
 
t
 
 
 
 
1 df.pivot(index='foo' , columns='bar' , values='baz' ) 
bar
 
A
 
B
 
C
 
 
foo
 
 
 
 
 
 
one
 
1
 
2
 
3
 
 
two
 
4
 
5
 
6
 
 
 
 
1 df.pivot(index='foo' , columns='bar' )['baz' ] 
bar
 
A
 
B
 
C
 
 
foo
 
 
 
 
 
 
one
 
1
 
2
 
3
 
 
two
 
4
 
5
 
6
 
 
 
 
1 df.pivot(index='foo' , columns='bar' , values=['baz' , 'zoo' ]) 
 
baz
 
zoo
 
 
bar
 
A
 
B
 
C
 
A
 
B
 
C
 
 
foo
 
 
 
 
 
 
 
 
 
one
 
1
 
2
 
3
 
x
 
y
 
z
 
 
two
 
4
 
5
 
6
 
q
 
w
 
t
 
 
 
 
When there are any index, columns combinations with multiple values.
DataFrame.pivot_table when you need to aggregate. 
 
透视表pivot_table 
pivot_tableCreate a spreadsheet-style pivot table as a
DataFrame.
1 2 3 4 df = pd.DataFrame({"foo" : ['one' , 'one' , 'two' , 'two' ],                    "bar" : ['A' , 'A' , 'B' , 'C' ],                    "baz" : [1 , 2 , 3 , 4 ]}) df 
 
foo
 
bar
 
baz
 
 
 
0
 
one
 
A
 
1
 
 
1
 
one
 
A
 
2
 
 
2
 
two
 
B
 
3
 
 
3
 
two
 
C
 
4
 
 
 
 
1 df.pivot(index='foo' , columns='bar' , values='bar' ) 
bar
 
A
 
B
 
C
 
 
foo
 
 
 
 
 
 
one
 
1.5
 
NaN
 
NaN
 
 
two
 
NaN
 
3.0
 
4.0
 
 
 
 
数据的转换 
移除重复数据 
1 2 3 data = pd.DataFrame({'k1' : ['one' , 'two' ] * 3  + ['two' ],                      'k2' : [1 , 1 , 2 , 3 , 3 , 4 , 4 ]}) data 
 
k1
 
k2
 
 
 
0
 
one
 
1
 
 
1
 
two
 
1
 
 
2
 
one
 
2
 
 
3
 
two
 
3
 
 
4
 
one
 
3
 
 
5
 
two
 
4
 
 
6
 
two
 
4
 
 
 
 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
drop_duplicates方法返回一个移除了重复行的DataFrame默认会判断全部列,也可以指定部分列进行判断 
duplicated和drop_duplicates默认保留的是第一个出现的值组合,传入参数take_last=True 则保留最后一个。 
 
k1
 
k2
 
 
 
0
 
one
 
1
 
 
1
 
two
 
1
 
 
2
 
one
 
2
 
 
3
 
two
 
3
 
 
4
 
one
 
3
 
 
5
 
two
 
4
 
 
 
 
1 2 data['v1' ] = range (7 ) data.drop_duplicates(['k1' ]) 
 
k1
 
k2
 
v1
 
 
 
0
 
one
 
1
 
0
 
 
1
 
two
 
1
 
1
 
 
 
 
1 data.drop_duplicates(['k1' , 'k2' ], keep='last' ) 
 
k1
 
k2
 
v1
 
 
 
0
 
one
 
1
 
0
 
 
1
 
two
 
1
 
1
 
 
2
 
one
 
2
 
2
 
 
3
 
two
 
3
 
3
 
 
4
 
one
 
3
 
4
 
 
6
 
two
 
4
 
6
 
 
 
 
利用函数或映射进行数据转换 
根据数组、Series或DataFrame列中的值来实现转换工作。
1 2 3 4 5 data = pd.DataFrame({'food' : ['bacon' , 'pulled pork' , 'bacon' ,                               'Pastrami' , 'corned beef' , 'Bacon' ,                               'pastrami' , 'honey ham' , 'nova lox' ],                      'ounces' : [4 , 3 , 12 , 6 , 7.5 , 8 , 3 , 5 , 6 ]}) data 
 
food
 
ounces
 
 
 
0
 
bacon
 
4.0
 
 
1
 
pulled pork
 
3.0
 
 
2
 
bacon
 
12.0
 
 
3
 
Pastrami
 
6.0
 
 
4
 
corned beef
 
7.5
 
 
5
 
Bacon
 
8.0
 
 
6
 
pastrami
 
3.0
 
 
7
 
honey ham
 
5.0
 
 
8
 
nova lox
 
6.0
 
 
 
 
1 2 3 4 5 6 7 8 meat_to_animal = {   'bacon' : 'pig' ,   'pulled pork' : 'pig' ,   'pastrami' : 'cow' ,   'corned beef' : 'cow' ,   'honey ham' : 'pig' ,   'nova lox' : 'salmon'  } 
1 2 lowercased = data['food' ].str .lower() lowercased 
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
Series的map方法可以接受一个函数或含有映射关系的字典。 
 
1 2 data['animal' ] = lowercased.map (meat_to_animal) data 
 
food
 
ounces
 
animal
 
 
 
0
 
bacon
 
4.0
 
pig
 
 
1
 
pulled pork
 
3.0
 
pig
 
 
2
 
bacon
 
12.0
 
pig
 
 
3
 
Pastrami
 
6.0
 
cow
 
 
4
 
corned beef
 
7.5
 
cow
 
 
5
 
Bacon
 
8.0
 
pig
 
 
6
 
pastrami
 
3.0
 
cow
 
 
7
 
honey ham
 
5.0
 
pig
 
 
8
 
nova lox
 
6.0
 
salmon
 
 
 
 
1 data['food' ].map (lambda  x: meat_to_animal[x.lower()]) 
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object部分常用string的函数
 
 
isalnum()如果字符串至少有一个字符并且所有字符都是字母或数字则返 回
True,否则返回 False 
 
isalpha()如果字符串至少有一个字符并且所有字符都是字母则返回 True, 否则返回
False 
 
isdigit()如果字符串只包含数字则返回 True 否则返回 False 
 
isnumeric()如果字符串中只包含数字字符,则返回 True,否则返回 False 
 
isspace()如果字符串中只包含空白,则返回 True,否则返回 False. 
 
lower()转换字符串中所有大写字符为小写 
 
islower()如果字符串中包含至少一个区分大小写的字符,并且所有这些(区分大小写的)字符都是小写,则返回
True,否则返回 False 
 
capitalize()将字符串的第一个字符转换为大写 
 
isupper()如果字符串中包含至少一个区分大小写的字符,并且所有这些(区分大小写的)字符都是大写,则返回
Tru 
 
title()返回"标题化"的字符串,就是说所有单词都是以大写开始,其余字母均为小写 
 
istitle如果字符串是标题化的(见 title())则返回 True,否则返回 False 
 
 
替换值 
1 2 data = pd.Series([1. , -999. , 2. , -999. , -1000. , 3. ]) data 
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float641 data.replace(-999 , np.nan) 
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float641 2 data.replace([-999 , -1000 ], np.nan) 
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64
对不同的值进行不同的替换,传入替换关系组成的列表或字典。 
 
1 data.replace([-999 , -1000 ], [np.nan, 0 ]) 
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float641 data.replace({-999 : np.nan, -1000 : 0 }) 
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64重命名轴索引 
跟Series的值一样,轴标签也可以函数或映射进行转换,从而得到一个新的对象。跟Series一样,轴标签也有一个map函数 
轴还可以就地修改,而无需新建一个数据结构 
 
1 2 3 data = pd.DataFrame(np.arange(12 ).reshape((3 , 4 )),                     index=['Ohio' , 'Colorado' , 'New York' ],                     columns=['one' , 'two' , 'three' , 'four' ]) 
1 2 transform = lambda  x: x[:4 ].upper() data.index.map (transform) 
Index(['OHIO', 'COLO', 'NEW '], dtype='object')1 2 data.index = data.index.map (transform) data 
 
one
 
two
 
three
 
four
 
 
 
OHIO
 
0
 
1
 
2
 
3
 
 
COLO
 
4
 
5
 
6
 
7
 
 
NEW
 
8
 
9
 
10
 
11
 
 
 
 
rename创建数据集的转换版,而不是修改原始数据。
1 data.rename(index=str .title, columns=str .upper) 
 
ONE
 
TWO
 
THREE
 
FOUR
 
 
 
Ohio
 
0
 
1
 
2
 
3
 
 
Colo
 
4
 
5
 
6
 
7
 
 
New
 
8
 
9
 
10
 
11
 
 
 
 
1 2 3 data.rename(index={'OHIO' : 'INDIANA' },             columns={'three' : 'peekaboo' }) 
 
one
 
two
 
peekaboo
 
four
 
 
 
INDIANA
 
0
 
1
 
2
 
3
 
 
COLO
 
4
 
5
 
6
 
7
 
 
NEW
 
8
 
9
 
10
 
11
 
 
 
 
1 2 3 data.rename(index={'OHIO' : 'INDIANA' }, inplace=True ) data 
 
one
 
two
 
three
 
four
 
 
 
INDIANA
 
0
 
1
 
2
 
3
 
 
COLO
 
4
 
5
 
6
 
7
 
 
NEW
 
8
 
9
 
10
 
11