开发工具:
文件大小: 685kb
下载次数: 0
上传时间: 2019-08-31
详细说明:pandas库的常用操作,参考书籍《Pandas Cookbook》,内容干货,推荐下载!movie get_dtype_counts# output the number of columns with each specific data type:
movie. select_dtypes(include['int ]).head(# select only integer columns
movie. filter(1ike=' facebook').head()#1ike参数表示包含此字符串
movie. filter( regex="\d").head()# movIe. Filter支持正则表达式
movie filter(items=[ 'actor_l_name,asdf ']#which takes a list of exact co l umn names
在整个 Dataframe上操作
movie shape
movie. counto
move.min(#各列的最小值
movie.iSnu11(.anyO).any()#判断整个 Dataframe有没有缺失值,方法是连着使用两个any
(co1lege_ugds_head()+.00501)//.01#列名参与代数运算,等价于其中每一个元素参与此运算
#统计缺失值最主要方法是使用讠snu11方法
college_ugds_.isnu110. sumO
college_ugds_ cumsum. sort values("UGDs_HISP', ascending= False)#按照某一列排序
co1 lege_gds. drona(hoW='a11")#如果所有列都是缺失值,则将其去除
数据分析
1查看数据
college= pd. read_csv(data/college. csv)
college head
college. shape
display(col1ege. describe( include=[np. number]).T)#统计数值列,并讲行转置,很推荐!!!
选择数据子集
n directly after a Series or DataFrame
The iloc indexer selects only by integer location and works similarly to Python lists
The. loc indexer selects only by index label, which is similar to how Python dictionaries work.
行、列、均可以
co1ege.iloc[:,[4,6]].head()#选取两列的所有的行
college. loc[:,[WOMENONLY',SATVRMIDJ
college.iloc[[60,99,3]]. index. tolist#, index, tolist(可以直接提取索引标签,生成一个列表
co1lege.i1oc[5,-4]#数索引
college.loc[' The University of Alabama"'," PCTFLOAN']#标签索引
co1lege[10:20:2]#逐行读取。
city= college['CITY']
city[10:20:2]# Series也可以进行同样的切片
布尔索引
布尔索引也叫布尔选择,通过提供boo值来选择行, These boolean values are usually stored in a Series,不同条
件可以进行与或非,&,|,~但请注意, python中,位运算符的优先级高于比较运算符,所以需要加括号
criterial- movie. imdb score 8
criterial= moviecontent_rating ==PG-13
criteria3=( movie. title year<2000)( move.tit1 e year>=2010)#号不能少
final=criterial& criteria criteria
college[fina1]#作为索引,直接选择值为True的行。
employee. BASE SALARY. between(80000,120000)#用 between来选择
#排除最常出现的5家单位
criteria= employee DEPARTMENT. isin(top_5_depts)
employee[criteria]. head
条件复杂时,采用 dataframe的 query方法
df query('A>B)# qei valent to df[df. A>df.B]
#读取 employee数据,确定选取的部门和列
employee= pd. read_csv( data/emp loyee. csv ')
depts =['Houston Police Department-HPD, Houston Fire Department (HFD)']
select_columns =[ UNIQUE_ID,DEPARTMENT,GENDER,BASE_SALARY'
qs ="DEPARtMENT in depts and GEnDER ==Female and 80000 <=BASESALARY <=120000
emp_filtered emp loyee query (gs)
emp_filteredlselect-co lumns]. head
对 DataFrame的行做mask,使得所有满足条件的数据都消失
criteria= c1 C2
movie mask (criteria).head
对不满足条件的值进行替换,使用 pandas的 where语句
s= pd series(range(5)
s where(s>0)
s where(s>1, 10)
split-apply-combine
common data analysis pattern of breaking up data into independent manageable chunks
independently applying functions to these chunks, and then combining the results back together
Split
agg作用的对象,默认是作用的是所有剩余的列
unt
order
ext price
Combine
3830801000123583
383080
10001
(transform)
Input
Apply(sum)
account order ext price Order Total
383080
100uU1
107.97
38308010001235
83
576
unt order ext pi
count order ext pric.
3830801000123583
38300002612
accoun:ordcrcxt price
38308010001
830801000157612
3830801c011079757612
330801000110797
10005
67936
4122901000526793818549
41229010005267936
0005
41229010005
286.02
count order ext pri
41229010005286.2818549
412904100032541201005
3472.D4
2290100058185
41290100058329519549
n12290100053472CA
412290
4122901005347204818549
2189510003061.12
412290100091512818549
2188951000651865
acd unt crde
21889510006305112372449
218895100062169
account
ext price
2188951006
218951000030612
100063724.49
2188951651865372449
18895100062169
100c6
51865
2198951co0672.18372449
18895
10006
2169
10006
7218
上述思路对于到 pandas中就是 groupby
#按照 AIRLINE分组,使用agg方法,传入要聚合的列和聚合函数
flights. groupby( AIRLINE), agg(i 'Arr_Delay: 'mean]). head
或者要选取的列使用索引,聚合函数作为字符串传入agg
flights. groupby('AIRLINE ')['ARR_DELAY]. agg ('mean')head
flights. groupby( AIRLINE )['ARR_DELAY]. meanO. head
#分组可以是多组,选取可以是多组,聚合函数也可以是多个,此时—对应
flights. groupby(['AIRLINE,'WEEKDAY])['CANCELLED','DIVERTED']agg([ sum
mean']). head(7)
#可以对同一列施加不同的函数
group_cols=[ORG_AIR,DEST_AIR]
agg_dict =t CANCELLED:[ sum,mean,size '
'AIR-TIME:[mean,var J
flights. groupby (group_cols). agg cagg _dict). head
#下面这个例子中, max deviation是自定义的函数
def max deviation (s):
std_score=(s -s mean o)/s. stdo
return std_score. abs(. maxo
college. groupby( STABBR')['UGDS] agg(max_deviation). round (1). head
grouped= college. groupby(['STABBR,'RELAFFIL])
grouped. ngroups#用 ngroups属性查看分组的数量
list(grouped groups. keys o)
filter(用来筛选数据, transform(产生新的数据
if we want to get a single value for each group-> use aggregate)
if we want to get a subset of the input rows-> use filter
if we want to get a new value for each input row-> use transform)
对某一列实施复杂操作,用app1y(函数
数据清理
stack方法可以将每一行所有的列值转换为行值 unstack方法可以将其还原
state fruit pd. read csv(data/state fruit. csv, index col=0)
state fruit. stack()#再使用rest_ index将结果转换为 dataframe;给. columns赋值可以重命名列
#也可以使用 rename_axis给不同的行索引层级命名
state_fruit. stack(. rename_axis([ 'state,fruit]. reset_index (name=weight
用 read csv方法只选取特征的列指定usco1s参数
usecol func = lambda x: 'ugds in x or x== INSTNM'
college pd read_csv( data/college. csv, useco Is=usecol-func)
透视表 pivot_ table,透视针对的对象是不同的列名
1. The index parameter takes a column(or columns)that will not be pivoted and whose unique values
will be placed in the index
2. The columns parameter takes a column(or columns)that will be pivoted and whose unique values will
be made into column names
3. The values parameter takes a column(or columns)that will be aggregated
4. aggfunc parameter determines how the columns in the values parameter get aggregated
Ihttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivottable.html
dataframe拼接
Concat gives the flexibility to join based on the axis( all rows or all columns)
Append is the specific case(axis=0, join='outer')of concat
Join is based on the indexes(set by set_index)on how variable =[left, 'right, inner, 'couter]
merge is based on any particular column each of the two dataframes, this columns are variables on like
left_on, 'right_on,on
sex-age=W1melt[' sex age'].str.5p1it( expand=True)#此时其实可以使用字符串的多个方法。
movie. Insert(o,'id',np. arange(len( move))#插入新的列
添加新行直接用1oc就能指定
new_datalist =[Al
names. loc[4]=new_data_list
等价于
names. loc[4]=[Zach, 3]
names. append(I"Name':"Aria',"Age':1}, I gnore_ index-True)# append方法可以同时添加多行,此时要放
在列表中。
data_dict bbal1-_16.iloc[o].to_dicto
#keys参数可以给两个 Dataframe命名, names参数可以重命名每个索引层
pd concat(s_list, keys=[2016,2017], names=[ Year,' Symbol])
pres_41-45[ President'J value_counts)
时间
pd. to_datetime is capable of converting entire lists or Series of strings or integers to Timestamps
#引入 datetime模块,创建date、time和 datetime对象
date= datetime. date (year=2013, month=6, day=7)
date is2013-06-07
# time 1s12:30:19.463198
# datetime is2013-06-0712:30:19.463198
s=pd. series(['12-5-2015,"14-1-2013',"20/12/2017',"40/23/20171])
pd to_datetime(s, dayfirst-True, errors=coerce')
pd. Timestamp (year=2012, month=12, day=21, hour=5, minute=10, second=8, microsecond=99)
pd Timestamp( 2016/1/10)
pd. TImestamp(2016-01-05T05:34:43.123456789)
pd. T1 tamp(500)#可以传递整数,表示距离1970-01-0100:00:00.00000000的毫秒数
d. to_datetime c'2015-5-13)#类似函数有pd. to dataframe
to_ timedel ta函数可以产生一个 Timed1ta对象
pd Timedelta( 12 days 5 hours 3 minutes 123456789 nanoseconds ')
time strings =[2 days 24 minutes 89. 67 seconds,00: 45: 23.6']
dto_timedeltactime_strings)
# Timedeltas对象可以和 TImestamps互相加减,甚至可以相除返回一个浮点数
pd. Timede l ta( 12 days 5 hours 3 minutes )*2
ts=pd. TImes tamp(2016-10-14:23:23.91)
ts.ceil(h')# TImestamp("2016-10-0105:00:001)
td total_seconds o
可以在导入的时候将时间列设为 index,然后可以加快速度,时间支持部分匹配
# REPORTED DATE设为了行索引,所以就可以进行智能 TImestamp对象切分
crime= crime. set index(' REPORTED_DATE)#, sort_index)
crime.loc['2016-05-1216:45:00’]
#选取2012-06的数据
crime sort.loc[: 2012-06']
crime.loc['2016-05-121]
#也可以选取一整月、一整年或某天的某小时
crime. loc[2016-05 ]. shape
crime. loc[2016].shape
crime. loc[2016-05-12 03]. shape
crime. loc[Dec 2015].sort_index o
#用at_time方法选取特定时间
crime. at time(5: 47 ). head
crime. plot(figsize=(16, 4), title=All Denver crimes)
crimesort resample(Qs-MAR'['ISCRIME,IS_TRAFFIC]. sumO. head
Concept Scalar Class Array Class
pandas Data Type
Primary Creation
Method
Date
datetime 64 lns or
to datetime or
TImes tam
Datetimeindex
times
datetime 64 [ns, tz]
date range
Time
Timedelta
TimedeltaIndex timede1ta64 [ns]
to timedelta or
deltas
timed l ta_range
Time
Period
PeriodIndex
periodlfreg]
Period or period_range
pans
Date
offsets
title =Denver Crimes and Traffic Accidents per year
crime[REPorTEd_Date]. dt. year value_counts
sort__index\
plot(kind=barh, title=title)
import seaborn as sns
sns heatmap (crime_table, cmap=Greys ')
Matρloib提供了两种方法来作图:面向过程和面向对象,推荐使用面向过程
=[-3,5,7],y=[10,2,5]
fig, ax =plt. subplots(figsize=(15, 3))
ax. plot(x, y)
ax set_xlim( 0, 10)
axset_ylim(3, 8)
ax set._ xlabel(x axis)
ax set_ylabel('Y axis')
ax set_(line plot
fig suptitle( Figure Title, size=20, y=1. 03)
med_ budget_rol. index. values#将数据转换为 numpy,然后再使用p1t绘图
pandas绘图
df= pd DataFrame(index=[ 'Atiya,'Abbas,Cornelia,'Stephanie,'Monte']
data={ Apples':[20,10,40,20,50
oranges':[35,40,25,19,33]})
color =[.
7’]
df plot(kind= 'bar, color=color, figsize=(16, 4))
fig,(axl, ax2, ax3)= plt. subplots (1,3, figsize=(16, 4))
fig suptitle( two variable plots, size=20, y=l02)
df. plot(kind=line, color-color, ax=ax1, title-Line plot
df plot(x='Apples, y=oranges, kind=scatter, color=color, ax=ax2, title=Scatterplot
df. plot(kind= 'bar, color=color, ax=ax3, title=Bar plot ')
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.