1049人加入学习
(15人评价)
Python数据分析 - Pandas玩转Excel

Python数据分析轻松学

价格 免费

笔记,不错。

不能直接贴屏?!

图片要用文件上传模式,一般。

 

[展开全文]

一、涉及知识点

1、pandas读取已经存在的Excel文件

2、涉及的BIF操作:

df=pd.read_excel('F:/sdf/sdf/output.xlsx',header=None)

pd.to_excel('F:/sdf/sdf/output1.xlsx')

df.shape

df.columns

df.header(5)

df.tail(5)

二、问题点

1、读取xls文件异常(xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'Test Tim'

原因:源文件损坏,或者格式有问题(虽然是以.xls结尾,但实际上内容格式有问题的)

[展开全文]

一、涉及知识点

1、Excel和pandas创建excel格式的数据源

2、数据格式有很多,常见的如下:CSV文件、数据库表格、文件格式(Excel、PDF、Word等)、HTML文件、JSON文件、文本文件、XML文件

 

二、遇到的问题及原因

1、pandas库找不到(AttributeError: module 'pandas' has no attribute 'DataFrame'

原因:1)未安装pandas库

          2)pycharm中解释器配置错误(项目多时需要配置虚拟环境,每个虚拟环境中挂载的库都不一样,pycharm的setting中会有默认的配置,需检查是否错误)

 

[展开全文]

import pandas as pd

people=pd.dread_excle('C:/Temp/people.xlsx')

print(people.shape)

print(people.columns)

print('==================')

print(people.tail(3))

[展开全文]

import pandas as pd

df=pd.DataFrame[{'ID':[1,2,3],'Name':['Tim','Victor','N}]

df.to_excel('C:/Temp/output.xlsx')

print('Done!')

[展开全文]

饿,TIM老师的这个方程其实是可以化简的,也不是很复杂

import pandas as pd
import numpy as np

# def get_circumcircle_area(l,h):
#     r = np.sqrt(l**2+h**2)/2
#     return r**2*np.pi

# def get_circumcircle_area(l,h):
#     return np.pi*(l**2+h**2)/4
#
# def wrapper(row):
#      return get_circumcircle_area(row['Length'],row['Height'])

rects = pd.read_excel('D:/Temp/Rectangles.xlsx',index_col='ID')
rects['CA'] = rects.apply(lambda row:np.pi*(row['Length']**2+row['Height']**2)/4,axis=1)

print(rects)

 

这样的话一行就可以了

[展开全文]

https://www.jianshu.com/p/2d49cb87626b

对于group详细的解析

[展开全文]

字符串前加f是用于格式化字符串的,

print(f'#{row.ID}student{row.Name}has an invalid score{row.Score}.')

就相当于

print('# %d student{row.Name}has an invalid score %s.' % (row.ID,row.Score))

 

具体上网搜一下python 字符串格式化的三种方法就好了

[展开全文]
students['2017'].sort_values(ascending=True).plot.pie(fontsize=6,startangle=-270)

可以用在乱序的情况

[展开全文]
老师资料里matplotlib里面的代码应该用的不是同一个文件了,要改一下

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

students = pd.read_excel('D:/Temp/Students10.xlsx')
students.sort_values(by='2017', inplace=True, ascending=False)
students.index = range(0, len(students))
print(students)

bar_width = 0.7
x_pos = np.arange(len(students) * 2, step=2)
plt.bar(x_pos, students['2016'], color='orange', width=bar_width)
plt.bar(x_pos + bar_width, students['2017'], color='red', width=bar_width)

plt.xticks(x_pos + bar_width / 2, students['Field'], rotation='90')
plt.title('International Student by Field', fontsize=16)
plt.xlabel('Field')
plt.ylabel('Number')
plt.tight_layout()
plt.show()
[展开全文]

matplotlib感觉和matlab很像

[展开全文]

def age_18_to_30(a):

      return  18<=a<=30

def  level_a(s):
     return  85<=s<=100

students=pd.read_excel('C:/temp/students.xlsx',index_col='ID')

students.loc[students['Age'].apply(age_18_to_30)].loc[students['Score'].apply(level_a)]

[展开全文]

任务13:绘制折线趋势图、叠加区域图

!!!!本课下载下来的excel附件名字是Order.xlsx. 编辑代码时注意修改!!!

本课代码:

import pandas as pd
import matplotlib.pyplot as plt

weeks = pd.read_excel('C:/Temp/Orders.xlsx', index_col='Week')
print(weeks)
print(weeks.columns)

#weeks.plot.area(y=['Accessories', 'Bikes', 'Clothing', 'Components', 'Grand Total'])
weeks.plot.bar(y=['Accessories', 'Bikes', 'Clothing', 'Components', 'Grand Total'], stacked=True)
plt.title('Sales Weekly Trend', fontsize=16, fontweight='bold')
plt.ylabel('Total', fontsize=12, fontweight='bold')
plt.xticks(weeks.index, fontsize=8)
plt.show()

打印结果:

       Accessories         Bikes      ...         Components   Grand Total
Week                                  ...                                 
1      9939.465500  2.258337e+06      ...       7.872110e+04  2.356639e+06
2     12626.660000  6.005350e+05      ...       0.000000e+00  6.204234e+05
3     14414.950000  5.547708e+05      ...       0.000000e+00  5.759616e+05
4     12924.580000  5.892557e+05      ...       0.000000e+00  6.083717e+05
5     40443.498516  5.749222e+06      ...       4.709014e+05  6.360041e+06
6     13735.460000  5.539423e+05      ...       0.000000e+00  5.753385e+05
7     13588.800000  6.053847e+05      ...       0.000000e+00  6.247596e+05
8     13997.810000  5.320056e+05      ...       0.000000e+00  5.526938e+05
9     52392.263204  4.701389e+06      ...       6.852023e+05  5.567191e+06
10    14276.640000  5.815496e+05      ...       0.000000e+00  6.037474e+05
11    13584.320000  6.169319e+05      ...       0.000000e+00  6.372021e+05
12    14128.770000  5.985606e+05      ...       0.000000e+00  6.204505e+05
13    34372.148628  5.154563e+06      ...       6.173474e+05  5.899177e+06
14    58097.712659  3.361458e+06      ...       5.296900e+05  4.041540e+06
15    16287.020000  6.655744e+05      ...       0.000000e+00  6.894546e+05
16    15990.960000  6.749113e+05      ...       0.000000e+00  6.991723e+05
17    15120.710000  6.581411e+05      ...       0.000000e+00  6.795171e+05
18    67753.596201  5.739679e+06      ...       1.091392e+06  7.059584e+06
19    15416.040000  7.537814e+05      ...       0.000000e+00  7.770648e+05
20    16113.010000  7.323520e+05      ...       0.000000e+00  7.571492e+05
21    15903.150000  7.380932e+05      ...       0.000000e+00  7.620668e+05
22    57090.139277  4.388471e+06      ...       1.137457e+06  5.746683e+06
23    11900.146000  8.310402e+05      ...       3.152596e+04  8.824418e+05
24    11336.820000  4.095025e+05      ...       0.000000e+00  4.251703e+05
25    10573.210000  4.065446e+05      ...       0.000000e+00  4.232483e+05
26    29376.532664  3.101888e+06      ...       7.994845e+05  4.039316e+06
27    72003.211677  4.932875e+06      ...       1.043760e+06  6.191153e+06
28    11621.900000  4.086479e+05      ...       0.000000e+00  4.258483e+05
29    11640.460000  4.056193e+05      ...       0.000000e+00  4.225788e+05
30    12359.920000  4.156482e+05      ...       0.000000e+00  4.325679e+05
31    81276.364307  5.475372e+06      ...       1.593068e+06  7.363333e+06
32    15208.996500  1.494933e+06      ...       1.025128e+05  1.623856e+06
33    13187.100000  3.938290e+05      ...       0.000000e+00  4.135150e+05
34    13046.980000  4.383911e+05      ...       0.000000e+00  4.571066e+05
35    50187.449516  3.732927e+06      ...       5.842441e+05  4.495606e+06
36    15063.164000  1.173016e+06      ...       6.180437e+04  1.260751e+06
37    11505.920000  4.814773e+05      ...       0.000000e+00  4.985188e+05
38    13170.670000  4.883039e+05      ...       0.000000e+00  5.066807e+05
39    11278.600000  3.675468e+05      ...       0.000000e+00  3.845985e+05
40    70170.520320  7.878476e+06      ...       1.177747e+06  9.303972e+06
41    12441.460000  4.658615e+05      ...       0.000000e+00  4.834701e+05
42    12924.950000  4.715758e+05      ...       0.000000e+00  4.907546e+05
43    13314.310000  4.931541e+05      ...       0.000000e+00  5.127108e+05
44    62745.172581  5.156642e+06      ...       9.100149e+05  6.286165e+06
45    19630.003108  2.199453e+06      ...       1.478740e+05  2.381990e+06
46    14822.180000  7.092621e+05      ...       0.000000e+00  7.304781e+05
47    13728.300000  6.623922e+05      ...       0.000000e+00  6.830832e+05
48    36075.469500  3.240381e+06      ...       2.592590e+05  3.616344e+06
49    13642.418500  1.227307e+06      ...       2.369806e+04  1.273325e+06
50    12388.980000  5.217178e+05      ...       0.000000e+00  5.419306e+05
51    12852.400000  5.264518e+05      ...       0.000000e+00  5.455141e+05
52    13085.680000  5.412245e+05      ...       0.000000e+00  5.604452e+05
53    31315.891268  4.790804e+06      ...       4.568886e+05  5.375680e+06

[53 rows x 5 columns]
Index(['Accessories', 'Bikes', 'Clothing', 'Components', 'Grand Total'], dtype='object')

[展开全文]

 Series序列类似于dictionary

生成一个序列

d={'x':100,'y':200,'z':300}#字典/键值对

print(d.keys())#所有键

print(d.values())#所有值

print(d['x'])#具体一个值

s1=pd.Series(d)#字典转为序列

-----------------------------------

第二种创建序列方法

s1=pd.Series([100,200,300],index=['x','y','z'])

s1=pd.Series(data,index,name)#序列既不是行也不是列

s2=pd.Series(data,index,name)#序列只有加入DataFrame才能有行和列的区别,而这种区别是建立在加入时的方法不同上:两种加入法:dictonary 或者list

 

pd.DataFrame({s1.name:s1,s2.name:s2})#以dictonary加入当作列         

pd.DataFrame([s1,s2])   

pd.DataFrame(list)#以list加入当作行

index的作用:

对齐原则

有共同值则对齐,没有共同值的位置给一个NaN空

--------------------------------------

下节学习:使用Series进行快速填充

[展开全文]

任务12:绘制饼图

绘制饼图是针对一个seriers

students['2017'].plot.pie(fontsize=8, counterclock=False)

注:counterclock=False:顺时针排列

本节代码:

import pandas as pd
import matplotlib.pyplot as plt

students = pd.read_excel('C:/Temp/Students.xlsx', index_col='From')
print(students)

students['2017'].plot.pie(fontsize=8, counterclock=False)
plt.title('Source of International Students', fontsize=16, fontweight='bold')
plt.ylabel('2017', fontsize=12, fontweight='bold')
plt.show()

打印结果:

                Rank    2016    2017
From                                
China              1  328547  350755
India              2  165918  186267
South Korea        3   61007   58663
Saudi Arabia       4   61287   52611
Canada             5   26973   27065
Vietnam            6   21403   22438
Taiwan             7   21127   21516
Japan              8   19060   18780
Mexico             9   16733   16835
Brazil            10   19370   13089
Iran              11   12269   12643
Nigeria           12   10674   11710
Nepal             13    9662   11607
United Kingdom    14   11599   11489
Turkey            15   10691   10586
Germany           16   10145   10169
Kuwait            17    9772    9825
France            18    8764    8814
Indonesia         19    8727    8776
Venezuela         20    8267    8540
Malaysia          21    7834    8247
Colombia          22    7815    7982
Hong Kong         23    7923    7547
Spain             24    6640    7164
Bangladesh        25    6513    7143

[展开全文]

任务11:叠加柱状图,水平柱状图

在DataFrame里面新增加一列“Total”:

users['Total'] = users['Oct'] + users['Nov'] + users['Dec']

生成叠加柱状图:

users.plot.bar(x='Name', y=['Oct', 'Nov', 'Dec'], stacked=True, title='User Behavior')

注:Stached=True

生成水平柱状图:

users.plot.barh(x='Name', y=['Oct', 'Nov', 'Dec'], stacked=True, title='User Behavior')

注:users.plot.barh

本节代码:

import pandas as pd
import matplotlib.pyplot as plt

users = pd.read_excel('C:/Temp/Users.xlsx')
users['Total'] = users['Oct'] + users['Nov'] + users['Dec']
users.sort_values(by='Total', inplace=True, ascending=True)
print(users)

users.plot.barh(x='Name', y=['Oct', 'Nov', 'Dec'], stacked=True, title='User Behavior')
plt.tight_layout()
plt.show()

打印结果

    ID      Name  Oct  Nov  Dec  Total
11  12  User_012    7   11   11     29
13  14  User_014    9   11   11     31
0    1  User_001    7   15   10     32
1    2  User_002   10   13   10     33
5    6  User_006   11    9   14     34
9   10  User_010    9   15   10     34
3    4  User_004    8   14   13     35
17  18  User_018    8   13   14     35
4    5  User_005    8   11   16     35
19  20  User_020    9   13   13     35
7    8  User_008    9   12   16     37
8    9  User_009   11   11   15     37
10  11  User_011    9   14   15     38
18  19  User_019    8   16   15     39
2    3  User_003   12   10   17     39
16  17  User_017   10   13   16     39
12  13  User_013    7   15   18     40
6    7  User_007   11   14   17     42
14  15  User_015   10   15   18     43
15  16  User_016   12   14   17     43

 

end.

 

[展开全文]

read_excel()方法可以使用xlrd Python模块读取Excel 2003(.xls)和Excel 2007+(.xlsx)文件。 to_excel()实例方法用于将DataFrame保存到Excel

people.shape 显示(行数,列数)

people.columns显示Index[Titile]

people.head(3)看前三行数据

people.coumns=[]手动设置header

header默认值为0,当一行为错误数据时可以将header改为1实现正确

 

 

 

[展开全文]

授课教师

Tim老师

课程特色

视频(30)
下载资料(25)

学员动态