475人加入学习
(9人评价)
Python数据分析 - Pandas玩转Excel

Python数据分析轻松学

价格 免费

饿,TIM老师的这个方程其实是可以化简的,也不是很复杂

import pandas as pd
import numpy as np

# def get_circumcircle_area(l,h):
#     r = np.sqrt(l**2+h**2)/2
#     return r**2*np.pi

# def get_circumcircle_area(l,h):
#     return np.pi*(l**2+h**2)/4
#
# def wrapper(row):
#      return get_circumcircle_area(row['Length'],row['Height'])

rects = pd.read_excel('D:/Temp/Rectangles.xlsx',index_col='ID')
rects['CA'] = rects.apply(lambda row:np.pi*(row['Length']**2+row['Height']**2)/4,axis=1)

print(rects)

 

这样的话一行就可以了

[展开全文]

https://www.jianshu.com/p/2d49cb87626b

对于group详细的解析

[展开全文]

字符串前加f是用于格式化字符串的,

print(f'#{row.ID}student{row.Name}has an invalid score{row.Score}.')

就相当于

print('# %d student{row.Name}has an invalid score %s.' % (row.ID,row.Score))

 

具体上网搜一下python 字符串格式化的三种方法就好了

[展开全文]
students['2017'].sort_values(ascending=True).plot.pie(fontsize=6,startangle=-270)

可以用在乱序的情况

[展开全文]
老师资料里matplotlib里面的代码应该用的不是同一个文件了,要改一下

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

students = pd.read_excel('D:/Temp/Students10.xlsx')
students.sort_values(by='2017', inplace=True, ascending=False)
students.index = range(0, len(students))
print(students)

bar_width = 0.7
x_pos = np.arange(len(students) * 2, step=2)
plt.bar(x_pos, students['2016'], color='orange', width=bar_width)
plt.bar(x_pos + bar_width, students['2017'], color='red', width=bar_width)

plt.xticks(x_pos + bar_width / 2, students['Field'], rotation='90')
plt.title('International Student by Field', fontsize=16)
plt.xlabel('Field')
plt.ylabel('Number')
plt.tight_layout()
plt.show()
[展开全文]

matplotlib感觉和matlab很像

[展开全文]

def age_18_to_30(a):

      return  18<=a<=30

def  level_a(s):
     return  85<=s<=100

students=pd.read_excel('C:/temp/students.xlsx',index_col='ID')

students.loc[students['Age'].apply(age_18_to_30)].loc[students['Score'].apply(level_a)]

[展开全文]

任务13:绘制折线趋势图、叠加区域图

!!!!本课下载下来的excel附件名字是Order.xlsx. 编辑代码时注意修改!!!

本课代码:

import pandas as pd
import matplotlib.pyplot as plt

weeks = pd.read_excel('C:/Temp/Orders.xlsx', index_col='Week')
print(weeks)
print(weeks.columns)

#weeks.plot.area(y=['Accessories', 'Bikes', 'Clothing', 'Components', 'Grand Total'])
weeks.plot.bar(y=['Accessories', 'Bikes', 'Clothing', 'Components', 'Grand Total'], stacked=True)
plt.title('Sales Weekly Trend', fontsize=16, fontweight='bold')
plt.ylabel('Total', fontsize=12, fontweight='bold')
plt.xticks(weeks.index, fontsize=8)
plt.show()

打印结果:

       Accessories         Bikes      ...         Components   Grand Total
Week                                  ...                                 
1      9939.465500  2.258337e+06      ...       7.872110e+04  2.356639e+06
2     12626.660000  6.005350e+05      ...       0.000000e+00  6.204234e+05
3     14414.950000  5.547708e+05      ...       0.000000e+00  5.759616e+05
4     12924.580000  5.892557e+05      ...       0.000000e+00  6.083717e+05
5     40443.498516  5.749222e+06      ...       4.709014e+05  6.360041e+06
6     13735.460000  5.539423e+05      ...       0.000000e+00  5.753385e+05
7     13588.800000  6.053847e+05      ...       0.000000e+00  6.247596e+05
8     13997.810000  5.320056e+05      ...       0.000000e+00  5.526938e+05
9     52392.263204  4.701389e+06      ...       6.852023e+05  5.567191e+06
10    14276.640000  5.815496e+05      ...       0.000000e+00  6.037474e+05
11    13584.320000  6.169319e+05      ...       0.000000e+00  6.372021e+05
12    14128.770000  5.985606e+05      ...       0.000000e+00  6.204505e+05
13    34372.148628  5.154563e+06      ...       6.173474e+05  5.899177e+06
14    58097.712659  3.361458e+06      ...       5.296900e+05  4.041540e+06
15    16287.020000  6.655744e+05      ...       0.000000e+00  6.894546e+05
16    15990.960000  6.749113e+05      ...       0.000000e+00  6.991723e+05
17    15120.710000  6.581411e+05      ...       0.000000e+00  6.795171e+05
18    67753.596201  5.739679e+06      ...       1.091392e+06  7.059584e+06
19    15416.040000  7.537814e+05      ...       0.000000e+00  7.770648e+05
20    16113.010000  7.323520e+05      ...       0.000000e+00  7.571492e+05
21    15903.150000  7.380932e+05      ...       0.000000e+00  7.620668e+05
22    57090.139277  4.388471e+06      ...       1.137457e+06  5.746683e+06
23    11900.146000  8.310402e+05      ...       3.152596e+04  8.824418e+05
24    11336.820000  4.095025e+05      ...       0.000000e+00  4.251703e+05
25    10573.210000  4.065446e+05      ...       0.000000e+00  4.232483e+05
26    29376.532664  3.101888e+06      ...       7.994845e+05  4.039316e+06
27    72003.211677  4.932875e+06      ...       1.043760e+06  6.191153e+06
28    11621.900000  4.086479e+05      ...       0.000000e+00  4.258483e+05
29    11640.460000  4.056193e+05      ...       0.000000e+00  4.225788e+05
30    12359.920000  4.156482e+05      ...       0.000000e+00  4.325679e+05
31    81276.364307  5.475372e+06      ...       1.593068e+06  7.363333e+06
32    15208.996500  1.494933e+06      ...       1.025128e+05  1.623856e+06
33    13187.100000  3.938290e+05      ...       0.000000e+00  4.135150e+05
34    13046.980000  4.383911e+05      ...       0.000000e+00  4.571066e+05
35    50187.449516  3.732927e+06      ...       5.842441e+05  4.495606e+06
36    15063.164000  1.173016e+06      ...       6.180437e+04  1.260751e+06
37    11505.920000  4.814773e+05      ...       0.000000e+00  4.985188e+05
38    13170.670000  4.883039e+05      ...       0.000000e+00  5.066807e+05
39    11278.600000  3.675468e+05      ...       0.000000e+00  3.845985e+05
40    70170.520320  7.878476e+06      ...       1.177747e+06  9.303972e+06
41    12441.460000  4.658615e+05      ...       0.000000e+00  4.834701e+05
42    12924.950000  4.715758e+05      ...       0.000000e+00  4.907546e+05
43    13314.310000  4.931541e+05      ...       0.000000e+00  5.127108e+05
44    62745.172581  5.156642e+06      ...       9.100149e+05  6.286165e+06
45    19630.003108  2.199453e+06      ...       1.478740e+05  2.381990e+06
46    14822.180000  7.092621e+05      ...       0.000000e+00  7.304781e+05
47    13728.300000  6.623922e+05      ...       0.000000e+00  6.830832e+05
48    36075.469500  3.240381e+06      ...       2.592590e+05  3.616344e+06
49    13642.418500  1.227307e+06      ...       2.369806e+04  1.273325e+06
50    12388.980000  5.217178e+05      ...       0.000000e+00  5.419306e+05
51    12852.400000  5.264518e+05      ...       0.000000e+00  5.455141e+05
52    13085.680000  5.412245e+05      ...       0.000000e+00  5.604452e+05
53    31315.891268  4.790804e+06      ...       4.568886e+05  5.375680e+06

[53 rows x 5 columns]
Index(['Accessories', 'Bikes', 'Clothing', 'Components', 'Grand Total'], dtype='object')

[展开全文]

 Series序列类似于dictionary

生成一个序列

d={'x':100,'y':200,'z':300}#字典/键值对

print(d.keys())#所有键

print(d.values())#所有值

print(d['x'])#具体一个值

s1=pd.Series(d)#字典转为序列

-----------------------------------

第二种创建序列方法

s1=pd.Series([100,200,300],index=['x','y','z'])

s1=pd.Series(data,index,name)#序列既不是行也不是列

s2=pd.Series(data,index,name)#序列只有加入DataFrame才能有行和列的区别,而这种区别是建立在加入时的方法不同上:两种加入法:dictonary 或者list

 

pd.DataFrame({s1.name:s1,s2.name:s2})#以dictonary加入当作列         

pd.DataFrame([s1,s2])   

pd.DataFrame(list)#以list加入当作行

index的作用:

对齐原则

有共同值则对齐,没有共同值的位置给一个NaN空

--------------------------------------

下节学习:使用Series进行快速填充

[展开全文]

任务12:绘制饼图

绘制饼图是针对一个seriers

students['2017'].plot.pie(fontsize=8, counterclock=False)

注:counterclock=False:顺时针排列

本节代码:

import pandas as pd
import matplotlib.pyplot as plt

students = pd.read_excel('C:/Temp/Students.xlsx', index_col='From')
print(students)

students['2017'].plot.pie(fontsize=8, counterclock=False)
plt.title('Source of International Students', fontsize=16, fontweight='bold')
plt.ylabel('2017', fontsize=12, fontweight='bold')
plt.show()

打印结果:

                Rank    2016    2017
From                                
China              1  328547  350755
India              2  165918  186267
South Korea        3   61007   58663
Saudi Arabia       4   61287   52611
Canada             5   26973   27065
Vietnam            6   21403   22438
Taiwan             7   21127   21516
Japan              8   19060   18780
Mexico             9   16733   16835
Brazil            10   19370   13089
Iran              11   12269   12643
Nigeria           12   10674   11710
Nepal             13    9662   11607
United Kingdom    14   11599   11489
Turkey            15   10691   10586
Germany           16   10145   10169
Kuwait            17    9772    9825
France            18    8764    8814
Indonesia         19    8727    8776
Venezuela         20    8267    8540
Malaysia          21    7834    8247
Colombia          22    7815    7982
Hong Kong         23    7923    7547
Spain             24    6640    7164
Bangladesh        25    6513    7143

[展开全文]

任务11:叠加柱状图,水平柱状图

在DataFrame里面新增加一列“Total”:

users['Total'] = users['Oct'] + users['Nov'] + users['Dec']

生成叠加柱状图:

users.plot.bar(x='Name', y=['Oct', 'Nov', 'Dec'], stacked=True, title='User Behavior')

注:Stached=True

生成水平柱状图:

users.plot.barh(x='Name', y=['Oct', 'Nov', 'Dec'], stacked=True, title='User Behavior')

注:users.plot.barh

本节代码:

import pandas as pd
import matplotlib.pyplot as plt

users = pd.read_excel('C:/Temp/Users.xlsx')
users['Total'] = users['Oct'] + users['Nov'] + users['Dec']
users.sort_values(by='Total', inplace=True, ascending=True)
print(users)

users.plot.barh(x='Name', y=['Oct', 'Nov', 'Dec'], stacked=True, title='User Behavior')
plt.tight_layout()
plt.show()

打印结果

    ID      Name  Oct  Nov  Dec  Total
11  12  User_012    7   11   11     29
13  14  User_014    9   11   11     31
0    1  User_001    7   15   10     32
1    2  User_002   10   13   10     33
5    6  User_006   11    9   14     34
9   10  User_010    9   15   10     34
3    4  User_004    8   14   13     35
17  18  User_018    8   13   14     35
4    5  User_005    8   11   16     35
19  20  User_020    9   13   13     35
7    8  User_008    9   12   16     37
8    9  User_009   11   11   15     37
10  11  User_011    9   14   15     38
18  19  User_019    8   16   15     39
2    3  User_003   12   10   17     39
16  17  User_017   10   13   16     39
12  13  User_013    7   15   18     40
6    7  User_007   11   14   17     42
14  15  User_015   10   15   18     43
15  16  User_016   12   14   17     43

 

end.

 

[展开全文]

read_excel()方法可以使用xlrd Python模块读取Excel 2003(.xls)和Excel 2007+(.xlsx)文件。 to_excel()实例方法用于将DataFrame保存到Excel

people.shape 显示(行数,列数)

people.columns显示Index[Titile]

people.head(3)看前三行数据

people.coumns=[]手动设置header

header默认值为0,当一行为错误数据时可以将header改为1实现正确

 

 

 

[展开全文]

任务12:绘制饼图时,需要加上:plt.axis('equal')确保饼图是一个正圆,否则为椭圆

[展开全文]

 任务10:绘制分组柱图,深度优化图标

本节大部分内容各位上节课的复习内容。

import pandas as pd
import matplotlib.pyplot as plt

students = pd.read_excel('C:/Temp/Students.xlsx')
students.sort_values(by='2017', inplace=True, ascending=False)
print(students)
students.plot.bar(x='Field', y=['2016', '2017'], color=['orange', 'red'])
plt.title('International Students by Field', fontsize=16, fontweight='bold')
plt.xlabel('Field', fontweight='bold')
plt.ylabel('Number', fontweight='bold')
ax = plt.gca()
ax.set_xticklabels(students['Field'], rotation=45, ha='right')
f = plt.gcf()
f.subplots_adjust(left=0.2, bottom=0.42)
#plt.tight_layout()
plt.show()

 

几个注意的点:

1. 在定义x轴(x=)和y轴(y=)的时候,对于需要分组显示的部分要注意使用list,比如示例中的 y=['2016', '2017'], color=['orange', 'red'] 

2. pandas制图是基于matplotlib的。

3. 在优化title,xlabel, ylabel的时候修改字号,字体使用“fontsize=”,fontweight=' '。

4. 修改X/Y轴的一个函数 plt.gca()

用法:

ax = plt.gca()
ax.set_xticklabels(students['Field'], rotation=45, ha='right')

注:ha=’right‘意思是按照文字右上角为基点进行旋转。

5. 修改左右空白空间的函数plt.gcf()

用法:

f = plt.gcf()
f.subplots_adjust(left=0.2, bottom=0.42)

 

 

 

 

[展开全文]

任务9: 柱状图

1.运用pandas绘制

首先引入matplotlib这个包下面的pyplot模块

import pandas as pd
import matplotlib.pyplot as plt

在DataFrane下面绘制棒图

students.plot.bar(x='Field', y='Number',color='orange',title='international students by field')
plt.tight_layout()
plt.show()

输出结果

2. 运用matplotlib绘制

plt.bar(students.Field,students.Number,color='orange')
plt.xticks(students.Field, rotation='90')
plt.xlabel('Field')
plt.ylabel('Number')
plt.title('international students by field', fontsize= 16)
plt.tight_layout()
plt.show()

 

可以看出matplotlib比pandas绘图更加方便一些

 

[展开全文]

任务8:数据筛选,过滤

方法一:通过函数定义筛选条件

筛选数据的条件通常是通过函数的形式来表达的。因此对于目标excel的筛选条件我们要先定义相应的函数

def age_18_to_30(a):
#   return a >=18 and a <30  #常规写法
    return 18 <= a < 30  # Python 特有的表达式方式
def level_A(s):
    return 85 <= s <= 100

学习一下DataFrame里面的.loc[]属性。注意attribut后面跟的是【】,"loc"是“location”的缩写,意思是定义到某个位置,并将其保留下来。

students = students.loc[students['Age'].apply(age_18_to_30)].loc[students['Score'].apply(level_A)]

注:

- students.loc[].loc[]中两个.loc[]连着写目的是实现多条件筛选同时其书写的先后顺序也会影响实际筛选的先后顺序

- 复习上节课Series.apply()的用法,注意()里的函数名不要带()。

 

方法二:采用lambda表达式(了解内容)

采用lambda表达式可以省略函数定义,式整个代码变得更加简洁

students = students.loc[students['Age'].apply(lambda a: 18 <= a < 30)].loc[
    students['Score'].apply(lambda s: 85 <= s <= 100)]

 

学习DataFrame中列的另外一种表达方式:

表达方式1: students['Age']

表达方式2:students.Age

students = students.loc[students.Age.apply(age_18_to_30)].loc[students.Score.apply(level_A)]

 

本节代码:


# def age_18_to_30(a):
#   return a >=18 and a <30  #常规写法
#    return 18 <= a < 30  # Python 特有的表达式方式

# def level_A(s):
#    return 85 <= s <= 100

students = pd.read_excel('C:/Temp/Students.xlsx', index_col='ID')
# students = students.loc[students['Age'].apply(age_18_to_30)].loc[students['Score'].apply(level_A)]
students = students.loc[students['Age'].apply(lambda a: 18 <= a < 30)].loc[
    students['Score'].apply(lambda s: 85 <= s <= 100)]
# students = students.loc[students.Age.apply(age_18_to_30)].loc[students.Score.apply(level_A)]
print(students)

打印结果

           Name  Age  Score
ID                         
2   Student_002   26     92
6   Student_006   20     93
9   Student_009   18     85
19  Student_019   19     86
20  Student_020   20     94

 

 

[展开全文]

任务7: 排序,多重任务

关于对DATa Frame里面不同列(Series)的排序主要用到一个DataFrame的一个函数 ".sort_values()"

.sort_values()函数里的几个成员的解释

by=【'column1','column2'】 :指定需要排序的目标列,以及排序的先后顺序。

inplace= True:与.set_index里的功能一样,避免再次生成一个DataFrame。

ascending=[True,False]: 【】里面的成员数量要与by[]里面的成员对应。表示从大到小或从小到大排列。

本节代码:

 

import pandas as  pd

products = pd.read_excel('C:/Temp/List.xlsx', index_col='ID')

#products.sort_values(by='Price', inplace=True, ascending=False)
products.sort_values(by=['Worthy', 'Price'], inplace=True, ascending=[True, False])
print(products)

打印结果

           Name  Price Worthy
ID                           
18  Product_018  11.22     No
15  Product_015  10.31     No
3   Product_003   9.62     No
9   Product_009   8.98     No
6   Product_006   7.34     No
12  Product_012   7.29     No
2   Product_002  11.99    Yes
17  Product_017  11.95    Yes
8   Product_008  11.14    Yes
4   Product_004  11.08    Yes
7   Product_007  10.97    Yes
19  Product_019  10.95    Yes
16  Product_016  10.26    Yes
1   Product_001   9.82    Yes
10  Product_010   9.18    Yes
14  Product_014   9.16    Yes
20  Product_020   8.82    Yes
13  Product_013   8.36    Yes
11  Product_011   8.31    Yes
5   Product_005   7.75    Yes

 

[展开全文]

授课教师

Tim老师

课程特色

视频(30)
下载资料(25)