Pandas 在数据分析中的应用：基础使用方法

Pandas 作为 Python 中数据挖掘和数据分析时，一个必不可少的库。利用 Pandas 可以快速完成数据读写、数据分片、分组统计、数据整理等操作。

数据结构

pandas 的所有功能都建立在数据结构（DataFrame、Series）

DataFrame

一种表格型的数据结构，包含一组有序的列，每列可以是不同类型的值（数值、字符串等），既有行索引，也有列索引，可以看成是 Series 组成的字典。


import pandas as pd

# 设置最多显示 7 行 6 列
pd.set_option('display.max_rows',7)
pd.set_option('display.max_columns',6)

# 读取一个 csv 文件
retail_data = pd.read_csv("*.csv")
retail_data.head()

在Jupyter Notebook 执行

数据通过 read_csv 将 csv 文件内容读入了 retail_data 变量中，为 DataFrame 对象。

Pandas 输出结果

通过 索引标签 可以快速访问一个 DataFrame 的指定行或列的数据，在多个 Series 或 DataFrame 合并时，索引用来做数据对齐。 DataFrame 把索引和列都当作 轴（axis）,有一个竖轴（Index）和一个横轴（Columns）。最左边的一列的 0,1,2,3,4 是索引，最顶端的一行代表了该数据的列（Columns）。数据缺失在 pandas 中使用 NaN 表示。


print(type(retail_data.index), ' ', retail_data.index)
print(type(retail_data.columns), ' ', retail_data.columns)
print(type(retail_data.values), ' ', retail_data.values)

输出后显示索引、列、值的相关类型及值

<class 'pandas.core.indexes.range.RangeIndex'> RangeIndex(start=0, stop=1000, step=1) 

<class 'pandas.core.indexes.base.Index'> Index(['Rank', 'Company', 'Sector', 'Industry', 'Location', 'Revenue', 'Profits', 'Employees'], dtype='object') 

<class 'numpy.ndarray'> [[1 'Walmart' 'Retailing' ... 482130.0 14694 2300000] [2 'Exxon Mobil' 'Energy' ... 246204.0 16150 75600] [3 'Apple' 'Technology' ... 233715.0 53394 110000] ... [997 'Portland General Electric' 'Energy' ... 1898.0 172 2646] [999 'Wendy__' 'Hotels, Resturants & Leisure' ... 1896.0 161 21200] [1000 'Briggs & Stratton' 'Industrials' ... 1895.0 46 5480]]

数据类型

数据分析 中通常将数据简单地分为连续变量和离散变量，而 Pandas对数据有更详细的分类。

Pandas Type	Python Type	NumPy Type	使用场景
object	str	str	文本
int64	int	int, int8, int16, int32, int64, uint8, uint16, uint32, uint64	整数
float64	float	float	浮点数
bool	bool	bool	布尔类型
datetime64	NA	NA	日期
timedelta[ns]	NA	NA	日期间隔
category	NA	NA	分类变量

DataFrame 的每一列只能是一种数据类型，而不同列则可以是不同数据类型。


retail_data.dtypes

# 输出结果
# Rank           int64
# Company       object
# Sector        object
#               ...   
# Revenue      float64
# Profits        int64
# Employees      int64
# Length: 8, dtype: object

Series

一种类似于一维数组的对象，由一组数据（NumPy数据类型）与之相关的索引组成。Series 就是构成 DataFrame的列，每一列就是一个 Series，其中每一个元素都有一个标签，可以是数字或字符串。


# 读取方式1
retail_data["Company"]

# 读取方式2，不建议使用，因为可能存在特殊字符，以及出现重名
retail_data.Company

# 输出结果
# 0                        Walmart
# 1                    Exxon Mobil
# 2                          Apple
#                  ...            
# 997    Portland General Electric
# 998                      Wendy__
# 999            Briggs & Stratton
# Name: Company, Length: 1000, dtype: object

运算符

可以使用运算符对指定列进行运算


retail_data["Rank"]

# 0         1
# 1         2
# 2         3
#        ... 
# 997     997
# 998     999
# 999    1000
# Name: Rank, Length: 1000, dtype: int64


retail_data["Rank"] + 2

# 0         3
# 1         4
# 2         5
#        ... 
# 997     999
# 998    1001
# 999    1002
# Name: Rank, Length: 1000, dtype: int64

retail_data["Rank"] * 2

# 0         2
# 1         4
# 2         6
#        ... 
# 997    1994
# 998    1998
# 999    2000
# Name: Rank, Length: 1000, dtype: int64

链式方法

在 Python 这种面向对象的高级语言中，一切皆是对象，有着自己的属性和方法，对这些属性或方法调用可能会返回新的对象和方法。

通过 . 操作来顺序调用对象的方法称为链式方法


# 返回总数并获取前五行
companies = retail_data['Company']
companies.value_counts().head()

# Output
# Company
# Regions Financial            2
# Portland General Electric    2
# Host Hotels & Resorts        2
# Noble Energy                 2
# Hubbell                      1
# Name: count, dtype: int64

# 判断缺失值所占比例
retail_data["Revenue"].isnull().mean()

# Output
# 0.01

# 填充缺失值为 0
retail_data["Revenue"].fillna(0).isnull().mean()

# Output
# 0

索引和列

修改索引和列


# 方式一
retail_data = pd.read_csv("*.csv", index_col="Company")

# 方式二
retail_data = pd.read_csv("*.csv")
# 设置Company列为索引，inplace=True表示直接在原数据上修改
retail_data.set_index("Company", inplace=True)

# 恢复索引
retail_data.reset_index(inplace=True)

输出结果


# 修改索引名
idx_rename = {1: "first", 2: "second", 3: "third"}
# 修改列名
col_rename = {"Company": "Name", "Profits ($M)": "Profits"}
# 在原数据上修改索引和列名
retail_data.rename(index=idx_rename, columns=col_rename, inplace=True)

输出结果

添加、修改或删除列


# 增加列
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['C'] = df['A'] + df['B']
print(df)

#    A  B  C
# 0  1  4  5
# 1  2  5  7
# 2  3  6  9

# 增加行 
df.insert(3, 'D', [7, 8, 9])
print(df)

#    A  B  C  D
# 0  1  4  5  7
# 1  2  5  7  8
# 2  3  6  9  9

# 删除列，返回一个新对象，指定inplace在原对象上修改
df.drop('C', axis=1, inplace=True)
print(df)

#    A  B  D
# 0  1  4  7
# 1  2  5  8
# 2  3  6  9

选择多列


# 选择多列，返回的是 DataFrame
print(retail_data[["Company", "Sector", "Industry"]])

# 根据类型选择列
print(retail_data.select_dtypes(include=["number"]))

# 根据字符串部分匹配选择列
print(retail_data.filter(like="Market"))

# 根据正则表达式选择列
print(retail_data.filter(regex="^M"))

# 根据列名选择列，如果列名不存在不会报错
print(retail_data.filter(items=["Company", "Sector", "Industry"]))