Python–pandas库进行数据处理的实例

以下是一个使用 Python `pandas` 库进行数据处理的实例，涵盖常见操作如数据读取、清洗、筛选、聚合等。

—

### 示例场景
假设我们有一个销售数据集 `sales_data.csv`，包含以下字段：
– `OrderID` (订单ID)
– `Product` (产品名称)
– `Quantity` (购买数量)
– `Price` (单价)
– `OrderDate` (订单日期)
– `CustomerID` (客户ID)

目标：分析销售数据，找出最畅销的产品和客户消费行为。

—

### 1. 导入库并读取数据
“`python
import pandas as pd

# 读取 CSV 文件
df = pd.read_csv(“sales_data.csv”)

# 查看前 5 行数据
print(df.head())
“`

—

### 2. 数据清洗
#### 处理缺失值
“`python
# 检查缺失值
print(df.isnull().sum())

# 填充缺失值（例如用平均值填充 “Price” 列的缺失值）
df[“Price”].fillna(df[“Price”].mean(), inplace=True)

# 删除包含缺失值的行（可选）
df.dropna(inplace=True)
“`

#### 处理重复值
“`python
# 删除重复行
df.drop_duplicates(inplace=True)
“`

—

### 3. 数据筛选与排序
#### 筛选特定条件的数据
“`python
# 筛选单价大于 100 的订单
high_price_orders = df[df[“Price”] > 100]

# 筛选 2023 年的订单（假设 OrderDate 格式为 “YYYY-MM-DD”）
df[“OrderDate”] = pd.to_datetime(df[“OrderDate”])
orders_2023 = df[df[“OrderDate”].dt.year == 2023]
“`

#### 按列排序
“`python
# 按单价降序排列
df_sorted = df.sort_values(“Price”, ascending=False)
“`

—

### 4. 数据聚合与分组统计
#### 计算总销售额（新增列）
“`python
df[“TotalSales”] = df[“Quantity”] * df[“Price”]
“`

#### 按产品统计总销量和总销售额
“`python
product_stats = df.groupby(“Product”).agg({
“Quantity”: “sum”,
“TotalSales”: “sum”
}).reset_index()

print(product_stats)
“`

#### 按客户统计消费次数和总消费金额
“`python
customer_stats = df.groupby(“CustomerID”).agg({
“OrderID”: “count”,
“TotalSales”: “sum”
}).rename(columns={“OrderID”: “PurchaseCount”})

print(customer_stats)
“`

—

### 5. 数据合并
假设有另一个客户信息表 `customer_info.csv`，包含 `CustomerID` 和 `CustomerName`：
“`python
# 读取客户信息表
df_customers = pd.read_csv(“customer_info.csv”)

# 合并销售数据与客户信息（类似 SQL 的 JOIN）
merged_df = pd.merge(df, df_customers, on=”CustomerID”, how=”left”)
“`

—

### 6. 数据保存
将处理后的数据保存为新文件：
“`python
# 保存为 CSV
merged_df.to_csv(“processed_sales_data.csv”, index=False)

# 保存为 Excel
merged_df.to_excel(“processed_sales_data.xlsx”, index=False)
“`

—

### 7. 高级分析示例
#### 时间序列分析（按月统计销售额）
“`python
monthly_sales = merged_df.resample(“M”, on=”OrderDate”)[“TotalSales”].sum()
print(monthly_sales)
“`

#### 最畅销产品 Top 5
“`python
top_products = product_stats.sort_values(“TotalSales”, ascending=False).head(5)
print(top_products)
“`

—

### 8. 数据可视化（结合 `matplotlib`）
“`python
import matplotlib.pyplot as plt

# 绘制月度销售额趋势图
monthly_sales.plot(kind=”line”, title=”Monthly Sales Trend”)
plt.xlabel(“Month”)
plt.ylabel(“Total Sales”)
plt.show()

# 绘制产品销量柱状图
top_products.plot(kind=”bar”, x=”Product”, y=”TotalSales”, title=”Top 5 Products by Sales”)
plt.ylabel(“Total Sales”)
plt.show()
“`

—

### 总结
通过上述操作，可以实现：
1. **数据清洗**：处理缺失值、重复值。
2. **数据筛选**：按条件过滤和排序。
3. **数据聚合**：分组统计关键指标。
4. **数据合并**：关联多张表。
5. **可视化**：直观展示分析结果。

根据实际需求，可以进一步扩展更复杂的逻辑（如特征工程、机器学习集成等）。

CNITW

CNITW

Python–pandas库进行数据处理的实例

天道酬勤

Related Posts

python练习3

Python陪练2–二维数组

You Missed

四大顶尖AI模型

线上数字人体验地址

DeepSeek在线使用平台汇总

AI工具集

分享目前最全AI工具合集

python练习3