python数据预处理

user_profile.csv: 这个文件太大 就复制少一点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
,userid,cms_segid,cms_group_id,final_gender_code,age_level,pvalue_level,occupation,new_user_class_level 
0,234,0,5,2,5,,0,3.0
1,523,5,2,2,2,1.0,1,2.0
2,612,0,8,1,2,2.0,0,
3,1670,0,4,2,4,,0,
4,2545,0,10,1,4,,0,
5,3644,49,6,2,6,2.0,0,2.0
6,5777,44,5,2,5,2.0,0,2.0
7,6211,0,9,1,3,,0,2.0
8,6355,2,1,2,1,1.0,0,4.0
9,6823,43,5,2,5,2.0,0,1.0
10,6972,5,2,2,2,2.0,1,2.0
11,9293,0,5,2,5,,0,4.0
12,9510,55,8,1,2,2.0,0,2.0
13,10122,33,4,2,4,2.0,0,2.0
14,10549,0,4,2,4,2.0,0,
15,10812,0,4,2,4,,0,
16,10912,0,4,2,4,2.0,0,
17,10996,0,5,2,5,,0,4.0
18,11256,8,2,2,2,1.0,0,3.0
19,11310,31,4,2,4,1.0,0,4.0

ads_sample.csv:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
,user_id,adgroup_id,pid,nonclk,clk,date
0,581738,1,430548_1007,1,0,2021-03-07 14:14:04
1,449818,3,430548_1007,1,0,2021-03-13 09:26:18
2,914836,4,430548_1007,1,0,2021-03-13 12:47:59
3,914836,5,430548_1007,1,0,2021-03-13 12:50:29
4,399907,8,430548_1007,1,0,2021-03-09 12:09:18
5,628137,9,430548_1007,1,0,2021-03-12 01:48:55
6,298139,9,430539_1007,1,0,2021-03-11 08:29:53
7,775475,9,430548_1007,1,0,2021-03-12 11:50:36
8,555266,11,430539_1007,1,0,2021-03-09 13:18:56
9,117840,11,430548_1007,1,0,2021-03-06 10:12:23
10,739815,11,430539_1007,1,0,2021-03-07 08:03:07
11,623911,11,430548_1007,1,0,2021-03-13 05:41:41
12,623911,11,430548_1007,1,0,2021-03-11 05:26:48
13,421590,11,430548_1007,1,0,2021-03-06 09:29:04
14,976358,13,430548_1007,1,0,2021-03-07 19:35:49
15,286630,13,430539_1007,1,0,2021-03-08 12:42:59
16,286630,13,430539_1007,1,0,2021-03-09 08:20:47
17,771431,13,430548_1007,1,0,2021-03-07 18:44:27

在做数据分析处理时要用到的一系列操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
from numpy import nan
def proProcess(filename1,filename2):
#加载文件
df1 = pd.read_csv(filename1)
# df = pd.DataFrame(data,columns=title,dtype=float)
# title = ['data', 'feature_names']
# a = [['2015年', 137462.0],
# ['2015年', 137462.0],
# [nan, nan],
# [nan, nan]]
#去重
df1 = df1.drop_duplicates()
#去除缺失值
df1 = df1.dropna()
# 去重
df1 = df1.drop_duplicates()
# 去除缺失值
df1 = df1.dropna()

#填补数值
# df = df.Age.fillna(df.Age.mean())
print(df1)

# 加载文件
df2 = pd.read_csv(filename2)
# df = pd.DataFrame(data,columns=title,dtype=float)
# title = ['data', 'feature_names']
# a = [['2015年', 137462.0],
# ['2015年', 137462.0],
# [nan, nan],
# [nan, nan]]
# 去重
df2 = df2.drop_duplicates()
# 去除缺失值
df2 = df2.dropna()
# 去重
df2 = df2.drop_duplicates()
# 去除缺失值
df2 = df2.dropna()

# 填补数值
# df = df.Age.fillna(df.Age.mean())
print(df2)
if __name__ == '__main__':
# data = [['Google', 10], ['Runoob', 12], ['Wiki', 13], [nan, nan], ['Wiki', 13]]
# title = ['Site', 'Age']
filename1 = 'user_profile.csv'
filename2 = 'ads_sample.csv'
proProcess(filename1,filename2)