數據可視化基礎專題（十八）：Pandas120題（三）21-50

本文轉載自查看原文 2021-05-03 21:42 279 數據可視化pandas/ numpy/ matplotlib/ pycharts基礎

第二期 Pandas數據處理

21.讀取本地EXCEL數據

import pandas as pd
df = pd.read_excel('pandas120.xlsx')

22.查看df數據前5行

df.head()

23.將salary列數據轉換為最大值與最小值的平均值

#備注，在某些版本pandas中.ix方法可能失效，可使用.iloc，參考https://mp.weixin.qq.com/s/5xJ-VLaHCV9qX2AMNOLRtw
#為什么不能直接使用max，min函數，因為我們的數據中是20k-35k這種字符串，所以需要先用正則表達式提取數字
import re
# 方法一：apply + 自定義函數
def func(df):
    lst = df['salary'].split('-')
    smin = int(lst[0].strip('k'))
    smax = int(lst[1].strip('k'))
    df['salary'] = int((smin + smax) / 2 * 1000)
    return df

df = df.apply(func,axis=1)
# 方法二：iterrows + 正則
import re
for index,row in df.iterrows():
    nums = re.findall('\d+',row[2])
    df.iloc[index,2] = int(eval(f'({nums[0]} + {nums[1]}) / 2 * 1000'))

24.將數據根據學歷進行分組並計算平均薪資

print(df.groupby('education').mean())

25.將createTime列時間轉換為月-日

#備注，在某些版本pandas中.ix方法可能失效，可使用.iloc，參考https://mp.weixin.qq.com/s/5xJ-VLaHCV9qX2AMNOLRtw
for i in range(len(df)):
    df.ix[i,0] = df.ix[i,0].to_pydatetime().strftime("%m-%d")  
df.head()

26.查看索引、數據類型和內存信息

df.info()

27.查看數值型列的匯總統計

df.describe()

28.新增一列根據salary將數據分為三組

bins = [0,5000, 20000, 50000]
group_names = ['低', '中', '高']
df['categories'] = pd.cut(df['salary'], bins, labels=group_names)
df

29.按照salary列對數據降序排列

df.sort_values('salary', ascending=False)

30.取出第33行數據

df.loc[32]

31.計算salary列的中位數

np.median(df['salary'])

32.繪制薪資水平頻率分布直方圖

#執行兩次
df.salary.plot(kind='hist')

33.繪制薪資水平密度曲線

df.salary.plot(kind='kde',xlim=(0,80000))

34.刪除最后一列categories

del df['categories']
# 等價於
df.drop(columns=['categories'], inplace=True)

35.將df的第一列與第二列合並為新的一列

df['test'] = df['education']+df['createTime']
df

36.將education列與salary列合並為新的一列

#備注：salary為int類型，操作與35題有所不同
df["test1"] = df["salary"].map(str) + df['education']
df

37.計算salary最大值與最小值之差

df[['salary']].apply(lambda x: x.max() - x.min())

38.將第一行與最后一行拼接

pd.concat([df[:1], df[-2:-1]])

39.將第8行數據添加至末尾

df.append(df.iloc[7])

40.查看每列的數據類型

df.dtypes

41.將createTime列設置為索引

df.set_index("createTime")

42.生成一個和df長度相同的隨機數dataframe

df1 = pd.DataFrame(pd.Series(np.random.randint(1, 10, 135)))
df1

43.將上一題生成的dataframe與df合並

df= pd.concat([df,df1],axis=1)
df

44.生成新的一列new為salary列減去之前生成隨機數列

df["new"] = df["salary"] - df[0]
df

45.檢查數據中是否含有任何缺失值

df.isnull().values.any()

46.將salary列類型轉換為浮點數

df['salary'].astype(np.float64)

47.計算salary大於10000的次數

len(df[df['salary']>10000])

48.查看每種學歷出現的次數

df.education.value_counts()

49.查看education列共有幾種學歷

df['education'].nunique()

50.提取salary與new列的和大於60000的最后3行

df1 = df[['salary','new']]
rowsums = df1.apply(np.sum, axis=1)
res = df.iloc[np.where(rowsums > 60000)[0][-3:], :]
res

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 數據可視化基礎專題（三十五）：Pandas基礎（十五）關於axis參數的理解 python數據可視化之pandas繪圖數據可視化基礎專題（十二）：Matplotlib 基礎（四）常用圖表（二）氣泡圖、堆疊圖、雷達圖、餅圖、數據可視化之powerBI基礎（十八）Power BI度量值的格式如何修改？這里有三種方式 Tableau 基礎數據可視化流程數據可視化實例（五）：氣泡圖（matplotlib，pandas）數據可視化實例（三）：散點圖（pandas，matplotlib，numpy）數據可視化實例（七）：計數圖（matplotlib，pandas）數據可視化之powerBI基礎（三）編輯交互，體驗更靈活的PowerBI可視化數據可視化之DAX篇（十八）收藏 | DAX代碼格式指南