背景:
學習python制作數據可視化時時,從世界銀行(http://data.worldbank.org/indicator/)下載csv文件,讀取csv內容時,發現文件頭第一行會有亂碼問題,經查閱原來是頭部有codecs.BOM_UTF8(\xef\xbb\xbf),以下簡稱BOM,這些內容在csv中看不出來。
經過:
1.不設置格式,直接讀取(Windows下應該是gbk編碼吧),
import csv
file_name = 'API_MS.MIL.TOTL.P1_DS2_en_csv_v2.csv' with open(file_name) as f: reader = csv.reader(f) head_row = next(reader) print(head_row)
開頭出現亂碼,結果如下:
['鍩緿ata Source', 'World Development Indicators', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
2.設置utf-8格式讀取:
import csv file_name = 'API_MS.MIL.TOTL.P1_DS2_en_csv_v2.csv' with open(file_name, encoding='utf-8') as f: reader = csv.reader(f) head_row = next(reader) print(head_row)
開頭‘亂碼’變為\ufeff,顯示如下:
['\ufeffData Source', 'World Development Indicators', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
解決:
翻到一篇帖子(https://www.cnblogs.com/chongzi1990/p/8694883.html),只要把utf-8編碼改成utf-8-sig即可,具體原因這篇帖子里有些到。
import csv file_name = 'API_MS.MIL.TOTL.P1_DS2_en_csv_v2.csv' with open(file_name, encoding='utf-8-sig') as f: reader = csv.reader(f) head_row = next(reader) print(head_row)
顯示正常,得到了想要的內容:
['Data Source', 'World Development Indicators', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
*期間嘗試過其他方式,比如先讀取csv內容,將頭行存入新文件,再以‘rb'二進制方式讀取,判斷如果有BOM,則剔除掉,。。。很麻煩,還是直接制定utf-8-sig方法最簡單。