python2爬取國家統計局全國省份城市區街道信息

本文轉載自查看原文 2021-04-16 18:46 244 其它

工作中，再次需要python,發現python用得好，真的可以節省很多人力，先說我的需求，需要做一個類似像支付寶添加收貨地址時，選擇地區的功能，需要詳細到街道信息，也就是4級聯動，如右圖。首先需要的就是級聯的數據，許是百度能力太差，找不到想要的，或者想要的需要積分才能下載，沒有積分，只能干巴巴看着，好無奈，想起國家統計局有這個，以前在那里下載過，是一個表格，現在也忘記放哪里了，在它的官網找了好久，都沒找到，后來是如何找到這個鏈接的也忘記了：http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html，找到鏈接，第一個想到的就是pathon，於是決定靠自己豐衣足食。最后的代碼如下，才70多行，咋一下不難，但也花費了我2天時間，腦袋有時候還是不夠清晰。

 1 # -*-coding:utf-8 -*-
 2 import urllib2,urllib  3 from selenium import webdriver  4 import time  5 import sys  6 reload(sys)  7 sys.setdefaultencoding('utf-8')  8 import os  9 
10 def writeData(tasklist): 11     conf = 'ck.txt'
12     file = open("%s/%s" % (os.path.abspath(os.path.dirname(__file__)), conf),"a+") 13  file.write(tasklist) 14  file.close() 15 
16 chrome = webdriver.Chrome() 17 chrome.get("http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html") 18 time.sleep(10) 19 href=[] 20 href1=[] 21 href2=[] 22 href3=[] 23 href4=[] 24 href5=[] 25 href6=[] 26 text=[] 27 text3=[] 28 text5=[] 29 
30 href1=chrome.find_elements_by_css_selector('.provincetr td a')[30:] 31 # 省份和其對於的下一級鏈接
32 for q in href1: 33     href.append(q.get_attribute('href')) 34     text.append(q.get_attribute('innerHTML')) 35 for h,t in zip(href,text): 36     # if t.find("上海市")<0:
37     # continue
38     if h=='':continue
39  chrome.get(h) 40     time.sleep(3) 41  writeData(t) 42     href2=chrome.find_elements_by_css_selector(".citytr :nth-child(2) a") 43     #城市和其對應的下一級鏈接
44     timer=0 45     while timer<len(href2): 46         q1=chrome.find_elements_by_css_selector(".citytr :nth-child(2) a")[timer] 47         timer+=1
48         href3=q1.get_attribute('href') 49         text3=q1.get_attribute('innerHTML') 50         if href3=='':continue
51  chrome.get(href3) 52         time.sleep(3) 53         href4=chrome.find_elements_by_css_selector(".countytr :nth-child(2) a") 54         #區和其對應的下一級鏈接
55         timer7=0 56         while timer7<len(href4): 57             print timer7 58             print len(href4) 59             q2=chrome.find_elements_by_css_selector(".countytr :nth-child(2) a")[timer7] 60             timer7+=1
61             href5=q2.get_attribute('href') 62             text5=q2.get_attribute('innerHTML') 63             if href5=='':continue
64  chrome.get(href5) 65             time.sleep(3) 66             href6=chrome.find_elements_by_css_selector(".towntr :nth-child(2) a") 67             #街道信息
68             timer6=0 69             while timer6<len(href6): 70                 q3=chrome.find_elements_by_css_selector(".towntr :nth-child(2) a")[timer6] 71                 timer6+=1
72                 writeData(t+"   "+text3+"   "+text5+"   "+q3.get_attribute('innerHTML')+"\n") 73  chrome.back() 74  chrome.back() 75  chrome.back() 76

下面說說我遇到的主要問題，記錄下來，免得下次又踩坑。

1.用for in循環遍歷，報錯：element is not attached to the page document，發現是chrome.get打開另一個頁面之后，再回來就會報這邊錯誤，原來是頁面刷新之后，需要重新獲取一下元素，一開始我是用了3個for in 遍歷的，發現不行，就改為了while，在while重新獲取一次元素，獲取元素依次增一。

2.爬到的數據有缺失，發現是變量timer，timer7,timer6，在每個while循環前，需要復位為0。

3.大概爬到三分一的時候，需要填入圖片中的數字才可以繼續打開頁面，頁面做了反爬，很多網站都會有所限制，接下來的爬取，很多時候需要人工干預，改變爬取的起點，讓程序繼續爬取剩下的數據

4,。我用的是txt保存爬到的數據，我一開始是打算用excel的，但是安裝xlwt失敗，報編碼問題，安裝包的時候，經常遇到這個問題，不知道如何解決。

爬完之后，得到的文件有2M多，的確是蠻大的。有5萬多行呢。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 抓取國家統計局全國區划代碼 python爬取國家統計局2019年行政區划分數據mssql 根據國家統計局的行政區划爬取阿里雲地圖邊界爬取國家統計局2020年行政區划分數據使用java爬取國家統計局的12位行政區划代碼 Python爬蟲超簡單實戰教程（一）| 爬取國家統計局數據給老子爬爬爬！2019國家統計局最新城鄉划分代碼爬一下國家統計局行政區划代碼C# 從國家統計局官網獲取最新省市區三級聯動數據 jquery省市區三級聯動(數據來源國家統計局官網)內附源碼下載