python3實踐-從網站獲取數據(Carbon Market Data-GD) （bs4/Beautifulsoup）

本文轉載自查看原文 2017-01-17 00:13 895 Python3/ Python3實踐

結合個人需求，從某個網站獲取一些數據，發現網頁鏈接是隱藏的，需要通過瀏覽器看后面的代碼來獲取真實的鏈接。

下面這個案例，直接是從真實的鏈接中爬去數據。

此外，發現用pandas的read_html不能直接解析“lxml”的表格，有待后續研究。

另外，爬去的數據發現有很多空格符號，主要是 "\r"、"\n"、"\t"，

字符串的去除 "\r"、"\n"、"\t" 的方法也一並添加在這個案例中。

具體代碼如下：

 1 # Code based on Python 3.x
 2 # _*_ coding: utf-8 _*_
 3 # __Author: "LEMON"
 4 
 5 
 6 from bs4 import BeautifulSoup
 7 import requests
 8 import csv
 9 
10 url2 = 'http://ets.cnemission.com/carbon/portalIndex/markethistory?Top=1'
11 
12 req = requests.get(url2)
13 # soup = BeautifulSoup(req.content, 'html5lib')
14 soup = BeautifulSoup(req.content, 'lxml')
15 # 用“lxml”解析，可以獲得數據，但csv文件中每行有空行
16 
17 table = soup.table
18 trs = table.find_all('tr')
19 
20 list1 = []
21 for tr in trs:
22     td = tr.find_all('td')
23 
24     # 去除每個單元格數據后面的"\r"和"\n"和"\t"，
25     # 下面兩種方法都可以生成csv文件，
26     # 但method1生成的csv文件較小，應該是優化性能較好，暫時不明白其中原理
27     # method1
28     row = [i.text.replace('\r', '').replace('\n', '').replace('\t', '') for i in td]
29     # method 2
30     # row = [i.text.replace('\r\n\t', '') for i in td]
31 
32     list1.append(row)
33 
34 with open('MktDataGuangdong.csv', 'a', errors='ignore', newline='') as f:
35     f_csv = csv.writer(f)
36     f_csv.writerows(list1)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3實踐-從網站獲取數據(Carbon Market Data-BJ) （pandas，bs4） python3 之 bs4 BeautifulSoup 簡單使用 python bs4 BeautifulSoup 【bs4】安裝beautifulsoup python 在linux上面安裝beautifulsoup4(bs4) No module named 'bs4' 【python+beautifulsoup4】Python中安裝bs4后，pycharm報錯ModuleNotFoundError: No module named 'bs4' Python（00）：BeautifulSoup(BS4)解析HTML和XML python庫：bs4，BeautifulSoup庫、Requests庫爬蟲-使用BeautifulSoup4（bs4）解析html數據 python 3.x 爬蟲基礎---Requersts,BeautifulSoup4（bs4）