1.參考

表格標簽

表格	描述
<table>	定義表格
<caption>	定義表格標題。
<th>	定義表格的表頭。
<tr>	定義表格的行。
<td>	定義表格單元。
<thead>	定義表格的頁眉。
<tbody>	定義表格的主體。
<tfoot>	定義表格的頁腳。
<col>	定義用於表格列的屬性。
<colgroup>	定義表格列的組。

表格元素定位

參看網頁源代碼並沒有 thead 和 tbody。。。

<table class="wikitable sortable" style="text-align: center; font-size: 85%; width: auto; table-layout: fixed;">
　　<caption>List of text editors</caption>
　　<tr>
　　　　<th style="width: 12em">Name</th>
　　　　<th>Creator</th>
　　　　<th>First public release</th>
　　　　<th data-sort-type="number">Latest stable version</th>
　　　　<th>Latest Release Date</th>
　　　　<th><a href="/wiki/Programming_language" title="Programming language">Programming language</a></th>
　　　　<th data-sort-type="currency">Cost (<a href="/wiki/United_States_dollar" title="United States dollar">US$</a>)</th>
　　　　<th><a href="/wiki/Software_license" title="Software license">Software license</a></th>
　　　　<th><a href="/wiki/Free_and_open-source_software" title="Free and open-source software">Open source</a></th>
　　　　<th><a href="/wiki/Command-line_interface" title="Command-line interface">Cli available</a></th>
　　　　<th>Minimum installed size</th>
　　</tr>
　　<tr>
　　　　<th

2.提取表格數據

表格標題可能出現超鏈接，導致標題被拆分，

也可能不帶表格標題。。

<caption>Text editor support for remote file editing over 
　　<a href="/wiki/Lists_of_network_protocols" title="Lists of network protocols">network protocols</a>
</caption>

表格內容換行

<td>
　　<a href="/wiki/Plan_9_from_Bell_Labs" title="Plan 9 from Bell Labs">Plan 9</a>
　　 and 
　　<a href="/wiki/Inferno_(operating_system)" title="Inferno (operating system)">Inferno</a>
</td>

tag 規律

table
thead tr1	th	th	th	th
tbody tr2	td/th	td
tbody tr3	td/th
tbody tr3	td/th

2.1提取所有表格標題列表

filenames = []

for index, table in enumerate(response.xpath('//table')):
    caption = table.xpath('string(./caption)').extract_first()    #提取caption tag里面的所有text，包括子節點內的和文本子節點，這樣也行 caption = ''.join(table.xpath('./caption//text()').extract())
    filename = str(index+1)+'_'+caption if caption else str(index+1)  #xpath 要用到 table 計數，從[1]開始
    filenames.append(re.sub(r'[^\w\s()]','',filename))    #移除特殊符號


In [233]: filenames
Out[233]:
[u'1_List of text editors',
 u'2_Text editor support for various operating systems',
 u'3_Available languages for the UI',
 u'4_Text editor support for common document interfaces',
 u'5_Text editor support for basic editing features',
 u'6_Text editor support for programming features (see source code editor)',
 u'7_Text editor support for other programming features',
 '8',
 u'9_Text editor support for key bindings',
 u'10_Text editor support for remote file editing over network protocols',
 u'11_Text editor support for some of the most common character encodings',
 u'12_Right to left (RTL)  bidirectional (bidi) support',
 u'13_Support for newline characters in line endings']

2.2每個表格分別寫入csv文件

for index, filename in enumerate(filenames):
    print filename
    with open('%s.csv'%filename,'wb') as fp:
        writer = csv.writer(fp)
        for tr in response.xpath('//table[%s]/tr'%(index+1)):
            writer.writerow([i.xpath('string(.)').extract_first().replace(u'\xa0', u' ').strip().encode('utf-8','replace') for i in tr.xpath('./*')])  #xpath組合，限定 tag 范圍，tr.xpath('./th | ./td')

代碼處理 .replace(u'\xa0', u' ')

HTML轉義字符&npsp；表示non-breaking space，unicode編碼為u'\xa0',超出gbk編碼范圍？

使用 'w' 寫csv文件，會出現如下問題，使用'wb' 即可解決問題

【已解決】Python中通過csv的writerow輸出的內容有多余的空行 – 在路上

所有表格寫入同一excel文件的不同工作表 sheet，需要使用xlwt

python ︰創建 excel 工作簿和傾倒 csv 文件作為工作表

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python抓取網頁並保存為PDF 將爬取的網頁數據分別保存為csv和xls文件(Python） python查詢數據庫保存為csv MySQL 查詢結果保存為CSV文件將Excel表格保存為圖片 HTML網頁保存為PDF文件網頁保存為圖片[rasterizeHTML] 如何將網頁保存為PDF文件將網頁保存為單獨HTML文件 Chrome內核保存為mhtml(單網頁)