爬蟲爬取晉江文學網總分榜

本文轉載自查看原文 2020-05-03 23:18 1210

一、目的：

爬取晉江文學網總分榜

二、python爬取數據

網址：http://www.jjwxc.net/topten.php?orderstr=7&t=0

三、爬取

在開始多出現了38號而且順序內容不准確

代碼：

import requests

from bs4 import BeautifulSoup

import bs4

url="http://www.jjwxc.net/topten.php?orderstr=7&t=0"

def getHtml(url):

r=requests.get(url)

r.raise_for_status()

r.encoding=r.apparent_encoding

return r.text[26000:100000]

def fillList(html):

l1,l2 = [],[]

soup = BeautifulSoup(html,"html.parser")

for i in soup.find_all('a',"tooltip"):

l1.append(str(i.string))

for tag in soup.find_all('td',{"align":"center"}):

s=str(tag.string)

s.replace(" "," ")

l2.append(s)

return l1,l2

def printList(l1,l2):

n1,n2 = len(l1),len(l2)

n=max(n1,n2)

for i in range(n):

print("第{}名：《{}》".format(i+1,l1[i]))

print("積分：{}".format(l2[i]))

print("")

def main():

html=getHtml(url)

l1,l2=fillList(html)

printList(l1,l2)

main()

這幾類數據我分不開，絕望

百度了一下就發現

內容網址：https://www.cnblogs.com/wangyongfengxiaokeai/p/11869595.html

而且好像height=‘23’和alig前后位置不同對結果也有影響

又換了試就發現是紅框的問題，但是紅框內換了幾次代碼還是都不能完全分開，最后只有l2中為作品字數時可以完全帶進去，但是字數在這里沒有什么實際價值。

就只能做出排名

🆗

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【python爬蟲】爬取當當網TOP500圖書暢銷榜【古典文學網】http://www.cngdwx.com/ Python網絡爬蟲--爬取bilibili排行榜關於爬取b站播放排行榜的爬蟲 python 爬蟲之爬取大街網（思路）爬蟲實戰(三) 用Python爬取拉勾網 Python 爬蟲爬取煎蛋網圖片爬蟲---爬取豆瓣網評論內容