Python虎撲爬取球員生涯數據

本文轉載自查看原文 2020-04-20 20:50 623

一、網絡爬蟲設計方案

1、爬蟲名稱：虎撲爬取球員生涯數據

2、內容：虎撲爬取球員生涯數據

3、概述：首先分析頁面結構，使用requests模塊獲取網頁源代碼，再使用BeautifulSoup解析得到所需要的數據

二、主題頁面的結構特征分析

1.主題頁面的結構與特征分析

球員生涯數據頁面，F12打開審查元素進行分析

通過 devTool工具可以分析網頁，找到對應的標簽屬性

標簽；<table class="players_table bott bgs_table">

即 table.players_table.bott.bgs_table 可以獲取到整個表格。

第一行是表格的題頭，采集數據的時候是不需要。

標簽：<tr class="color_font1 borders_btm">

在屬性一樣的情況下，想要取出生涯數據，必須過濾掉第一個tr標簽，才是最后想要的數據！

整理后的選擇器代碼 table.players_table.bott.bgs_table > tbody > tr.color_font1.borders_btm

三、網絡爬蟲程序設計

1.數據爬取與采集

讀取頁面

# -*- coding: utf-8 -*-

import requests

def duqu():
    try:
        r = requests.get("https://nba.hupu.com/players/lebronjames-650.html")
        r.raise_for_status()
        return r.text
    except:
        return "打不開"

分析采集

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

list = [] #數據數組

def getdata(html):
    #初始化
    soup = BeautifulSoup(html, "html.parser")
    #選擇器
    table = soup.select("table.players_table.bott.bgs_table > tbody > tr.color_font1.borders_btm")
    #循環取出每組數據
    for tables in table:
        datas = tables.get_text().split('\n')
        #去掉數據里的空元素
        for i in datas:
            if len(i) == 0:
                datas.remove(i)
        #加入數據組
        list.append(datas)

有個問題，有的數據中含有空的成員，要先去掉：

保存數據

#保存數據
    fo = open("data.txt", "w+")
    #循環取出每組數據
    for datas in list:
        #內循環單數據
        for i in datas:
            fo.writelines(i + " ")
        #大循環換行
        fo.writelines("\n")
    fo.close()

完整代碼：

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

list = [] #數據數組

def duqu():
    #獲取網頁數據
    try:
        r = requests.get("https://nba.hupu.com/players/lebronjames-650.html")
        r.raise_for_status()
        return r.text
    except:
        return "打不開"
        
        
def getdata(html):
    #初始化
    soup = BeautifulSoup(html, "html.parser")
    #選擇器
    table = soup.select("table.players_table.bott.bgs_table > tbody > tr.color_font1.borders_btm")
    #循環取出每組數據
    for tables in table:
        datas = tables.get_text().split('\n')
        #去掉數據里的空元素
        for i in datas:
            if len(i) == 0:
                datas.remove(i)
        #加入數據組
        list.append(datas)
    #保存數據
    fo = open("lebronjames.txt", "w+")
    #循環取出每組數據
    for datas in list:
        #內循環單數據
        for i in datas:
            fo.writelines(i + " ")
        #大循環換行
        fo.writelines("\n")
    fo.close()
    
    
html = duqu()
getdata(html)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python-2：爬取某個網頁（虎撲）帖子的標題做詞雲圖 python-虎撲爬蟲虎撲論壇基因探秘：社群用戶行為數據洞察 python 爬取虎嗅網-post方法抓取ajax動態頁面(上）用python爬取疫情數據 python爬取網站數據 python爬取網頁數據 python爬蟲----爬取淘寶數據 python爬取動態加載的數據 python爬蟲-爬取豆瓣電影數據