網絡爬蟲之爬取百度新聞鏈接

本文轉載自查看原文 2020-05-27 23:17 946 API接口測試

1.安裝beauitfulsoup4  cmd-> pip install beautifulsoup4
python提供了一個支持處理網絡鏈接的內置模塊urllib,beatuifulsoup是用來解析html

驗證安裝是否成功

2. pycharm配置

3.代碼如下

import urllib.request
from bs4 import BeautifulSoup
class Scraper:
    def __init__(self,site):
        self.site=site

    def scrape(self):
        r=urllib.request.urlopen(self.site)
        html=r.read()
        parser="html.parser"
        sp=BeautifulSoup(html,parser)
        for tag in sp.find_all("a"):
            url=tag.get("href")
            if url is None:
                continue
            if "html" in url:
                print("\n"+url)

news="http://news.baidu.com/"
Scraper(news).scrape()


4.運行結果就是獲取了百度新聞的鏈接

5. 如何把獲取的鏈接保存到文件里呢？

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        response = urllib.request.urlopen(self.site)
        html = response.read()
        soup = BeautifulSoup(html, 'html.parser')
        with open("output.txt", "w") as f:
            for tag in soup.find_all('a'):
                url = tag.get('href')
                if url and 'html' in url:
                    print("\n" + url)
                    f.write(url + "\n")

Scraper('http://news.baidu.com/').scrape()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 網絡爬蟲百度新聞標題及鏈接爬取【Python網絡爬蟲四】通過關鍵字爬取多張百度圖片的圖片百度網盤爬蟲（如何爬取百度網盤）百度圖片爬蟲-python版-如何爬取百度圖片? 爬蟲實戰(一) 用Python爬取百度百科 python 爬取百度圖片利用python的爬蟲技術爬取百度貼吧的帖子 Python網絡爬蟲——爬取騰訊新聞國內疫情數據利用python爬取海量疾病名稱百度搜索詞條目數的爬蟲實現零基礎掌握百度地圖興趣點獲取POI爬蟲（python語言爬取）（基礎篇）