網絡爬蟲之爬取百度新聞鏈接


1.安裝beauitfulsoup4  cmd-> pip install beautifulsoup4
python提供了一個支持處理網絡鏈接的內置模塊urllib,beatuifulsoup是用來解析html

 

 

 驗證安裝是否成功

 

 

 

2. pycharm配置

 

 

 

 

 

 3.代碼如下

import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self,site):
self.site=site

def scrape(self):
r=urllib.request.urlopen(self.site)
html=r.read()
parser="html.parser"
sp=BeautifulSoup(html,parser)
for tag in sp.find_all("a"):
url=tag.get("href")
if url is None:
continue
if "html" in url:
print("\n"+url)

news="http://news.baidu.com/"
Scraper(news).scrape()


4.運行結果就是獲取了百度新聞的鏈接

 

 

 

 

5. 如何把獲取的鏈接保存到文件里呢?

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
def __init__(self, site):
self.site = site

def scrape(self):
response = urllib.request.urlopen(self.site)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
with open("output.txt", "w") as f:
for tag in soup.find_all('a'):
url = tag.get('href')
if url and 'html' in url:
print("\n" + url)
f.write(url + "\n")
Scraper('http://news.baidu.com/').scrape()





免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM