簡單的網絡爬蟲是對一個url進行請求,並等待其返回響應。在數據量小的情況下很實用,但是當你的數據量很大,顯然分布式爬蟲就更占優勢!關於分布式,一般是使用一台主機(master)充當多個爬蟲的共享redis隊列,其他主機(slave)采用遠程連接master,關於redis如何安裝,這里不多做介紹!
以爬蟲伯樂在線的python文章為例,我的分布式爬蟲由main01 main02 main03三個python文件構成,main01的主要任務是運行在master上,將文章的url爬取下來存入redis以供main02和main03讀取解析數據。main01的主要代碼如下:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
from redis import Redis
from lxml import etree
r = Redis.from_url("redis://x.x.x.x:6379", decode_responses=True)
def get_urls(url="http://python.jobbole.com/all-posts/"):
result = requests.get(url=url)
selector = etree.HTML(result.text)
links = selector.xpath(r'//*[@id="archive"]/div/div[2]/p[1]/a[1]/@href')
for link in links:
r.sadd("first_urls", link)
next_url = extract_next_url(result.text)
if next_url:
get_urls(next_url)
def extract_next_url(html):
soup = BeautifulSoup(html, "lxml")
next_url = soup.select('a[class="next page-numbers"]')
for url in next_url:
url = str(url)
soup = BeautifulSoup(url, "lxml")
next_url = soup.a["href"]
return next_url
if __name__ == '__main__':
get_urls()
從本地連接master的redis可以看到,數據已經成功寫入redis

下面是main02的代碼:
# -*- coding: utf-8 -*-
import json
import codecs
import requests
from redis import Redis
from lxml import etree
from settings import *
import MySQLdb
r = Redis.from_url(url=REDIS_URL, decode_responses=True)
def parse_urls():
if "first_urls" in r.keys():
while True:
try:
url = r.spop("first_urls")
result = requests.get(url=url, timeout=10)
selector = etree.HTML(result.text)
title = selector.xpath(r'//*[@class="entry-header"]/h1/text()')
title = title[0] if title is not None else None
author = selector.xpath(r'//*[@class="copyright-area"]/a/text()')
author = author[0] if author is not None else None
items = dict(title=title, author=author, url=url)
insert_mysql(items)
except:
if "first_urls" not in r.keys():
print("爬取結束,關閉爬蟲!")
break
else:
print("{}請求發送失敗!".format(url))
continue
else:
parse_urls()
def insert_json(value):
file = codecs.open("save.json", "a", encoding="utf-8")
line = json.dumps(value, ensure_ascii=False) + "," + "\n"
file.write(line)
file.close()
def insert_mysql(value):
conn = MySQLdb.connect(MYSQL_HOST, MYSQL_USER, MYSQL_PASSWORD, MYSQL_DBNAME, charset="utf8", use_unicode=True)
cursor = conn.cursor()
insert_sql = '''
insert into article(title, author, url) VALUES (%s, %s, %s)
'''
cursor.execute(insert_sql, (value["title"], value["author"], value["url"]))
conn.commit()
if __name__ == '__main__':
parse_urls()
main02和main03(同main02的代碼可以一樣)可以運行在你的本地機器,在main02中我們會先判斷master的redis中是否已經生成url,如果沒有main02爬蟲會等待,直到master的redis存在url,才會進行下面的解析!
運行main02可以發現本地的mysql數據庫中已經成功被寫入數據!

以上就是一個簡單的分布式爬蟲,當然真正運行的時候,肯定是幾個爬蟲同時運行的,這里只是為了調試才先運行了main01!

