python爬蟲成長之路(三):基礎爬蟲架構及爬取證券之星全站行情數據


       爬蟲成長之路(一)里我們介紹了如何爬取證券之星網站上所有A股數據,主要涉及網頁獲取和頁面解析的知識。爬蟲成長之路(二)里我們介紹了如何獲取代理IP並驗證,涉及了多線程編程和數據存儲的知識。此次我們將在前兩節的基礎上,對證券之星全站的行情數據進行爬取。第一節的思路爬一個欄目的數據尚可,爬上百個欄目的數據工作量就有點大了。下面我們先介紹下基礎的爬蟲架構。

       本文主要包含爬蟲框架六大基礎模塊,分別為爬蟲調度器、URL下載器、URL管理器、HTML下載器、HTML解析器、數據存儲器。功能分析如下

       爬蟲調度器:主要負責統籌其他四個模塊的工作。

       URL下載器:主要負責下載需要爬取數據的URL鏈接。

       URL管理器:負責管理URL鏈接,維護已經爬取的URL集合和未爬取的URL集合,提供獲取新URL鏈接的接口。

       HTML下載器:用於從URL管理器中獲取未爬取的URL鏈接並下載HRML網頁。

       HTML解析器:用戶從HTML下載器中獲取已經下載的HTML網頁,解析出有效數據交給數據存儲器。

       數據存儲器:用於將HTML解析器解析出來的數據通過文件或者數據庫的形式儲存起來。

       為了方便理解,以下是基礎爬蟲框架運行流程示意圖

      

       此處介紹文件夾,下面,我們對這6大模塊進行詳細的介紹。

       一、URL下載器

       URL下載器包含兩步,首先下載網站左側導航欄的URL,然后通過導航欄的URL獲取每個子欄目包含的鏈接列表。

      

       下面是獲取左側導航欄所有鏈接並生成導航文件的代碼

# -*- coding: utf-8 -*- import pandas as pd import urllib.request from bs4 import BeautifulSoup import re import os class get_catalog(object): '''生成和操作導航文件''' def save_catalog(self): '''獲得證券之星左側自導航的內容和網址並保存''' #獲取網頁內容 url = 'http://quote.stockstar.com' request =urllib.request.Request(url = url) response = urllib.request.urlopen(request) content = response.read().decode('gbk') #截取左側導航內容 soup = BeautifulSoup(content,"lxml") soup = BeautifulSoup(str(soup.find_all('div',class_ = "subMenuBox")),"lxml") #初始化一級子目錄和二級子目錄的數據框 catalog1 = pd.DataFrame(columns = ["cata1","cata2","url2"]) catalog2 = pd.DataFrame(columns = ["url2","cata3","url3"]) #整理目錄內容和其對應的鏈接 index1 = 0;index2 = 0 for content1 in soup.find_all('div',class_ = re.compile("list submenu?")): cata1 = re.findall('>(.*?)<',str(content1.h3.a)) for content2 in content1.find_all('dl'): cata2 = re.findall('>(.*?)<',str(content2.dt.a).replace('\r\n','')) url2 = url + content2.dt.a['href'] catalog1.loc[index1] = {'cata1':cata1[0],'cata2':cata2[0].split()[0],'url2':url2} index1 += 1 for content3 in content2.find_all('li'): cata3 = re.findall('·(.*?)<',str(content3.a)) url3 = url + content3.a['href'] catalog2.loc[index2] = {'url2':url2,'cata3':cata3[0],'url3':url3} index2 += 1 #對一級子目錄表和二級子目錄表做表連接並保存 catalog = pd.merge(catalog1,catalog2,on='url2',how='left') catalog.to_csv('catalog.csv') def load_catalog(self): '''判斷導航文件是否存在並載入''' if 'catalog.csv' not in os.listdir(): self.save_catalog() print('網址導航文件已生成') else: print('網址導航文件已存在') catalog = pd.read_csv('catalog.csv',encoding='gbk',usecols=range(1,6)) print("網址導航文件已載入") return(catalog) def index_info(self,catalog,index): '''創建每行的行名,作為存入數據庫的表名,並獲取每行終端的網址鏈接''' if str(catalog.loc[index]['cata3'])=='nan': table_name = catalog.loc[index]['cata1'] + '_' + catalog.loc[index]['cata2'] url = catalog.loc[index]['url2'] else: #+、()等符號不能作為數據庫表名,得替換或剔除 if '+' in catalog.loc[index]['cata3']: cata3 = catalog.loc[index]['cata3'].replace('+','') table_name = catalog.loc[index]['cata1'] + '_' + catalog.loc[index]['cata2'] + '_' + cata3 elif '(' in catalog.loc[index]['cata3']: cata3 = catalog.loc[index]['cata3'].replace('(','').replace(')','') table_name = catalog.loc[index]['cata1'] + '_' + catalog.loc[index]['cata2'] + '_' + cata3 else: table_name = catalog.loc[index]['cata1'] + '_' + catalog.loc[index]['cata2'] + '_' + catalog.loc[index]['cata3'] url = catalog.loc[index]['url3'] return(table_name,url)
get_catalog

       下面是獲取每個子欄目所有鏈接的代碼

import pandas as pd from selenium import webdriver import time import re import math from get_catalog import get_catalog class get_urls(object): '''獲取每個欄目的鏈接列表''' def __init__(self,browser,url): self.browser = browser #瀏覽器對象 self.url = url #待爬取的URL def get_browser(self): '''連接URL''' state = 0 test = 0 while state == 0 and test < 5: try: self.browser.get(self.url) state = 1 print('成功連接 %s'%self.url) except: test += 1 def get_element(self): '''獲取翻頁相關按鈕的鏈接列表''' self.get_browser() element_list=[] for i in range(1,8): try: element = self.browser.find_element_by_xpath('//*[@id="divPageControl1"]/a[%d]'%i).get_attribute('href') element_list.append(element) except: time.sleep(0.2) return(element_list) def get_urllist(self): '''通過翻頁相關按鈕生成有效的頁碼鏈接列表''' element_list = self.get_element() if len(element_list)<=1: urls = [self.url] else: try: max_number = re.search('_(\d*)\.',element_list[len(element_list)-3]) begin = max_number.start() + 1 end = max_number.end() - 1 int_max_number = int(element_list[len(element_list)-3][begin:end]) urls = [] for i in range(1,int_max_number + 1): url = element_list[len(element_list)-3][:begin] + str(i) + element_list[len(element_list)-3][end:] urls.append(url) except: urls = [self.url] return(urls)
get_urls

       二、URL管理器

       URL管理器主要包括兩個變量,一個是已爬取的URL的 集合,另外一個是未爬取的URL的集合。采用Python中的set類型,主要是使用set的去重功能。

       URL管理器除了具有兩個URL集合,還需要提供以下接口,用於配合其他模塊使用,接口如下:

       判斷是否有待取的URL,方法定義為has_new_url()。

       添加新的URL到未爬取集合中,方法定義為add_new_url(url),add_new_urls(urls)。

       獲取一個未爬取的URL,方法定義為get_new_url()

       下面為URL管理器模塊的代碼

# coding:utf - 8 class UrlManager(object): '''URL管理器''' def __init__(self): self.new_urls = set() #未爬取URL集合 self.old_urls = set() #已爬取URL def has_new_url(self): '''判斷是否有未爬取的URL''' return(self.new_url_size()!=0) def get_new_url(self): '''獲取一個未爬取的URL''' new_url = self.new_urls.pop() self.old_urls.add(new_url) return(new_url) def add_new_url(self,url): '''將新的URL添加到未爬取的URL集合中''' if url is None: return if url not in self.new_urls and url not in self.old_urls: self.new_urls.add(url) def add_new_urls(self,urls): '''將新的URL列表添加到未爬取的URL集合中''' if urls is None or len(urls)==0: return for url in urls: self.add_new_url(url) def new_url_size(self): '''獲取為爬取URL集合的大小''' return(len(self.new_urls))
UrlManager

       三、HTML下載器

       HTML下載器用來下載網頁,這時候需要注意網頁的編碼,已保證下載的網頁沒有亂碼。

       獲取網頁內容時可能會遇到IP被封的情況,所以我們得爬取一個代理IP池,供HTML下載器使用。

       下面是獲取代理IP池的代碼

import urllib.request import re import time import random import socket import threading class proxy_ip(object): '''獲取有效代理IP並保存''' def __init__(self,url,total_page): self.url = url #打算爬取的網址 self.total_page = total_page #遍歷代理IP網頁的頁數 def get_proxys(self): '''抓取代理IP''' user_agent = ["Mozilla/5.0 (Windows NT 10.0; WOW64)", 'Mozilla/5.0 (Windows NT 6.3; WOW64)', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; rv:11.0) like Gecko)', 'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1', 'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3', 'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12', 'Opera/9.27 (Windows NT 5.2; U; zh-cn)', 'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0', 'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E)', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1 ', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; QQBrowser/7.3.9825.400)', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 ', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser 2.x)', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11'] ip_totle=[] for page in range(1,self.total_page+1): #url = 'http://www.httpsdaili.com/?page='+str(page) #url='http://www.kuaidaili.com/free/inha/'+str(page)+'/' url='http://www.xicidaili.com/nn/'+str(page) #西刺代理 headers={"User-Agent":random.choice(user_agent)} try: request=urllib.request.Request(url=url,headers=headers) response=urllib.request.urlopen(request) content=response.read().decode('utf-8') print('get page',page) pattern=re.compile('<td>(\d.*?)</td>') #截取<td>與</td>之間第一個數為數字的內容 ip_page=re.findall(pattern,str(content)) ip_totle.extend(ip_page) except Exception as e: print(e) time.sleep(random.choice(range(1,5))) #打印抓取內容 print('代理IP地址 ','\t','端口','\t','速度','\t','驗證時間') for i in range(0,len(ip_totle),4): print(ip_totle[i],' ','\t',ip_totle[i+1],'\t',ip_totle[i+2],'\t',ip_totle[i+3]) #整理代理IP格式 proxys = [] for i in range(0,len(ip_totle),4): proxy_host = ip_totle[i]+':'+ip_totle[i+1] proxy_temp = {"http":proxy_host} proxys.append(proxy_temp) return(proxys) def test(self,lock,proxys,i,f): '''驗證代理IP有效性''' socket.setdefaulttimeout(15) #設置全局超時時間 url = self.url try: proxy_support = urllib.request.ProxyHandler(proxys[i]) opener = urllib.request.build_opener(proxy_support) opener.addheaders=[("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64)")] urllib.request.install_opener(opener) #res = urllib.request.urlopen(url).read().decode('gbk') res = urllib.request.urlopen(url).read().decode('utf-8') print(res) lock.acquire() #獲得鎖 print(proxys[i],'is OK') f.write('%s\n' %str(proxys[i])) #寫入該代理IP lock.release() #釋放鎖 except Exception as e: lock.acquire() print(proxys[i],e) lock.release() def get_ip(self): '''多線程驗證''' f = open('proxy_ip.txt','a+') #新建一個儲存有效IP的文檔 lock=threading.Lock() #建立一個鎖 #多線程驗證 proxys = self.get_proxys() threads=[] for i in range(len(proxys)): thread=threading.Thread(target=self.test,args=[lock,proxys,i,f]) threads.append(thread) thread.start() #阻塞主進程,等待所有子線程結束 for thread in threads: thread.join() f.close() #關閉文件
get_proxy_ip

       下面是HTML下載器模塊的代碼

# _*_ coding:utf-8 _*_ from firstSpider.get_proxy_ip import proxy_ip import urllib.request import random import os import socket import time import re class HtmlDownloader(object): '''獲取網頁內容''' def download(self,url): user_agent = ["Mozilla/5.0 (Windows NT 10.0; WOW64)", 'Mozilla/5.0 (Windows NT 6.3; WOW64)', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; rv:11.0) like Gecko)', 'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1', 'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3', 'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12', 'Opera/9.27 (Windows NT 5.2; U; zh-cn)', 'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0', 'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E)', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1 ', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; QQBrowser/7.3.9825.400)', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 ', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser 2.x)', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11'] state = 0;test = 0 socket.setdefaulttimeout(20) #設置全局超時時間 while state == 0 and test < 5: try: request = urllib.request.Request(url=url,headers={"User-Agent":random.choice(user_agent)})#隨機從user_agent列表中抽取一個元素 response = urllib.request.urlopen(request) readhtml = response.read() content = readhtml.decode('gbk') #讀取網頁內容 time.sleep(random.randrange(1,6)) if re.search('Auth Result',content) == None: state = 1 except Exception as e: print('系統IP獲取網頁失敗','',e) if 'proxy_ip.txt' not in os.listdir() or os.path.getsize('proxy_ip.txt') == 0: print('代理IP池不存在,新建代理IP池') pool = proxy_ip(url,5) pool.get_ip() print('代理IP池創建完畢') else: f = open('proxy_ip.txt','r') proxys_ip = f.readlines() f.close() random.shuffle(proxys_ip) for i in range(len(proxys_ip)): try: proxy_support = urllib.request.ProxyHandler(eval(proxys_ip[i][:-1])) opener = urllib.request.build_opener(proxy_support) opener.addheaders=[("User-Agent",random.choice(user_agent))] urllib.request.install_opener(opener) response = urllib.request.urlopen(url) readhtml = response.read() content = readhtml.decode('gbk') time.sleep(random.randrange(1,6)) if re.search('Auth Result',content) == None: #排除被判別為無效用戶的情況 state = 1 print('成功接入代理IP',proxys_ip[i]) break except Exception as e: print(proxys_ip[i],'請求失敗',e) except urllib.error.HTTPError as e: print(proxys_ip[i],'請求失敗',e.code) except urllib.error.URLError as e: print(proxys_ip[i],'請求失敗',e.reason) try: if i == len(proxys_ip)-1: os.remove('proxy_ip.txt') print('代理IP池失效,已刪除') except: #i不存在的情況 os.remove('proxy_ip.txt') print('代理IP池為空,文件已刪除') time.sleep(60) test += 1 if test == 5: print('未成功獲取 %s 頁面內容'%url) content = None return(content)
HtmlDownloader

       四、HTML解析器

       HTML解析器主要對HTML下載器下載的網頁內容進行解析,提取想要的內容。

       本文用到的網頁解析方法主要是正則表達式和BeautifulSoup,下面是HTML解析器的代碼

# coding:utf-8 import re from bs4 import BeautifulSoup import pandas as pd import urllib.request import numpy as np import time import datetime class HtmlParser(object): '''解析網頁內容''' def __init__(self,content): self.soup = BeautifulSoup(content,"lxml") #待解析內容 def get_header(self): '''獲取表格標題''' try: header = [] for tag in self.soup.thead.find_all('td'): title = str(tag) title = title.replace(' ','') title = title.replace('\n','') header.extend(re.findall('>(.*?)<',title)) header_name = [] for data in header: if data != '': header_name.append(data.strip()) header_name.append('數據時間') except: #無標題返回空列表,標記了該內容是否有效 header_name = [] return(header_name) h2_len = len(self.soup.thead.find_all('td',class_ = "h2")) datalist_len = len(self.soup.find_all('tbody',id="datalist") + self.soup.find_all('tbody',id="datalist1") + self.soup.find_all('tbody',id="datalist2")) if h2_len >= 6 or datalist_len == 0: #排除了標題格式不統一和沒數據的兩種情況 header_name = [] return(header_name) def get_header2(self): '''獲取表格標題(標題存在兩層)''' stati_date = [] for date in self.soup.thead.find_all('td',class_ = "double align_center"): stati_date.extend(re.findall('>(.*?)<',str(date))) header_total = self.get_header() header_name = header_total[:-5] header_name = header_name[:2] + header_total[-5:-1] + header_name[2:] if stati_date[0] in header_name: header_name.remove(stati_date[0]) if stati_date[1] in header_name: header_name.remove(stati_date[1]) header_name.append('三四列統計時間') header_name.append('五六列統計時間') header_name.append('數據時間') return(header_name,stati_date) def get_datatime(self): '''獲取數據時間''' try: date = re.findall('數據時間:(.*?)<',str(self.soup.find_all('span',class_ = "fl")))[0][0:10] except: #若不存在,根據系統時間推斷 now_time = time.localtime() if time.strftime("%w",now_time) in ['1','2','3','4','5']: date = time.strftime("%Y-%m-%d",now_time) elif time.strftime("%w",now_time) == '6': dt = (datetime.datetime.now() - datetime.timedelta(days = 1)) date = dt.strftime("%Y-%m-%d") else: dt = (datetime.datetime.now() - datetime.timedelta(days = 2)) date = dt.strftime("%Y-%m-%d") return(date) def get_datalist(self): '''獲取數據內容''' if len(self.soup.find_all('tbody',id="datalist")) >= 1: soup = BeautifulSoup(str(self.soup.find_all('tbody',id="datalist")[0]),"lxml") elif len(self.soup.find_all('tbody',id="datalist1")) >= 1: soup = BeautifulSoup(str(self.soup.find_all('tbody',id="datalist1")[0]),"lxml") else: soup = BeautifulSoup(str(self.soup.find_all('tbody',id="datalist2")[0]),"lxml") date = self.get_datatime() row = len(soup.tbody.find_all('tr')) #初始化正常標題和雙重標題時的數組 if len(self.soup.thead.find_all('td',class_ = "double align_center")) == 0: header_name = self.get_header() col = len(header_name) datalist = np.array(['']*(row * col),dtype = 'U24').reshape(row,col) flag = 1 else: header_name = self.get_header2()[0] col = len(header_name) datalist = np.array(['']*(row * col),dtype = 'U24').reshape(row,col) flag = 2 for i in range(row): #提取數據並寫入數組 detail = re.findall('>(.*?)<',str(soup.find_all('tr')[i])) for blank in range(detail.count('')): detail.remove("") try: if flag == 1: detail.append(date) datalist[i] = detail elif flag == 2: stati_date = self.get_header2()[1] detail.append(stati_date[0]) detail.append(stati_date[1]) detail.append(date) datalist[i] = detail except: datalist[i][0] = detail[0] datalist[i][col-1] = date return(datalist,header_name) def get_dataframe(self): '''組合標題和數據數據為數據框並輸出''' datalist,header_name = self.get_datalist() table = pd.DataFrame(datalist ,columns = header_name) return(table)
HtmlParser

       五、數據存儲器

       數據存儲器主要對解析器解析的數據進行存儲,存儲方式有很多種,本文選用MYSQL數據庫進行存儲。

       解析器把每一頁的股票數據存為了一個數據框,然后通過數據庫連接引擎,把數據框的數據直接存入數據庫。

       以下是數據存儲器的模塊的代碼

import pymysql from sqlalchemy import create_engine import pandas as pd from firstSpider.HtmlParser import HtmlParser class DataOutput(object): '''把數據存入MYSQL數據庫''' def __init__(self,engine,table,table_name): self.engine = engine #數據庫連接引擎 self.table = table #要儲存的表 self.table_name = table_name #表名 def output(self): self.table.to_sql(name = self.table_name,con = self.engine,if_exists = 'append',index = False,index_label = False)
DataOutput

       六、爬蟲調度器

       爬蟲調度器主要將上述幾個模塊組合起來,合理的分工,高效完成任務。

       爬蟲調度器采用進程池的方式加快了程序執行的效率,下面是爬蟲調度器模塊的代碼

from firstSpider.UrlManager import UrlManager from firstSpider.HtmlDownloader import HtmlDownloader from firstSpider.HtmlParser import HtmlParser from firstSpider.DataOutput import DataOutput from sqlalchemy import create_engine import threadpool,time class SpiderMan(object): '''爬蟲機器人''' def __init__(self,engine,table_name): self.engine = engine #數據庫連接引擎 self.table_name = table_name #表名 self.manager = UrlManager() #URL管理器 self.downloader = HtmlDownloader() #HTML下載器 def spider(self,url): '''單網頁爬蟲組件''' # HTML下載器下載網頁 html = self.downloader.download(url) f = open('stock.txt','w') f.write(html) f.close() # HTML解析器抽取網頁數據 parser = HtmlParser(html) if len(parser.get_header()) > 0: data = parser.get_dataframe() # 數據儲存器儲存文件 out = DataOutput(self.engine,data,self.table_name) out.output() print('%s 的數據已存入表 %s'%(url,self.table_name)) time.sleep(1) return(parser.get_datatime()) def crawl(self,urls): '''爬取一個欄目連接列表的內容''' self.manager.add_new_urls(urls) # 判斷url管理器中是否有新的url pool = threadpool.ThreadPool(10) while(self.manager.has_new_url()): # 從URL管理器獲取新的url new_url = self.manager.get_new_url() requests = threadpool.makeRequests(self.spider,(new_url,)) pool.putRequest(requests[0]) pool.wait()
SpiderMan

       將上述每個模塊的代碼都新建一個py文件放在firstSpider文件夾下,並運行如下主程序即可獲取證券之星全站的股票數據

from firstSpider.get_proxy_ip import proxy_ip from firstSpider.get_catalog import get_catalog from firstSpider.get_urls import get_urls from firstSpider.SpiderMan import SpiderMan from selenium import webdriver from sqlalchemy import create_engine import time '''根據左側子導航下載證券之星當天所有數據''' if __name__ == "__main__": print('獲取代理IP並驗證有效性') ip_pool = proxy_ip('http://quote.stockstar.com',8) ip_pool.get_ip() print('代理IP池建立完畢') getcata = get_catalog() catalog = getcata.load_catalog() start = 0 end = len(catalog) catalog = catalog[start : end] print('初始化瀏覽器') browser = webdriver.Chrome() engine = create_engine('mysql+pymysql://root:Jwd116875@localhost:3306/scott?charset=utf8') for index in range(start,end): table_name,url = getcata.index_info(catalog,index) stop_url = ['http://quote.stockstar.com/gold/globalcurrency.shtml'] #想過濾掉的網頁鏈接 if url not in stop_url: geturls = get_urls(browser,url) urls = geturls.get_urllist() print('已獲取 %s 的鏈接列表'%table_name) Spider_man = SpiderMan(engine,table_name) Spider_man.crawl(urls) datatime = Spider_man.spider(urls[0]) print('%s: %s 欄目 %s 的增量數據爬取完畢'%(index,table_name,datatime))
main

      麻雀雖小五臟俱全,以上是用簡單的爬蟲框架實現的一次全站內容爬取,在執行速度和程序偽裝上還有很大提升空間,希望能夠與大家一同交流成長。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM