scrapy 是一個很強大的爬蟲框架,可以自定義很多插件,滿足我們不同的需求....
首先我們應該要會用twisted 寫web service
其實scrapy 已經幫我們做了整理了
from scrapy.utils.reactor import listen_tcp
listen_tcp就可以開啟web service
所以web 插件可以這樣寫
class WebService(server.Site): name = 'WebService' def __init__(self, crawler): self.crawler = crawler self.crawler.itemData = [] portal = Portal(PublicHTMLRealm(Root(self.crawler)), [StringCredentialsChecker('test', 'tset')]) credential_factory = BasicCredentialFactory("Auth") resource = HTTPAuthSessionWrapper(portal, [credential_factory]) server.Site.__init__(self,resource) self.crawler.signals.connect(self.start_listening, signals.engine_started) self.crawler.signals.connect(self.stopService, signals.engine_stopped) self.crawler.signals.connect(self.item_scraped, signals.item_scraped) self.crawler.signals.connect(self.spider_idle, signal=signals.spider_idle) @classmethod def from_crawler(cls, crawler): return cls(crawler) def start_listening(self): self.port = listen_tcp([8000,8070], '127.0.0.1',self) h = self.port.getHost() logger.info("scrapy web console available at http://%(host)s:%(port)d", {'host': h.host, 'port': h.port}, extra={'crawler': self.crawler}) import webbrowser webbrowser.open("http://%(host)s:%(port)d"%{'host': h.host, 'port': h.port}) def stopService(self): self.port.stopListening() def item_scraped(self,item, response, spider): try: self.crawler.itemData.append(item) except: pass def spider_idle(self): raise DontCloseSpider
然后界面可以在Root里實現。
以下是實現的界面

可以添加控制爬蟲的一些操作,如爬蟲暫停、添加開始爬的內容等

當然、還可以做一些調試的界面或是其他有趣的
