本節我將向大家介紹如何運行與調試YayCrawler。該框架是采用SpringBoot開發的,所以可以通過java –jar xxxx.jar的方式運行,也可以部署在tomcat等容器中運行。
首先讓我們介紹一下運行環境:
1、jdk8
2、安裝mysql數據庫,用作存儲解析規則等數據,需要創建一個“yayCrawler”的數據庫實例,並執行quartz相關的數據庫腳本:quartz.sql(見發布包或源碼)。
3、安裝redis,用作任務隊列
4、安裝mongoDB用於存放結果數據
5、安裝ftp服務器軟件ftpserver(可選,用於存放下載圖片)
一、運行發布包
首先從https://github.com/liushuishang/YayCrawler.Release.git獲取release包,目錄如下:
每個文件夾里面的文件結構是一樣的,以admin文件夾為例
casperjs和phantomjs兩個文件夾是為了執行某些特殊操作准備的,這里先不用理會。xxx_local.properties是一個服務配置文件,里面有配置端口、數據庫連接等參數,可以按照實際參數來調整;quartz.sql是運行quartz框架需要的數據庫表腳本,方在這里是為了方便;start.bat和start.vbs都是啟動腳本,雙擊就可以啟動admin端,start.bat會在控制台輸出日志內容,start.vbs是在后台執行,不會彈出控制台,啟動后會在該文件夾產生一個“catalina.base_IS_UNDEFINED”的文件夾,里面存放的是輸出日志;雙擊stop.bat就可以停止admin端程序。我們雙擊start.bat來啟動admin端程序:
可以看到admin端已經成功啟動,瀏覽器http://localhost:8069/admin/即可訪問管理界面:
Master與Worker的啟動與上面Admin端一致,只是沒有web界面,這里不再贅述。
二、源碼的運行與調試
首先從https://github.com/liushuishang/YayCrawler.git拉取源碼,然后用Intellij Idea打開(Eclipse也可以),可以看到如下的目錄解構:
yaycrawler-admin:Web管理控制台,用戶可以這里配置解析規則、測試規則、查看任務隊列情況和發布任務等。
yaycralwer-master:管理任務隊列和任務調度,與admin和worker互相通信。
yaycralwer-worker:爬蟲任務的工作端,定時向master發送心跳,接收並執行任務,負責數據的持久化。
yaycralwer-spider:與WebMagic結合,負責下載頁面、解析頁面、定義爬蟲任務的處理流程和接口。
yaycrawler-common:公用的實體模型和工具包。
yaycralwer-monitor:提供反監控的工具包,比如驗證碼刷新、自動登陸等
yaycralwer-proxy:工具包,用於從網上搜索可用的ip代理
yaycrawler-cache:為框架提供數據緩存功能的組件。
yaycrawler-quartz:通用的定時任務調度組件,可以通過配置定時調度不同的任務。可以用來做定時爬蟲任務。
yaycrawler-dao:提供與mysql數據庫交互的功能。
yaycrawler-ftpserver:ftpserver客戶端工具包。
我們分別為admin、master和worker配置三個TomcatServer,各占用的http端口如下圖所示。
然后分別修改各自工程下的src/main/resources/application.properties文件,如(請注意紅色的部分配置)
Admin端:
signature.token=2c91d29854a2f3fc0154a30959f40003
#Master的服務地址 master.server.address=http://127.0.0.1:8068/master/
# EMBEDDED SERVER CONFIGURATION (ServerProperties)
server.port=8069
# bind to a specific NIC
server.address=127.0.0.1
# the context path, defaults to '/'
server.context-path=/admin
# the servlet path, defaults to '/'
server.servlet-path=/
# base dir (usually not needed, defaults to tmp)
server.tomcat.basedir=/tmp
# in seconds
server.tomcat.background-processor-delay=30
# number of threads in protocol handler
server.tomcat.max-threads = 0
# character encoding to use for URL decoding
server.tomcat.uri-encoding = UTF-8
#(這里是限制的文件大小)
multipart.max-file-size=50Mb
#(這里是限制請求的文件大小)
multipart.max-request-size=50Mb
# SPRING MVC (HttpMapperProperties)
# pretty print JSON
http.mappers.json-pretty-print=false
# sort keys
http.mappers.json-sort-keys=false
# set fixed locale, e.g. en_UK
spring.mvc.locale=zh_CN
# set fixed date format, e.g. dd/MM/yyyy
spring.mvc.date-format=yyyy-MM-dd
# PREFIX_ERROR_CODE / POSTFIX_ERROR_CODE
spring.resources.cache-period=60000
# cache timeouts in headers sent to browser
spring.mvc.message-codes-resolver-format=PREFIX_ERROR_CODE
# THYMELEAF (ThymeleafAutoConfiguration)
spring.thymeleaf.cache=false
spring.thymeleaf.check-template-location=true
spring.thymeleaf.content-type=text/html
spring.thymeleaf.enabled=true
spring.thymeleaf.encoding=UTF-8
#spring.thymeleaf.excluded-view-names= # Comma-separated list of view names that should be excluded from resolution.
spring.thymeleaf.mode=HTML5
spring.thymeleaf.prefix=classpath:/templates/
spring.thymeleaf.suffix=.html
#spring.thymeleaf.template-resolver-order= # Order of the template resolver in the chain.
#spring.thymeleaf.view-names= # Comma-separated list of view names that can be resolved.
#配置Mysql數據庫
spring.datasource.url = jdbc:mysql://localhost:3306/yaycrawler?autoReconnect=true
&characterEncoding=utf8&useSSL
=false spring.datasource.username = root spring.datasource.password = root spring.datasource.driverClassName = com.mysql.jdbc.Driver
# Specify the DBMS
spring.jpa.database = MYSQL
# Show or not log for each sql query
spring.jpa.show-sql = true
# Hibernate ddl auto (create, create-drop, update)
spring.jpa.hibernate.ddl-auto = update //首次運行請修改為create來創建數據表,完成后更改為update
# Naming strategy
#spring.jpa.hibernate.naming-strategy = org.hibernate.cfg.DefaultNamingStrategy
spring.jpa.hibernate.naming-strategy = org.hibernate.cfg.ImprovedNamingStrategy
# stripped before adding them to the entity manager)
spring.jpa.properties.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect
#配置MongoDB數據庫
# MONGODB (MongoProperties)
#spring.data.mongodb.authentication-database= # Authentication database name.
spring.data.mongodb.database=crawler #spring.data.mongodb.field-naming-strategy= # Fully qualified name of the FieldNamingStrategy to use. #spring.data.mongodb.grid-fs-database= # GridFS database name. spring.data.mongodb.host=localhost spring.data.mongodb.port=27017
# Enable Mongo repositories.
spring.data.mongodb.repositories.enabled=true
#spring.data.mongodb.uri=mongodb://localhost/test # Mongo database URI. When set, host and port are ignored.
#spring.data.mongodb.username=
#spring.data.mongodb.password=
master端:
signature.token=2c91d29854a2f3fc0154a30959f40003
#一次分配給worker的任務大小
worker.task.batchSize=500
#worker的刷新時間
worker.refreshInteval=20000
#處理中隊列超時時間
task.queue.timeout=5400000
#批量加入隊列時的批量包含的任務數
task.queue.batchSize=1000
# EMBEDDED SERVER CONFIGURATION (ServerProperties)
server.port=8068
# bind to a specific NIC
server.address=127.0.0.1
#server.address=127.0.0.1
# the context path, defaults to '/'
server.context-path=/master
# the servlet path, defaults to '/'
server.servlet-path=/
# base dir (usually not needed, defaults to tmp)
server.tomcat.basedir=/tmp
# in seconds
server.tomcat.background-processor-delay=30
# number of threads in protocol handler
server.tomcat.max-threads = 0
# character encoding to use for URL decoding
server.tomcat.uri-encoding = UTF-8
spring.redis.host=127.0.0.1 spring.redis.port=6379 spring.redis.database=1 #spring.redis.password=
Worker端:
signature.token=2c91d29854a2f3fc0154a30959f40003
master.server.address=http://127.0.0.1:8068/master/ context.path=http://127.0.0.1:8086/worker/
worker.heartbeat.inteval=60000
worker.spider.threadCount=10
# ftpserver服務器地址
ftp.server.url=172.17.82.46
# ftpserver 端口
ftp.server.port=2121
# ftpserver 用戶名
ftp.server.username=admin
# ftpserver 密碼
ftp.server.password=admin
# EMBEDDED SERVER CONFIGURATION (ServerProperties)
server.port=8086
# bind to a specific NIC
server.address=127.0.0.1
# the context path, defaults to '/'
server.context-path=/worker
# the servlet path, defaults to '/'
server.servlet-path=/
# base dir (usually not needed, defaults to tmp)
server.tomcat.basedir=/tmp
# in seconds
server.tomcat.background-processor-delay=30
# number of threads in protocol handler
server.tomcat.max-threads = 0
# character encoding to use for URL decoding
server.tomcat.uri-encoding = UTF-8
#Spring JPA
spring.datasource.url = jdbc:mysql://localhost:3306/yaycrawler?autoReconnect=true
&characterEncoding=utf8&useSSL
=false
spring.datasource.username = root spring.datasource.password = root spring.datasource.driverClassName = com.mysql.jdbc.Driver
# Specify the DBMS
spring.jpa.database = MYSQL
# Show or not log for each sql query
spring.jpa.show-sql = false
# Hibernate ddl auto (create, create-drop, update)
spring.jpa.hibernate.ddl-auto = none
# Naming strategy
#spring.jpa.hibernate.naming-strategy = org.hibernate.cfg.DefaultNamingStrategy
spring.jpa.hibernate.naming-strategy = org.hibernate.cfg.ImprovedNamingStrategy
# stripped before adding them to the entity manager)
spring.jpa.properties.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect
# MONGODB (MongoProperties)
#spring.data.mongodb.authentication-database= # Authentication database name.
spring.data.mongodb.database=crawler #spring.data.mongodb.field-naming-strategy= # Fully qualified name of the FieldNamingStrategy to use. #spring.data.mongodb.grid-fs-database= # GridFS database name. spring.data.mongodb.host=localhost spring.data.mongodb.port=27017
# Enable Mongo repositories.
spring.data.mongodb.repositories.enabled=true
#spring.data.mongodb.uri=mongodb://localhost/test # Mongo database URI. When set, host and port are ignored.
#spring.data.mongodb.username=
#spring.data.mongodb.password=
啟動Master、Admin和Worker,在瀏覽器中輸入http://localhost:8069/admin/,即可訪問管理界面。
三、案例演示
前面已經介紹如何啟動項目,現在我們以抓取博客園的博客為例講解如何使用框架。假設我要通過框架抓取http://www.cnblogs.com/yuananyun/頁面的所有博客的標題和摘要,讓我們來開始創建奇跡吧,哈哈。