1. WebMagic爬蟲框架
WebMagic是一個簡單靈活的Java爬蟲框架。基於WebMagic,你可以快速開發出一個高效、易維護的爬蟲。
1.1 相關文檔
官網:
中文文檔地址:
English:
1.2 WebMagic結構如下
WebMagic的結構分為
Downloader
、PageProcessor
、Scheduler
、Pipeline
四大組件,並由Spider將它們彼此組織起來。這四大組件對應爬蟲生命周期中的下載、處理、管理和持久化等功能。
2.SpringBoot集成MybatisPlus+WebMagic
2.1 集成WebMagic
spring boot
與webmagic
的結合主要有三個模塊,分別為爬取模塊Processor
,入庫模塊Pipeline
,向數據庫存入爬取數據,和定時任務模塊Scheduled
,復制定時爬取網站數據。
2.1.1 添加maven依賴
<!--爬蟲框架 --> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency>
2.1.2 爬取模塊Processor
爬取什么值得買的頁面的Processor,分析什么值得買的頁面數據,獲取響應的鏈接和標題,放入wegmagic的Page中,到入庫模塊取出添加到數據庫。代碼如下
package com.dxz.spider.HttpUtil; import com.dxz.spider.model.SmzdmModel; import com.dxz.spider.util.TimeUtil; import lombok.extern.slf4j.Slf4j; import org.apache.commons.lang3.StringUtils; import org.springframework.stereotype.Component; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.processor.PageProcessor; @Slf4j @Component public class SmzdmPageProcessor implements PageProcessor { // 部分一:抓取網站的相關配置,包括編碼、抓取間隔、重試次數等 //抓取網站的相關配置,包括:編碼、抓取間隔、重試次數等 private Site site = Site.me() .setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36") .setTimeOut(10 * 1000) .setRetryTimes(3) .setRetrySleepTime(3000); // process是定制爬蟲邏輯的核心接口,在這里編寫抽取邏輯 @Override public void process(Page page) { // 部分二:定義如何抽取頁面信息,並保存下來 \\w+ if (page.getUrl().regex("https://search.smzdm.com/\\?c=faxian&s=GU&v=b&p=\\d+").match()){ page.addTargetRequests(page.getHtml().xpath("//ul[@id='J_feed_pagenation']/li/a").links().all()); page.addTargetRequests(page.getHtml().xpath("//div[@class=feed-main-con]/ul[@id='feed-main-list']/li/div/div[@class='z-feed-content']/h5/a").links().all()); }else { SmzdmModel smzdmModel = new SmzdmModel(); String imgLocation = page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='info']/a/img[@class=main-img]/@src").get(); // 獲取物品的url String url= page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='info']/a/@href").get(); String title= page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='info']/div[@class='info-right']/div[@class='title-box']/h1[@class='title']/text()").get(); String price = page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='info']/div[@class='info-right']/div[@class='title-box']/div[@class='price']/span/text()").get(); String introduce = page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='item-name']/article/div[@class='baoliao-block']/p/text()").get(); String baoliao = page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='item-name']/article/p/text()").get(); String time = page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='info']/div[@class='info-right']/div[@class='info-details']/div[@class='author-info']/span[@class='time']/text()").get(); String zhi = page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='item-name']/div[@class='score_rateBox']/div[@class='score_rate']/span[@id='rating_worthy_num']/text()").get(); String buZhi = page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='item-name']/div[@class='score_rateBox']/div[@class='score_rate']/span[@id='rating_unworthy_num']/text()").get(); String start = page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='item-name']/div[@class='operate_box']/div[@class='operate_icon']/a[@class='fav']/span/text()").get(); String pl =page.getHtml().xpath("//section[@id='feed-wrap']/article/div[@id='feed-main']/div[@class='item-name']/div[@class='operate_box']/div[@class='operate_icon']/a[@class='comment']/em[@class='commentNum']/text()").get(); if (StringUtils.isBlank(introduce)){ smzdmModel.setIntroduce(baoliao); } time = TimeUtil.handSmzdm(time); smzdmModel.setUrl(url); smzdmModel.setTitle(title); smzdmModel.setPrice(price); smzdmModel.setIntroduce(introduce); smzdmModel.setFbtime(time); smzdmModel.setNoZhi(buZhi); smzdmModel.setZhi(zhi); smzdmModel.setStart(start); smzdmModel.setPl( pl); smzdmModel.setImgurl(imgLocation); // 將爬取結果存儲起來,key為smzdm value為爬取的數據即為smzdmModel的對象 page.putField("smzdm",smzdmModel); } } @Override public Site getSite() { return site; } }
2.1.3 入庫模塊Pipeline
入庫模塊結合MyBatisPlus模塊一起組合成入庫方法,繼承webmagic的Pipeline,然后實現方法,在process方法中獲取爬蟲模塊的數據,然后調用MybatisPlus的save方法。代碼如下:
package com.dxz.spider.HttpUtil; import com.dxz.spider.model.HotWeeklyBlogs; import com.dxz.spider.model.SmzdmModel; import com.dxz.spider.service.SmzdmService; import com.dxz.spider.service.WeeklyService; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Component; import us.codecraft.webmagic.ResultItems; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.pipeline.Pipeline; @Component public class MysqlPipeline implements Pipeline { @Autowired private WeeklyService weeklyService; @Autowired private SmzdmService smzdmService; @Override public void process(ResultItems resultItems, Task task) { // 取出processor過程中保存的結果,和Map類似,取出的key為smzdm和blogs HotWeeklyBlogs blogs = resultItems.get("blogs"); SmzdmModel smzdmModel = resultItems.get("smzdm"); if (blogs!=null){ weeklyService.save(blogs); }else if (smzdmModel!=null){ smzdmService.save(smzdmModel); System.out.println(smzdmModel.toString()); } } }
2.1.4 定時任務模塊Scheduled
使用spring boot自帶的定時任務注解@Scheduled(cron = "* * * * * ? ")
,每天每分鍾執行一次爬取任務,在定時任務里調取webmagic的爬取模塊Processor
。代碼如下:
package com.dxz.spider.HttpUtil; import com.dxz.spider.WebMagicBugs.HttpClientDownloader; import lombok.extern.slf4j.Slf4j; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.scheduling.annotation.Scheduled; import org.springframework.stereotype.Component; import us.codecraft.webmagic.Spider; @Component @Slf4j public class AllSpiderStarter { @Autowired private MysqlPipeline mysqlPipeline; @Scheduled(cron = "* * * * * ?") public void WeeklyScheduled(){ log.info("開始執行爬取任務"); Spider.create(new SmzdmPageProcessor()) .setDownloader(new HttpClientDownloader()) .addUrl("https://blog.hellobi.com/hot/monthly?page=1") .thread(5) .addPipeline(mysqlPipeline) .run(); } }
在springboot
啟動類上加注解@EnableScheduling
import com.dxz.spider.util.HotMonthWebMagic; import org.mybatis.spring.annotation.MapperScan; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.scheduling.annotation.EnableScheduling; @SpringBootApplication @EnableScheduling @MapperScan("com.dxz.spider.mapper") public class SpiderApplication { public static void main(String[] args) { SpringApplication.run(SpiderApplication.class, args); } }
2.2 集成MybatisPlus
2.1.1 MyBatisPlus
使用上基本和MyBatis一致,但是集成了基本的CRUD接口,對基本的CRUD可以直接調用。
官網地址
2.1.2 導入maven依賴
<!-- Mybatis-plus --> <dependency> <groupId>com.baomidou</groupId> <artifactId>mybatis-plus-boot-starter</artifactId> <version>3.0.5</version> </dependency>
2.1.3 編寫Mapper、Server和Model
什么值得買爬取的Model類
package com.dxz.spider.model; import com.baomidou.mybatisplus.annotation.TableField; import com.baomidou.mybatisplus.annotation.TableName; import lombok.Data; /** * 什么值得買的數據庫模型 */ @Data // TODO:對應數據庫的名字,可自行更改 @TableName("smzdm") public class SmzdmModel { /** * 標題 */ private String title; /** * 價格 */ private String price; /** * 簡介 */ private String introduce; /** * 認為值的人數 */ private String zhi; /** * 認為不值得人數 */ //TODO:對應的數據庫列的名字,可自行更改 @TableField(value = "NoZhi") private String NoZhi; /** * 收藏的人數 */ private String start; /** * 評論數 */ private String pl; /** * 發布時間 */ private String fbtime; /** * url */ private String url; /** * 圖床鏈接 */ private String imgurl; }
編寫Mapper類
public interface SmzdmMapper extends BaseMapper<SmzdmModel> { @Select("select * from smzdm") List<SmzdmModel> selectAll(); }
繼承BaseMapper<T>接口,獲取基礎的CRUD
@Service @Slf4j public class SmzdmService extends ServiceImpl<SmzdmMapper, SmzdmModel> { public List<SmzdmModel> selectAll(){ return smzdmMapper.selectAll(); } }
編寫application.properties
spring.datasource.username=root spring.datasource.password=123456 spring.datasource.url=jdbc:mysql://localhost:3306/dxzstudy?useUnicode=true&characterEncoding=utf-8&serverTimezone=Asia/Shanghai spring.datasource.driverClassName = com.mysql.cj.jdbc.Driver // mybatis的xml的保存位置 mybatis-plus.mapper-locations=classpath:mapperxml/*.xml
集成完畢!
3.編寫視圖AMIS
3.1 What is AMIS ?
amis 是一個前端低代碼框架,它使用 JSON 配置來生成頁面,可以極大節省頁面開發工作量,極大提升開發前端界面的效率。 有了AMIS,對於基本的界面,就算程序員不會前端。只要會JSON配置,或者說只要會漢語就能很快上手了。百度開源的神器!!!
參考文檔
https://baidu.github.io/amis/docs/intro?page=1
3.2 下載css和js
從官網下載sdk.css和就sdk.js
3.3 編寫HTML頁面
<!DOCTYPE html> <html lang="zh"> <head> <meta charset="UTF-8"/> <title>什么值得買</title> <meta name="referrer" content="no-referrer" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" /> <meta http-equiv="X-UA-Compatible" content="IE=Edge"/> <link rel="stylesheet" href="/static/sdk.css"/> <style> html, body, .app-wrapper { position: relative; width: 100%; height: 100%; margin: 0; padding: 0; } </style> </head> <body> <div id="root" class="app-wrapper"></div> <script src="/static/sdk.js"></script> <script type="text/javascript"> (function () { var amis = amisRequire('amis/embed'); amis.embed('#root', { "$schema": "https://houtai.baidu.com/v2/schemas/page.json#", "type": "page", "title": "什么值得買優衣庫專場", "toolbar": [ { "type": "button", "actionType": "dialog", "label": "新增", "icon": "fa fa-plus pull-left", "primary": true, "dialog": { "title": "新增", "body": { "type": "form", "name": "sample-edit-form", "api": "", "controls": [ { "type": "alert", "level": "info", "body": "因為沒有配置 api 接口,不能真正的提交哈!" }, { "type": "text", "name": "text", "label": "文本", "required": true }, { "type": "divider" }, { "type": "image", "name": "image", "label": "圖片", "required": true }, { "type": "divider" }, { "type": "date", "name": "date", "label": "日期", "required": true }, { "type": "divider" }, { "type": "select", "name": "type", "label": "選項", "options": [ { "label": "漂亮", "value": "1" }, { "label": "開心", "value": "2" }, { "label": "驚嚇", "value": "3" }, { "label": "緊張", "value": "4" } ] } ] } } } ], "body": [ { "type": "form", "title": "條件輸入", "className": "m-t", "wrapWithPanel": false, "target": "service1", "mode": "inline", "controls": [ { "type": "text", "name": "keywords", "placeholder": "關鍵字", "addOn": { "type": "button", "icon": "fa fa-search", "actionType": "submit", "level": "primary" } } ] }, { "type": "crud", "api": "http://localhost:8080/getAll", "defaultParams": { "perPage": 5 }, "columns": [ { "name": "title", "label": "標題", "type": "text" }, { "name": "price", "label": "價格", "type": "text" }, { "name": "url", "label": "商品鏈接", "type": "text" }, { "type": "image", "label": "物品圖片", "multiple": false, "name": "imgurl", "popOver": { "title": "查看大圖", "body": "<div class=\"w-xxl\"><img class=\"w-full\" src=\"${imgurl}\"/></div>" } }, { "name": "fbtime", "type": "date", "label": "發布日期" }, { "type": "container", "label": "操作", "body": [ { "type": "button", "icon": "fa fa-eye", "level": "link", "actionType": "dialog", "tooltip": "查看", "dialog": { "title": "查看", "body": { "type": "form", "controls": [ { "type": "static", "name": "title", "label": "標題" }, { "type": "divider" }, { "type": "static", "name": "price", "label": "價格" }, { "type": "divider" }, { "type": "static-image", "label": "圖片", "name": "imgurl", "popOver": { "title": "查看大圖", "body": "<div class=\"w-xxl\"><img class=\"w-full\" src=\"${imgurl}\"/></div>" } }, { "type": "divider" }, { "name": "fbtime", "type": "static", "label": "發布時間" }, { "type": "divider" }, { "name": "url", "type": "static", "label": "購買鏈接" }, ] } } }, { "type": "button", "icon": "fa fa-pencil", "tooltip": "編輯", "level": "link", "actionType": "drawer", "drawer": { "position": "left", "size": "lg", "title": "編輯", "body": { "type": "form", "name": "sample-edit-form", "controls": [ { "type": "alert", "level": "info", "body": "因為沒有配置 api 接口,不能真正的提交哈!" }, { "type": "hidden", "name": "id" }, { "type": "text", "name": "text", "label": "文本", "required": true }, { "type": "divider" }, { "type": "image", "name": "image", "multiple": false, "label": "圖片", "required": true }, { "type": "divider" }, { "type": "date", "name": "date", "label": "日期", "required": true }, { "type": "divider" }, { "type": "select", "name": "type", "label": "選項", "options": [ { "label": "漂亮", "value": "1" }, { "label": "開心", "value": "2" }, { "label": "驚嚇", "value": "3" }, { "label": "漂亮", "value": "緊張" } ] } ] } } }, { "type": "button", "level": "link", "icon": "fa fa-times text-danger", "actionType": "ajax", "tooltip": "刪除", "confirmText": "您確認要刪除? 沒有配置 api 確定了也沒用,還是不要確定了", "api": "" } ] } ] } ] }); })(); </script> </body> </html>
3.4 配置SpringBoot
編寫后台訪問接口,這里只寫了查找的接口,編輯和刪除的可以自行編寫
package com.dxz.spider.web; import com.dxz.spider.model.SmzdmModel; import com.dxz.spider.service.SmzdmService; import com.dxz.spider.web.SmzdmVO.GoodsVO; import lombok.extern.slf4j.Slf4j; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RequestMethod; import org.springframework.web.bind.annotation.RestController; import java.util.List; @RestController @Slf4j public class SmzdmWeb { @Autowired private SmzdmService smzdmService; @RequestMapping(value = "/getAll",method = RequestMethod.GET) public GoodsVO selectByPage(){ log.info("請求什么值得買的getAll接口"); GoodsVO goodsVO = new GoodsVO(); List<SmzdmModel> smzdmModels = smzdmService.selectAll(); if (smzdmModels.size()>0){ goodsVO.setStatus(0); goodsVO.setMsg("請求成功"); goodsVO.setData(smzdmModels); return goodsVO; }else{ return null; } } }
編寫視圖控制器
package com.dxz.spider.config; import org.springframework.context.annotation.Configuration; import org.springframework.web.servlet.config.annotation.CorsRegistry; import org.springframework.web.servlet.config.annotation.ResourceHandlerRegistry; import org.springframework.web.servlet.config.annotation.ViewControllerRegistry; import org.springframework.web.servlet.config.annotation.WebMvcConfigurationSupport; @Configuration public class WebConfig extends WebMvcConfigurationSupport { /** * 映射靜態文件 * @param registry */ @Override protected void addResourceHandlers(ResourceHandlerRegistry registry) { registry.addResourceHandler("/static/**").addResourceLocations("classpath:/static/"); super.addResourceHandlers(registry); } /** * 映射視圖 * @param registry */ @Override protected void addViewControllers(ViewControllerRegistry registry) { registry.addViewController("/smzdm").setViewName("smzdm"); super.addViewControllers(registry); } /** * 跨域配置 * @param registry */ @Override protected void addCorsMappings(CorsRegistry registry) { registry.addMapping("/**") .allowedOrigins("http://localhost:8080") .allowedMethods("*") .allowedHeaders("*"); super.addCorsMappings(registry); } }
3.5 運行查看