说明:
该插件是一个纯js脚本,通过WebBrowser.RunJS注入到浏览器页面。通过脚本创建爬虫对象,然后支持对象,事件,json,element,node,正则,字符串的链式抽取。该js可以直接在浏览器控制台运行。
2020.3.4更新:
1、增加ForEach方法。支持自定义方便多值结构抽取。
2、增加more方法。支持跨级同级多节点抽取。
接口说明:
/* js爬虫封装方法,通过runjs注入脚本,执行获取结果 1、当前只做了chrome兼容 2、chrome如果开启了跨域,表达式中出现iframe则自动跨域 3、支持xpath,css-selector,jsonpath链式抽取 4、支持字符串结果链式处理 # 创建爬虫对象,可以传入抽取的element对象。默认document.body对象 # ele:对象数组 new crawler(ele); # xpath抽取,只支持document下抽取。即时后置的链式抽取,也是从document下抽取的 # query:xpath表达式【必填】 $x(query) # css-selector抽取。支持html的属性,方法,事件抽取,支持链式多次列表模板抽取。 # query:css-selector表达式【必填】。param:属性,方法,事件 $(query, param) # json格式数据抽取。 # query:jsonPath表达式【必填】。param:PATH(json路径)、VALUE(取值)。默认提取值 $j(query, param) # 字符串正则过滤。通过过滤条件排除或保留一些数据。 # query:正则表达式或字符串【必填】 filter(query) # 字符串正则提取。 # query:正则表达式或字符串【必填】。index:数字或数字数组,抽取结果根据下标过滤 regex(query, index) # 字符串替换。 # substr:被替换的字符串,正则表达式或字符串【必填】。replacement:替换后的字符串 replace(substr, replacement) # 字符串拆分 # query:正则表达式或字符串【必填】 split(query) # 链式字符串格式抽取,就是将链式字符串转为执行表达式 # expression:抽取表达式【必填】 mix(expression) # 对外暴露抽取结果处理。参考Array.forEach,对应遍历Crawler的ele结果 # func:javascript方法 forEach(func) # 用于同级多结果抽取,返回对象数组,或者二位数组 # exps:数组或json。value为Crawler的表达式 # types:字符串(mix|xpath|css|json|regex|replace|split|filter) more(exps,types) # 获取抽取结果。将ele对象转为可视化数据 get() # 根据元素获取xpath或者css-selector就近唯一表达式 # elm:要定位的元素【必填】。xp:xpath(true),css-selector(false) getSelector(elm, xp) # 根据元素获取xpath或者css-selector完整链路表达式 # elm:要定位的元素【必填】。xp:xpath(true),css-selector(false) getFullSelector(elm, xp) */
/* js爬虫封装方法,通过runjs注入脚本,执行获取结果 1、当前只做了chrome兼容 2、chrome如果开启了跨域,表达式中出现iframe则自动跨域 3、支持xpath,css-selector,jsonpath链式抽取 4、支持字符串结果链式处理 # 创建爬虫对象,可以传入抽取的element对象。默认document.body对象 # ele:对象数组 new crawler(ele); # xpath抽取,只支持document下抽取。即时后置的链式抽取,也是从document下抽取的 # query:xpath表达式【必填】 $x(query) # css-selector抽取。支持html的属性,方法,事件抽取,支持链式多次列表模板抽取。 # query:css-selector表达式【必填】。param:属性,方法,事件 $(query, param) # json格式数据抽取。 # query:jsonPath表达式【必填】。param:PATH(json路径)、VALUE(取值)。默认提取值 $j(query, param) # 字符串正则过滤。通过过滤条件排除或保留一些数据。 # query:正则表达式或字符串【必填】 filter(query) # 字符串正则提取。 # query:正则表达式或字符串【必填】。index:数字或数字数组,抽取结果根据下标过滤 regex(query, index) # 字符串替换。 # substr:被替换的字符串,正则表达式或字符串【必填】。replacement:替换后的字符串 replace(substr, replacement) # 字符串拆分 # query:正则表达式或字符串【必填】 split(query) # 链式字符串格式抽取,就是将链式字符串转为执行表达式 # expression:抽取表达式【必填】 mix(expression) # 对外暴露抽取结果处理。参考Array.forEach,对应遍历Crawler的ele结果 # func:javascript方法 forEach(func) # 用于同级多结果抽取,返回对象数组,或者二位数组 # exps:数组或json。value为Crawler的表达式 # types:字符串(mix|xpath|css|json|regex|replace|split|filter) more(exps,types) # 获取抽取结果。将ele对象转为可视化数据 get() # 根据元素获取xpath或者css-selector就近唯一表达式 # elm:要定位的元素【必填】。xp:xpath(true),css-selector(false) getSelector(elm, xp) # 根据元素获取xpath或者css-selector完整链路表达式 # elm:要定位的元素【必填】。xp:xpath(true),css-selector(false) getFullSelector(elm, xp) */
插件使用:
需要依赖Chrome浏览器,将附件task导入工程,源码界面导入即可。
Import spider
Demo1:
Import spider dim hWeb = "" hWeb = WebBrowser.Create("chrome","https://forum.uibot.com.cn/",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""}) data = spider.xpath(hWeb,"//div[@class='media-body']//a/text()") TracePrint(data) data = spider.xpath(hWeb,"//div[@class='media-body']//a/@href") TracePrint(data) data = spider.css(hWeb,"div.media-body>div>a","innerText") TracePrint(data) data = spider.css(hWeb,"div.media-body>div>a","href") TracePrint(data) data = spider.mix(hWeb,'''$(".card-body li .media-body").more(["//a[1]/text()","//a[1]/@href","//div[1]/span[1]/text()","//div[1]/span[2]/text()"],"xpath")''') TracePrint(data) data = spider.mix(hWeb,'''$(".card-body li .media-body").more({"标题":"//a[1]/text()","地址":"//a[1]/@href","作者":"//div[1]/span[1]/text()","时间":"//div[1]/span[2]/text()"},"xpath")''') TracePrint(data)
Demo2:
dim hWeb = "" hWeb = WebBrowser.Create("chrome","http://9pk.5566rs.com/",3600000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""}) // hWeb = WebBrowser.BindBrowser("chrome",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200}) data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").more(["//td[1]/a/text()","//td[1]/a/@href","//td[3]/text()"],"xpath")''') data = "" For Each v In data1 data = data & Join(v,",") & "\n" Next dim dicts = {"a":"a",\ "b":"b",\ "c":"c",\ "d":"d",\ "e":"e",\ "f":"f",\ "g":"g",\ "h":"h",\ "i":"i",\ "g":"j",\ "k":"k",\ "l":"l",\ "m":"m",\ "n":"n",\ "o":"o",\ "p":"p",\ "q":"q",\ "r":"r",\ "s":"s",\ "t":"t",\ "u":"u",\ "v":"v",\ "w":"w",\ "x":"x",\ "y":"y",\ "z":"z",\ "A":"A",\ "B":"B",\ "C":"C",\ "D":"D",\ "E":"E",\ "F":"F",\ "G":"G",\ "H":"H",\ "I":"I",\ "J":"J",\ "K":"K",\ "L":"L",\ "M":"M",\ "N":"N",\ "O":"O",\ "P":"P",\ "Q":"Q",\ "R":"R",\ "S":"S",\ "T":"T",\ "U":"U",\ "V":"V",\ "W":"W",\ "X":"X",\ "Y":"Y",\ "Z":"Z",\ "1":"1",\ "2":"2",\ "3":"3",\ "4":"4",\ "5":"5",\ "6":"6",\ "7":"7",\ "8":"8",\ "9":"9",\ "0":"0"} For Each k,v In dicts data = Replace(data,k,v) Next File.Write("d:\\aaa1.csv",data,"gbk")
Demo4:
Import spider dim hWeb = "" hWeb = WebBrowser.Create("chrome","http://9pk.5566rs.com/",3600000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""}) // hWeb = WebBrowser.BindBrowser("chrome",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200}) data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[1]/a/text()")''') data2=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[1]/a/@href")''') data3=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[3]/text()")''') data = "" For Each k,v In data1 res = [] push(res,v) push(res,data2[k]) push(res,data3[k]) data = data & Join(res,",") & "\n" Next dim dicts = {"a":"a",\ "b":"b",\ "c":"c",\ "d":"d",\ "e":"e",\ "f":"f",\ "g":"g",\ "h":"h",\ "i":"i",\ "g":"j",\ "k":"k",\ "l":"l",\ "m":"m",\ "n":"n",\ "o":"o",\ "p":"p",\ "q":"q",\ "r":"r",\ "s":"s",\ "t":"t",\ "u":"u",\ "v":"v",\ "w":"w",\ "x":"x",\ "y":"y",\ "z":"z",\ "A":"A",\ "B":"B",\ "C":"C",\ "D":"D",\ "E":"E",\ "F":"F",\ "G":"G",\ "H":"H",\ "I":"I",\ "J":"J",\ "K":"K",\ "L":"L",\ "M":"M",\ "N":"N",\ "O":"O",\ "P":"P",\ "Q":"Q",\ "R":"R",\ "S":"S",\ "T":"T",\ "U":"U",\ "V":"V",\ "W":"W",\ "X":"X",\ "Y":"Y",\ "Z":"Z",\ "1":"1",\ "2":"2",\ "3":"3",\ "4":"4",\ "5":"5",\ "6":"6",\ "7":"7",\ "8":"8",\ "9":"9",\ "0":"0"} For Each k,v In dicts data = Replace(data,k,v) Next File.Write("d:\\aaa.csv",data,"gbk")
更多:
1、链式抽取:
new crawler().$(".subject.break-all").$x("//a").filter(/857/ig).replace(/(t+)/,'hahaha$1').get()
2、事件触发:
new crawler().$(".subject.break-all","click()")
3、浏览器对象使用
new crawler(document.querySelectorAll(".subject.break-all")).$("a").get()
4、重复区块,子模板抽取
{"title":new crawler(document.querySelectorAll(".subject.break-all")).$("a").get(), "content":new crawler(document.querySelectorAll(".subject.break-all")).$("a").get() }
5、多节点抽取(高级功能)
new crawler().$("li a").filter("tabindex").$x("img").forEach(function(a,b,c){c[b]=a.src}).get()
6、快捷多节点抽取(高级功能)
data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").more(["//td[1]/a/text()","//td[1]/a/@href","//td[3]/text()"],"xpath")''')
更多的彩蛋大家自己发现。
PS:该JS为个人呕心之作,会用的人会发现超级实用。
下载插件请进原文地址下载:https://forum.uibot.com.cn/thread-869.htm