說明:
該插件是一個純js腳本,通過WebBrowser.RunJS注入到瀏覽器頁面。通過腳本創建爬蟲對象,然后支持對象,事件,json,element,node,正則,字符串的鏈式抽取。該js可以直接在瀏覽器控制台運行。
2020.3.4更新:
1、增加ForEach方法。支持自定義方便多值結構抽取。
2、增加more方法。支持跨級同級多節點抽取。
接口說明:
/* js爬蟲封裝方法,通過runjs注入腳本,執行獲取結果 1、當前只做了chrome兼容 2、chrome如果開啟了跨域,表達式中出現iframe則自動跨域 3、支持xpath,css-selector,jsonpath鏈式抽取 4、支持字符串結果鏈式處理 # 創建爬蟲對象,可以傳入抽取的element對象。默認document.body對象 # ele:對象數組 new crawler(ele); # xpath抽取,只支持document下抽取。即時后置的鏈式抽取,也是從document下抽取的 # query:xpath表達式【必填】 $x(query) # css-selector抽取。支持html的屬性,方法,事件抽取,支持鏈式多次列表模板抽取。 # query:css-selector表達式【必填】。param:屬性,方法,事件 $(query, param) # json格式數據抽取。 # query:jsonPath表達式【必填】。param:PATH(json路徑)、VALUE(取值)。默認提取值 $j(query, param) # 字符串正則過濾。通過過濾條件排除或保留一些數據。 # query:正則表達式或字符串【必填】 filter(query) # 字符串正則提取。 # query:正則表達式或字符串【必填】。index:數字或數字數組,抽取結果根據下標過濾 regex(query, index) # 字符串替換。 # substr:被替換的字符串,正則表達式或字符串【必填】。replacement:替換后的字符串 replace(substr, replacement) # 字符串拆分 # query:正則表達式或字符串【必填】 split(query) # 鏈式字符串格式抽取,就是將鏈式字符串轉為執行表達式 # expression:抽取表達式【必填】 mix(expression) # 對外暴露抽取結果處理。參考Array.forEach,對應遍歷Crawler的ele結果 # func:javascript方法 forEach(func) # 用於同級多結果抽取,返回對象數組,或者二位數組 # exps:數組或json。value為Crawler的表達式 # types:字符串(mix|xpath|css|json|regex|replace|split|filter) more(exps,types) # 獲取抽取結果。將ele對象轉為可視化數據 get() # 根據元素獲取xpath或者css-selector就近唯一表達式 # elm:要定位的元素【必填】。xp:xpath(true),css-selector(false) getSelector(elm, xp) # 根據元素獲取xpath或者css-selector完整鏈路表達式 # elm:要定位的元素【必填】。xp:xpath(true),css-selector(false) getFullSelector(elm, xp) */
/* js爬蟲封裝方法,通過runjs注入腳本,執行獲取結果 1、當前只做了chrome兼容 2、chrome如果開啟了跨域,表達式中出現iframe則自動跨域 3、支持xpath,css-selector,jsonpath鏈式抽取 4、支持字符串結果鏈式處理 # 創建爬蟲對象,可以傳入抽取的element對象。默認document.body對象 # ele:對象數組 new crawler(ele); # xpath抽取,只支持document下抽取。即時后置的鏈式抽取,也是從document下抽取的 # query:xpath表達式【必填】 $x(query) # css-selector抽取。支持html的屬性,方法,事件抽取,支持鏈式多次列表模板抽取。 # query:css-selector表達式【必填】。param:屬性,方法,事件 $(query, param) # json格式數據抽取。 # query:jsonPath表達式【必填】。param:PATH(json路徑)、VALUE(取值)。默認提取值 $j(query, param) # 字符串正則過濾。通過過濾條件排除或保留一些數據。 # query:正則表達式或字符串【必填】 filter(query) # 字符串正則提取。 # query:正則表達式或字符串【必填】。index:數字或數字數組,抽取結果根據下標過濾 regex(query, index) # 字符串替換。 # substr:被替換的字符串,正則表達式或字符串【必填】。replacement:替換后的字符串 replace(substr, replacement) # 字符串拆分 # query:正則表達式或字符串【必填】 split(query) # 鏈式字符串格式抽取,就是將鏈式字符串轉為執行表達式 # expression:抽取表達式【必填】 mix(expression) # 對外暴露抽取結果處理。參考Array.forEach,對應遍歷Crawler的ele結果 # func:javascript方法 forEach(func) # 用於同級多結果抽取,返回對象數組,或者二位數組 # exps:數組或json。value為Crawler的表達式 # types:字符串(mix|xpath|css|json|regex|replace|split|filter) more(exps,types) # 獲取抽取結果。將ele對象轉為可視化數據 get() # 根據元素獲取xpath或者css-selector就近唯一表達式 # elm:要定位的元素【必填】。xp:xpath(true),css-selector(false) getSelector(elm, xp) # 根據元素獲取xpath或者css-selector完整鏈路表達式 # elm:要定位的元素【必填】。xp:xpath(true),css-selector(false) getFullSelector(elm, xp) */
插件使用:
需要依賴Chrome瀏覽器,將附件task導入工程,源碼界面導入即可。
Import spider
Demo1:
Import spider dim hWeb = "" hWeb = WebBrowser.Create("chrome","https://forum.uibot.com.cn/",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""}) data = spider.xpath(hWeb,"//div[@class='media-body']//a/text()") TracePrint(data) data = spider.xpath(hWeb,"//div[@class='media-body']//a/@href") TracePrint(data) data = spider.css(hWeb,"div.media-body>div>a","innerText") TracePrint(data) data = spider.css(hWeb,"div.media-body>div>a","href") TracePrint(data) data = spider.mix(hWeb,'''$(".card-body li .media-body").more(["//a[1]/text()","//a[1]/@href","//div[1]/span[1]/text()","//div[1]/span[2]/text()"],"xpath")''') TracePrint(data) data = spider.mix(hWeb,'''$(".card-body li .media-body").more({"標題":"//a[1]/text()","地址":"//a[1]/@href","作者":"//div[1]/span[1]/text()","時間":"//div[1]/span[2]/text()"},"xpath")''') TracePrint(data)
Demo2:
dim hWeb = "" hWeb = WebBrowser.Create("chrome","http://9pk.5566rs.com/",3600000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""}) // hWeb = WebBrowser.BindBrowser("chrome",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200}) data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").more(["//td[1]/a/text()","//td[1]/a/@href","//td[3]/text()"],"xpath")''') data = "" For Each v In data1 data = data & Join(v,",") & "\n" Next dim dicts = {"a":"a",\ "b":"b",\ "c":"c",\ "d":"d",\ "e":"e",\ "f":"f",\ "g":"g",\ "h":"h",\ "i":"i",\ "g":"j",\ "k":"k",\ "l":"l",\ "m":"m",\ "n":"n",\ "o":"o",\ "p":"p",\ "q":"q",\ "r":"r",\ "s":"s",\ "t":"t",\ "u":"u",\ "v":"v",\ "w":"w",\ "x":"x",\ "y":"y",\ "z":"z",\ "A":"A",\ "B":"B",\ "C":"C",\ "D":"D",\ "E":"E",\ "F":"F",\ "G":"G",\ "H":"H",\ "I":"I",\ "J":"J",\ "K":"K",\ "L":"L",\ "M":"M",\ "N":"N",\ "O":"O",\ "P":"P",\ "Q":"Q",\ "R":"R",\ "S":"S",\ "T":"T",\ "U":"U",\ "V":"V",\ "W":"W",\ "X":"X",\ "Y":"Y",\ "Z":"Z",\ "1":"1",\ "2":"2",\ "3":"3",\ "4":"4",\ "5":"5",\ "6":"6",\ "7":"7",\ "8":"8",\ "9":"9",\ "0":"0"} For Each k,v In dicts data = Replace(data,k,v) Next File.Write("d:\\aaa1.csv",data,"gbk")
Demo4:
Import spider dim hWeb = "" hWeb = WebBrowser.Create("chrome","http://9pk.5566rs.com/",3600000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""}) // hWeb = WebBrowser.BindBrowser("chrome",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200}) data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[1]/a/text()")''') data2=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[1]/a/@href")''') data3=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[3]/text()")''') data = "" For Each k,v In data1 res = [] push(res,v) push(res,data2[k]) push(res,data3[k]) data = data & Join(res,",") & "\n" Next dim dicts = {"a":"a",\ "b":"b",\ "c":"c",\ "d":"d",\ "e":"e",\ "f":"f",\ "g":"g",\ "h":"h",\ "i":"i",\ "g":"j",\ "k":"k",\ "l":"l",\ "m":"m",\ "n":"n",\ "o":"o",\ "p":"p",\ "q":"q",\ "r":"r",\ "s":"s",\ "t":"t",\ "u":"u",\ "v":"v",\ "w":"w",\ "x":"x",\ "y":"y",\ "z":"z",\ "A":"A",\ "B":"B",\ "C":"C",\ "D":"D",\ "E":"E",\ "F":"F",\ "G":"G",\ "H":"H",\ "I":"I",\ "J":"J",\ "K":"K",\ "L":"L",\ "M":"M",\ "N":"N",\ "O":"O",\ "P":"P",\ "Q":"Q",\ "R":"R",\ "S":"S",\ "T":"T",\ "U":"U",\ "V":"V",\ "W":"W",\ "X":"X",\ "Y":"Y",\ "Z":"Z",\ "1":"1",\ "2":"2",\ "3":"3",\ "4":"4",\ "5":"5",\ "6":"6",\ "7":"7",\ "8":"8",\ "9":"9",\ "0":"0"} For Each k,v In dicts data = Replace(data,k,v) Next File.Write("d:\\aaa.csv",data,"gbk")
更多:
1、鏈式抽取:
new crawler().$(".subject.break-all").$x("//a").filter(/857/ig).replace(/(t+)/,'hahaha$1').get()
2、事件觸發:
new crawler().$(".subject.break-all","click()")
3、瀏覽器對象使用
new crawler(document.querySelectorAll(".subject.break-all")).$("a").get()
4、重復區塊,子模板抽取
{"title":new crawler(document.querySelectorAll(".subject.break-all")).$("a").get(), "content":new crawler(document.querySelectorAll(".subject.break-all")).$("a").get() }
5、多節點抽取(高級功能)
new crawler().$("li a").filter("tabindex").$x("img").forEach(function(a,b,c){c[b]=a.src}).get()
6、快捷多節點抽取(高級功能)
data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").more(["//td[1]/a/text()","//td[1]/a/@href","//td[3]/text()"],"xpath")''')
更多的彩蛋大家自己發現。
PS:該JS為個人嘔心之作,會用的人會發現超級實用。
下載插件請進原文地址下載:https://forum.uibot.com.cn/thread-869.htm