以前也用過爬蟲,比如使用nutch爬取指定種子,基於爬到的數據做搜索,還大致看過一些源碼。當然,nutch對於爬蟲考慮的是十分全面和細致的。每當看到屏幕上唰唰過去的爬取到的網頁信息以及處理信息的時候,總感覺這很黑科技。正好這次借助梳理Spring MVC的機會,想自己弄個小爬蟲,簡單沒關系,有些小bug也無所謂,我需要的只是一個能針對某個種子網站能爬取我想要的信息就可以了。有Exception就去解決,可能是一些API使用不當,也可能是遇到了http請求狀態異常,又或是數據庫讀寫有問題,就是在這個報exception和解決exception的過程中,JewelCrawler(兒子的小名)已經可以能夠獨立的爬取數據,並且還有一項基於Word2Vec算法做個情感分析的小技能。
后面可能還會有未知的Exception等着解決,也有一些性能需要優化,比如和數據庫的交互,數據的讀寫等等。但是目測年內沒有太多精力放這上面了,所以今天做一個簡單的總結,而且前兩篇主要側重的是功能和結果,這篇來說說JewelCrawler是如何誕生的,並將代碼放到Github上(源碼地址在文章最后),有興趣的可以關注下(僅供交流學習,請勿他用,考慮下douban君。多一點真誠,少一點傷害)
環境介紹
開發工具:Intellij idea 14
數據庫: Mysql 5.5 + 數據庫管理工具Navicat(可用來連接查詢數據庫)
語言:Java
Jar包管理:Maven
版本管理:Git
目錄結構
其中
com.ansj.vec是Word2Vec算法的Java版本實現
com.jackie.crawler.doubanmovie是爬蟲實現模塊,其中又包括
有些包是空的,因為這些模塊還沒有用上,其中
constants包是存放常量類
crawl包存放爬蟲入口程序
entity包映射數據庫表的實體類
test包存放測試類
utils包存放工具類
resource模塊存放的是配置文件和資源文件,比如
beans.xml:Spring上下文的配置文件
seed.properties:種子文件
stopwords.dic:停用詞庫
comment12031715.txt:爬取的短評數據
tokenizerResult.txt:使用IKAnalyzer分詞后的結果文件
vector.mod:基於Word2Vec算法訓練的模型數據
test模塊是測試模塊,用於編寫UT.
數據庫配置
1. 添加依賴的包
JewelCrawler使用的maven管理,所以只需要在pom.xml中添加相應的依賴就可以了
<dependency> <groupId>org.springframework</groupId> <artifactId>spring-jdbc</artifactId> <version>4.1.1.RELEASE</version> </dependency> <dependency> <groupId>commons-pool</groupId> <artifactId>commons-pool</artifactId> <version>1.6</version> </dependency> <dependency> <groupId>commons-dbcp</groupId> <artifactId>commons-dbcp</artifactId> <version>1.4</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version> </dependency>
2. 聲明數據源bean
我們需要在beans.xml中聲明數據源的bean
<context:property-placeholder location="classpath*:*.properties"/> <bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"> <property name="driverClassName" value="${jdbc.driver}"/> <property name="url" value="${jdbc.url}"/> <property name="username" value="${jdbc.username}"/> <property name="password" value="${jdbc.password}"/> </bean>
注意: 這里是綁定了外部配置文件jdbc.properties,具體數據源的參數從該文件讀取。
如果遇到問題“SQL [insert into user(id) values(?)]; Field 'name' doesn't have a default value;”解決方法是設置表的相應字段為自增長字段。
解析頁面遇到的問題
對於爬到的網頁數據需要解析dom結構,拿到自己想要的數據,期間遇到如下錯誤
org.htmlparser.Node不識別
解決方法:添加jar包依賴
<dependency> <groupId>org.htmlparser</groupId> <artifactId>htmlparser</artifactId> <version>1.6</version> </dependency>
org.apache.http.HttpEntity不識別
解決方法:添加jar包依賴
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency>
當然這是期間遇到的問題,最后用的是Jsoup做的頁面解析。
maven倉庫下載速度慢
之前使用的是默認的maven中央倉庫,下載jar包的速度很慢,不知道是我的網絡問題還是其他原因,后來在網上找到了阿里雲的maven倉庫,更新后,相比之前簡直是秒下,吐血推薦。
<mirrors> <mirror> <id>alimaven</id> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <mirrorOf>central</mirrorOf> </mirror> </mirrors>
找到maven的settings.xml文件,添加這個鏡像即可。
讀取resource模塊下文件的一種方法
比如讀取seed.properties文件
@Test public void testFile(){ File seedFile = new File(this.getClass().getResource("/seed.properties").getPath()); System.out.print("===========" + seedFile.length() + "===========" ); }
有關正則表達式
使用regrex正則表達式的時候,如果匹配上了定義的Pattern,則需要先調用matcher的find方法然后才能使用group方法找到子串。直接調用group方法是沒有辦法找到你想要的結果的。
我看了下上面Matcher類的源碼
package java.util.regex; import java.util.Objects; public final class Matcher implements MatchResult { /** * The Pattern object that created this Matcher. */ Pattern parentPattern; /** * The storage used by groups. They may contain invalid values if * a group was skipped during the matching. */ int[] groups; /** * The range within the sequence that is to be matched. Anchors * will match at these "hard" boundaries. Changing the region * changes these values. */ int from, to; /** * Lookbehind uses this value to ensure that the subexpression * match ends at the point where the lookbehind was encountered. */ int lookbehindTo; /** * The original string being matched. */ CharSequence text; /** * Matcher state used by the last node. NOANCHOR is used when a * match does not have to consume all of the input. ENDANCHOR is * the mode used for matching all the input. */ static final int ENDANCHOR = 1; static final int NOANCHOR = 0; int acceptMode = NOANCHOR; /** * The range of string that last matched the pattern. If the last * match failed then first is -1; last initially holds 0 then it * holds the index of the end of the last match (which is where the * next search starts). */ int first = -1, last = 0; /** * The end index of what matched in the last match operation. */ int oldLast = -1; /** * The index of the last position appended in a substitution. */ int lastAppendPosition = 0; /** * Storage used by nodes to tell what repetition they are on in * a pattern, and where groups begin. The nodes themselves are stateless, * so they rely on this field to hold state during a match. */ int[] locals; /** * Boolean indicating whether or not more input could change * the results of the last match. * * If hitEnd is true, and a match was found, then more input * might cause a different match to be found. * If hitEnd is true and a match was not found, then more * input could cause a match to be found. * If hitEnd is false and a match was found, then more input * will not change the match. * If hitEnd is false and a match was not found, then more * input will not cause a match to be found. */ boolean hitEnd; /** * Boolean indicating whether or not more input could change * a positive match into a negative one. * * If requireEnd is true, and a match was found, then more * input could cause the match to be lost. * If requireEnd is false and a match was found, then more * input might change the match but the match won't be lost. * If a match was not found, then requireEnd has no meaning. */ boolean requireEnd; /** * If transparentBounds is true then the boundaries of this * matcher's region are transparent to lookahead, lookbehind, * and boundary matching constructs that try to see beyond them. */ boolean transparentBounds = false; /** * If anchoringBounds is true then the boundaries of this * matcher's region match anchors such as ^ and $. */ boolean anchoringBounds = true; /** * No default constructor. */ Matcher() { } /** * All matchers have the state used by Pattern during a match. */ Matcher(Pattern parent, CharSequence text) { this.parentPattern = parent; this.text = text; // Allocate state storage int parentGroupCount = Math.max(parent.capturingGroupCount, 10); groups = new int[parentGroupCount * 2]; locals = new int[parent.localCount]; // Put fields into initial states reset(); } .... /** * Returns the input subsequence matched by the previous match. * * <p> For a matcher <i>m</i> with input sequence <i>s</i>, * the expressions <i>m.</i><tt>group()</tt> and * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(),</tt> <i>m.</i><tt>end())</tt> * are equivalent. </p> * * <p> Note that some patterns, for example <tt>a*</tt>, match the empty * string. This method will return the empty string when the pattern * successfully matches the empty string in the input. </p> * * @return The (possibly empty) subsequence matched by the previous match, * in string form * * @throws IllegalStateException * If no match has yet been attempted, * or if the previous match operation failed */ public String group() { return group(0); } /** * Returns the input subsequence captured by the given group during the * previous match operation. * * <p> For a matcher <i>m</i>, input sequence <i>s</i>, and group index * <i>g</i>, the expressions <i>m.</i><tt>group(</tt><i>g</i><tt>)</tt> and * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(</tt><i>g</i><tt>),</tt> <i>m.</i><tt>end(</tt><i>g</i><tt>))</tt> * are equivalent. </p> * * <p> <a href="Pattern.html#cg">Capturing groups</a> are indexed from left * to right, starting at one. Group zero denotes the entire pattern, so * the expression <tt>m.group(0)</tt> is equivalent to <tt>m.group()</tt>. * </p> * * <p> If the match was successful but the group specified failed to match * any part of the input sequence, then <tt>null</tt> is returned. Note * that some groups, for example <tt>(a*)</tt>, match the empty string. * This method will return the empty string when such a group successfully * matches the empty string in the input. </p> * * @param group * The index of a capturing group in this matcher's pattern * * @return The (possibly empty) subsequence captured by the group * during the previous match, or <tt>null</tt> if the group * failed to match part of the input * * @throws IllegalStateException * If no match has yet been attempted, * or if the previous match operation failed * * @throws IndexOutOfBoundsException * If there is no capturing group in the pattern * with the given index */ public String group(int group) { if (first < 0) throw new IllegalStateException("No match found"); if (group < 0 || group > groupCount()) throw new IndexOutOfBoundsException("No group " + group); if ((groups[group*2] == -1) || (groups[group*2+1] == -1)) return null; return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString(); } /** * Attempts to find the next subsequence of the input sequence that matches * the pattern. * * <p> This method starts at the beginning of this matcher's region, or, if * a previous invocation of the method was successful and the matcher has * not since been reset, at the first character not matched by the previous * match. * * <p> If the match succeeds then more information can be obtained via the * <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods. </p> * * @return <tt>true</tt> if, and only if, a subsequence of the input * sequence matches this matcher's pattern */ public boolean find() { int nextSearchIndex = last; if (nextSearchIndex == first) nextSearchIndex++; // If next search starts before region, start it at region if (nextSearchIndex < from) nextSearchIndex = from; // If next search starts beyond region then it fails if (nextSearchIndex > to) { for (int i = 0; i < groups.length; i++) groups[i] = -1; return false; } return search(nextSearchIndex); } /** * Initiates a search to find a Pattern within the given bounds. * The groups are filled with default values and the match of the root * of the state machine is called. The state machine will hold the state * of the match as it proceeds in this matcher. * * Matcher.from is not set here, because it is the "hard" boundary * of the start of the search which anchors will set to. The from param * is the "soft" boundary of the start of the search, meaning that the * regex tries to match at that index but ^ won't match there. Subsequent * calls to the search methods start at a new "soft" boundary which is * the end of the previous match. */ boolean search(int from) { this.hitEnd = false; this.requireEnd = false; from = from < 0 ? 0 : from; this.first = from; this.oldLast = oldLast < 0 ? from : oldLast; for (int i = 0; i < groups.length; i++) groups[i] = -1; acceptMode = NOANCHOR; boolean result = parentPattern.root.match(this, from, text); if (!result) this.first = -1; this.oldLast = this.last; return result; } ... }
原因是這樣的:這里如果不先調用find方法,直接調用group,可以發現group方法調用group(int group),該方法的方法體中有if first<0,顯然這里這個條件是成立的,因為first的初始值就是-1,所以這里會拋異常。但是如果調用find方法,可以發現,最終會調用search(nextSearchIndex),注意這里的nextSearchIndex已被last賦值,而last的值為0,再跳轉到search方法中
boolean search(int from) { this.hitEnd = false; this.requireEnd = false; from = from < 0 ? 0 : from; this.first = from; this.oldLast = oldLast < 0 ? from : oldLast; for (int i = 0; i < groups.length; i++) groups[i] = -1; acceptMode = NOANCHOR; boolean result = parentPattern.root.match(this, from, text); if (!result) this.first = -1; this.oldLast = this.last; return result; }
這個nextSearchIndex傳給了from,而from在方法體中被賦值給了first,所以,調用了find方法之后,這個的first就不在是-1,也就不是拋異常了。
源碼已經上傳至Github:https://github.com/DMinerJackie/JewelCrawler
以上說的問題比較碎,都是在遇到問題和解決問題的時候的一些總結。在具體操作的時候還會遇到其他問題,有問題或者建議的話歡迎提出來^^。
最后放幾張截止目前爬取的數據
Record表
其中存儲的是79032條,爬取過的網頁有48471條
movie表
目前爬取了2964部影視作品
comments表
爬取了29711條記錄
如果您覺得閱讀本文對您有幫助,請點一下“推薦”按鈕,您的“推薦”將是我最大的寫作動力!如果您想持續關注我的文章,請掃描二維碼,關注JackieZheng的微信公眾號,我會將我的文章推送給您,並和您一起分享我日常閱讀過的優質文章。