其實,若不考慮反爬蟲技術,正兒八經的爬蟲技術沒有什么太多的技術含量,這里只是將這次爬取數據的過程做個簡單的備忘,在Conv-2019的特別日子里,不能到公司職場工作,在家遠程,做一些調研和准備工作。這里頭,就有產品市場調研這塊,數據說話!
我重點爬取了京東商城的數據,當然,早期也爬取了天貓和淘寶的數據(阿里系列,反爬蟲技術還是比較厲害,后來頻繁提示滑動條,這個繞不過去,即便程序中監測到跳出來了滑動條驗證,然后我手動驗證都不讓過,這的確比較厲害,目前因為沒有多少時間深入調研,沒有弄清楚這個到底怎么繞過去,若有過來人,還請告知一二!!!)
我的爬取過程,技術采用的是selenium+httpclient+mysql實現的。
- selenium是一款自動化測試工具,在這里,很好的用來設計自動化的點擊頁面按鈕的動作。說實在話,不用selenium,完全用jsoup也是可以搞得定的。但是,完全用selenium,可能有些場景就不是那么好搞定了。涉及到完全異步操作的時候,selenium的模擬點擊頁面,不管通過cssSelector還是xpath等,都可能遇到元素不存在的錯誤。
- 完全用jsoup是可以解決問題的,只不過呢,完全用jsoup,這個爬蟲的程序就相對比較復雜一些了,自己要寫很多的代碼。
- 所以,我最終采用了selenium和httpclient爬取數據。selenium模擬翻頁,因為京東商城的商品列表頁面,是有明確的規律的。不管是參數翻頁(WebElement.click(href)這種模式),還是基於模擬點擊列表頁面的"下一頁",都是比較輕松的事情,而且,針對要爬取的頁面,還有web頁面被打開,可以看到一個大概的視圖。httpclient在這里,主要用來獲取商品的價格和評論數據,價格是輔助獲取,評論數據是完全依靠httpclient。
先創建一個爬蟲程序的maven工程,主要是為了方便拉取依賴包。
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.6</version>
</dependency>
<dependency>
<groupId>c3p0</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.1.2</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>
因為我這里selenium基於瀏覽器運行,即模擬瀏覽器的工作,所以,我選擇的是客戶端模式,谷歌瀏覽器驅動。所以,還要下載chrome的本地程序,可以理解為chrome的內核程序,在java工程程序中,系統參數中需要配置這個chrome瀏覽器內核,通過java的JNI工作模式,進行模擬控制操作瀏覽器打開頁面的過程。
整個java工程就是一個非常基本的main程序,普通的maven項目,讀者可以按照自己的需求,設計成web模式也是可以的。先來看看配置selenium的部分。
JDSeleniumFullProxy
package com.shihuc.up.spider.jd.comment; import com.google.common.collect.Lists; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import java.util.List; import java.util.concurrent.TimeUnit; public class JDSeleniumFullProxy { public static ChromeDriver driver; static { try { //啟動瀏覽器 getDriver(); } catch (InterruptedException e) { e.printStackTrace(); } } public static void main(String[] args) throws InterruptedException { getProductsWithFullScenario(); Thread.sleep(10000); System.out.println("!!!!!!!==========Well Done===========!!!!!!"); //關閉模擬器 driver.quit(); } private static void getProductsWithFullScenario() { String urls[] = new String[] { /*車載手機支架*/ "https://search.jd.com/Search?keyword=%E8%BD%A6%E8%BD%BD%E6%89%8B%E6%9C%BA%E6%94%AF%E6%9E%B6&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&click=0" }; String products[][] = new String[][] { {"jd_info_czsjzj", "jd_comment_czsjzj"} }; int hmp = 40; JDProductDao productDao = new JDProductDao(); //爬取所需的數據 for (int i=0; i < urls.length; i++) { JDSeleniumFullCrawler.getAllProducts(driver, hmp, urls[i], productDao, products[i]); } //將價格和銷量做適當的處理(價格有范圍的,銷量中有‘萬’或者 ‘+’的,處理為數值) for (int i=0; i<products.length; i++) { productDao.updateProductForPriceSells(products[i][0]); } } /** * 獲取 ChromeDriver * @throws InterruptedException */ private static void getDriver() throws InterruptedException{ String os = System.getProperty("os.name"); if (os.toLowerCase().startsWith("win")) { System.setProperty("webdriver.chrome.driver", System.getProperty("user.dir") + "\\chromedriver_win32\\chromedriver.exe"); } else { System.setProperty("webdriver.chrome.driver", "/usr/bin/chromedriver"); } ChromeOptions options = new ChromeOptions(); // 關閉界面上的---Chrome正在受到自動軟件的控制 options.addArguments("--disable-infobars"); // 允許重定向 options.addArguments("--disable-web-security"); // 最大化 options.addArguments("--start-maximized"); options.addArguments("--no-sandbox"); List<String> excludeSwitches = Lists.newArrayList("enable-automation"); options.setExperimentalOption("excludeSwitches", excludeSwitches); driver = new ChromeDriver(options); driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS); //driver.get("https://passport.jd.com/new/login.aspx"); /** * 下面這些模擬滑動條的方式,都不湊用,只有通過淘寶的登錄頁打開,然后手動切換到支付寶登錄頁,手機支付寶掃碼 * 這樣方能繞過淘寶反爬蟲的那個滑動條阻攔 */ // while(true) { // if(currentIsLoginPage()){ // System.out.println("============>>>>"); // }else { // System.out.println(">>>>>>OOOOOOOOOOO"); // break; // } // Thread.sleep(2000); // } } private static boolean currentIsLoginPage() { String url = driver.getCurrentUrl(); if (url.contains("https://passport.jd.com/new/login.aspx")){ return true; } return false; } }
代碼中紅色部分,是我的chrome驅動程序所在路徑的配置,即chromedriver.exe文件在我的項目內文件夾chromedriver_win32里面。依據你下載這個文件時放的路徑不同,這里有所調整。
上面程序中,也可以模擬程序登錄的過程,因為京東商城瀏覽商品,不管怎么瀏覽都不要求登錄,不想阿里系,瀏覽一下,還防爬,時不時蹦出來登錄。。。鄙視。。。
接下來,就是真正操作selenium和jsoup爬取數據的過程了。
JDSeleniumFullCrawler
package com.shihuc.up.spider.jd.comment; import com.alibaba.fastjson.JSONArray; import com.alibaba.fastjson.JSONObject; import com.shihuc.up.spider.jd.opt.JDPhoneHolder; import org.openqa.selenium.By; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import java.io.IOException; import java.util.HashMap; import java.util.List; import java.util.Set; public class JDSeleniumFullCrawler { private static String COMMENT_TOTAL = "評論總數"; private static String COMMENT_GOOD = "好評數量"; private static String COMMENT_GENERAL = "中評數量"; private static String COMMENT_POOL = "差評數量"; private static String COMMENT_VIDEO = "視頻曬單"; private static String COMMENT_AFTER = "追評數量"; public static void getAllProducts(ChromeDriver driver, int howManyPages, String url, JDProductDao pdao, String []pname) { for (int i = 1; i <= howManyPages; i++) { getFullPageProducts(driver, i, url, pdao, pname); try { Thread.sleep(100); } catch (InterruptedException e) { e.printStackTrace(); } } } public static void getFullPageProducts(ChromeDriver driver, int i, String rawUrl, JDProductDao pdao, String []pname) { // WebElement pageNumInput = driver.findElement(By.xpath("//*[@id=\"J_bottomPage\"]/span[2]/input")); // pageNumInput.clear(); // pageNumInput.sendKeys(i + ""); // WebElement searchSubmit = driver.findElement(By.xpath("//*[@id=\"J_bottomPage\"]/span[2]/a")); // searchSubmit.click(); String url = rawUrl + "&page=" + (2*i - 1) + "&s=" + (60*(i-1) + 1); driver.get(url); getProductsProcess(driver, pdao, pname); } private static void getProductsProcess(ChromeDriver driver, JDProductDao pdao, String []pname) { List<WebElement> itemElements = driver.findElements(By.cssSelector("#J_goodsList .gl-item")); System.out.println(itemElements.size()); String mainHandle = driver.getWindowHandle(); String href = null; for(WebElement we: itemElements) { try { String weId = we.getAttribute("data-pid"); //WebElement weHref = we.findElement(By.cssSelector(".p-name a")); WebElement weHref = we.findElement(By.cssSelector(".p-img a")); //href = weHref.getAttribute("href"); href = "https://item.jd.com/" + weId + ".html"; //價格和評論這么取取不到,網站是一個完全異步的顯示邏輯 String price = null; try { WebElement wePrice = we.findElement(By.cssSelector(".p-price strong i")); price = wePrice.getText(); }catch (Exception ep) { System.err.println("can not get the price information for pid " + weId + " ......"); } // String sells = null; // try { // WebElement weSells = we.findElement(By.cssSelector(".p-commit strong a")); // sells = weSells.getText(); // }catch (Exception ec) { // System.err.println("can not get the comment information for pid " + weId + " ......"); // } driver.executeScript("window.open(\"https://item.jd.com/" + weId + ".html\");"); Set<String> handles = driver.getWindowHandles(); String newHandle = ""; for (String s : handles) { if (s.equalsIgnoreCase(mainHandle)) { continue; } newHandle = s; break; } //將窗口調整到剛才打開的產品詳情頁窗口 driver.switchTo().window(newHandle); //獲取當前產品詳情頁的關注的產品詳情信息 JDProduct product = getJDProductInfos(driver); try { if (price == null) { price = JDPhoneHolder.getPrice(weId); } } catch (IOException e) { e.printStackTrace(); } product.setUrl(href); product.setPid(weId); product.setPrice(price); //JDComment comment = getCommentByCD(driver); JDComment comment = getCommentByPID(weId); product.setComment(comment); int rid = pdao.addProductInfoGenId(product, pname[0]); pdao.addProductComments2(product, rid, pname[1]); //關閉當前處理的產品詳情頁窗口 closeAllOtherWindows(mainHandle, driver); }catch(Exception eal) { closeAllOtherWindows(mainHandle, driver); eal.printStackTrace(); System.out.println(href); } } } public static JDProduct getJDProductInfoByUrl(WebDriver driver, String url, JDProduct product) { System.out.println("URL: " + url); driver.get(url); WebElement weComment = driver.findElement(By.cssSelector(".comment-count .count")); WebElement wePrice = driver.findElement(By.cssSelector(".summary-price .price")); String strComment = weComment.getText(); if (strComment.equalsIgnoreCase("0")){ try { strComment = JDPhoneHolder.getCommitCountNum(product.getPid()) + ""; } catch (IOException e) { e.printStackTrace(); } } String strPrice = wePrice.getText(); product.setPrice(strPrice); return product; } public static JDProduct getJDProductInfos(WebDriver driver) { WebElement weTitle = driver.findElement(By.cssSelector(".w div.sku-name")); String title = weTitle.getText(); /** * 獲取產品型號信息, 通過xpath獲取信息的性能比cssSelector高很多 */ WebElement weBrand = driver.findElement(By.xpath(".//*[@id=\"parameter-brand\"]/li/a")); String brand = weBrand.getText(); WebElement weName = driver.findElement(By.xpath(".//*[@id=\"detail\"]/div[2]/div[1]/div[1]/ul[2]/li[1]")); String name = weName.getText(); name = name.replace("商品名稱:","").trim(); JDProduct product = new JDProduct(); product.setBrand(brand); product.setPname(name); product.setTitle(title); return product; } public static JDComment getCommentByPID(String pid) { JDComment comments = new JDComment(); HashMap<String, Integer> groups = new HashMap<>(); try { JSONObject commentJson =JDPhoneHolder.getComments(pid); JSONObject productCommentSummary = commentJson.getJSONObject("productCommentSummary"); //好評比例 int goodRateShow = productCommentSummary.getInteger("goodRateShow"); comments.setGoodRate(goodRateShow); //評論總數 int commentCount = productCommentSummary.getInteger("commentCount"); comments.setTotalc(commentCount); //好評數量 int goodCount = productCommentSummary.getInteger("goodCount"); comments.setGoodc(goodCount); //中評數量 int generalCount = productCommentSummary.getInteger("generalCount"); comments.setGeneralc(generalCount); //差評數量 int poorCount = productCommentSummary.getInteger("poorCount"); comments.setPoorc(poorCount); //視頻曬單 int videoCount = productCommentSummary.getInteger("videoCount"); comments.setVideoc(videoCount); //追評數量 int afterCount = productCommentSummary.getInteger("afterCount"); comments.setAfterc(afterCount); JSONArray hotCommentTagStatistics = commentJson.getJSONArray("hotCommentTagStatistics"); for (int i=0; i<hotCommentTagStatistics.size(); i++){ JSONObject hotComment = hotCommentTagStatistics.getJSONObject(i); String name = hotComment.getString("name"); int count = hotComment.getInteger("count"); groups.put(name, count); } } catch (IOException e) { e.printStackTrace(); } comments.setCommentGroups(groups); return comments; } public static JDComment getCommentByCD(ChromeDriver driver) { JDComment comment = new JDComment(); WebElement weCommentTab = driver.findElement(By.xpath("//*[@id=\"detail\"]/div[1]/ul/li[5]")); weCommentTab.click(); try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } WebElement weGoodRate = driver.findElement(By.cssSelector(".comment-percent .percent-con")); String goodRate = weGoodRate.getText(); int len = goodRate.length(); if (len > 1) { goodRate = goodRate.substring(0, len - 1); } int rate = Integer.valueOf(goodRate); List<WebElement> weGroupList = driver.findElements(By.cssSelector(".J-comment-info .percent-info .tag-list .tag-1")); HashMap<String, Integer> groups = new HashMap<>(); for (WebElement we: weGroupList) { String rawGroup = we.getText(); splitDescInfo(rawGroup, groups); } List<WebElement> weLevelList = driver.findElements(By.cssSelector(".J-comments-list .filter-list li")); HashMap<String, Integer> levels = new HashMap<>(); for (WebElement we: weLevelList) { WebElement weLevel = we.findElement(By.cssSelector("a")); if (containsDatatab(weLevel)){ //TODO // String rawLevel = weLevel.getText(); // splitDescInfo(rawLevel, levels); } } comment.setGoodRate(rate); comment.setCommentGroups(groups); return comment; } private static boolean containsDatatab(WebElement we){ try { we.getAttribute("data-tab"); return true; }catch (Exception e){ return false; } } private static void splitDescInfo(String desc, HashMap<String, Integer> map) { String info = desc; int commaIdx = info.indexOf("("); String context = info.substring(0, commaIdx); String strCount = info.substring(commaIdx+1, info.length() - 1); float count = getRealCount(strCount); map.put(context, (int)count); } private static float getRealCount(String rawCount) { float realCount; if (rawCount.contains("萬")){ int wanIdx = rawCount.indexOf("萬"); String strRealCount = rawCount.substring(0, wanIdx); realCount = Float.valueOf(strRealCount) * 10000; }else if (rawCount.contains("+")){ int plusIdx = rawCount.indexOf("+"); String strRealCount = rawCount.substring(0, plusIdx); realCount = Integer.valueOf(strRealCount); }else{ realCount = Integer.valueOf(rawCount); } return realCount; } private static void closeAllOtherWindows(String main, ChromeDriver driver) { Set<String> handles = driver.getWindowHandles(); System.out.println("------->main: " + main); Object []hs = handles.toArray(); for (int i = hs.length - 1; i>0; i--) { System.out.println("-------->child: " + hs[i]); driver.switchTo().window(hs[i].toString()); driver.close(); } driver.switchTo().window(main); } }
這個java類里面,重點在於處理頁面切換的邏輯,否則想操作的頁面數據和實際driver所指向的頁面handle可能不是一個東西,導致所找的頁面元素不存在的錯誤,這是比較常見的錯誤,所以,一定得注意窗口句柄的管理,爬取完畢后,頁面最好是關閉掉(selenium模擬操作頁面打開頁面是順序的將句柄記錄在一個有序集合LinkedHashSet里面,所以,操作的時候,后打開的頁面句柄在集合的后面,利用Set轉換為Array的模式,簡單實現窗口的關閉邏輯),因為爬取數據的場景很簡單,列表頁和詳情頁之間切換。
接下來,是爬取到的數據寫庫的過程,我操作數據,用的是很簡單的spring的jdbcTemplate實現的,雖然功能不及mybatis那么強大,但是應付爬取點數據,還是夠了。
JDProductDao
package com.shihuc.up.spider.jd.comment; import com.mchange.v2.c3p0.ComboPooledDataSource; import org.openqa.selenium.chrome.ChromeDriver; import org.springframework.jdbc.core.BatchPreparedStatementSetter; import org.springframework.jdbc.core.JdbcTemplate; import org.springframework.jdbc.core.PreparedStatementCreator; import org.springframework.jdbc.core.RowCallbackHandler; import org.springframework.jdbc.support.GeneratedKeyHolder; import org.springframework.jdbc.support.KeyHolder; import java.beans.PropertyVetoException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; import java.util.ArrayList; import java.util.HashMap; import java.util.List; public class JDProductDao extends JdbcTemplate{ public JDProductDao(){ //定義c3p0連接池 ComboPooledDataSource ds = new ComboPooledDataSource(); try { ds.setDriverClass("com.mysql.jdbc.Driver"); ds.setUser("root"); ds.setPassword("shihuc"); ds.setJdbcUrl("jdbc:mysql://localhost:3306/nav?characterEncoding=utf-8"); } catch (PropertyVetoException e) { e.printStackTrace(); } super.setDataSource(ds); } public int addProductInfoGenId(JDProduct product, String shop) { KeyHolder keyHolder = new GeneratedKeyHolder(); JDComment comment = product.getComment(); super.update(new PreparedStatementCreator(){ final String sql="insert into good_holder_" + shop + " (pid,title,brand,pname,price,url, goodrate,totalc,goodc,generalc,poorc,videoc,afterc)" + " values (?,?,?,?,?,?,?,?,?,?,?,?,?)"; public PreparedStatement createPreparedStatement(java.sql.Connection conn) throws SQLException{ PreparedStatement ps = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS); ps.setString(1, product.getPid()); ps.setString(2, product.getTitle()); ps.setString(3, product.getBrand()); ps.setString(4, product.getPname()); ps.setString(5, product.getPrice()); ps.setString(6, product.getUrl()); ps.setInt(7, comment.getGoodRate()); ps.setInt(8, comment.getTotalc()); ps.setInt(9, comment.getGoodc()); ps.setInt(10, comment.getGeneralc()); ps.setInt(11, comment.getPoorc()); ps.setInt(12, comment.getVideoc()); ps.setInt(13, comment.getAfterc()); return ps; } },keyHolder); return keyHolder.getKey().intValue(); } public void addProductComments(JDProduct product, int rid, String shop) { final String sql="insert into good_holder_" + shop + " (rid,info,count) values (?,?,?)"; List<Object []> comments = transformCommentsToObjects(rid, product.getComment()); super.batchUpdate(sql, new BatchPreparedStatementSetter() { @Override public void setValues(PreparedStatement ps, int i) throws SQLException { ps.setInt(1, (Integer) comments.get(i)[0]); ps.setString(2, (String)comments.get(i)[1]); ps.setInt(3, (Integer) comments.get(i)[2]); } @Override public int getBatchSize() { return comments.size(); } }); } public void addProductComments2(JDProduct product, int rid, String shop) { final String sql="insert into good_holder_" + shop + " (rid,info,count) values (?,?,?)"; List<Object []> comments = transformCommentsToObjects(rid, product.getComment()); super.batchUpdate(sql, comments); } private List<Object[]> transformCommentsToObjects(int rid, JDComment comments) { List<Object[]> list = new ArrayList<>(); Object[] object = null; HashMap<String, Integer> groups = comments.getCommentGroups(); for(String group: groups.keySet()){ object = new Object[]{ rid, group, groups.get(group), }; list.add(object); } return list ; } public List<JDProduct> updateProductForPriceSells(String tableIdx) { //查詢數據,使用RowCallbackHandler處理結果集 String sql = "select id, pid, price from good_holder_" + tableIdx; final JDProduct product = new JDProduct(); List<JDProduct> nokProducts = new ArrayList<>(); //將結果集數據行中的數據抽取到product對象中 super.query(sql, new Object[]{}, new RowCallbackHandler() { public void processRow(ResultSet rs) throws SQLException { product.setId(rs.getInt("id")); product.setPid(rs.getString("pid")); product.setPrice(rs.getString("price")); dataProcess(product, tableIdx); } }); return nokProducts; } public void updateNokProductForPriceSells(String tableIdx, ChromeDriver driver) { //查詢數據,使用RowCallbackHandler處理結果集 String sql = "select id, url, price from good_holder_" + tableIdx; final JDProduct product = new JDProduct(); //將結果集數據行中的數據抽取到product對象中 super.query(sql, new Object[]{}, new RowCallbackHandler() { public void processRow(ResultSet rs) throws SQLException { product.setId(rs.getInt("id")); product.setUrl(rs.getString("url")); product.setPrice(rs.getString("price")); if(isNokProduct(product, tableIdx)){ JDProduct pd = JDSeleniumFullCrawler.getJDProductInfoByUrl(driver, product.getUrl(), product); reSetPriceOrSells(product.getId(), tableIdx, pd.getPrice()); } } }); } public boolean isNokProduct(JDProduct product, String tableIdx){ String price = product.getPrice(); String url = product.getUrl(); if (price.equalsIgnoreCase("")) { System.out.println("good_holder_" + tableIdx + ", id=" + product.getId() + " data is not ok"); if (url != null && !url.equalsIgnoreCase("")){ return true; } } return false; } public void dataProcess(JDProduct product, String tableIdx) { String price = product.getPrice(); double dlow = 0 , dhigh=0; if (price.equalsIgnoreCase("")) { System.out.println("good_holder_" + tableIdx + ", id=" + product.getId() + " data is not ok"); return; } String low = "0", high = "0"; if (price.contains("-")){ int idx = price.indexOf("-"); low = price.substring(0, idx); high = price.substring(idx+1); }else{ low = price; high = price; } dlow = Double.valueOf(low); dhigh = Double.valueOf(high); // String countReg = "^[1-9][0-9]*"; // Pattern p = Pattern.compile(countReg); // Matcher m = p.matcher(sells); // if (m.find()){ // String sc = m.group(); // sellCount = Integer.valueOf(sc); // } updateProductPriceSell(product.getId(), tableIdx, dlow, dhigh); } public void updateProductPriceSell(int id, String tableIdx, double priceLow, double priceHigh) { String sql = "update good_holder_" + tableIdx + " set priceLow=?,priceHigh=? where id=?"; int rows = super.update(sql, priceLow, priceHigh,id); System.out.println(rows); } public void reSetPriceOrSells(int id, String tableIdx, String price) { String sql = "update good_holder_" + tableIdx + " set price=? where id=?"; int rows = super.update(sql, price, id); System.out.println(rows); } }
下面就是商品信息和評論信息的model類
JDProduct
package com.shihuc.up.spider.jd.comment; public class JDProduct { private int id; private String pid; private String title; private String brand; private String pname; private String price; private String url; private double priceHigh; private double priceLow; private JDComment comment; public int getId() { return id; } public void setId(int id) { this.id = id; } public String getPid() { return pid; } public void setPid(String pid) { this.pid = pid; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getBrand() { return brand; } public void setBrand(String brand) { this.brand = brand; } public String getPname() { return pname; } public void setPname(String pname) { this.pname = pname; } public String getPrice() { return price; } public void setPrice(String price) { this.price = price; } public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } public double getPriceHigh() { return priceHigh; } public void setPriceHigh(double priceHigh) { this.priceHigh = priceHigh; } public double getPriceLow() { return priceLow; } public void setPriceLow(double priceLow) { this.priceLow = priceLow; } public JDComment getComment() { return comment; } public void setComment(JDComment comment) { this.comment = comment; } @Override public String toString() { return "Product{" + "pid=" + pid + ", title='" + title + '\'' + ", brand='" + brand + '\'' + ", pname='" + pname + '\'' + ", price=" + price + '\'' + '}'; } }
JDComment
package com.shihuc.up.spider.jd.comment; import java.awt.*; import java.util.HashMap; public class JDComment { private Integer goodRate; /** * 評論內容的分類信息以及對應的條數 */ private HashMap<String, Integer> commentGroups; //天貓是銷量數據,淘寶和京東一樣,是累計評論數據 private int totalc; private int goodc; private int generalc; private int poorc; private int videoc; private int afterc; public Integer getGoodRate() { return goodRate; } public void setGoodRate(Integer goodRate) { this.goodRate = goodRate; } public HashMap<String, Integer> getCommentGroups() { return commentGroups; } public void setCommentGroups(HashMap<String, Integer> commentGroups) { this.commentGroups = commentGroups; } public int getTotalc() { return totalc; } public void setTotalc(int totalc) { this.totalc = totalc; } public int getGoodc() { return goodc; } public void setGoodc(int goodc) { this.goodc = goodc; } public int getGeneralc() { return generalc; } public void setGeneralc(int generalc) { this.generalc = generalc; } public int getPoorc() { return poorc; } public void setPoorc(int poorc) { this.poorc = poorc; } public int getVideoc() { return videoc; } public void setVideoc(int videoc) { this.videoc = videoc; } public int getAfterc() { return afterc; } public void setAfterc(int afterc) { this.afterc = afterc; } }
這里需要補充說明一下,價格和評論用到的關於httpclient拉到網頁的工具類
HttpClientUtils
package com.shihuc.up.spider; import com.shihuc.up.spider.jd.opt.JDPhoneHolder; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.entity.UrlEncodedFormEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.client.methods.HttpRequestBase; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import org.apache.http.message.BasicNameValuePair; import org.apache.http.util.EntityUtils; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.Map; public class HttpClientUtils { //創建httpclient連接池 private static PoolingHttpClientConnectionManager connectionManager; static{ connectionManager=new PoolingHttpClientConnectionManager(); //定義連接池最大連接數 connectionManager.setMaxTotal(200); //對指定的網址最多只有20個連接 connectionManager.setDefaultMaxPerRoute(20); } private static CloseableHttpClient getCloseableHttpClient(){ CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectionManager).build(); return httpClient; } private static String execute(HttpRequestBase httpRequestBase) throws IOException { httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0"); //設置超時時間 RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(10000).setConnectTimeout(10000).setSocketTimeout(15 * 1000).build(); httpRequestBase.setConfig(config); CloseableHttpClient httpClient = getCloseableHttpClient(); CloseableHttpResponse response = httpClient.execute(httpRequestBase); String html = EntityUtils.toString(response.getEntity(), "utf-8"); return html; } private static String executeReferer(HttpRequestBase httpRequestBase, String referer) throws IOException { httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0"); httpRequestBase.setHeader("Referer", referer); httpRequestBase.setHeader("Sec-Fetch-Mode", "no-cors"); //設置超時時間 RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(60000).setConnectTimeout(60000).setSocketTimeout(10 * 10000).build(); httpRequestBase.setConfig(config); CloseableHttpClient httpClient = getCloseableHttpClient(); CloseableHttpResponse response = httpClient.execute(httpRequestBase); String html = EntityUtils.toString(response.getEntity(), "utf-8"); return html; } public static String doGet(String url) throws IOException { HttpGet httpGet = new HttpGet(url); String html = execute(httpGet); return html; } public static String doGetReferer(String url, String referer) throws IOException { HttpGet httpGet = new HttpGet(url); String html = executeReferer(httpGet, referer); return html; } public static String doPost(String url, Map<String,String> params) throws IOException { HttpPost httpPost = new HttpPost(url); List<BasicNameValuePair> list = new ArrayList<>(); for (String key : params.keySet()) { list.add(new BasicNameValuePair(key,params.get(key))); } UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list); httpPost.setEntity(entity); return execute(httpPost); } public static void main(String args[]) { String pid = "4310407"; // try { // JDPhoneHolder.getCommitCount(pid); // } catch (IOException e) { // e.printStackTrace(); // } try { int commitCountNum = JDPhoneHolder.getCommitCountNum(pid); System.out.println("產品: " + pid + ", 評論數:" + commitCountNum); } catch (IOException e) { e.printStackTrace(); } } }
針對所用到的表結構,也附在這里:
產品表:
CREATE TABLE `good_holder_jd_info_czsjzj` ( `id` int(11) NOT NULL AUTO_INCREMENT, `pid` varchar(32) NOT NULL COMMENT '產品ID', `title` varchar(1024) NOT NULL COMMENT '產品標題描述', `brand` varchar(1024) NOT NULL COMMENT '產品品牌', `pname` varchar(1024) NOT NULL COMMENT '產品名稱', `price` varchar(32) NOT NULL COMMENT '產品價格', `url` varchar(2048) NOT NULL COMMENT '產品鏈接', `priceLow` double(16,2) DEFAULT NULL COMMENT '商品的低價', `priceHigh` double(16,2) DEFAULT NULL COMMENT '商品的高價', `goodrate` int(11) DEFAULT NULL COMMENT '產品評論分數', `totalc` int(64) DEFAULT NULL COMMENT '總評論數', `goodc` int(11) DEFAULT NULL COMMENT '好評數量', `generalc` int(11) DEFAULT NULL COMMENT '中評數量', `poorc` int(11) DEFAULT NULL COMMENT '差評數量', `videoc` int(11) DEFAULT NULL COMMENT '視頻曬單量', `afterc` int(11) DEFAULT NULL COMMENT '追評數量', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=1134 DEFAULT CHARSET=utf8mb4
評論分類表(我這里沒有抓評論的詳情數據,我只抓取了評論的類別和次數數據)
CREATE TABLE `good_holder_jd_comment_czsjzj` ( `rid` int(11) NOT NULL COMMENT '評論對應的產品記錄的主鍵ID', `info` varchar(256) DEFAULT NULL COMMENT '描述內容信息', `count` int(11) DEFAULT NULL COMMENT '對應內容的條數' ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
這個評論分類表的數據類似下圖紅圈內的內容:
寫在博文的最后,關於抓取JD商品價格和評論數據的方法:
//獲取價格,只需要傳入商品的ID即可 public static String getPrice(String pid) throws IOException { String priceUrl="https://p.3.cn/prices/mgets?pduid="+Math.random()+"&skuIds=J_"+pid; String priceJson = HttpClientUtils.doGet(priceUrl); System.out.println(priceJson); Gson gson = new GsonBuilder().create(); List<Map<String,String>> list = gson.fromJson(priceJson, List.class); return list.get(0).get("p"); }
//獲取商品的評論信息,只需要傳入商品的ID即可 public static JSONObject getComments(String pid) throws IOException { String baseUrl = "https://sclub.jd.com/comment/productPageComments.action?score=0&sortType=5&page=1&pageSize=1&isShadowSku=0&productId=" + pid; String commentJson = HttpClientUtils.doGet(baseUrl); System.out.println(commentJson); JSONObject jsonObject = JSON.parseObject(commentJson); return jsonObject; }
兩個函數中,紅色URL部分,是重點內容,從這兩個URL來看,JD的商城站點信息,相對設計的還是比較簡單的。
這篇博文,就分享到這里吧,上述爬蟲程序(主要是爬取車載手機支架信息的),稍微修改一下,就可以爬取其他商品的類似信息。歡迎評論,歡迎給出繞開阿里反爬技術的解決方案!