Java 獲取網頁數據的一般步驟和方式

本文轉載自查看原文 2020-03-24 22:57 744 爬蟲

在很多行業當中，我們需要對行業進行分析，就需要對這個行業的數據進行分類，匯總，及時分析行業的數據，對於公司未來的發展，有很好的參照和橫向對比。面前通過網絡進行數據獲取是一個很有效而且快捷的方式。
首先我們來簡單的介紹一下，利用java對網頁數據進行抓取的一些步驟，有不足的地方，還望指正，哈哈。屁話不多說了。

其實一般分為以下不足：
1：通過HttpClient請求到達某網頁的url訪問地址（特別需要注意的是請求方式）
2：獲取網頁源碼
3：查看源碼是否有我們需要提取的數據
4：對源碼進行拆解，一般使用分割，正則或者第三方jar包
5：獲取需要的數據對自己創建的對象賦值
6：數據提取保存

下面簡單的說一下在提取數據中的部分源碼，以及用途：

 /**
     * 向指定URL發送GET方法的請求
     *
     * @param url
     *            發送請求的URL
     * @param param
     *            請求參數，請求參數應該是 name1=value1&name2=value2 的形式。
     * @return URL 所代表遠程資源的響應結果
     */
    public static String sendGet(String url, String param) {
        String result = "";
        BufferedReader in = null;
        try {
            String urlNameString = url;
            URL realUrl = new URL(urlNameString);
            // 打開和URL之間的連接
            URLConnection connection = realUrl.openConnection();
            // 設置通用的請求屬性
            connection.setRequestProperty("accept", "*/*");
            connection.setRequestProperty("connection", "Keep-Alive");
            connection.setRequestProperty("user-agent",
                    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
            // 建立實際的連接
            connection.connect();
            // 獲取所有響應頭字段
            Map<String, List<String>> map = connection.getHeaderFields();

            // 定義 BufferedReader輸入流來讀取URL的響應
            in = new BufferedReader(new InputStreamReader(
                    connection.getInputStream())); //這里如果出現亂碼，請使用帶編碼的InputStreamReader構造方法，將需要的編碼設置進去
            String line;
            while ((line = in.readLine()) != null) {
                result += line;
            }
        } catch (Exception e) {
            System.out.println("發送GET請求出現異常！" + e);
            e.printStackTrace();
        }
        // 使用finally塊來關閉輸入流
        finally {
            try {
                if (in != null) {
                    in.close();
                }
            } catch (Exception e2) {
                e2.printStackTrace();
            }
        }
        return result;
    }

解析存儲數據

public Bid getData(String html) throws Exception {
        //獲取的數據，存放在到Bid的對象中，自己可以重新建立一個對象存儲
        Bid bid = new Bid();
        //采用Jsoup解析
        Document doc = Jsoup.parse(html);
        //  System.out.println("doc內容" + doc.text());
        //獲取html標簽中的內容tr
        Elements elements = doc.select("tr");
        System.out.println(elements.size() + "****條");
       //循環遍歷數據
        for (Element element : elements) {
            if (element.select("td").first() == null){
                continue;
            }
            Elements tdes = element.select("td");
            for(int i = 0; i < tdes.size(); i++){
               this.relation(tdes,tdes.get(i).text(),bid,i+1);
            }
        }
        return bid;
    }

得到的數據

Bid {
    h2 = '詳見內容', 
      itemName = '訴訟服務中心設備采購',
      item = '貨物/辦公消耗用品及類似物品/其他辦公消耗用品及類似物品', 
      itemUnit = '詳見內容', 
      areaName = '港北區', 
      noticeTime = '2018年10月22日 18:41',
      itemNoticeTime = 'null',
      itemTime = 'null',
      kaibiaoTime = '2018年10月26日 09:00', 
      winTime = 'null',
      kaibiaoDiDian = 'null', 
      yusuanMoney = '￥67.00元（人民幣）', 
      allMoney = 'null', 
      money = 'null', 
      text = ''
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 6、通過xpath獲取網頁數據 JAVA 爬蟲獲取js動態生成的網頁數據 uniCloud爬蟲獲取網頁數據使用HtmlUnit動態獲取網頁數據 VB中獲取網頁數據使用HttpClient進行Get方式通信(使用HttpGet獲取網頁數據) java htmlunit 抓取網頁數據 Java-->Json解析網頁數據 java通過url抓取網頁數據 java實現導出網頁數據到excel