如何優雅的爬取 gzip 格式的頁面並保存在本地（java實現）

本文轉載自查看原文 2018-10-30 11:29 713 java

1. 引言

在爬取汽車銷量數據時需要爬取 html 保存在本地后再做分析，由於一些頁面的 gzip 編碼格式，

獲取后要先解壓縮，否則看到的是一片亂碼。在網絡上仔細搜索了下，終於在這里找到了一個優雅的方案。

2. 使用的開源庫

        <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.4</version> </dependency> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>18.0</version> </dependency>

3. 實現代碼

package com.reycg; import java.io.File; import java.io.IOException; import java.io.InputStream; import java.net.URL; import java.util.List; import java.util.zip.GZIPInputStream; import org.apache.commons.io.FileUtils; import com.google.common.base.Charsets; import com.google.common.io.ByteSource; import com.google.common.io.Resources; public class GzippedByteSource extends ByteSource { private final ByteSource source; public GzippedByteSource(ByteSource gzippedSource) { source = gzippedSource; } @Override public InputStream openStream() throws IOException { return new GZIPInputStream(source.openStream()); } public static void main(String[] args) throws IOException { URL url = new URL("..."); // TODO 此處需要輸入 html 頁面地址 String filePath = "1.html"; List<String> lines = new GzippedByteSource(Resources.asByteSource(url)).asCharSource(Charsets.UTF_8).readLines();
　　　　 // List<String> lines = Resources.asCharSource(url, Charsets.UTF_8).readLines(); // 非 gzip 格式 html 頁面獲取 (1) FileUtils.writeLines(new File(filePath), lines); } }

4. 注意

1. 如果在執行時報下面錯誤，說明返回 html 頁面並非 gzip 格式

Exception in thread "main" java.util.zip.ZipException: Not in GZIP format

此時可以使用上面代碼標號為（1）的代碼行獲取。

5. 附注

獲取汽車銷量主要用來在我個人開發的 汽車銷量查詢小助手（小程序）展示所用，如果有同學感興趣，可以在

微信小程序中搜索汽車銷量查詢小助手或者掃描下方二維碼查看效果，歡迎同學提建議和評論。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【知識積累】使用Httpclient實現網頁的爬取並保存至本地 Python：爬取網站圖片並保存至本地爬取淘寶商品數據並保存在excel中使用Scrapy爬取圖片入庫,並保存在本地 python爬蟲--房產數據爬取並保存本地網絡爬蟲（爬取網站圖片，自動保存本地） python爬取網站上的圖片並保存到本地 python爬取網頁圖片並保存到本地 Python爬蟲爬取目標小說並保存到本地 python爬取網站視頻保存到本地