【提出問題】
autohome是個汽車門戶,有時論壇里面會有一些比較好看的帖子,比如“一家四口環中國行”,主貼100多頁,跟帖4000多頁,看起來很爽。
但是,其論壇的JS腳本寫的並不好,如果一帖圖片非常多的情況下,經常有圖片顯示不了,很是郁悶。
於是有思路想下載帖子出來離線瀏覽。有人可能會說,現在有很多現成的離線瀏覽軟件呀,不錯,但是下載不了這里的圖片,因為其圖片URL做了個小小的手腳。
【分析問題】
1、URL規律分析
第一帖是 http://club.autohome.com.cn/bbs/thread-o-200042-19582947-1.html
第二帖子 http://club.autohome.com.cn/bbs/thread-o-200042-19582947-2.html
發現其N貼是 http://club.autohome.com.cn/bbs/thread-o-200042-19582947-N.html
2、圖片分析
查看源文件,其圖片的HTML為
1 <img id="img-0-8" name="lazypic" onload="tz.picLoaded(this)" onerror="tz.picNotFind(this)" style="width:700px;height:464px" src="http://x.autoimg.cn/club/lazyload.png" src9="http://club1.autoimg.cn/album/userphotos/2013/2/25/500_9bed_79b1f6c8_79b1f6c8.jpg" />
默認的src是一個等待圖片,真實的src為src9屬性,通過onload事件來替換src實現顯示圖片,超時或者出錯是顯示onerror事件
那么我們抓取src9就可以下載圖片了
3、圖片抓取嘗試
比如上面的圖片URL為 http://club1.autoimg.cn/album/userphotos/2013/2/25/500_9bed_79b1f6c8_79b1f6c8.jpg ,如果直接下載圖片的話,服務器會拒絕,因為你在盜鏈。
所以最好是用HTTP 1.1的指令方式發起HTTP REQUEST,同時要傳達 request.Referer 屬性,可用fiddler監控
1 GET http://club1.autoimg.cn/album/userphotos/2013/4/3/500_43be_cde0b51b_cde0b51b.jpg HTTP/1.1 2 Accept: */* 3 Referer: http://club.autohome.com.cn/bbs/thread-o-200042-19582947-1.html 4 Accept-Language: zh-CN 5 User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; KB974487) 6 Accept-Encoding: gzip, deflate 7 Connection: Keep-Alive 8 DNT: 1 9 Host: club1.autoimg.cn
為了方便圖片顯示,建議按照原來的路徑保存圖片文件,比如圖片 http://club1.autoimg.cn/album/userphotos/2013/4/3/500_43be_cde0b51b_cde0b51b.jpg,則按照文件夾 club1.autoimg.cn/album/userphotos/2013/4/3/500_43be_cde0b51b_cde0b51b.jpg 來保存。
4、分頁鏈接
為了便於瀏覽,下載后的分頁連接要能連上
1 <div class="pages fs"> 2 <a href="forum-o-200042-1.html">返回列表</a></div> 3 <div class="pages" id="x-pages1" maxindex="4927"><span class="cur">1</span><a target="_self" href="thread-o-200042-19582947-2.html">2</a><a target="_self" href="thread-o-200042-19582947-3.html">3</a><a target="_self" href="thread-o-200042-19582947-4.html">4</a><a target="_self" href="thread-o-200042-19582947-5.html">5</a><span>...</span><a target="_self" href="thread-o-200042-19582947-4927.html">4927</a><span class="gopage"><input type="text" value="1" title="輸入頁碼,按回車快速跳轉" onkeydown="if(event.keyCode==13){tz.goPage(this)}" /><span class="fs" title="共 4927 頁"> / 4927 頁</span></span><a target="_self" class="afpage" href="thread-o-200042-19582947-2.html" title="支持鍵盤 ← → 鍵翻頁">下一頁</a></div> 4 <div class="jfwen"> 5 到第<span><input type="text" value="" class="topinp txtcenter" id="txtGoFloor1" maxlength="7" 6 title="輸入樓層數,按回車快速定位" onkeydown="if(event.keyCode==13){tz.goFloor(null,'txtGoFloor1')}" /></span>樓</div>
發現本來的連接就是html文件的文件名,所以只要按原來的文件名保存就可以了。
5、無用代碼過濾
將onload和onerror事件去掉,將script的始末標簽替換為DIV,將無用http://開頭 替換為本地./,方便本地瀏覽不占資源
【解決問題】
1、http
1 private static readonly string DefaultUserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"; 2 /// <summary> 3 /// 創建GET方式的HTTP請求 ,拿來的改了下 4 /// </summary> 5 /// <param name="url">請求的URL</param> 6 /// <param name="timeout">請求的超時時間</param> 7 /// <param name="userAgent">請求的客戶端瀏覽器信息,可以為空</param> 8 /// <param name="referer">請求來源URL</param> 9 /// <param name="cookies">隨同HTTP請求發送的Cookie信息,如果不需要身份驗證可以為空</param> 10 /// <returns></returns> 11 public static HttpWebResponse CreateGetHttpResponse(string url, int? timeout, string userAgent, string referer, CookieCollection cookies) 12 { 13 if (string.IsNullOrEmpty(url)) 14 { 15 throw new ArgumentNullException("url"); 16 } 17 HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest; 18 request.Method = "GET"; 19 request.UserAgent = DefaultUserAgent; 20 if (!string.IsNullOrEmpty(userAgent)) 21 { 22 request.UserAgent = userAgent; 23 } 24 if (timeout.HasValue) 25 { 26 request.Timeout = timeout.Value; 27 } 28 if (referer != null) 29 { 30 request.Referer = referer; 31 } 32 if (cookies != null) 33 { 34 request.CookieContainer = new CookieContainer(); 35 request.CookieContainer.Add(cookies); 36 } 37 return request.GetResponse() as HttpWebResponse; 38 }
2、主要動作按鈕
1 private void btnStart_Click(object sender, EventArgs e) 2 { 3 btnStart.Enabled = false; 4 btnStop.Enabled = true; 5 timer1.Enabled = true; 6 7 var dt1 = DateTime.Now; 8 var dir = tbSaveDir.Text; 9 var baseUrl = tbURL.Text.Replace("-1.", "-#."); 10 var totalPage = (int) tbTotalPage.Value; 11 var fromPage = (int) tbFromPage.Value; 12 var isDefault = radioButton1.Checked; 13 var html = ""; 14 var imgurl = ""; 15 var htmlFile = ""; 16 var imgNum = 0; 17 18 //進度條 19 progressBar1.Maximum = totalPage - fromPage + 1; 20 progressBar1.Value = 1; 21 progressBar2.Value = 1; 22 23 for (int i = fromPage; i <= totalPage; i++) 24 { 25 //處理進度條 26 progressBar1.Step = 1; 27 progressBar1.PerformStep(); 28 29 //處理操作 30 var url = baseUrl.Replace("#", i.ToString()); 31 try 32 { 33 var response = HttpWebResponseUtility.CreateGetHttpResponse(url, null, null, url, null); 34 if (response != null) 35 { 36 37 var sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"),true); 38 html = sr.ReadToEnd(); 39 response.Close(); 40 41 string pattern = "src9=\"http://[a-zA-Z0-9_./]+\""; 42 var gc = Regex.Matches(html, pattern); 43 //Console.WriteLine(gc); 44 45 //處理HTML文件 46 html = html.Replace("src=\"http://x.autoimg.cn/club/lazyload.png\" src9=\"", "src=\""); 47 html = html.Replace("http://", ""); 48 html = html.Replace("onload=", "x1="); 49 html = html.Replace("onerror=", "x2="); 50 html = html.Replace("<script", "<DIV style=\"display:none\" "); 51 html = html.Replace("</script", "</DIV"); 52 htmlFile = dir + Path.DirectorySeparatorChar + Path.GetFileName(url.Replace("http://", "")); 53 var sw = new StreamWriter(htmlFile, true, Encoding.GetEncoding("gb2312")); 54 sw.Write(html); 55 sw.Close(); 56 sw.Dispose(); 57 tbLog.AppendText(htmlFile + " ok"); 58 59 imgNum = 0; 60 foreach (var match in gc) 61 { 62 //Console.WriteLine(match.ToString()); 63 imgurl = match.ToString().Replace("\"", "").Replace("src9=", ""); 64 _myQue.Enqueue(new ParamEntity(dir, imgurl, url, isDefault)); 65 66 67 68 69 imgNum++; 70 }//end-foreach 71 72 tbLog.AppendText(", " + imgNum +" image(s)" + Environment.NewLine); 73 _totalNum += imgNum; 74 } 75 } 76 catch (Exception exception) 77 { 78 //Console.WriteLine(exception); 79 MessageBox.Show(exception.Message); 80 } 81 82 } 83 btnStart.Enabled = true; 84 var dt2 = DateTime.Now; 85 var timeUse = dt2 - dt1; 86 MessageBox.Show(string.Format("頁面下載已結束,耗時 {0} 分鍾,請等待圖片下載結束,結束后打開目錄 {1} 查看下載內容。", timeUse.TotalMinutes.ToString("F2"), dir)); 87 }
3、下載圖片
1 private void DownloadImage(object obj) 2 { 3 var pe = obj as ParamEntity; 4 var tmp = pe.ImgUrl.Replace("http://", ""); 5 var dir = pe.SaveDir + Path.DirectorySeparatorChar + Path.GetDirectoryName(tmp); 6 var filename = pe.SaveDir + Path.DirectorySeparatorChar + tmp; 7 8 if (!Directory.Exists(dir)) 9 { 10 Directory.CreateDirectory(dir); 11 } 12 try 13 { 14 _runNum++; 15 if (pe.IsType1) 16 { 17 var wc = new WebClient(); 18 wc.DownloadFile(pe.ImgUrl, filename); 19 } 20 else 21 { 22 var imgres = HttpWebResponseUtility.CreateGetHttpResponse(pe.ImgUrl, null, null, pe.PageUrl, null); 23 if (imgres != null) 24 { 25 var reader = imgres.GetResponseStream(); 26 var writer = new FileStream(filename, FileMode.OpenOrCreate, FileAccess.Write); 27 var buff = new byte[512]; 28 var c = 0; //實際讀取的字節數 29 while ((c = reader.Read(buff, 0, buff.Length)) > 0) 30 { 31 writer.Write(buff, 0, c); 32 } 33 writer.Close(); 34 writer.Dispose(); 35 reader.Close(); 36 reader.Dispose(); 37 imgres.Close(); 38 } 39 } 40 41 } 42 catch (Exception e) 43 { 44 //Console.WriteLine(e.ToString()); 45 _logQue.Enqueue(Path.GetFileName(tmp) + " fail. " + e.Message); 46 } 47 }
4、timer觸發器
1 private void timer1_Tick(object sender, EventArgs e) 2 { 3 //處理進度條 4 progressBar2.Maximum = _totalNum; 5 progressBar2.Step = 5; 6 progressBar2.PerformStep(); 7 8 for (var i = 1; i <= 5; i++) 9 { 10 if (_myQue.Count > 0) 11 { 12 //Console.WriteLine(@"RunThread ({0}) {1}", i, _runNum); 13 var pe = (ParamEntity) _myQue.Dequeue(); 14 var thread = new Thread(new ParameterizedThreadStart(DownloadImage)); 15 thread.Start(pe); 16 } 17 } 18 if (_logQue.Count > 0) 19 { 20 tbLog.AppendText(_logQue.Dequeue().ToString() + Environment.NewLine); 21 } 22 23 Application.DoEvents(); 24 }
本來不想用timer的,想用一個隊列,自己處理完了會再繼續處理下一個,結果沒寫成。
【可能的技術要點】
1、http請求帶referer
2、多線程,界面不阻塞(backgroundWorker,我還沒改)
3、progressBar
4、Queue
【成品】

【心得】
不求最好,但求心安。新手的看看,大蝦的指點。
