Google 圖片下載工具

本文轉載自查看原文 2012-07-11 22:06 3395 C#/ Google/ RE/ 圖片下載

畢設做實驗需要從網上下幾萬張圖片，以前用師兄做的Flickr下載器，用Flickr的API完成的。但是Flickr上的圖片是用戶分享居多，通過指定的關鍵詞去搜索，很多時候無法得到滿意的圖片。在Google、Bing上雖然能得到比較好的搜索結果，但是Google早早地停用了搜索的SDK，CodeProject上的例子是N年前的，試過都不能用了；Bing雖然現在還有SDK，但是看官方的通告，大約是8月份也要停用了，而且現在提供的下載限制每天一張，木有辦法，只能自己想招了。

在查看Google圖片搜索頁面的源碼時，發現在<a>的href屬性里面包含了圖片原始的url，所以就想到解析搜索結果頁面的辦法，將原始圖片的url切出來，然后從url現在原始的圖片。url的切割，可以使用正則表達式來完成。

還有一個問題就是搜索結果頁面的獲取。

看了下搜索結果的url，發現在url里面包含了搜索關鍵字，以及其他的一些信息，於是想：是不是可以拼接字符串，然后向google發送web請求，返回結果可能就是搜索結果頁面吧。

這個url是直接復制瀏覽器搜索結果的url，然后替換掉q和oq兩個參數，其值為搜索關鍵詞。舉個例子，關鍵詞是answer，發送請求，保存返回的結果，得到如下頁面：

總過有21張圖片，可見Google對這種形式的web請求的返回結果做了一些限制，不可以像使用瀏覽器那樣，一次返回很多頁，只能一次返回21張。

然后，點了下分頁的“2”，看第二頁，發現url有了些變化：

關鍵在於這個start屬性！start=21，那應該是說，利用這個屬性來做的分頁，但是了，又不知道哪個屬性指定end，略郁悶。。。但還好了，至少可以說，在拼url的時候，可以通過指定start的值，發送多次請求，得到多個搜索結果分頁，雖然笨一點，但是目的還是達到了。

Nice！

最后使用的url如下：

q指定搜索關鍵詞，start索引各個分頁的起始位置。

以上就是編程在Google上搜索關鍵詞並下載圖片的基本思路。

有幾個問題還要記錄下：

1、搜索返回頁面的問題。

直接向Google發送請求，然后從搜索結果里面用RE分割url，耗費內存資源太多。實驗發現，一個搜索返回頁面200k，同時下載關鍵詞100個，每個關鍵詞下載200張，也就是10個頁面，如果都在內存里完成的話，不算光是存這些東西，就需要大約200M，還要用RE來切割url，實際跑的時候看了下，內存爆了都，我電腦是4G的內存。而且，由於網絡的問題，有時候下載過程會中斷，想了下，還是先把搜索結果存到本地，然后再切割url。

2、RE切割圖片的問題。

RE學的不夠好，自己寫了幾個總是有問題，就在網上找了下面的一個：

這個只能切除url，還需要加一些后綴過濾，才能得到圖片的url。還在琢磨怎么寫一個高級的RE，直接把圖片的url切出來。

3、多線程下載的問題。

因為同時下載100多個關鍵詞的圖片，就想多線程來做，撇開網絡帶寬的問題，希望這樣能下載的快一點。初始的時候，每個關鍵詞開啟一個線程。但是這樣還是慢，每個關鍵詞下載額定200張的圖片，需要1-2個小時，當然，是在多個關鍵詞同時下的情況。72個關鍵詞，實際下到8000多張圖片，共計下了一晚上，6、7個小時吧。忍不了。。

后來，想着在每個關鍵詞下載過程中，給每個下載鏈接起一個線程。結果，由於同時起的線程過多，導致線程資源短缺，頻繁的有線程掛掉，雖然在一定的階段里，同等網絡條件下，確實下的快了，但是整體上卻是慢了。也嘗試了ThreadPool，情況類似。

我想的理想情況是，開200個線程，維護一個線程池，有需求就請求線程作業，作業完畢則釋放資源，如果當前沒有資源，則請求端等待，直到有可用線程。暫時沒有實現，最近要再搞一下。

此外，在下載中，還做了log，並對圖片重新編號，還有圖片名與url的映射字典。

以下附程序代碼：

Program.cs

1 using System;
2 using System.Collections.Generic;
3 using System.IO;
4 using System.Linq;
5 using System.Threading;
6 using System.Windows.Forms;
7
8 namespace GoogleImageDownload
9 {
10      static class Program
11     {
12          /// <summary>
13          /// 應用程序的主入口點。
14          /// </summary>
15         [STAThread]
16          static void Main()
17         {
18             StreamReader sr = new StreamReader( " keywords.txt ");
19              string keywordPool = sr.ReadToEnd();
20             sr.Close();
21              string[] keywords = keywordPool.Split( new char[] { ' , ' });
22              foreach ( string keyword in keywords)
23             {
24                 GoogleImages bi = new GoogleImages();
25                 Thread download = new Thread( new ParameterizedThreadStart(bi.DownLoadImages));
26                 download.Name = " Thread_ " + keyword;
27                 download.Start(keyword);
28                  // WaitCallback wc = new WaitCallback(bi.DownLoadImages);
29                  // ThreadPool.QueueUserWorkItem(wc, keyword);
30             }
31         }
32     }
33 }

GoogleImages

  1 using System;
  2 using System.Collections.Generic;
  3 using System.Net;
  4 using System.Xml.Linq;
  5 using System.IO;
  6 using System.Web;
  7 using System.Text.RegularExpressions;
  8 using System.Threading;
  9 using System.Drawing;
10 using System.Text;
11
12 namespace GoogleImageDownload
13 {
14      public class GoogleImages
15     {
16          /// <summary>
17          /// 通過拼url的方式，向google發出請求
18          /// 參數0：查詢關鍵字
19          /// 參數1：從那一條搜索記錄開始，每頁默認21個，通過設置為0、21、42、63等，獲取多張圖片
20          /// </summary>
21          private const string IMG_URL = " http://www.google.com.hk/search?q={0}&hl=zh-CN&newwindow=1&safe=strict&biw=1280& " +
22              " bih=699&gbv=2&ie=UTF-8&tbm=isch&ei=2HblT4vrCISwiQeavLhZ&start={1}&sa=N ";
23          /// <summary>
24          /// 默認POST 10 頁
25          /// </summary>
26          private const int PAGES = 10;
27          /// <summary>
28          /// 提供四種出錯信息
29          /// </summary>
30          public static string[] ERRORS = { " GetDownloadInfo ", " CreateImageDownloadLink ", " SaveToLocal ", " RenameImage " };
31          private string logFile = "";
32          private string downloadFolder = "";
33          private string downloadObj = "";
34
35          /// <summary>
36          /// Download images from Google
37          /// </summary>
38          public void DownLoadImages( object Obj)
39         {
40             DateTime tStart = DateTime.Now;
41              this.downloadObj = ( string)Obj;
42
43              /// Images：每個關鍵字為單獨的一個文件夾，該文件夾下保存圖片
44              /// DownloadInfos：POST得到的Google搜索結果的頁面，以及頁面中圖片的url
45              /// Log：下載圖片中出現的異常信息
46              #region 創建保存下載信息的文件夾
47
48              /// 創建圖片文件夾
49              if (!Directory.Exists(String.Format( " .\\Images\\{0} ", downloadObj)))
50             {
51                 Directory.CreateDirectory(String.Format( " .\\Images\\{0} ", downloadObj));
52             }
53              this.downloadFolder = String.Format( " .\\Images\\{0} ", downloadObj);
54
55              /// 創建下載信息文件夾
56              if (!Directory.Exists(String.Format( " .\\DownloadInfos ", downloadObj)))
57             {
58                 Directory.CreateDirectory(String.Format( " .\\DownloadInfos ", downloadObj));
59             }
60
61              string resHtmlFile = " .\\DownloadInfos\\ " + downloadObj + " _res.txt ";
62              if (!File.Exists(resHtmlFile))
63             {
64                 File.Create(resHtmlFile);
65             }
66
67              string resImageList = " .\\DownloadInfos\\ " + downloadObj + " _img.txt ";
68              if (!File.Exists(resImageList))
69             {
70                 File.Create(resImageList);
71             }
72
73              /// 創建日志文件夾
74              if (!Directory.Exists(String.Format( " .\\Log ", downloadObj)))
75             {
76                 Directory.CreateDirectory(String.Format( " .\\Log ", downloadObj));
77             }
78
79              this.logFile = " .\\Log\\ " + ( string)Obj + " .log ";
80              if (!File.Exists(logFile))
81             {
82                 File.Create(logFile);
83             }
84
85              #endregion
86
87              /// 確定下載幾頁，模擬了在搜索結果中手動翻頁
88              /// 默認10頁，每頁大約21張圖片
89
90              #region POST 10 個請求，返回10個Google搜索結果頁面，所有內容存在一個文本文件里面
91
92              for ( int i = 0; i < PAGES; i++)
93             {
94                  string url = string.Format(IMG_URL, downloadObj, 21 * i);
95                  try
96                 {
97                     System.Net.HttpWebRequest r = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(url);
98                     r.AllowAutoRedirect = true;
99                     System.Net.CookieContainer c = new System.Net.CookieContainer();
100                     r.CookieContainer = c;
101                     System.Net.HttpWebResponse res = r.GetResponse() as System.Net.HttpWebResponse;
102                      if (res.StatusCode == HttpStatusCode.OK)
103                     {
104                         System.IO.StreamReader s = new System.IO.StreamReader(res.GetResponseStream(), System.Text.Encoding.GetEncoding( " GB2312 "));
105                          // Response.Write(s.ReadToEnd());
106                         Console.WriteLine( " start ");
107                         StreamWriter sw = new StreamWriter(resHtmlFile, true);
108                         sw.Write(s.ReadToEnd());
109                         Console.WriteLine(downloadObj + " " + i);
110                         sw.Close();
111                         s.Close();
112                         res.Close();
113                     }
114                 }
115                  catch (Exception ex)
116                 {
117                     PrintException( 0, downloadObj, ex.ToString());
118                 }
119
120                 Console.WriteLine( " end ");
121             }
122
123              #endregion
124
125              #region 從文本文件中用RE切出圖片的url，並下載圖片
126
127             StreamReader sr = new StreamReader(resHtmlFile);
128              string result = sr.ReadToEnd();
129             sr.Close();
130
131              /// 一般網址url的RE
132              string strRegex = " (http[s]{0,1}|ftp)://[a-zA-Z0-9\\.\\-]+\\.([a-zA-Z]{2,4})(:\\d+)?(/[a-zA-Z0-9\\.\\-~!@#$%^&*+?:_/=<>]*)? ";
133             Regex re = new Regex(strRegex);
134             MatchCollection mactes = re.Matches(result);
135
136              int count = 0;
137              foreach (Match img in mactes)
138             {
139                  string tmp = img.Value;
140                  /// 割掉RE得到的多余的內容
141                  if (tmp.Contains( " &amp "))
142                     tmp = tmp.Substring( 0, tmp.Length - 4);
143                  /// 過濾url，專找圖片的url
144                  if (tmp.Contains( " .jpg ") || tmp.Contains( " .png ") || tmp.Contains( " .jpeg ") || tmp.Contains( " .gif "))
145                 {
146                      string newFileName = "";
147                      string[] split = tmp.Split( new char[]{ ' / '});
148                      /// 給圖片分配一個新的名字
149                      try
150                     {
151                         FileInfo fi = new FileInfo(split[split.Length - 1]);
152                         newFileName = String.Format( " {0}_{1}{2} ", this.downloadObj, count.ToString( " 000 "), fi.Extension);
153                         count++;
154                          /// 輸出“newFileName    ImageUrl”到DownloadInfos
155                         StreamWriter sw2 = new StreamWriter(resImageList, true);
156                         sw2.WriteLine(String.Format( " {0}\t{1} ", newFileName, tmp));
157                         sw2.Flush();
158                         sw2.Close();
159                     }
160                      catch (Exception ex)
161                     {
162                         PrintException( 0, split[split.Length - 1], ex.ToString());
163                     }
164
165                     Console.WriteLine(split[split.Length- 1]+ " is downloading ");
166                      /////////////////////
167                      // Download Images //
168                      /////////////////////
169                     SavePhotoFromUrl(newFileName, tmp);
170                      // ThreadPool.QueueUserWorkItem(new WaitCallback(this.SavePhotoFromUrl), new string[] { newFileName, tmp });
171                      // Thread save = new Thread(new ParameterizedThreadStart(this.SavePhotoFromUrl));
172                      // save.Name = "Thread_" + newFileName;
173                      // save.Start(new string[] { newFileName, tmp });
174                     Console.WriteLine(split[split.Length- 1]+ " has been downloaded ");
175                 }
176             }
177
178              #endregion
179
180              /// 輸出下載用時
181             DateTime tEnd = DateTime.Now;
182             TimeSpan cost = tEnd - tStart;
183             PrintTime(cost.ToString());
184         }
185
186          /// <summary>
187          /// 通過url將圖片保存到本地，指定文件名為FileName
188          /// </summary>
189          public bool SavePhotoFromUrl( string FileName, string Url)
190          // public void SavePhotoFromUrl(Object paras)
191         {
192              // string[] Para = (string[])paras;
193              // string FileName = Para[0];
194              // string Url = Para[1];
195              bool Value = false;
196             WebResponse response = null;
197             Stream stream = null;
198
199              try
200             {
201                 HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
202                 response = request.GetResponse();
203                 stream = response.GetResponseStream();
204                 Value = SaveBinaryFile(response, this.downloadFolder + " \\ " + FileName);
205             }
206              catch (Exception ex)
207             {
208                 PrintException( 1, Url, ex.ToString());
209             }
210              return Value;
211         }
212         /// <summary>
213         /// 保存圖片到本地
214         /// </summary>
215         /// <param name="response"> 用來保存圖片的Response </param>
216          private bool SaveBinaryFile(WebResponse response, string FileName)
217         {
218              bool Value = true;
219              byte[] buffer = new byte[ 1024];
220
221              try
222             {
223                  if (File.Exists(FileName))
224                 {
225                      return true;
226                 }
227                 Stream outStream = System.IO.File.Create(FileName);
228                 Stream inStream = response.GetResponseStream();
229
230                  int l;
231                  do
232                 {
233                     l = inStream.Read(buffer, 0, buffer.Length);
234                      if (l > 0)
235                         outStream.Write(buffer, 0, l);
236                 }
237                  while (l > 0);
238
239                 outStream.Close();
240                 inStream.Close();
241             }
242              catch (Exception ex)
243             {
244                 PrintException( 2, FileName, ex.ToString());
245                 Value = false;
246             }
247              return Value;
248         }
249
250          /// <summary>
251          /// 在下載過程中打印出錯信息
252          /// 三種出錯信息：
253          /// 0：GetDownloadInfo           在向Google請求下載信息的時候出錯
254          /// 1：CreateImageDownloadLink   在獲取圖片url后建立連接過程中出錯
255          /// 2：SaveToLocal               在建立下載連接后保存到本地過程中出錯
256          /// 3: RenameImage               按照標准命名方式重命名文件過程中出錯
257          /// </summary>
258          /// <param name="type"> 出錯類型 </param>
259          /// <param name="obj"> 出錯對象 </param>
260          /// <param name="exceptionInfo"> 出錯信息 </param>
261          private void PrintException( int type, string obj, string exceptionInfo)
262         {
263              try
264             {
265                 StreamWriter sPrint = new StreamWriter( this.logFile, true);
266                 sPrint.WriteLine(String.Format( " TYPE:{0}\tOBJECT:[{1,-30}]\nERROR:{2}\n ", GoogleImages.ERRORS[type], obj, exceptionInfo));
267                 sPrint.Close();
268             }
269              catch (Exception ex)
270             {
271                 ;
272             }
273         }
274
275          /// <summary>
276          /// 輸出下載所用時間
277          /// </summary>
278          /// <param name="time"> 下載用時 </param>
279          private void PrintTime( string time)
280         {
281              try
282             {
283                 StreamWriter sPrint = new StreamWriter( this.logFile, true);
284                 sPrint.WriteLine(String.Format( " Download Cost:{0} ",time));
285                 sPrint.Close();
286             }
287              catch (Exception ex)
288             {
289                 ;
290             }
291         }
292     }
293 }

附：Google url中各個參數的含義（

http://www.4ucode.com/Study/Topic/1060948）：

hl(Interface Language)：Google搜索的界面語言

q(Query)：查詢的關鍵詞

start：顯示搜索結果的起始端，如果start=1，則從第2個搜索結果開始顯示；如果你想直接看第搜索結果第21頁，讓start=200即可，由於Google只顯示1000條搜索結果記錄，start理論取值范圍在0–999之間

lr(Language Restrict)：搜索內容的語言限定限定只搜索某種語言的網頁。如果lr參數為空，則為搜索所有網頁

ie(Input Encoding)：查詢關鍵詞的編碼，缺省設置為utf-8，也就是說請求Google搜索時參數q的值是一段utf-8編碼的文字

         oe(Output Encoding)：搜索結果頁面的網頁編碼，缺省設置oe=utf-8
         num(Number)：搜索結果顯示條數，取值范圍在10–100條之間，缺省設置num=10
         newwindow：是否開啟新窗口以顯示查詢結果，缺省設置newwindow=1，在新窗口打開搜索結果而面
         aq(Ascending Query)：判斷搜索用戶是否是第一次查詢，如果用戶第一次進行查詢，則aq=f(First)；如若進行過多次查詢，則aq=-1，這個的主要作用應該是統計和放置作-弊
         as_q(Ascending Search Query)：上一次查詢關鍵詞

歡迎指教&討論~

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 danbooru 圖片下載工具。 Google POI下載工具破解之路 a 標簽圖片下載 Linux 下載工具 node爬蟲之圖片下載 php實現圖片下載 HTML5 圖片下載 js 動態多圖片下載前端圖片下載的實現