一、正題前的嘮叨

第一篇實戰博客，閱讀量1000+，第二篇，閱讀量200+，兩篇文章相差近5倍，這個差異真的令我很費勁，截止今天，我一直在思考為什么會有這么大的差距，是因為干貨變少了，還是什么原因，一直沒想清楚，如果有讀者發現問題，可以評論寫下大家的觀點，當出現這樣的差距會是什么原因，謝謝大家。

二、分析汽車之家品牌Logo頁面

2.1分析頁面結構

首先我們打開汽車之家品牌Logo選擇頁 https://car.m.autohome.com.cn/，我們以華頌為例，實際上我們就是需要將class是item的里面的img的src（圖片路徑），和strong里面的text（品牌）獲取就行了，大家可以看到，這個其實很簡單，相比上次我們獲取頁面，獲取接口數據簡單多了，為什么要單獨拿一個作為一篇文章呢，就是因為這個地方還涉及到一個文件下載，這一塊之前都沒有提到過。

2.2頁面中的坑

最開始抓取的時候，我發現很多地方src都是空，我就很納悶為什么會這樣，后來斷點調試后才發現，汽車之家Logo圖片在頁面還未划到此處的時候，img是不會加載的，只是占一個位置在那，等到滾動條滾到哪，哪的圖片就會加載，所以此處抓取img的路徑時需要判斷一下

三、動手開發

3.1准備Processor

private class GetLogoInfoProcessor : BasePageProcessor //獲取Logo信息
        {
            public GetLogoInfoProcessor()
            {
            }
            protected override void Handle(Page page)
            {
                List<LogoInfoModel> logoInfoList = new List<LogoInfoModel>();
                var logoInfoNodes = page.Selectable.XPath(".//div[@id='div_ListBrand']//div[@class='item']").Nodes();
                foreach (var logoInfo in logoInfoNodes)
                {
                    LogoInfoModel model = new LogoInfoModel();
                    model.BrandName = logoInfo.XPath("./strong").GetValue();
                    model.ImgPath = logoInfo.XPath("./img/@src").GetValue();
                    if (model.ImgPath == null)
                    {
                        model.ImgPath = logoInfo.XPath("./img/@data-src").GetValue();
                    }
                    if (model.ImgPath.IndexOf("https") == -1)
                    {
                        model.ImgPath = "https:" + model.ImgPath;
                    }
                    logoInfoList.Add(model);
                    //page.AddTargetRequest(model.ImgPath); //Site設置DownloadFiles為TRUE就可以自動下載文件
                }
                page.AddResultItem("LogoInfoList", logoInfoList);

            }

        }

3.2准備Pipeline

這個地方我沒用他原用的下載方法，自己寫了一個簡單的下載方法，因為我感覺他的下載方式直接down下來，不是很符合我的業務邏輯

        private class PrintLogInfoPipe : BasePipeline
        {

            public override void Process(IEnumerable<ResultItems> resultItems, ISpider spider)
            {

                foreach (var resultItem in resultItems)
                {
                    var logoInfoList = resultItem.GetResultItem("LogoInfoList") as List<LogoInfoModel>;
                    foreach (var logoInfo in logoInfoList)
                    {
                        Console.WriteLine($"brand:{logoInfo.BrandName} path:{logoInfo.ImgPath}");
                        SaveFile(logoInfo.ImgPath, logoInfo.BrandName);
                    }
                }
            }
            private void SaveFile(string url, string filename)
            {
                HttpRequestMessage httpRequestMessage = new HttpRequestMessage();
                httpRequestMessage.RequestUri = new Uri(url);
                httpRequestMessage.Method = HttpMethod.Get;
                HttpClient httpClient = new HttpClient();
                var httpResponse = httpClient.SendAsync(httpRequestMessage);
                string filePath = Environment.CurrentDirectory + "/img/"+ filename + ".jpg";
                if (!File.Exists(filePath))
                {
                    try
                    {
                        string folder = Path.GetDirectoryName(filePath);
                        if (!string.IsNullOrWhiteSpace(folder))
                        {
                            if (!Directory.Exists(folder))
                            {
                                Directory.CreateDirectory(folder);
                            }
                        }

                        File.WriteAllBytes(filePath, httpResponse.Result.Content.ReadAsByteArrayAsync().Result);
                    }
                    catch
                    {
                    }
                }
                httpClient.Dispose();
            }
        }

存儲實體類

private class LogoInfoModel
{
     public string BrandName { get; set; }
     public string ImgPath { get; set; }
}

3.3構造爬蟲

static void Main(string[] args)
        {
            var site = new Site
            {
                CycleRetryTimes = 1,
                SleepTime = 200,
                //DownloadFiles = true,     DotNetSpider中設置是否下載文件
                Headers = new Dictionary<string, string>()
                {
                    { "Accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" },
                    { "Cache-Control","no-cache" },
                    { "Connection","keep-alive" },
                    { "Content-Type","application/x-www-form-urlencoded; charset=UTF-8" },
                    { "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"}
                }

            };
            List<Request> resList = new List<Request>();
            Request res = new Request();
            res.Url = "https://car.m.autohome.com.cn/";
            res.Method = System.Net.Http.HttpMethod.Get;
            resList.Add(res);
            var spider = Spider.Create(site, new QueueDuplicateRemovedScheduler(), new GetLogoInfoProcessor()) 
                .AddStartRequests(resList.ToArray())
                .AddPipeline(new PrintLogInfoPipe());
            spider.ThreadNum = 1;
            spider.Run();
            Console.Read();
        }

3.4 Site中DownloadFiles 源碼分析

源代碼中HttpClientDownloader中源代碼會自動去判斷Site中的DownloadFiles是否允許下載文件，默認是false，如果不將DownloadFiles的值設置為true，那么對於非字符串格式的接口數據，直接會被忽略，如果大家感興趣，可以將我代碼中的兩行注釋取消，那么就可以看到DotnetSpider中的下載方式

四、執行結果

本次執行的結果，已經上傳到bilibili中，大家有興趣可以打開圍觀一下

https://www.bilibili.com/video/av24022630/

五、總結

這次我們將數據的抓取以及文件的下載進行了一個小綜合，也介紹了DotnetSpider原生的下載方式，以及我自己寫的一個下載方法，大家如果遇到類似的需求可以自己選擇符合自己業務邏輯的方法，希望這篇文章能夠幫助到大家，如果覺得哪里寫的不好，歡迎拍大板磚

三次博文源代碼我已經上傳Github，感興趣可以直接下載下來 https://github.com/FunnyBoyDeng/SpiderAutoHome

六、下期沒有預告

至於下期我還沒想好爬什么，歡迎大家留言說自己想要爬的東西

2018-05-27

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 汽車之家店鋪商品詳情數據抓取 DotnetSpider實戰[二] PuppeteerSharp+AngleSharp的爬蟲實戰之汽車之家數據抓取對長城汽車品牌多樣化的一點思考爬蟲之爬汽車之家 python爬蟲——汽車之家數據 java爬蟲入門--用jsoup爬取汽車之家的新聞 Python爬蟲實戰-抓取boss直聘招聘信息基於Casperjs的網頁抓取技術【抓取豆瓣信息網絡爬蟲實戰示例】 python scrapy 抓取腳本之家文章(scrapy 入門使用簡介) 汽車之家一道SQL 面試題，大家閑來無事都來敲一敲