NetCore控制台程序-使用HostService和HttpClient實現簡單的定時爬蟲


.NetCore承載系統

.NetCore的承載系統, 可以將長時間運行的服務承載於托管進程中, AspNetCore應用其實就是一個長時間運行的服務, 啟動AspNetCore應用后, 它就會監聽網絡請求, 也就是開啟了一個監聽器, 監聽器會將網絡請求傳遞給管道進行處理, 處理后得到Http響應返回

有很多場景都會有服務承載的需求, 比如這篇博文要做的, 定時抓取華為論壇的文章點贊數

爬取文章點贊數

分析

比如這個鏈接 https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201308791792470245&fid=23 , 點進去不難發現這是用angular做的一個頁面, 既然是Angular, 那說明前后端分離了, 瀏覽器F12查看網絡請求
image
找到對應api請求方法:

POST https://developer.huawei.com/consumer/cn/forum/mid/partnerforumservice/v1/open/getTopicDetail? HTTP/1.1
Host: developer.huawei.com
Content-Type: application/json
Content-Length: 33

{"topicId":"0201302923811480141"}

這里經過我的測試, Content-TypeContent-Length必須上面那樣的值, 還有body, 你多一個空格請求都會失敗

使用HttpClient請求數據

直接看代碼吧, 這里使用了依賴注入來注入HttpClientFactory, 還可以使用強類型的HttpClient, 具體可以看文檔和dudu博客的這篇博文
工廠參觀記:.NET Core 中 HttpClientFactory 如何解決 HttpClient 臭名昭著的問題

private readonly IHttpClientFactory _httpClientFactory;

public async Task<int> Crawl(string link)
{
    using (var httpClient = _httpClientFactory.CreateClient())
    {
        var uri = new Uri(link);
        uri.TryReadQueryAsJson(out var queryParams);
        var topicId = queryParams["tid"].ToString();
        int likeCount = -1;
        if (!string.IsNullOrEmpty(topicId))
        {
            var body = JsonConvert.SerializeObject(
                        new { topicId },
                        Formatting.None);
            uri = new Uri(_baseUrl);
            var jsonContentType = "application/json";

            var requestMessage = new HttpRequestMessage
            {
                RequestUri = uri,
                Headers =
                {
                    { "Host", uri.Host }
                },
                Method = HttpMethod.Post,
                Content = new StringContent(body)
            };
            requestMessage.Content.Headers.ContentType = new MediaTypeWithQualityHeaderValue(jsonContentType);
            requestMessage.Content.Headers.ContentLength = body.Length;
            var response = await httpClient.SendAsync(requestMessage);
            if (response.StatusCode == HttpStatusCode.OK)
            {
                dynamic data = await response.Content.ReadAsAsync<dynamic>();
                likeCount = data.result.likes;
            }
        }

        return likeCount;
    }
}

這里有更簡潔的的寫法, 使用_httpClient.PostAsJsonAsync(), 但是考慮到可能需要自定義Content-Type這些請求頭, 所以先這樣寫;

配置承載系統

class Program
{
    static void Main()
    {
        new HostBuilder()
            .ConfigureServices(services =>
            {
                services.AddHttpClient();
                services.AddHostedService<LikeCountCrawler>();
            })
            .Build()
            .Run();
    }
}

LikeCountCrawler實現了IHostedService接口

IHostedService接口

public interface IHostedService
{
    /// <summary>
    /// Triggered when the application host is ready to start the service.
    /// </summary>
    /// <param name="cancellationToken">Indicates that the start process has been aborted.</param>
    Task StartAsync(CancellationToken cancellationToken);

    /// <summary>
    /// Triggered when the application host is performing a graceful shutdown.
    /// </summary>
    /// <param name="cancellationToken">Indicates that the shutdown process should no longer be graceful.</param>
    Task StopAsync(CancellationToken cancellationToken);
}

LikeCountCrawlerStartAsync方法中, 設置開啟了一個定時器, 定時器每次溢出, 都執行一次爬蟲邏輯

private readonly Timer _timer = new Timer();
private readonly IEnumerable<string> _links = new string[]
{
    "https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201308791792470245&fid=23",
    "https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201303654965850166&fid=18",
    "https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201294272503450453&fid=24",
    "https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201294189025490019&fid=17"
};
private readonly string _baseUrl = "https://developer.huawei.com/consumer/cn/forum/mid/partnerforumservice/v1/open/getTopicDetail";
...

public Task StartAsync(CancellationToken cancellationToken)
{
    _timer.Interval = 5 * 60 * 1000;
    _timer.Elapsed += OnTimer;
    _timer.AutoReset = true;
    _timer.Enabled = true;
    _timer.Start();
    OnTimer(null, null);
    return Task.CompletedTask;
}

public async Task Crawl(IEnumerable<string> links)
{
    await Task.Run(() =>
    {
        Parallel.ForEach(links, async link =>
        {
            Console.WriteLine($"Crawling link:{link}, ThreadId:{Thread.CurrentThread.ManagedThreadId}");
            var likeCount = await Crawl(link);
            Console.WriteLine($"Succeed crawling likecount - {likeCount}, ThreadId:{Thread.CurrentThread.ManagedThreadId}");
        });
    });
}

private void OnTimer(object sender, ElapsedEventArgs args)
{
    _ = Crawl(_links);
}

...

運行效果:
image


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM