C# 爬蟲 Jumony-html解析

本文轉載自查看原文 2017-09-07 10:45 4827 .NET/C#

前言

　　前幾天寫了個爬蟲，然后認識到了自己的不足。烽火情懷推薦了Jumony.Core，通過倚天照海- -推薦的文章，也發現了Jumony.Core。

　　研究了2天，我發現這個東西簡單粗暴，非常好用，因為語法比較像jQuery。上手快，也很好理解。

添加DLL

　　IDE是Visual Studio 2013，我是在NugGet中搜索，並添加到項目中。

Jumony的用法

1、從網站獲取html代碼，將html字符串分析為標准的文檔對象模型（DOM）。

IHtmlDocument source = new JumonyParser().LoadDocument("http://www.23us.so/files/article/html/13/13655/index.html", System.Text.Encoding.GetEncoding("utf-8"));

Jumony的API可以從互聯網上直接抓取文檔分析，並根據HTTP頭自動識別編碼，但是上面的網站怎么也無法獲取到html，其他網站就沒問題（例如博客園、起點），后來我把源碼下載下來，一步步測試，發現html是獲取到的，但是亂碼，導致了Jumony類庫分析html文本的時候，分析的不正確。解決辦法就是設置utf-8。

2、獲取所有的meta標簽

var aLinks = source.Find("meta");//獲取所有的meta標簽
foreach (var aLink in aLinks)
{
    if (aLink.Attribute("name").Value() == "keywords")
    {
        name = aLink.Attribute("content").Value();//無疆,無疆最新章節,無疆全文閱讀
    }
}

3、獲取 name=keywords 的meta標簽，並得到content屬性里的值

string name = source.Find("meta[name=keywords]").FirstOrDefault().Attribute("content").Value();

4、獲取所有Class=L

var lLinks = source.Find(".L");//獲取所有class=L的td標簽
foreach (var lLink in lLinks)//循環class=L的td
{ 
　　//lLink值 例如：<td class="L"><a href="http://www.23us.so/files/article/html/13/13655/5638724.html">楔子</a></td>  
}

var aLinks = source.Find(".L a");//獲取所有class=L下的a標簽
foreach (var aLink in aLinks)
{ 　　
　　//aLink值 <a href="http://www.23us.so/files/article/html/13/13655/5638724.html">楔子</a>
　　string title = aLink.InnerText()//楔子 
　　string url = aLink.Attribute("href").Value();//http://www.23us.so/files/article/html/13/13655/5638724.html
}

5、根據ID獲取

var chapterLink = source.Find("#at a");//查找id=at下的所有a標簽
foreach (var i in chapterLink)//這里循環的就是a標簽
{ 
　　//aLink值 例如：<a href="http://www.23us.so/files/article/html/13/13655/5638724.html">楔子</a> 
　　string title = i.InnerText();//楔子
　　string url = i.Attribute("href").Value();//http://www.23us.so/files/article/html/13/13655/5638724.html 
}

C#完整代碼

using Ivony.Html;
using Ivony.Html.Parser;using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
using System.Web.Mvc;

namespace Test.Controllers
{
    public class CrawlerController : BaseController
    {
        // GET: Crawler
        public void Index()
        {
            //需要給utf-8的編碼，否則html是亂碼。
            IHtmlDocument source = new JumonyParser().LoadDocument("http://www.23us.so/files/article/html/13/13655/index.html", System.Text.Encoding.GetEncoding("utf-8"));

            //<meta name="keywords" content="無疆,無疆最新章節,無疆全文閱讀"/>
            string name = source.Find("meta[name=keywords]").FirstOrDefault().Attribute("content").Value().Split(',')[0];//獲取小說名字
            var chapterLink = source.Find("#at a");//查找id=at下的所有a標簽
            foreach (var i in chapterLink)//這里循環的就是a標簽
            {
                //章節標題
                string title = i.InnerText();

                //章節url
                string url = i.Attribute("href").Value();

                //根據章節的url，獲取章節頁面的html
                IHtmlDocument sourceChild = new JumonyParser().LoadDocument(url, System.Text.Encoding.GetEncoding("utf-8"));

                //查找id=contents下的小說正文內容
                string content = sourceChild.Find("#contents").FirstOrDefault().InnerHtml().Replace("&nbsp;", "").Replace("<br />", "\r\n");

                //txt文本輸出
                string path = AppDomain.CurrentDomain.BaseDirectory.Replace("\\", "/") + "Txt/";
                Novel(title + "\r\n" + content, name, path);
            }
        }
    }
}

相關文章：C# 爬蟲抓取小說

Jumony源代碼地址：https://github.com/Ivony/Jumony

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 C# 爬蟲正則、NSoup、HtmlAgilityPack、Jumony四種方式抓取小說 C#解析html文檔 C# HtmlAgilityPack和AngleSharp 解析HTML 【解析HTML】HTML解析，網絡爬蟲 python爬蟲之BeautifulSoup的HTML解析 C#解析html文檔類庫HtmlAgilityPack下載地址 C#使用HtmlAgilityPack解析Html 爬取圖片和視頻 C# 爬蟲 Python爬蟲 | Beautifulsoup解析html頁面 Jumony Core 3，真正的HTML引擎，正式版發布