QT學習:c++解析html相關


原來我做爬蟲的時候,對頁面進行解析的時候總是用很簡單粗暴的方法,直接找規律。后來在網上看到了gumbo,嘗試了一下,發現確實很好用,所以向大家推薦一下。

以下轉自:http://blog.csdn.net/whyistao/article/details/37919581

1.c++好像沒有太多的html解析庫可以用,最后試着在qt里面集成了htmlcxx,一開始在pro里面寫了 includepath += 路徑,發現仍然沒有用
后來發現只要在 HEADERS 和 SOURCES 里面 把htmlcxx的c文件和.h文件 +=進去就行了,像這樣:
SOURCES += main.cpp\
        html/utils.cc \
        html/Uri.cc \
        html/ParserSax.cc \
        html/ParserDom.cc \
        html/Node.cc \
        html/Extensions.cc
HEADERS  += mainwindow.h \
        html/utils.h \
        html/Uri.h \
        html/tree.h \
        html/ParserSax.h \
        html/ParserDom.h \
        html/Node.h \
        html/Extensions.h \
        html/debug.h \
        html/ci_string.h \
        html/wincstring.h \
        html/tld.h

參考了:   htmlcxx for qt(mingw)      http://blog.chinaunix.net/uid-21525518-id-1824657.html


2.使用gumbo解析
導入c和h文件方法同上,記一下gumbo常用類型
GumboOutput   
用GumboOutput來解析html源碼,然后output->root即為根節點。
GumboOutput* output = gumbo_parse(htmlString.c_str());
GumboNode* node = output->root
GumboNode    節點                      
GumboNode node;      
獲得節點里面的東西    
node->v->text                           //  節點的文本
node->v.element.children    // 獲得節點的子節點列表
node->type     //節點的類型 
GumboVector    節點容器  
比如可以   GumboVector  * children  =    node->v.element.children;   來獲得節點的子節點列表
(GumboNode*) ( children->data[i] )     //獲得這個節點列表的第i個節點   
GumboAttribute  節點屬性
GumboAttribute* href;  
if (node->v.element.tag == GUMBO_TAG_A &&   (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) 
{    std::cout << href->value << std::endl;  }


節點的類型  
  ELEMENT_NODE,普通元素節點,如<html>,<p>,<div>,<span>,<img>  
  ATTRIBUTE_NODE,元素屬性  
  TEXT_NODE,文本節點  
  CDATA_SECTION_NODE,即<![CDATA[ ]]>  
  ENTITY_REFERENCE_NODE,實體引用,如&   
  ENTITY_NODE,實體,如<!ENTITY copyright “Copyright 2010, impng. All rights reserved”]>  
  PROCESSING_INSTRUCTION_NODE,PI,處理指令,如<?xml  version=”1.0″?>  
  COMMENT_NODE,注釋<!–   –>  
  DOCUMENT_NODE,根節點,即document.nodeType  
  DOCUMENT_TYPE_NODE,DTD,文檔類型<!DOCTYPE   >  
  DOCUMENT_FRAGMENT_NODE,文檔片段  
  NOTATION_NODE,DTD中定義的記號  

在代碼里的節點類型可以有如下幾種           (使用方法       node->type ==  GUMBO_NODE_ELEMENT )
typedef enum {
  /** Document node.  v will be a GumboDocument. */
  GUMBO_NODE_DOCUMENT,
  /** Element node.  v will be a GumboElement. */
  GUMBO_NODE_ELEMENT,
  /** Text node.  v will be a GumboText. */
  GUMBO_NODE_TEXT,
  /** CDATA node. v will be a GumboText. */
  GUMBO_NODE_CDATA,
  /** Comment node.  v. will be a GumboText, excluding comment delimiters. */
  GUMBO_NODE_COMMENT,
  /** Text node, where all contents is whitespace.  v will be a GumboText. */
  GUMBO_NODE_WHITESPACE
} GumboNodeType;

標簽類型:                           (使用方法    node->v.element.tag != GUMBO_TAG_SCRIPT   )
typedef enum {
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#the-root-element
  GUMBO_TAG_HTML,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#document-metadata
  GUMBO_TAG_HEAD,
  GUMBO_TAG_TITLE,
  GUMBO_TAG_BASE,
  GUMBO_TAG_LINK,
  GUMBO_TAG_META,
  GUMBO_TAG_STYLE,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/scripting-1.html#scripting-1
  GUMBO_TAG_SCRIPT,
  GUMBO_TAG_NOSCRIPT,
  GUMBO_TAG_TEMPLATE,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/sections.html#sections
  GUMBO_TAG_BODY,
  GUMBO_TAG_ARTICLE,
  GUMBO_TAG_SECTION,
  GUMBO_TAG_NAV,
  GUMBO_TAG_ASIDE,
  GUMBO_TAG_H1,
  GUMBO_TAG_H2,
  GUMBO_TAG_H3,
  GUMBO_TAG_H4,
  GUMBO_TAG_H5,
  GUMBO_TAG_H6,
  GUMBO_TAG_HGROUP,
  GUMBO_TAG_HEADER,
  GUMBO_TAG_FOOTER,
  GUMBO_TAG_ADDRESS,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/grouping-content.html#grouping-content
  GUMBO_TAG_P,
  GUMBO_TAG_HR,
  GUMBO_TAG_PRE,
  GUMBO_TAG_BLOCKQUOTE,
  GUMBO_TAG_OL,
  GUMBO_TAG_UL,
  GUMBO_TAG_LI,
  GUMBO_TAG_DL,
  GUMBO_TAG_DT,
  GUMBO_TAG_DD,
  GUMBO_TAG_FIGURE,
  GUMBO_TAG_FIGCAPTION,
  GUMBO_TAG_MAIN,
  GUMBO_TAG_DIV,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#text-level-semantics
  GUMBO_TAG_A,
  GUMBO_TAG_EM,
  GUMBO_TAG_STRONG,
  GUMBO_TAG_SMALL,
  GUMBO_TAG_S,
  GUMBO_TAG_CITE,
  GUMBO_TAG_Q,
  GUMBO_TAG_DFN,
  GUMBO_TAG_ABBR,
  GUMBO_TAG_DATA,
  GUMBO_TAG_TIME,
  GUMBO_TAG_CODE,
  GUMBO_TAG_VAR,
  GUMBO_TAG_SAMP,
  GUMBO_TAG_KBD,
  GUMBO_TAG_SUB,
  GUMBO_TAG_SUP,
  GUMBO_TAG_I,
  GUMBO_TAG_B,
  GUMBO_TAG_U,
  GUMBO_TAG_MARK,
  GUMBO_TAG_RUBY,
  GUMBO_TAG_RT,
  GUMBO_TAG_RP,
  GUMBO_TAG_BDI,
  GUMBO_TAG_BDO,
  GUMBO_TAG_SPAN,
  GUMBO_TAG_BR,
  GUMBO_TAG_WBR,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/edits.html#edits
  GUMBO_TAG_INS,
  GUMBO_TAG_DEL,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/embedded-content-1.html#embedded-content-1
  GUMBO_TAG_IMAGE,
  GUMBO_TAG_IMG,
  GUMBO_TAG_IFRAME,
  GUMBO_TAG_EMBED,
  GUMBO_TAG_OBJECT,
  GUMBO_TAG_PARAM,
  GUMBO_TAG_VIDEO,
  GUMBO_TAG_AUDIO,
  GUMBO_TAG_SOURCE,
  GUMBO_TAG_TRACK,
  GUMBO_TAG_CANVAS,
  GUMBO_TAG_MAP,
  GUMBO_TAG_AREA,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#mathml
  GUMBO_TAG_MATH,
  GUMBO_TAG_MI,
  GUMBO_TAG_MO,
  GUMBO_TAG_MN,
  GUMBO_TAG_MS,
  GUMBO_TAG_MTEXT,
  GUMBO_TAG_MGLYPH,
  GUMBO_TAG_MALIGNMARK,
  GUMBO_TAG_ANNOTATION_XML,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#svg-0
  GUMBO_TAG_SVG,
  GUMBO_TAG_FOREIGNOBJECT,
  GUMBO_TAG_DESC,
  // SVG title tags will have GUMBO_TAG_TITLE as with HTML.
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/tabular-data.html#tabular-data
  GUMBO_TAG_TABLE,
  GUMBO_TAG_CAPTION,
  GUMBO_TAG_COLGROUP,
  GUMBO_TAG_COL,
  GUMBO_TAG_TBODY,
  GUMBO_TAG_THEAD,
  GUMBO_TAG_TFOOT,
  GUMBO_TAG_TR,
  GUMBO_TAG_TD,
  GUMBO_TAG_TH,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/forms.html#forms
  GUMBO_TAG_FORM,
  GUMBO_TAG_FIELDSET,
  GUMBO_TAG_LEGEND,
  GUMBO_TAG_LABEL,
  GUMBO_TAG_INPUT,
  GUMBO_TAG_BUTTON,
  GUMBO_TAG_SELECT,
  GUMBO_TAG_DATALIST,
  GUMBO_TAG_OPTGROUP,
  GUMBO_TAG_OPTION,
  GUMBO_TAG_TEXTAREA,
  GUMBO_TAG_KEYGEN,
  GUMBO_TAG_OUTPUT,
  GUMBO_TAG_PROGRESS,
  GUMBO_TAG_METER,
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/interactive-elements.html#interactive-elements
  GUMBO_TAG_DETAILS,
  GUMBO_TAG_SUMMARY,
  GUMBO_TAG_MENU,
  GUMBO_TAG_MENUITEM,
  // Non-conforming elements that nonetheless appear in the HTML5 spec.
  // http://www.whatwg.org/specs/web-apps/current-work/multipage/obsolete.html#non-conforming-features
  GUMBO_TAG_APPLET,
  GUMBO_TAG_ACRONYM,
  GUMBO_TAG_BGSOUND,
  GUMBO_TAG_DIR,
  GUMBO_TAG_FRAME,
  GUMBO_TAG_FRAMESET,
  GUMBO_TAG_NOFRAMES,
  GUMBO_TAG_ISINDEX,
  GUMBO_TAG_LISTING,
  GUMBO_TAG_XMP,
  GUMBO_TAG_NEXTID,
  GUMBO_TAG_NOEMBED,
  GUMBO_TAG_PLAINTEXT,
  GUMBO_TAG_RB,
  GUMBO_TAG_STRIKE,
  GUMBO_TAG_BASEFONT,
  GUMBO_TAG_BIG,
  GUMBO_TAG_BLINK,
  GUMBO_TAG_CENTER,
  GUMBO_TAG_FONT,
  GUMBO_TAG_MARQUEE,
  GUMBO_TAG_MULTICOL,
  GUMBO_TAG_NOBR,
  GUMBO_TAG_SPACER,
  GUMBO_TAG_TT,
  // Used for all tags that don't have special handling in HTML.
  GUMBO_TAG_UNKNOWN,
  // A marker value to indicate the end of the enum, for iterating over it.
  // Also used as the terminator for varargs functions that take tags.
  GUMBO_TAG_LAST,
} GumboTag;


3.使用gumbo的時候,報了一個RtlWerpReportException failed with status code :-1073741823 錯,
一開始以為是堆棧溢出的問題,后來發現是自己代碼邏輯沒寫對,最好對照着官方demo的用法去寫
if (node->v.element.tag == GUMBO_TAG_A &&      (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) 
{    std::cout << href->value << std::endl;  }


4.編譯gumbo的時候報了一個錯
 錯誤:'for' loop initial declarations are only allowed in C99 mode
所以在項目pro配置里要加上這兩句
QMAKE_CFLAGS_DEBUG +=  --std=c99
QMAKE_CFLAGS_RELEASE +=  --std=c99

 

轉載請注明:http://www.cnblogs.com/fnlingnzb-learner/p/5835428.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM