用正則表達式解析html格式用法備忘

本文轉載自查看原文 2017-06-23 10:37 2248

起因

a) 工作中需要解析html ，由於現有的html解析代碼過於老舊，不滿足可讀性，復用性，維護性三個編程要求。意味着，html解析模塊一旦出了問題，修復時間長，人力耗費高。而對於系統，html解析的需求是長期存在的，需要有一套可復用，高效，模塊化方法來解析。正則表達式解析正是這樣一個嘗試方向。

正則表達式解析舉例

a) 刪除特定標簽：

刪除<xxx>類型的標簽:

#include "boost/regex.hpp"

string str_replace = "";

string instr = "<tag>content</tag>";

boost::regex reg_del( "<[^>]*>" );

string result = boost::regex_replace(instr, reg_del, str_replace, boost::match_default | boost::format_all);

result的值為 “content”

刪除<style>xxx</style>:

#include "boost/regex.hpp"

string str_replace = "";

string instr = "<a><style>content</style><b>";

boost::regex reg_del2( "<style[^>]*>[^>]*</style>" );

string result = boost::regex_replace(instr, reg_del2, str_replace, boost::match_default | boost::format_all);

result的值為 “<a><b>”

#include "boost/regex.hpp"

std::string text("<a color=green>content1</a><b color=blue>content2</b>");

boost::regex regex("<[^>]*color[^>]*>");

boost::sregex_token_iterator iter(text.begin(), text.end(), regex, 0);

boost::sregex_token_iterator end;

for (; iter != end; ++iter)

{std::cout << *iter << '\n';}

程序輸出為：

使用正則表達式做解析優劣

a) 好處

i. 代碼簡潔短小，減少維護壓力。

b) 壞處

i. 正則表達式入門成本高

ii. 不懂正則的人讀不懂，要靠注釋說明。

正則表達式資料

a) 簡單的入門其實不難，一些簡單的網站介紹幾本可以大概了解。

b) 正則用法廣泛，不必追求全懂，知道可能會有寫法，一邊查資料，一邊測試效果更快。

c) 測試工具推薦使用 RegexBuddy，用法簡單，網上也有blog教用。用此工具可以快速看到正則的效果是不是需要的（如下圖）。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 正則表達式入門及備忘別在迷戀正則表達式解析html了，好嗎？ JavaScript備忘錄(3)——正則表達式正則表達式格式正則表達式解析基本json 正則表達式解析正則表達式解析正則表達式解析 GROK解析正則表達式正則表達式——html