C++ 提取網頁內容系列之四正則

本文轉載自查看原文 2014-12-19 14:23 2224 提取網頁/ 網頁內容分析/ c++代碼練習/ 正則/ 網絡編程

標題: C++ 提取網頁內容系列之四
作者: itdef
鏈接: http://www.cnblogs.com/itdef/p/4173833.html

歡迎轉帖請保持文本完整並注明出處

將網頁內容下載后存入字符串string 或者本地文件后我們開始進行搜索和查詢獲取信息
這里使用正則式  使用vs2008  其自帶的tr1庫(預備標准庫) 有正則式庫供使用
帶頭文件/*******************************************************************************
*  @file
*  @author    def< qq group: 324164944 >
*  @blog       http://www.cnblogs.com/itdef/
*  @brief
/*******************************************************************************/
#include <regex>
using namespace std::tr1;
using namespace std;

這里推薦正則式教程
正則表達式30分鍾入門教程
http://www.cnblogs.com/deerchao/ ... zhongjiaocheng.html

C++：Regex正則表達式
http://blog.sina.com.cn/s/blog_ac9fdc0b0101oow9.html

首先來個簡單例子

#include <string>
#include <iostream>
#include <regex>

using namespace std::tr1;
using namespace std;

string strContent = " onclick=\"VeryCD.TrackEvent('base', '首頁大推', '神雕俠侶');";

void Test1()
{
        string strText = strContent;
        string strRegex = "首頁大推";
        regex regExpress(strRegex);

        smatch ms;

        cout << "*****************************" << endl;
        cout << "Test 1" << endl << endl;

        while(regex_search(strText, ms, regExpress))
        {
                for(string::size_type i = 0;i < ms.size();++i)
                {
                        cout << ms.str(i).c_str() << endl;
                }
                strText = ms.suffix().str();
        }

        cout << "*****************************" << endl << endl;
}

void Test2()
{
        string strText = strContent;
        string strRegex = "首頁大推.*'(.*)'";
        regex regExpress(strRegex);

        smatch ms;

        cout << "*****************************" << endl;
        cout << "Test 2" << endl << endl;
        while(regex_search(strText, ms, regExpress))
        {
                for(string::size_type i = 0;i < ms.size();++i)
                {
                        cout << ms.str(i).c_str() << endl;
                }
                strText = ms.suffix().str();
        }
        cout << "*****************************" << endl << endl;
}


int _tmain(int argc, _TCHAR* argv[])
{
        Test1();
        Test2();
                
        return 0;
}

Test1中我們等於是直接搜索字符串然后打印出找到的位置Test2中我們使用首頁大推.*'(.*)'
.號等於是任意非空白換行字符 *則代表重復任意多次(0-無窮次)
而括號表示一個字符集也就是我們需要查找的內容
請注意這個括號是在 ' ' 之間的也就是查找首頁大推任意字符之后兩個 ' '號之間的內容

效果如下:
而且我們也發現 ms的顯示規律他首先顯示符合條件的字符串然后現實符合( )里面條件的子字符串

下面來個深入點得  分析這個字符串
string strContent0 = "alt=\"火影忍者\" /><div class=\"play_ico_middle\"></div><div class=\"cv-title\" style=\"width:85px;\">更新至612集</div>";

我們使用的正則式規則為 string strRegex = "alt=\"([^\"]*)\".*width:85px;\">(.*)</div>";
注意里面有兩個括號  一個是在alt= 之后在兩個" " 之間的內容  一個是在width:85px;\">  和 </div> 之間的內容
注意  "的顯示由於C++語言的特性必須寫成 \"
現在分析兩個括號內容 ([^\"]*)    (.*)

(.*)無須多說  就是任意非空白字符而且是在width:85px;\">  和 </div> 之間的內容
([^\"]*)  就是說非"的內容任意重復多次  而且這個括號是在alt= 之后在兩個" " 之間的內容

運行結果如下：(為了不顯示過多內容符合條件的內容沒有全部顯示只顯示了符合括號需求的子字符串)

/*******************************************************************************
*  @file        
*  @author      def< qq group: 324164944 >
*  @blog        http://www.cnblogs.com/itdef/
*  @brief     
/*******************************************************************************/


#include <string>
#include <iostream>
#include <regex>

using namespace std::tr1;
using namespace std;



string strContent = " onclick=\"VeryCD.TrackEvent('base', '首頁大推', '神雕俠侶');";

string strContent0 = "alt=\"火影忍者\" /><div class=\"play_ico_middle\"></div><div class=\"cv-title\" style=\"width:85px;\">更新至612集</div>";

void Test1()
{
        string strText = strContent;
        string strRegex = "首頁大推";
        regex regExpress(strRegex);

        smatch ms;

        cout << "*****************************" << endl;
        cout << "Test 1" << endl << endl;

        while(regex_search(strText, ms, regExpress))
        {
                for(string::size_type i = 0;i < ms.size();++i)
                {
                        cout << ms.str(i).c_str() << endl;
                }
                strText = ms.suffix().str();
        }

        cout << "*****************************" << endl << endl;
}

void Test2()
{
        string strText = strContent;
        string strRegex = "首頁大推.*'(.*)'";
        regex regExpress(strRegex);

        smatch ms;

        cout << "*****************************" << endl;
        cout << "Test 2" << endl << endl;
        while(regex_search(strText, ms, regExpress))
        {
                for(string::size_type i = 0;i < ms.size();++i)
                {
                        cout << ms.str(i).c_str() << endl;
                }
                strText = ms.suffix().str();
        }
        cout << "*****************************" << endl << endl;
}

void Test3()
{
        string strText = strContent0;
        string strRegex = "alt=\"([^\"]*)\".*width:85px;\">(.*)</div>";
        regex regExpress(strRegex);

        smatch ms;

        cout << "*****************************" << endl;
        cout << "Test 3" << endl << endl;

        while(regex_search(strText, ms, regExpress))
        {
                for(string::size_type i = 0;i < ms.size();++i)
                {
                        if(i > 0)
                                cout << ms.str(i).c_str() << endl;
                }
                strText = ms.suffix().str();
        }
        cout << "*****************************" << endl << endl;

}


int _tmain(int argc, _TCHAR* argv[])
{
        Test1();
        Test2();
        Test3();
                
        return 0;
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 C++ 提取網頁內容系列之一 C++抓網頁/獲取網頁內容網頁內容爬取：如何提取正文內容網頁內容爬取：如何提取正文內容 BEAUTIFULSOUP的輸出 C# 抓取網頁內容的方法怎樣抓取網頁內容 C#獲取網頁內容的三種方式 C#獲取網頁內容，解決亂碼問題 C#獲取網頁內容的三種方式 C#獲取網頁內容的三種方式