爬取珍愛網后用戶信息展示

本文轉載自查看原文 2019-10-18 00:34 585

golang爬取珍愛網，爬到了3萬多用戶信息，並存到了elasticsearch中，如下圖，查詢到了3萬多用戶信息。

先來看看最終效果：

利用到了go語言的html模板庫：

執行模板渲染：

func (s SearchResultView) Render (w io.Writer, data model.SearchResult) error {
	return s.template.Execute(w, data)
}

model.SearchResult數據結構如下：

type SearchResult struct {
	Hits int64
	Start int
	Query string
	PrevFrom int
	NextFrom int
	CurrentPage int
	TotalPage int64
	Items []interface{}
	//Items []engine.Item
}

```html
<!DOCTYPE html>
<html xmlns:javascript="http://www.w3.org/1999/xhtml">
<head>
    <title>Love Search</title>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link href="./css/style.css" rel="stylesheet">
    <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.0/css/bootstrap.min.css" rel="stylesheet"
          id="bootstrap-css">
    <script src="https://code.jquery.com/jquery-1.11.1.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.0/js/bootstrap.min.js"></script>
    <script src="./js/page.js"></script>
</head>
<body>

<div id="demo">

    <div id="searchblank">
        <form method="get" class="form-inline">
            <div class="form-group">
                <input type="text" class="form-control" style="width: 500px" value="{{.Query}}" name="q">
                <button class="btn btn-default" type="submit" maxlength="100">搜索</button>
            </div>
        </form>
    </div>
    <h4 style="text-align: center">共為你找到相關結果為{{.Hits}}個。顯示從{{.Start}}起共{{len .Items}}個</h4>

    <div id="customers" class="table-responsive-vertical shadow-z-1">
        <table id="table" class="table table-striped table-hover table-mc-indigo">
            <thead>
            <tr>
                <th>昵稱</th>
                <th>性別</th>
                <th>年齡</th>
                <th>身高</th>
                <th>體重</th>
                <th>收入</th>
                <th>學歷</th>
                <th>職位</th>
                <th>所在地</th>
                <th>星座</th>
                <th>購房情況</th>
                <th>購車情況</th>
            </tr>
            </thead>

            <tbody>
            {{range .Items}}
            <tr>
                <td><a href="{{.Url}}" target="_blank">{{.Payload.Name}}</a></td>
            {{with .Payload}}
                <td>{{.Gender}}</td>
                <td>{{.Age}}</td>
                <td>{{.Height}}CM</td>
                <td>{{.Weight}}KG</td>
                <td>{{.Income}}</td>
                <td>{{.Education}}</td>
                <td>{{.Occupation}}</td>
                <td>{{.Hukou}}</td>
                <td>{{.Xinzuo}}</td>
                <td>{{.House}}</td>
                <td>{{.Car}}</td>
            {{end}}
            </tr>
            {{else}}
            <tr>
                <td colspan="12">沒有找到相關用戶</td>
            </tr>
            {{end}}
            </tbody>
        </table>
        <div align="middle">
        {{if gt .CurrentPage 1}}
            <a href="search?q={{.Query}}&current={{Sub .CurrentPage 1}}">上一頁</a>
        {{end}}
        {{if lt .CurrentPage .TotalPage}}
            <a href="search?q={{.Query}}&current={{Add .CurrentPage 1}}">下一頁</a>
        {{end}}
            <span>共{{.TotalPage}}頁,當前第{{.CurrentPage}}頁</span>
        </div>
    </div>
</div>
</body>
</html>

其中用到了模板語法中的變量、函數、判斷、循環；

模板函數的定義：
上面模板代碼中的上一頁、下一頁的a標簽href里用到了自定義模板函數Add和Sub分別用於獲取上一頁和下一頁的頁碼，傳到后台（這里並沒有用JavaScript去實現）。

html/template包中提供的功能有限，所以很多時候需要使用用戶定義的函數來輔助渲染頁面。下面講講模板函數如何使用。template包創建新的模板的時候，支持.Funcs方法來將自定義的函數集合導入到該模板中，后續通過該模板渲染的文件均支持直接調用這些函數。

函數聲明

// Funcs adds the elements of the argument map to the template's function map.
// It panics if a value in the map is not a function with appropriate return
// type. However, it is legal to overwrite elements of the map. The return
// value is the template, so calls can be chained.
func (t *Template) Funcs(funcMap FuncMap) *Template {
	t.text.Funcs(template.FuncMap(funcMap))
	return t
}

Funcs方法就是用來創建我們模板函數了，它需要一個FuncMap類型的參數：

// FuncMap is the type of the map defining the mapping from names to
// functions. Each function must have either a single return value, or two
// return values of which the second has type error. In that case, if the
// second (error) argument evaluates to non-nil during execution, execution
// terminates and Execute returns that error. FuncMap has the same base type
// as FuncMap in "text/template", copied here so clients need not import
// "text/template".
type FuncMap map[string]interface{}

使用方法：

在go代碼中定義兩個函數Add和Sub：

//減法，為了在模板里用減1
func Sub(a, b int) int {
	return a - b
}

//加法，為了在模板里用加1
func Add(a, b int) int {
	return a + b
}

模板綁定模板函數：

創建一個FuncMap類型的map，key是模板函數的名字，value是剛才定義函數名。
將 FuncMap注入到模板中。

filename := "../view/template_test.html"

template, err := template.New(path.Base(filename)).Funcs(template.FuncMap{"Add": Add, "Sub": Sub}).ParseFiles(filename)

if err != nil {
	t.Fatal(err)
}

模板中如何使用：

如上面html模板中上一頁處的：

{{Sub .CurrentPage 1}}

把渲染后的CurrentPage值加1

注意：

1、函數的注入，必須要在parseFiles之前，因為解析模板的時候，需要先把函數編譯注入。

2、Template object can have multiple templates in it and each one has a name. If you look at the implementation of ParseFiles, you see that it uses the filename as the template name inside of the template object. So, name your file the same as the template object, (probably not generally practical) or else use ExecuteTemplate instead of just Execute.

3、The name of the template is the bare filename of the template, not the complete path。如果模板名字寫錯了，執行的時候會出現：

error: template: “…” is an incomplete or empty template

尤其是第三點，我今天就遇到了，模板名要用文件名，不能是帶路徑的名字，看以下代碼：


func TestTemplate3(t *testing.T) {

	//filename := "crawler/frontend/view/template.html"
	filename := "../view/template_test.html"

	//file, _ := os.Open(filename)

	t.Logf("baseName:%s\n", path.Base(filename))

	tpl, err := template.New(filename).Funcs(template.FuncMap{"Add": Add, "Sub": Sub}).ParseFiles(filename)

	if err != nil {
		t.Fatal(err)
	}

	page := common.SearchResult{}

	page.Hits = 123
	page.Start = 0
	item := engine.Item {
		Url:  "http://album.zhenai.com/u/107194488",
		Type: "zhenai",
		Id:   "107194488",
		Payload: model.Profile{
			Name:       "霓裳",
			Age:        28,
			Height:     157,
			Marriage:   "未婚",
			Income:     "5001-8000元",
			Education:  "中專",
			Occupation: "程序媛",
			Gender:     "女",
			House:      "已購房",
			Car:        "已購車",
			Hukou:      "上海徐匯區",
			Xinzuo:    "水瓶座",
		},
	}

	page.CurrentPage = 1
	page.TotalPage = 10
	page.Items = append(page.Items, item)

	afterHtml, err := os.Create("template_test1.html")

	if err != nil {
		t.Fatal(err)
	}

	tpl.Execute(afterHtml, page)
}

這里在template.New(filename)傳入的是文件名（上面定義時是帶路徑的文件名），導致執行完代碼后template_test1.html文件是空的，當然測試類的通過的，但是將此渲染到瀏覽器的時候，就會報：

 template: “…” is an incomplete or empty template

所以，要使用文件的baseName，即：

tpl, err := template.New(path.Base(filename)).Funcs(template.FuncMap{"Add": Add, "Sub": Sub}).ParseFiles(filename)

這樣運行代碼后template_test1.html就是被渲染有內容的。

其他語法：變量、判斷、循環用法比較簡單，我沒遇到問題；其他語法，如：模板的嵌套，我目前沒用到，在此也不做贅述。

查詢遇到的問題：

因為查詢每頁顯示10條記錄，查詢第1000頁是正常的，當查詢大於等於1001頁的時候，會報如下錯誤：

用restclient工具調，錯誤更明顯了：

{
  "error" : {
    "root_cause" : [
      {
        "type" : "query_phase_execution_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10010]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "dating_profile",
        "node" : "bJhldvT6QeaRTvHmBKHT4Q",
        "reason" : {
          "type" : "query_phase_execution_exception",
          "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10010]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
        }
      }
    ]
  },
  "status" : 500
}

問了谷哥后發現，是由於ElasticSearch的默認深度翻頁機制的限制造成的。ES默認的分頁機制一個不足的地方是，比如有5010條數據，當你僅想取第5000到5010條數據的時候，ES也會將前5000條數據加載到內存當中，所以ES為了避免用戶的過大分頁請求造成ES服務所在機器內存溢出，默認對深度分頁的條數進行了限制，默認的最大條數是10000條，這是正是問題描述中當獲取第10000條數據的時候報Result window is too large異常的原因。（因為頁面為1001頁的時候后台1001-1然后乘以10作為from的值取查詢ES，而ES默認需要from+size要小於index.max_result_window：最大窗口值）。

要解決這個問題，可以使用下面的方式來改變ES默認深度分頁的index.max_result_window 最大窗口值

curl -XPUT http://127.0.0.1:9200/dating_profile/_settings -d '{ "index" : { "max_result_window" : 50000}}'

這里的dating_profile為index。

其中my_index為要修改的index名，50000為要調整的新的窗口數。將該窗口調整后，便可以解決無法獲取到10000條后數據的問題。

注意事項

通過上述的方式解決了我們的問題，但也引入了另一個需要我們注意的問題，窗口值調大了后，雖然請求到分頁的數據條數更多了，但它是用犧牲更多的服務器的內存、CPU資源來換取的。要考慮業務場景中過大的分頁請求，是否會造成集群服務的OutOfMemory問題。在ES的官方文檔中對深度分頁也做了討論

https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html

https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html

核心的觀點如下：

Depending on the size of your documents, the number of shards, and the hardware you are using, paging 10,000 to 50,000 results (1,000 to 5,000 pages) deep should be perfectly doable. But with big-enough from values, the sorting process can become very heavy indeed, using vast amounts of CPU, memory, and bandwidth. For this reason, we strongly advise against deep paging.

這段觀點表述的意思是：根據文檔的大小，分片的數量以及使用的硬件，分頁10,000到50,000個結果（1,000到5,000頁）應該是完全可行的。但是，從價值觀上來看，使用大量的CPU，內存和帶寬，分類過程確實會變得非常重要。為此，我們強烈建議不要進行深度分頁。

ES作為一個搜索引擎，更適合的場景是使用它進行搜索，而不是大規模的結果遍歷。大部分場景下，沒有必要得到超過10000個結果項目，例如，只返回前1000個結果。如果的確需要大量數據的遍歷展示，考慮是否可以用其他更合適的存儲。或者根據業務場景看能否用ElasticSearch的 滾動API (類似於迭代器，但有時間窗口概念)來替代。

到此展示的問題就解決了：

頁數大於1001效果

項目代碼見：https://github.com/ll837448792/crawler

本公眾號免費提供csdn下載服務，海量IT學習資源，如果你准備入IT坑，勵志成為優秀的程序猿，那么這些資源很適合你，包括但不限於java、go、python、springcloud、elk、嵌入式、大數據、面試資料、前端等資源。同時我們組建了一個技術交流群，里面有很多大佬，會不定時分享技術文章，如果你想來一起學習提高，可以公眾號后台回復【2】，免費邀請加技術交流群互相學習提高，會不定期分享編程IT相關資源。

掃碼關注，精彩內容第一時間推給你

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 用go語言爬取珍愛網 | 第三回知乎用戶信息的爬取爬蟲之知乎用戶信息爬取利用 Scrapy 爬取知乎用戶信息爬取淘寶商品信息，放到html頁面展示新浪微博搜索頁用戶信息爬取運用Python爬取新浪微博用戶的信息抖音分享頁用戶信息爬取全球疫情爬取及展示爬蟲之爬取抖音用戶信息-字體加密-靜態