【原創】大數據基礎之ElasticSearch(4)es數據導入過程


1 准備analyzer

內置analyzer

參考:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

中文分詞

smartcn

參考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html

ik

$ bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip

參考:https://github.com/medcl/elasticsearch-analysis-ik

其他plugins

參考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html

2 創建索引--准備mapping,確定shards、replication

# curl -XPUT -H 'Content-Type: application/json' http://localhost:9200/testdoc -d '
{
  "settings": {
    "index.number_of_shards" : 10,
    "index.number_of_routing_shards" : 30,
    "index.number_of_replicas":1,
    "index.translog.durability": "async",
    "index.merge.scheduler.max_thread_count": 1,
    "index.refresh_interval": "30s"
  },
  "mappings": {
    "_doc": { 
      "_all": {
        "enabled": false
      },
      "_source": {
        "enabled": false
      },
      "properties": { 
        "title":    { "type": "text", "analyzer": "ik_smart"}, 
        "name":     { "type": "keyword", "doc_values": false}, 
        "age":      { "type": "integer", "index": false},  
        "created":  {
          "type":   "date", 
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    }
  }
}'

其中:

_source 控制是否存儲原始json
_all 控制是否對原始json建倒排
analyzer 用於指定分詞
doc_values 用於控制是否列式存儲
index 用於控制是否倒排

The _source field stores the original JSON body of the document. If you don’t need access to it you can disable it.
By default Elasticsearch indexes and adds doc values to most fields so that they can be searched and aggregated out of the box.

 

數據類型

參考:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

其中String有兩種:text和keyword,區別是text會被分詞,keyword不會被分詞;

text

參考:https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html

keyword

參考:https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

3 導入數據

3.1 調用index api

參考:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

3.2 准備hive外部表

詳見:https://www.cnblogs.com/barneywill/p/10300951.html

4 測試

# curl -XPOST -H 'Content-Type: application/json' 'http://localhost:9200/_xpack/sql?format=txt' -d '{"query":"select * from testdoc limit 10"}'

or

# curl -XGET 'http://localhost:9200/testdoc/_search?q=*'

5 問題

報錯:all nodes failed

2019-03-27 03:14:50,091 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.1:9200] failed (Read timed out); selected next node [192.168.0.1:9200]
2019-03-27 03:15:50,148 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.2:9200] failed (Read timed out); selected next node [192.168.0.2:9200]
2019-03-27 03:16:50,207 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.3:9200] failed (Read timed out); no other nodes left - aborting...
2019-03-27 03:16:50,208 ERROR [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: Hit error while closing operators - failing tree
2019-03-27 03:16:50,210 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Hive Runtime Error while closing operators
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.0.1:9200, 192.168.0.2:9200, 192.168.0.3:9200]]
        at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:152)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:398)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:362)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:366)
        at org.elasticsearch.hadoop.rest.RestClient.refresh(RestClient.java:267)
        at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.close(BulkProcessor.java:550)
        at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:219)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
        at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.close(EsHiveOutputFormat.java:74)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:190)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1047)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189)
        ... 8 more

解決方法:增加 index.number_of_shards,只能在創建索引時指定,默認為5

報錯:es_rejected_execution_exception

Caused by: org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [70/1000]. Error sample (first [5] error messages):
        org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of processing of [7622922][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[test_indix][18]] containing [38] requests, target allocation id: iLlIBScJTxahse559pTINQ, primary term: 1 on EsThreadPoolExecutor[name = 1hxgYU_/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@ce11763[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 5686436]]

報錯原因:

thread_pool.write.queue_size

For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

The queue_size allows to control the size of the queue of pending requests that have no threads to execute them. By default, it is set to -1 which means its unbounded. When a request comes in and the queue is full, it will abort the request.

查看thread_pool統計

# curl 'http://localhost:9200/_nodes/stats?pretty'|grep '"write"' -A 7

通常由於寫入速度、並發量或者壓力較大超過es處理能力,超出queue的大小就會被reject

解決方法:

1)修改配置調優

index.refresh_interval: -1
index.number_of_replicas: 0
indices.memory.index_buffer_size: 40%
thread_pool.write.queue_size: 1024

詳見:https://www.cnblogs.com/barneywill/p/10615249.html

2)減小寫入壓力

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM