幾個有關Hadoop自帶的性能測試工具的應用

本文轉載自查看原文 2018-02-27 11:16 4784 大數據

http://www.talkwithtrend.com/Question/177983-1247453

一些測試的描述如下內容最為詳細，供你參考：

測試對於驗證系統的正確性、分析系統的性能來說非常重要，但往往容易被我們所忽視。為了能對系統有更全面的了解、能找到系統的瓶頸所在、能對系統性能做更好的改進，打算先從測試入手，學習Hadoop幾種主要的測試手段。本文將分成兩部分：第一部分記錄如何使用Hadoop自帶的測試工具進行測試；第二部分記錄Intel開放的Hadoop Benchmark Suit: HiBench的安裝及使用。

1. Hadoop基准測試

Hadoop自帶了幾個基准測試，被打包在幾個jar包中，如hadoop-test.jar和hadoop-examples.jar，在Hadoop環境中可以很方便地運行測試。本文測試使用的Hadoop版本是cloudera的hadoop-0.20.2-cdh3u3。

在測試前，先設置好環境變量：

$ export $HADOOP_HOME=/home/hadoop/hadoop
$ export $PATH=$PATH:$HADOOP_HOME/bin

使用以下命令就可以調用jar包中的類：

$ hadoop jar $HADOOP_HOME/xxx.jar

(1). Hadoop Test

當不帶參數調用hadoop-test-0.20.2-cdh3u3.jar時，會列出所有的測試程序：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
TestDFSIO: Distributed i/o benchmark.
dfsthroughput: measure hdfs throughput
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode.
testarrayfile: A test for flat files of binary key/value pairs.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testipc: A test for ipc.
testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
testrpc: A test for rpc.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testsetfile: A test for flat files of binary key/value pairs.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

這些程序從多個角度對Hadoop進行測試，TestDFSIO、mrbench和nnbench是三個廣泛被使用的測試。

TestDFSIO

TestDFSIO用於測試HDFS的IO性能，使用一個MapReduce作業來並發地執行讀寫操作，每個map任務用於讀或寫每個文件，map的輸出用於收集與處理文件相關的統計信息，reduce用於累積統計信息，並產生summary。TestDFSIO的用法如下：

TestDFSIO.0.0.6
Usage: TestDFSIO [genericOptions] -read | -write | -append | -clean [-nrFiles N] [-fileSize Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir]

以下的例子將往HDFS中寫入10個1000MB的文件：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar TestDFSIO 
  -write -nrFiles 10 -fileSize 1000

結果將會寫到一個本地文件TestDFSIO_results.log：

----- TestDFSIO ----- : write
            Date & time: Mon Dec 10 11:11:15 CST 2012
        Number of files: 10
 Total MBytes processed: 10000.0
      Throughput mb/sec: 3.5158047729862436
 Average IO rate mb/sec: 3.5290374755859375
 IO rate std deviation: 0.22884063705950305
     Test exec time sec: 316.615

以下的例子將從HDFS中讀取10個1000MB的文件：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar TestDFSIO 
   -read -nrFiles 10 -fileSize 1000

結果將會寫到一個本地文件TestDFSIO_results.log：

----- TestDFSIO ----- : read
            Date & time: Mon Dec 10 11:21:17 CST 2012
        Number of files: 10
 Total MBytes processed: 10000.0
      Throughput mb/sec: 255.8002711482874
 Average IO rate mb/sec: 257.1685791015625
 IO rate std deviation: 19.514058659935184
     Test exec time sec: 18.459

使用以下命令刪除測試數據：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar TestDFSIO -clean

nnbench

nnbench用於測試NameNode的負載，它會生成很多與HDFS相關的請求，給NameNode施加較大的壓力。這個測試能在HDFS上模擬創建、讀取、重命名和刪除文件等操作。nnbench的用法如下：

NameNode Benchmark 0.4
Usage: nnbench <options>
Options:
     -operation <Available operations are create_write open_read rename delete. This option is mandatory>
      * NOTE: The open_read, rename and delete operations assume that the files they operate on, are already available. The create_write operation must be run before running the other operations.
     -maps <number of maps. default is 1. This is not mandatory>
     -reduces <number of reduces. default is 1. This is not mandatory>
     -startTime <time to start, given in seconds from the epoch. Make sure this is far enough into the future, so all maps (operations) will start at the same time>. default is launch time + 2 mins. This is not mandatory
     -blockSize <Block size in bytes. default is 1. This is not mandatory>
     -bytesToWrite <Bytes to write. default is 0. This is not mandatory>
     -bytesPerChecksum <Bytes per checksum for the files. default is 1. This is not mandatory>
     -numberOfFiles <number of files to create. default is 1. This is not mandatory>
     -replicationFactorPerFile <Replication factor for the files. default is 1. This is not mandatory>
     -baseDir <base DFS path. default is /becnhmarks/NNBench. This is not mandatory>
     -readFileAfterOpen <true or false. if true, it reads the file and reports the average time to read. This is valid with the open_read operation. default is false. This is not mandatory>
     -help: Display the help statement


以下例子使用12個mapper和6個reducer來創建1000個文件：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar nnbench 
    -operation create_write -maps 12 -reduces 6 -blockSize 1 
    -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 
    -readFileAfterOpen true -baseDir /benchmarks/NNBench-`hostname -s`

mrbench

mrbench會多次重復執行一個小作業，用於檢查在機群上小作業的運行是否可重復以及運行是否高效。mrbench的用法如下：

MRBenchmark.0.0.2
Usage: mrbench [-baseDir <base DFS path for output/input, default is /benchmarks/MRBench>] [-jar <local path to job jar file containing Mapper and Reducer implementations, default is current jar file>] [-numRuns <number of times to run the job, default is 1>] [-maps <number of maps for each run, default is 2>] [-reduces <number of reduces for each run, default is 1>] [-inputLines <number of input lines to generate, default is 1>] [-inputType <type of input to generate, one of ascending (default), descending, random>] [-verbose]

以下例子會運行一個小作業50次：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar mrbench -numRuns 50

運行結果如下所示：

DataLines     Maps     Reduces     AvgTime (milliseconds)
1          2     1     14237

以上結果表示平均作業完成時間是14秒。

(2). Hadoop Examples

除了上文提到的測試，Hadoop還自帶了一些例子，比如WordCount和TeraSort，這些例子在hadoop-examples-0.20.2-cdh3u3.jar中。執行以下命令會列出所有的示例程序：

$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  dbcount: An example job that count the pageview counts from a database.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using monte-carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sleep: A job that sleeps at each map and reduce task.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.

WordCount在 Running Hadoop On CentOS (Single-Node Cluster) 一文中已有介紹，這里就不再贅述。

TeraSort

一個完整的TeraSort測試需要按以下三步執行：

用TeraGen生成隨機數據對輸入數據運行TeraSort用TeraValidate驗證排好序的輸出數據

並不需要在每次測試時都生成輸入數據，生成一次數據之后，每次測試可以跳過第一步。

TeraGen的用法如下：

$ hadoop jar hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>

以下命令運行TeraGen生成1GB的輸入數據，並輸出到目錄/examples/terasort-input：

$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar teragen 
    10000000 /examples/terasort-input

TeraGen產生的數據每行的格式如下：

<10 bytes key><10 bytes rowid><78 bytes filler>rn

其中：

key是一些隨機字符，每個字符的ASCII碼取值范圍為[32, 126]rowid是一個整數，右對齊filler由7組字符組成，每組有10個字符（最后一組8個），字符從’A’到’Z’依次取值

以下命令運行TeraSort對數據進行排序，並將結果輸出到目錄/examples/terasort-output：

$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar terasort 
   /examples/terasort-input /examples/terasort-output

以下命令運行TeraValidate來驗證TeraSort輸出的數據是否有序，如果檢測到問題，將亂序的key輸出到目錄/examples/terasort-validate

$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar teravalidate 
   /examples/terasort-output /examples/terasort-validate

(3). Hadoop Gridmix2

Gridmix是Hadoop自帶的基准測試程序，是對其它幾個基准測試程序的進一步封裝，包括產生數據、提交作業、統計完成時間等功能模塊。Gridmix自帶了各種類型的作業，分別為streamSort、javaSort、combiner、monsterQuery、webdataScan和webdataSort。

$ cd  $HADOOP_HOME/src/benchmarks/gridmix2
$ ant
$ cp build/gridmix.jar .

修改環境變量

修改gridmix-env-2文件：

export HADOOP_INSTALL_HOME=/home/jeoygin
export HADOOP_VERSION=hadoop-0.20.2-cdh3u3
export HADOOP_HOME=${HADOOP_INSTALL_HOME}/${HADOOP_VERSION}
export HADOOP_CONF_DIR=${HADOOP_HOME}/conf
export USE_REAL_DATASET=

export APP_JAR=${HADOOP_HOME}/hadoop-test-0.20.2-cdh3u3.jar
export EXAMPLE_JAR=${HADOOP_HOME}/hadoop-examples-0.20.2-cdh3u3.jar
export STREAMING_JAR=${HADOOP_HOME}/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar

如果USE_REAL_DATASET的值為TRUE的話，將使用500GB壓縮數據（等價於2TB非壓縮數據），如果留空將使用500MB壓縮數據（等價於2GB非壓縮數據）。

修改配置信息

配置信息在gridmix_config.xml文件中。gridmix中，每種作業有大中小三種類型：小作業只有3個輸入文件（即3個map）；中作業的輸入文件是與正則表達式{part-0000,part-0001,part-000*2}匹配的文件；大作業會處理處有數據。

產生數據

$ chmod +x generateGridmix2data.sh
$ ./generateGridmix2data.sh

generateGridmix2data.sh腳本會運行一個作業，在HDFS的目錄/gridmix/data中產生輸入數據。

運行

$ chmod +x rungridmix_2
$ ./rungridmix_2
運行后，會創建_start.out文件來記錄開始時間，結束后，創建_end.out文件來記錄完成時間。

(4). 查看任務統計信息

Hadoop提供非常方便的方式來獲取一個任務的統計信息，使用以下命令即可作到：

$ hadoop job -history all <job output directory>

這個命令會分析任務的兩個歷史文件（這兩個文件存儲在<job output directory>/_logs/history目錄中）並計算任務的統計信息。

2. HiBench

HiBench是Intel開放的一個Hadoop Benchmark Suit，包含9個典型的Hadoop負載（Micro benchmarks、HDFS benchmarks、web search benchmarks、machine learning benchmarks和data analytics benchmarks），主頁是： https://github.com/intel-hadoop/hibench 。

HiBench為大多數負載提供是否啟用壓縮的選項，默認的compression codec是zlib。

Micro Benchmarks:

Sort (sort)：使用Hadoop RandomTextWriter生成數據，並對數據進行排序WordCount (wordcount)：統計輸入數據中每個單詞的出現次數，輸入數據使用Hadoop RandomTextWriter生成TeraSort (terasort)：這是由微軟的數據庫大牛Jim Gray（2007年失蹤）創建的標准benchmark，輸入數據由Hadoop TeraGen產生

HDFS Benchmarks:

增強的DFSIO (dfsioe)：通過產生大量同時執行讀寫請求的任務來測試Hadoop機群的HDFS吞吐量

Web Search Benchmarks:

Nutch indexing (nutchindexing)：大規模搜索引擎索引是MapReduce的一個重要應用，這個負載測試Nutch（Apache的一個開源搜索引擎）的索引子系統，使用自動生成的Web數據，Web數據中的鏈接和單詞符合Zipfian分布PageRank (pagerank)：這個負載包含一種在Hadoop上的PageRank算法實現，使用自動生成的Web數據，Web數據中的鏈接符合Zipfian分布

Machine Learning Benchmarks:

Mahout Bayesian classification (bayes)：大規模機器學習也是MapReduce的一個重要應用，這個負載測試Mahout 0.7（Apache的一個開源機器學習庫）中的Naive Bayesian訓練器，輸入數據是自動生成的文檔，文檔中的單詞符合Zipfian分布Mahout K-means clustering (kmeans)：這個負載測試Mahout 0.7中的K-means聚類算法，輸入數據集由基於均勻分布和高斯分布的GenKMeansDataset產生

Data Analytics Benchmarks:

Hive Query Benchmarks (hivebench)：這個負載的開發基於SIGMOD 09的一篇論文“A Comparison of Approaches to Large-Scale Data Analysis”和HIVE-396，包含執行典型OLAP查詢的Hive查詢（Aggregation and Join），使用自動生成的Web數據，Web數據中的鏈接符合Zipfian分布

下文將${HIBENCH_HOME}定義為HiBench的解壓縮目錄。

(1). 安裝與配置

建立環境：

HiBench-2.2：從https://github.com/intel-hadoop/HiBench/zipball/HiBench-2.2下載Hadoop：在運行任何負載之前，請確保Hadoop環境能正常運行，所有負載在Cloudera Distribution of Hadoop 3 update 4 (cdh3u4)和Hadoop 1.0.3上測試通過Hive：如果要測試hivebench，請確保已正確建立了Hive環境

配置所有負載：

需要在${HIBENCH_HOME}/bin/hibench-config.sh文件中設置一些全局的環境變量。

$ unzip HiBench-2.2.zip
$ cd HiBench-2.2
$ vim bin/hibench-config.sh

HADOOP_HOME      <The Hadoop installation location>
HADOOP_CONF_DIR  <The hadoop configuration DIR, default is $HADOOP_HOME/conf>
COMPRESS_GLOBAL  <Whether to enable the in/out compression for all workloads, 0 is disable, 1 is enable>
COMPRESS_CODEC_GLOBAL  <The default codec used for in/out data compression>

配置單個負載：

在每個負載目錄下，可以修改conf/configure.sh這個文件，設置負載運行的參數。

同步每個節點的時間

(2). 運行

同時運行幾個負載：

修改${HIBENCH_HOME}/conf/benchmarks.lst文件，該文件定義了將要運行的負載，每行指定一個負載，在任意一行前可以使用#跳過該行運行${HIBENCH_HOME}/bin/run-all.sh腳本

單獨運行每個負載：

可以單獨運行每個負載，通常，在每個負載目錄下有三個不同的文件：

conf/configure.sh   包含所有參數的配置文件，可以設置數據大小及測試選項等
bin/prepare*.sh   生成或拷貝作業輸入數據到HDFS
bin/run*.sh       運行benchmark

配置benchmark：如果需要，可以修改configure.sh文件來設置自己想要的參數准備數據：運行bin/prepare.sh腳本為benchmark准備輸入數據運行benchmark：運行bin/run*.sh腳本來運行對應的benchmark

(3). 小結

HiBench覆蓋了一些廣被使用的Hadoop Benchmark，如果看過該項目的源碼，會發現該項目很精悍，代碼不多，通過一些腳本使每個benchmark的配置、准備和運行變得規范化，用起來十分方便。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 幾個性能測試工具幾個性能測試工具幾個性能測試工具 hadoop自帶性能測試 MySQL自帶的性能壓力測試工具mysqlslap 幾個性能測試工具/框架的比較性能測試工具 WEB性能測試工具 GPU 性能測試工具性能測試工具Locust