【原創】大數據基礎之Benchmark（2）TPC-DS

本文轉載自查看原文 2019-03-05 22:55 1892 Hive/ Benchmark/ BigData/ Spark/ Impala

tpc

官方：http://www.tpc.org/

一簡介

The TPC is a non-profit corporation founded to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry.

TPC(The Transaction Processing Performance Council)是一個非盈利公司，致力於定義事務處理和數據庫benchmark，同時向業界發布客觀的可驗證的tpc性能數據；

The term transaction is often applied to a wide variety of business and computer functions. Looked at as a computer function, a transaction could refer to a set of operations including disk read/writes, operating system calls, or some form of data transfer from one subsystem to another.

While TPC benchmarks certainly involve the measurement and evaluation of computer functions and operations, the TPC regards a transaction as it is commonly understood in the business world: a commercial exchange of goods, services, or money. A typical transaction, as defined by the TPC, would include the updating to a database system for such things as inventory control (goods), airline reservations (services), or banking (money).

In these environments, a number of customers or service representatives input and manage their transactions via a terminal or desktop computer connected to a database. Typically, the TPC produces benchmarks that measure transaction processing (TP) and database (DB) performance in terms of how many transactions a given system and database can perform per unit of time, e.g., transactions per second or transactions per minute.

TPC-DS is a Decision Support Benchmark

官方：http://www.tpc.org/tpcds/default.asp

文檔：http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.10.1.pdf

A simple schema for decision support systems or data warehouses is the star schema, where events are collected in large fact tables, while smaller supporting tables (dimensions) are used to describe the data.

The TPC-DS is an example of such a schema. It models a typical retail warehouse where the events are sales and typical dimensions are date of sale, time of sale, or demographic of the purchasing party.

決策支持系統的schema中，event被存放在大的事實表（fact table）中，而小的維度表（dimension table）用來描述數據；

數據庫界最具挑戰的一個測試基准TPC-DS，它模擬了一個典型的零售行業的數據倉庫；

The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, multi-user decision support workload. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark.

二使用

1 下載

http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp

2 安裝

# unzip TPC-DS_Tools_v2.10.1.zip
# cd v2.10.1rc3/tools
# make

生成兩個工具：

dsdgen
dsqgen

3 初始化表結構sql腳本

tpcds.sql
tpcds_ri.sql
tpcds_source.sql

可能需要根據具體的db修改一些諸如字段類型；

4 生成測試數據

# mkdir /tmp/tpcdsdata
# ./dsdgen -SCALE 1GB -DIR /tmp/tpcdsdata -parallel 4 -child 4

其中 -SCALE 用於指定生成的數據規模，可以修改比如10GB，1TB

5 生成查詢腳本

# ./dsqgen -input ../query_templates/templates.lst -directory ../query_templates -dialect oracle -scale 1GB -OUTPUT_DIR /tmp/tpcdsdata/query_oracle

默認支持dialect如下：

db2.tpl
netezza.tpl
oracle.tpl
sqlserver.tpl

可見默認都是針對傳統的關系型數據庫，下面看怎樣應用於大數據場景；

三測試hive

官方：https://github.com/hortonworks/hive-testbench

1 下載安裝

$ wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip
$ unzip hive14.zip
$ cd hive-testbench-hive14/
$ ./tpcds-build.sh

2 生成測試數據和查詢腳本

$ export FORMAT=parquet
$ ./tpcds-setup.sh 1000

單位為G，修改FORMAT，比如orc、parquet等

生成日志

TPC-DS text data generation complete.
Loading text data into external tables.
Optimizing table date_dim (1/24).
Optimizing table time_dim (2/24).
Optimizing table item (3/24).
Optimizing table customer (4/24).
Optimizing table customer_demographics (5/24).
Optimizing table household_demographics (6/24).
Optimizing table customer_address (7/24).
Optimizing table store (8/24).
Optimizing table promotion (9/24).
Optimizing table warehouse (10/24).
Optimizing table ship_mode (11/24).
Optimizing table reason (12/24).
Optimizing table income_band (13/24).
Optimizing table call_center (14/24).
Optimizing table catalog_page (16/24).
Optimizing table web_page (15/24).
Optimizing table web_site (17/24).
Optimizing table store_sales (18/24).
Optimizing table store_returns (19/24).
Optimizing table web_sales (20/24).
Optimizing table web_returns (21/24).
Optimizing table catalog_sales (22/24).
Optimizing table inventory (24/24).
Optimizing table catalog_returns (23/24).
Data loaded into database tpcds_bin_partitioned_parquet_10.

生成結果

hive> use tpcds_bin_partitioned_parquet_10;
OK
Time taken: 0.025 seconds
hive> show tables;
OK
call_center
catalog_page
catalog_returns
catalog_sales
customer
customer_address
customer_demographics
date_dim
household_demographics
income_band
inventory
item
promotion
reason
ship_mode
store
store_returns
store_sales
time_dim
warehouse
web_page
web_returns
web_sales
web_site
Time taken: 0.049 seconds, Fetched: 24 row(s)

3 運行測試

測試sql腳本目錄：sample-queries-tpcds

$ cd sample-queries-tpcds
hive> use tpcds_bin_partitioned_parquet_10;
hive> source query12.sql;

4 批量測試

根據需要修改hive配置：sample-queries-tpcds/testbench.settings
根據需要修改測試腳本（perl）：runSuite.pl

$ perl runSuiteCommon.pl
ERROR: one or more parameters not defined

Usage:
perl runSuiteCommon.pl [tpcds|tpch] [scale]

Description:
This script runs the sample queries and outputs a CSV file of the time it took each query to run. Also, all hive output is kept as a log file named 'queryXX.sql.log' for each query file of the form 'queryXX.sql'. Defaults to scale of 2.

核心代碼：

my $suite = shift;

my $scale = shift || 2;

dieWithUsage("suite name required") unless $suite eq "tpcds" or $suite eq "tpch";

 

chdir $SCRIPT_PATH;

if( $suite eq 'tpcds' ) {

        chdir "sample-queries-tpcds";

} else {

        chdir 'sample-queries-tpch';

} # end if

my @queries = glob '*.sql';

 

my $db = {

        'tpcds' => "tpcds_bin_partitioned_orc_$scale",

        'tpch' => "tpch_flat_orc_$scale"

};

 

print "filename,status,time,rows\n";

for my $query ( @queries ) {

        my $logname = "$query.log";

        my $cmd="echo 'use $db->{${suite}}; source $query;' | hive -i testbench.settings 2>&1  | tee $query.log";

這個腳本有兩個參數：suite scale，比如tpcds 10

可以修改的更通用，一個是數據庫硬編碼orc，一個是硬編碼hive命令，一個是打印正在執行的cmd，一個是啟動命令有初始化環境的時間成本，直接使用beeline連接server的耗時更真實；修改之后可以用於其他測試，比如spark-sql、impala、drill等；

修改之后是這樣：

#!/usr/bin/perl

use strict;
use warnings;
use POSIX;
use File::Basename;

# PROTOTYPES
sub dieWithUsage(;$);

# GLOBALS
my $SCRIPT_NAME = basename( __FILE__ );
my $SCRIPT_PATH = dirname( __FILE__ );

# MAIN
dieWithUsage("one or more parameters not defined") unless @ARGV >= 4;
my $suite = shift;
my $scale = shift || 2;
my $format = shift || 3;
my $engineCmd = shift || 4;
dieWithUsage("suite name required") unless $suite eq "tpcds" or $suite eq "tpch";
print "params: $suite, $scale, $format, $engineCmd;";

chdir $SCRIPT_PATH;
if( $suite eq 'tpcds' ) {
        chdir "sample-queries-tpcds";
} else {
        chdir 'sample-queries-tpch';
} # end if
my @queries = glob '*.sql';

my $db = { 
        'tpcds' => "tpcds_bin_partitioned_${format}_$scale",
        'tpch' => "tpch_flat_${format}_$scale"
};

print "filename,status,time,rows\n";
for my $query ( @queries ) {
        my $logname = "$query.log";
        my $cmd="${engineCmd}/$db->{${suite}} -i conf.settings -f $query 2>&1  | tee $query.log";
#       my $cmd="cat $query.log";
        #print $cmd ; exit;
        my $currentTime = strftime("%Y-%m-%d %H:%M:%S", localtime(time));
        print "$currentTime : ";
        print "$cmd \n";

        my $hiveStart = time();

        my @hiveoutput=`$cmd`;
        die "${SCRIPT_NAME}:: ERROR:  hive command unexpectedly exited \$? = '$?', \$! = '$!'" if $?;

        my $hiveEnd = time();
        my $hiveTime = $hiveEnd - $hiveStart;
        my $is_success = 0;
        foreach my $line ( @hiveoutput ) {
                if( $line =~ /[(\d+|No)]\s+row[s]? selected \(([\d\.]+) seconds\)/ ) {
                        $is_success = 1;
                        print "$query,success,$hiveTime,$1\n"; 
                } # end if
        } # end while
        if( $is_success == 0) {
                print "$query,failed,$hiveTime\n";
        }
} # end for


sub dieWithUsage(;$) {
        my $err = shift || '';
        if( $err ne '' ) {
                chomp $err;
                $err = "ERROR: $err\n\n";
        } # end if

        print STDERR <<USAGE;
${err}Usage:
        perl ${SCRIPT_NAME} [tpcds|tpch] [scale] [format] [engineCmd]

Description:
        This script runs the sample queries and outputs a CSV file of the time it took each query to run.  Also, all hive output is kept as a log file named 'queryXX.sql.log' for each query file of the form 'queryXX.sql'. Defaults to scale of 4.
USAGE
        exit 1;
}

執行：

# beeline to hiveserver2

$ perl runSuite.pl tpcds 10 parquet "$HIVE_HOME/bin/beeline -u jdbc:hive2://localhost:10000"

# beeline to spark thrift server

$ perl runSuite.pl tpcds 10 parquet "$SPARK_HOME/bin/beeline -u jdbc:hive2://localhost:11111"

# beeline to impala

perl runSuite.pl tpcds 10 parquet "$HIVE_HOME/bin/beeline -d com.cloudera.impala.jdbc4.Driver -u jdbc:impala://localhost:21050"

批量測試腳本

#!/bin/sh

current_dir=`pwd`

scale="$1"
format="$2"

if [ -z "$scale" ]; then
        scale=10
fi
if [ -z "$format" ]; then
        format="parquet"
fi

#echo "$current_dir $component $scale $test_dir"
echo "mkdir merge"
echo ""

component="hive" 
test_dir="test_$component"

echo "# test $component"
echo "mkdir $test_dir"
echo "cp -R sample-queries-tpcds $test_dir"
echo "ln -s $current_dir/runSuiteCommon.pl $test_dir/runSuite.pl"
echo "cd $test_dir"
echo "perl runSuite.pl tpcds ${scale} $format \"$HIVE_HOME/bin/beeline -i conf.settings -n hadoop -u jdbc:hive2://localhost:10000\" 2>&1| tee ${component}_${scale}_${format}.log"
echo "cd .."
echo "grep -e '^query' $test_dir/${component}_${scale}_${format}.log|sort > merge/${component}_${scale}_${format}.log" 
echo "wc -l merge/${component}_${scale}_${format}.log"
echo ""

component="spark"
test_dir="test_$component"

echo "# test $component"
echo "mkdir $test_dir" 
echo "cp -R sample-queries-tpcds $test_dir" 
echo "ln -s $current_dir/runSuiteCommon.pl $test_dir/runSuite.pl" 
echo "cd $test_dir" 
echo "perl runSuite.pl tpcds ${scale} $format \"$SPARK_HOME/bin/beeline -i conf.settings -u jdbc:hive2://localhost:11111\" 2>&1| tee ${component}_${scale}_${format}.log"
echo "cd .."
echo "grep -e '^query' $test_dir/${component}_${scale}_${format}.log|sort > merge/${component}_${scale}_${format}.log" 
echo "wc -l merge/${component}_${scale}_${format}.log"
echo ""

component="impala"
test_dir="test_$component"

echo "# test $component"
echo "mkdir $test_dir"  
echo "cp -R sample-queries-tpcds $test_dir"  
echo "ln -s $current_dir/runSuiteCommon.pl $test_dir/runSuite.pl"  
echo "cd $test_dir" 
echo "perl runSuite.pl tpcds ${scale} $format \"$HIVE_HOME/bin/beeline -i conf.settings -d com.cloudera.impala.jdbc4.Driver -u jdbc:impala://localhost:21050\" 2>&1| tee ${component}_${scale}_${format}.log"
echo "cd .."
echo "grep -e '^query' $test_dir/${component}_${scale}_${format}.log|sort > merge/${component}_${scale}_${format}.log" 
echo "wc -l merge/${component}_${scale}_${format}.log"
echo ""

component="presto"
test_dir="test_$component"

echo "# test $component"
echo "mkdir $test_dir"  
echo "cp -R sample-queries-tpcds $test_dir"  
echo "ln -s $current_dir/runSuiteCommon.pl $test_dir/runSuite.pl"  
echo "cd $test_dir" 
echo "perl runSuite.pl tpcds ${scale} $format \"$HIVE_HOME/bin/beeline -i conf.settings -d com.facebook.presto.jdbc.PrestoDriver -n hadoop -u jdbc:presto://localhost:8080/hive\" 2>&1| tee ${component}_${scale}_${format}.log"
echo "cd .."
echo "grep -e '^query' $test_dir/${component}_${scale}_${format}.log|sort > merge/${component}_${scale}_${format}.log" 
echo "wc -l merge/${component}_${scale}_${format}.log"


echo "awk -F ',' '{if(NF==4){print \$1\",\"\$4}else{print \$1\",0\"}}' merge/hive_${scale}_${format}.log > /tmp/hive_${scale}_${format}.log"
echo "awk -F ',' '{if(NF==4){print \$4}else{print \"0\"}}' merge/spark_${scale}_${format}.log > /tmp/spark_${scale}_${format}.log"
echo "awk -F ',' '{if(NF==4){print \$4}else{print \"0\"}}' merge/impala_${scale}_${format}.log > /tmp/impala_${scale}_${format}.log"
echo "awk -F ',' '{if(NF==4){print \$4}else{print \"0\"}}' merge/presto_${scale}_${format}.log > /tmp/presto_${scale}_${format}.log"
echo "paste -d\",\" /tmp/hive_${scale}_${format}.log /tmp/spark_${scale}_${format}.log /tmp/impala_${scale}_${format}.log /tmp/presto_${scale}_${format}.log > merge/result_${scale}_${format}.csv"
echo "sed -i \"1i sql_${scale}_${format},hive,spark,impala,presto\" merge/result_${scale}_${format}.csv"
echo ""

結果合並之后使用excel圖形化顯示

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【原創】大數據基礎之Benchmark（4）TPC-DS測試結果（hive/hive on spark/spark sql/impala/presto）【原創】大數據基礎之Benchmark（1）HiBench TPC-DS數據壓測【重要】TPC-DS基礎介紹使用TPC-DS工具生成數據數據庫基准測試標准 TPC-C or TPC-H or TPC-DS 【TPC-DS】99條查詢SQL tpc-ds doris性能測試【原創】大數據基礎之Logstash（5）監控【原創】大數據基礎之調度框架