一篇文章看懂TPCx-BB（大數據基准測試工具）源碼

本文轉載自查看原文 2016-09-21 09:16 1797 hive/ spark/ 測試/ TPCx-BB/ hadoop/ 大數據

TPCx-BB是大數據基准測試工具,它通過模擬零售商的30個應用場景，執行30個查詢來衡量基於Hadoop的大數據系統的包括硬件和軟件的性能。其中一些場景還用到了機器學習算法（聚類、線性回歸等）。為了更好地了解被測試的系統的性能，需要對TPCx-BB整個測試流程深入了解。本文詳細分析了整個TPCx-BB測試工具的源碼，希望能夠對大家理解TPCx-BB有所幫助。

代碼結構

主目錄（$BENCH_MARK_HOME）下有：

bin
conf
data-generator
engines
tools

幾個子目錄。

bin下有幾個 module ,是執行時需要用到的腳本：bigBench、cleanLogs、logEnvInformation、runBenchmark、zipLogs等

conf下有兩個配置文件：bigBench.properties 和 userSettings.conf

bigBench.properties 主要設置 workload（執行的benchmarkPhases）和 power_test_0（POWER_TEST 階段需要執行的SQL查詢）

默認 workload ：

workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1

默認 power_test_0 ：1-30

userSetting.conf 是一些基本設置，包括JAVA environment 、default settings for benchmark（database、engine、map_tasks、scale_factor ...）、HADOOP environment、
HDFS config and paths、Hadoop data generation options(DFS_REPLICATION、HADOOP_JVM_ENV...)

data-generator下是跟數據生成相關的腳本及配置文件。詳細內容在下面介紹。

engines下是TPCx-BB支持的4種引擎：biginsights、hive、impala、spark_sql。默認引擎為hive。實際上，只有hive目錄下不為空，其他三個目錄下均為空，估計是現在還未完善。

tools下有兩個jar包：HadoopClusterExec.jar 和 RunBigBench.jar 。其中 RunBigBench.jar 是執行TPCx-BB測試的一個非常重要的文件，大部分程序都在該jar包內。

數據生成

數據生成相關程序和配置都在 data-generator 目錄下。該目錄下有一個 pdgf.jar 包和 config、dicts、extlib 三個子目錄。

pdgf.jar是數據生成的Java程序，代碼量很大。config下有兩個配置文件：bigbench-generation.xml 和 bigbench-schema.xml 。

bigbench-generation.xml 主要設置生成的原始數據（不是數據庫表）包含哪幾張表、每張表的表名、表的大小以及表輸出的目錄、表文件的后綴、分隔符、字符編碼等。

<schema name="default">
		<tables>	
		<!-- not refreshed tables -->	
			
			<!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons  -->
			<table name="income_band"></table>
			<table name="reason"></table>
			<table name="ship_mode"></table>
			<table name="web_site"></table>		
			<!-- /tables not used in benchmark  -->
			
			<!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) -->
			<table name="date_dim" static="true"></table>
			<table name="time_dim" static="true"></table>
			<table name="customer_demographics" static="true"></table>
			<table name="household_demographics" static="true"></table>
			<!-- /static tables -->
			
			<!-- "normal" tables. split over all nodes. not generated during refresh -->
			<table name="store"></table>
			<table name="warehouse"></table>
			<table name="promotion"></table>
			<table name="web_page"></table>
			<!-- /"normal" tables.-->
			
		<!-- /not refreshed tables -->	
			
			<!-- 
			refreshed tables. Generated on all nodes. 
			Refresh tables generate extra data during refresh (e.g. add new data to the existing tables)
			In "normal"-Phase  generate table rows:  [0,REFRESH_PERCENTAGE*Table.Size]; 
			In "refresh"-Phase generate table rows:  [REFRESH_PERCENTAGE*Table.Size+1, Table.Size] 
			.Has effect only if  ${REFRESH_SYSTEM_ENABLED}==1. 
			-->
			<table name="customer">
				<scheduler name="DefaultScheduler">
					<partitioner
						name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner">
						<prePartition><![CDATA[
					if(${REFRESH_SYSTEM_ENABLED}>0){
						int tableID = table.getTableID();
						int timeID = 0;
						long lastTableRow=table.getSize()-1;
						long rowStart;
						long rowStop;
						boolean exclude=false;
						long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE});
						if(${REFRESH_PHASE}>0){
							//Refresh part
							rowStart = lastTableRow - refreshRows +1;
							rowStop  = lastTableRow;
							if(refreshRows<=0){
								exclude=true;
							}
							
						}else{
							//"normal" part
							rowStart = 0;
							rowStop = lastTableRow - refreshRows;
						}
						return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude);
					}else{
						//DEFAULT
						return getParentPartitioner().getDefaultPrePartition(project, table);				
					}
					
					]]></prePartition>
					</partitioner>
				</scheduler>
			</table>

<output name="SplitFileOutputWrapper">
  <!-- DEFAULT output for all Tables, if no table specific output is specified-->
    <output name="CSVRowOutput">
      <fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
      <outputDir>output/</outputDir>
      <fileEnding>.dat</fileEnding>
      <delimiter>|</delimiter>
      <charset>UTF-8</charset>
      <sortByRowID>true</sortByRowID>
    </output>

    <output name="StatisticsOutput" active="1">
      <size>${item_size}</size><!-- a counter per item .. initialize later-->

      <fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
      <outputDir>output/</outputDir>
      <fileEnding>.csv</fileEnding>
      <delimiter>,</delimiter>
      <header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR-->
      </header>
      <footer></footer>

bigbench-schema.xml 設置了很多參數，有跟表的規模有關的，比如每張表的大小（記錄的條數）;絕大多數是跟表的字段有關的，比如時間的起始、結束、性別比例、結婚比例、指標的上下界等。還具體定義了每個字段是怎么生成的，以及限制條件。示例如下：

生成的數據大小由 SCALE_FACTOR（-f） 決定。如 -f 1，則生成的數據總大小約為1G；-f 100，則生成的數據總大小約為100G。那么SCALE_FACTOR（-f） 是怎么精確控制生成的數據的大小呢？

原因是 SCALE_FACTOR（-f） 決定了每張表的記錄數。如下，customer 表的記錄數為 100000.0d * ${SF_sqrt}，即如果 -f 1 則 customer 表的記錄數為 100000*sqrt(1)= 10萬條 ;如果 -f 100 則 customer 表的記錄數為 100000*sqrt(100)= 100萬條

<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property>

<property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property>
<property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property>

<property name="${gender_likelihood}" type="double">0.5</property>
<property name="${married_likelihood}" type="double">0.3</property>

<property name="${WP_LINK_MIN}" type="double">2</property>
<property name="${WP_LINK_MAX}" type="double">25</property>

  <field name="d_date" size="13" type="CHAR" primary="false">
   <gen_DateTime>
     <disableRng>true</disableRng>
     <useFixedStepSize>true</useFixedStepSize>
     <startDate>${date_dim_begin_date}</startDate>
     <endDate>${date_dim_end_date}</endDate>
     <outputFormat>yyyy-MM-dd</outputFormat>
    </gen_DateTime>
  </field>

  <field name="t_time_id" size="16" type="CHAR" primary="false">
   <gen_ConvertNumberToString>
    <gen_Id/>
    <size>16.0</size>
    <characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters>
   </gen_ConvertNumberToString>
  </field>

<field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false">
   <gen_Null probability="${NULL_CHANCE}">
   <gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" />
   </gen_Null>
  </field>

dicts下有city.dict、country.dict、male.dict、female.dict、state.dict、mail_provider.dict等字典文件，表里每一條記錄的各個字段應該是從這些字典里生成的。

extlib下是引用的外部程序jar包。有 lucene-core-4.9.0.jar、commons-net-3.3.jar、xml-apis.jar和log4j-1.2.15.jar等

總結：

pdgf.jar根據bigbench-generation.xml 和 bigbench-schema.xml兩個文件里的配置（表名、字段名、表的記錄條數、每個字段生成的規則），從 dicts 目錄下對應的 .dict
文件獲取表中每一條記錄、每個字段的值，生成原始數據。

customer 表里的某條記錄如下：

0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com   gdMmGdU9

如果執行 TPCx-BB 測試時指定 -f 1（SCALE_FACTOR = 1） 則最終生成的原始數據總大小約為 1G(977M+8.6M)

[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data
12.7 M   38.0 M   /user/root/benchmarks/bigbench/data/customer
5.1 M    15.4 M   /user/root/benchmarks/bigbench/data/customer_address
74.2 M   222.5 M  /user/root/benchmarks/bigbench/data/customer_demographics
14.7 M   44.0 M   /user/root/benchmarks/bigbench/data/date_dim
151.5 K  454.4 K  /user/root/benchmarks/bigbench/data/household_demographics
327      981      /user/root/benchmarks/bigbench/data/income_band
405.3 M  1.2 G    /user/root/benchmarks/bigbench/data/inventory
6.5 M    19.5 M   /user/root/benchmarks/bigbench/data/item
4.0 M    12.0 M   /user/root/benchmarks/bigbench/data/item_marketprices
53.7 M   161.2 M  /user/root/benchmarks/bigbench/data/product_reviews
45.3 K   135.9 K  /user/root/benchmarks/bigbench/data/promotion
3.0 K    9.1 K    /user/root/benchmarks/bigbench/data/reason
1.2 K    3.6 K    /user/root/benchmarks/bigbench/data/ship_mode
3.3 K    9.9 K    /user/root/benchmarks/bigbench/data/store
4.1 M    12.4 M   /user/root/benchmarks/bigbench/data/store_returns
88.5 M   265.4 M  /user/root/benchmarks/bigbench/data/store_sales
4.9 M    14.6 M   /user/root/benchmarks/bigbench/data/time_dim
584      1.7 K    /user/root/benchmarks/bigbench/data/warehouse
170.4 M  511.3 M  /user/root/benchmarks/bigbench/data/web_clickstreams
7.9 K    23.6 K   /user/root/benchmarks/bigbench/data/web_page
5.1 M    15.4 M   /user/root/benchmarks/bigbench/data/web_returns
127.6 M  382.8 M  /user/root/benchmarks/bigbench/data/web_sales
8.6 K    25.9 K   /user/root/benchmarks/bigbench/data/web_site

執行流程

要執行TPCx-BB測試，首先需要切換到TPCx-BB源程序的目錄下，然后進入bin目錄，執行以下語句：

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5

其中，-f、-m、-s、-j都是參數，用戶可根據集群的性能以及自己的需求來設置。如果不指定，則使用默認值，默認值在 conf 目錄下的 userSetting.conf 文件指定，如下：

export BIG_BENCH_DEFAULT_DATABASE="bigbench"
export BIG_BENCH_DEFAULT_ENGINE="hive"
export BIG_BENCH_DEFAULT_MAP_TASKS="80"
export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"
export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2"
export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"

默認 MAP_TASKS 為 80（-m 80）、SCALE_FACTOR 為 1000（-f 1000）、NUMBER_OF_PARALLEL_STREAMS 為 2（-s 2）。

所有可選參數及其意義如下：

General options:
-d  使用的數據庫 (默認: $BIG_BENCH_DEFAULT_DATABASE -> bigbench)
-e  使用的引擎 (默認: $BIG_BENCH_DEFAULT_ENGINE -> hive)
-f  數據集的規模因子（scale factor） (默認: $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000)
-h  顯示幫助
-m  數據生成的`map tasks`數 (default: $BIG_BENCH_DEFAULT_MAP_TASKS)"
-s  並行的`stream`數 (默認: $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2)

Driver specific options:
-a  偽裝模式執行
-b  執行期間將調用的bash腳本在標准輸出中打印出來
-i  指定需要執行的階段 (詳情見$BIG_BENCH_CONF_DIR/bigBench.properties)
-j  指定需要執行的查詢 (默認：1-30共30個查詢均執行)"
-U  解鎖專家模式

若指定了-U,即解鎖了專家模式，則：

echo "EXPERT MODE ACTIVE"
echo "WARNING - INTERNAL USE ONLY:"
echo "Only set manually if you know what you are doing!"
echo "Ignoring them is probably the best solution" 
echo "Running individual modules:"
echo "Usage: `basename $0` module [options]"

-D  指定需要debug的查詢部分. 大部分查詢都只有一個單獨的部分
-p  需要執行的benchmark phase (默認: $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)"
-q  指定需要執行哪個查詢（只能指定一個）
-t  指定執行該查詢時用第哪個stream
-v  metastore population的sql腳本 (默認: ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})"
-w  metastore refresh的sql腳本 (默認: ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})"
-y  含額外的用戶自定義查詢參數的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)"
-z  含額外的用戶自定義引擎設置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)"

List of available modules:
    $BIG_BENCH_ENGINE_BIN_DIR

回到剛剛執行TPCx-BB測試的語句：

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5

bigBench

bigBench是主腳本，runBenchmark是module。

bigBench 里設置了很多環境變量（包括路徑、引擎、STREAM數等等），因為后面調用 runBigBench.jar 的時候需要在Java程序里讀取這些環境變量。

bigBench 前面都是在做一些基本工作，如設置環境變量、解析用戶輸入參數、賦予文件權限、設置路徑等等。到最后一步調用 runBenchmark 的 runModule() 方法：

設置基本路徑

export BIG_BENCH_VERSION="1.0"
export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin"
export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf"
export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator"
export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools"
export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"

指定 core-site.xml 和 hdfs-site.xml 的路徑

數據生成時要用到Hadoop集群，生成在hdfs上
```
export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml"
```

export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"
```

賦予整個包下所有可執行文件權限（.sh/.jar/.py）
```
find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} +
```

find "$BIG_BENCH_HOME" -name '.jar' -exec chmod 755 {} +
find "$BIG_BENCH_HOME" -name '.py' -exec chmod 755 {} +
```

設置 userSetting.conf 的路徑並 source

USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf"
if [ ! -f "$USER_SETTINGS" ]
then
  echo "User settings file $USER_SETTINGS not found"
  exit 1
else
  source "$USER_SETTINGS"
fi

解析輸入參數和選項並根據選項的內容作設置

第一個參數必須是module_name

如果沒有輸入參數或者第一個參數以"-"開頭，說明用戶沒有輸入需要運行的module。
```
if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]]
then
  export MODULE_NAME=""
  SHOW_HELP="1"
else
  export MODULE_NAME="$1"
  shift
fi
export LIST_OF_USER_OPTIONS="$@"
```

解析用戶輸入的參數

根據用戶輸入的參數來設置環境變量

```bash
while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do

case "$OPT" in
# script options
d)
#echo "-d was triggered, Parameter: $OPTARG" >&2
USER_DATABASE="$OPTARG"
;;
D)
#echo "-D was triggered, Parameter: $OPTARG" >&2
DEBUG_QUERY_PART="$OPTARG"
;;
e)
#echo "-e was triggered, Parameter: $OPTARG" >&2
USER_ENGINE="$OPTARG"
;;
f)
#echo "-f was triggered, Parameter: $OPTARG" >&2
USER_SCALE_FACTOR="$OPTARG"
;;
h)
#echo "-h was triggered, Parameter: $OPTARG" >&2
SHOW_HELP="1"
;;
m)
#echo "-m was triggered, Parameter: $OPTARG" >&2
USER_MAP_TASKS="$OPTARG"
;;
p)
#echo "-p was triggered, Parameter: $OPTARG" >&2
USER_BENCHMARK_PHASE="$OPTARG"
;;
q)
#echo "-q was triggered, Parameter: $OPTARG" >&2
QUERY_NUMBER="$OPTARG"
;;
s)
#echo "-t was triggered, Parameter: $OPTARG" >&2
USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG"
;;
t)
#echo "-s was triggered, Parameter: $OPTARG" >&2
USER_STREAM_NUMBER="$OPTARG"
;;
U)
#echo "-U was triggered, Parameter: $OPTARG" >&2
USER_EXPERT_MODE="1"
;;
v)
#echo "-v was triggered, Parameter: $OPTARG" >&2
USER_POPULATE_FILE="$OPTARG"
;;
w)
#echo "-w was triggered, Parameter: $OPTARG" >&2
USER_REFRESH_FILE="$OPTARG"
;;
y)
#echo "-y was triggered, Parameter: $OPTARG" >&2
USER_QUERY_PARAMS_FILE="$OPTARG"
;;
z)
#echo "-z was triggered, Parameter: $OPTARG" >&2
USER_ENGINE_SETTINGS_FILE="$OPTARG"
;;
# driver options
a)
#echo "-a was triggered, Parameter: $OPTARG" >&2
export USER_PRETEND_MODE="1"
;;
b)
#echo "-b was triggered, Parameter: $OPTARG" >&2
export USER_PRINT_STD_OUT="1"
;;
i)
#echo "-i was triggered, Parameter: $OPTARG" >&2
export USER_DRIVER_WORKLOAD="$OPTARG"
;;
j)
#echo "-j was triggered, Parameter: $OPTARG" >&2
export USER_DRIVER_QUERIES_TO_RUN="$OPTARG"
;;
?)
echo "Invalid option: -$OPTARG" >&2
exit 1
;;
😃
echo "Option -$OPTARG requires an argument." >&2
exit 1
;;
esac
done
```

設置全局變量。如果用戶指定了某個參數的值，則采用該值，否則使用默認值。

```bash
export BIG_BENCH_EXPERT_MODE="${USER_EXPERT_MODE:-"0"}"

export SHOW_HELP="${SHOW_HELP:-"0"}"
export BIG_BENCH_DATABASE="${USER_DATABASE:-"$BIG_BENCH_DEFAULT_DATABASE"}"
export BIG_BENCH_ENGINE="${USER_ENGINE:-"$BIG_BENCH_DEFAULT_ENGINE"}"
export BIG_BENCH_MAP_TASKS="${USER_MAP_TASKS:-"$BIG_BENCH_DEFAULT_MAP_TASKS"}"
export BIG_BENCH_SCALE_FACTOR="${USER_SCALE_FACTOR:-"$BIG_BENCH_DEFAULT_SCALE_FACTOR"}"
export BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS="${USER_NUMBER_OF_PARALLEL_STREAMS:-"$BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS"}"
export BIG_BENCH_BENCHMARK_PHASE="${USER_BENCHMARK_PHASE:-"$BIG_BENCH_DEFAULT_BENCHMARK_PHASE"}"
export BIG_BENCH_STREAM_NUMBER="${USER_STREAM_NUMBER:-"0"}"
export BIG_BENCH_ENGINE_DIR="$BIG_BENCH_HOME/engines/$BIG_BENCH_ENGINE"
export BIG_BENCH_ENGINE_CONF_DIR="$BIG_BENCH_ENGINE_DIR/conf"
```

檢測 -s -m -f -j的選項是否為數字

if [ -n "`echo "$BIG_BENCH_MAP_TASKS" | sed -e 's/[0-9]*//g'`" ]
then
  echo "$BIG_BENCH_MAP_TASKS is not a number"
fi
if [ -n "`echo "$BIG_BENCH_SCALE_FACTOR" | sed -e 's/[0-9]*//g'`" ]
then
  echo "$BIG_BENCH_SCALE_FACTOR is not a number"
fi
if [ -n "`echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" | sed -e 's/[0-9]*//g'`" ]
then
  echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS is not a number"
fi
if [ -n "`echo "$BIG_BENCH_STREAM_NUMBER" | sed -e 's/[0-9]*//g'`" ]
then
  echo "$BIG_BENCH_STREAM_NUMBER is not a number"
fi

檢查引擎是否存在

if [ ! -d "$BIG_BENCH_ENGINE_DIR" ]
then
  echo "Engine directory $BIG_BENCH_ENGINE_DIR not found. Aborting script..."
  exit 1
fi
if [ ! -d "$BIG_BENCH_ENGINE_CONF_DIR" ]
then
  echo "Engine configuration directory $BIG_BENCH_ENGINE_CONF_DIR not found. Aborting script..."
  exit 1
fi

設置 engineSetting.conf 路徑並 source

ENGINE_SETTINGS="$BIG_BENCH_ENGINE_CONF_DIR/engineSettings.conf"
if [ ! -f "$ENGINE_SETTINGS" ]
then
  echo "Engine settings file $ENGINE_SETTINGS not found"
  exit 1
else
  source "$ENGINE_SETTINGS"
fi

檢查module是否存在

當輸入某個module時，系統會先到$BIG_BENCH_ENGINE_BIN_DIR/目錄下去找該module是否存在，如果存在，就source "$MODULE"；如果該目錄下不存在指定的module，再到export MODULE="$BIG_BENCH_BIN_DIR/"目錄下找該module，如果存在，就source "$MODULE"；否則，輸出Module $MODULE not found, aborting script.
```
export MODULE="$BIG_BENCH_ENGINE_BIN_DIR/$MODULE_NAME"
if [ -f "$MODULE" ]
then
  source "$MODULE"
else
  export MODULE="$BIG_BENCH_BIN_DIR/$MODULE_NAME"
  if [ -f "$MODULE" ]
  then
    source "$MODULE"
  else
    echo "Module $MODULE not found, aborting script."
    exit 1
  fi
fi
```

檢查module里的runModule（）、helpModule ( )、runEngineCmd()方法是否有定義

MODULE_RUN_METHOD="runModule"
if ! declare -F "$MODULE_RUN_METHOD" > /dev/null 2>&1
then
  echo "$MODULE_RUN_METHOD was not implemented, aborting script"
  exit 1
fi

運行module

如果module是runBenchmark，執行
runCmdWithErrorCheck "$MODULE_RUN_METHOD"
也就是
runCmdWithErrorCheck runModule（）

由上可以看出，bigBench腳本主要執行一些如設置環境變量、賦予權限、檢查並解析輸入參數等基礎工作，最終調用runBenchmark的runModule()方法繼續往下執行。

runBenchmark

接下來看看runBenchmark腳本。

runBenchmark里有兩個函數：helpModule ()和runModule ()。

helpModule ()就是顯示幫助。

runModule ()是運行runBenchmark模塊時真正調用的函數。該函數主要做四件事：

清除之前生成的日志
調用RunBigBench.jar來執行
logEnvInformation
將日志文件夾打包成zip

源碼如下：

runModule () {
  #check input parameters
  if [ "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" -le 0 ]
  then
    echo "The number of parallel streams -s must be greater than 0"
    return 1
  fi

  "${BIG_BENCH_BIN_DIR}/bigBench" cleanLogs -U $LIST_OF_USER_OPTIONS
  "$BIG_BENCH_JAVA" -jar "${BIG_BENCH_TOOLS_DIR}/RunBigBench.jar"
  "${BIG_BENCH_BIN_DIR}/bigBench" logEnvInformation -U $LIST_OF_USER_OPTIONS
  "${BIG_BENCH_BIN_DIR}/bigBench" zipLogs -U $LIST_OF_USER_OPTIONS
  return $?
}

相當於運行runBenchmark模塊時又調用了cleanLogs、logEnvInformation、zipLogs三個模塊以及RunBigBench.jar。其中RunBigBench.jar是TCPx-BB測試執行的核心代碼，用Java語言編寫。接下來分析RunBigBench.jar源碼。

runModule()

runModule()函數用來執行某個module。我們已知，執行某個module需要切換到主目錄下的bin目錄，然后執行：

./bigBench module_name arguments

在runModule()函數里，cmdLine用來生成如上命令。

ArrayList cmdLine = new ArrayList();
cmdLine.add("bash");
cmdLine.add(this.runScript);
cmdLine.add(benchmarkPhase.getRunModule());
cmdLine.addAll(arguments);

其中，this.runScript為：

this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";

benchmarkPhase.getRunModule()用來獲得需要執行的module。

arguments為用戶輸入的參數。

至此，cmdLine為：

bash $BIG_BENCH_BIN_DIR/bigBench module_name arguments

那么，怎么讓系統執行該bash命令呢？答案是調用runCmd()方法。

boolean successful = this.runCmd(this.homeDir, benchmarkPhase.isPrintStdOut(), (String[])cmdLine.toArray(new String[0]));

接下來介紹rumCmd()方法

runCmd()

runCmd()方法通過ProcessBuilder來創建一個操作系統進程，並用該進程執行以上的bash命令。

ProcessBuilder還可以設置工作目錄和環境。

ProcessBuilder pb = new ProcessBuilder(command);
pb.directory(new File(workingDirectory));
Process p = null;
---
p = pb.start();

getQueryList()

getQueryList()用來獲得需要執行的查詢列表。從$BIG_BENCH_LOGS_DIR/bigBench.properties文件中讀取。與$BIG_BENCH_HOME/conf/bigBench.properties內容一致。

bigBench.properties里power_test_0=1-30規定了powter_test_0階段需要執行的查詢及其順序。

可以用區間如 5-12 或者單個數字如 21 表示，中間用 , 隔開。

power_test_0=28-25,2-5,10,22,30

表示powter_test_0階段需要執行的查詢及其順序為：28,27,26,25,2,3,4,5,10,22,30

如果想讓30個查詢按順序執行，則：

power_test_0=1-30

獲得查詢列表的源碼如下：

    private List<Integer> getQueryList(BigBench.BenchmarkPhase benchmarkPhase, int streamNumber) {
        String SHUFFLED_NAME_PATTERN = "shuffledQueryList";
        BigBench.BenchmarkPhase queryOrderBasicPhase = BigBench.BenchmarkPhase.POWER_TEST;
        String propertyKey = benchmarkPhase.getQueryListProperty(streamNumber);
        boolean queryOrderCached = benchmarkPhase.isQueryOrderCached();
        if(queryOrderCached && this.queryListCache.containsKey(propertyKey)) {
            return new ArrayList((Collection)this.queryListCache.get(propertyKey));
        } else {
            Object queryList;
            String basicPhaseNamePattern;
            if(!this.properties.containsKey(propertyKey)) {
                if(benchmarkPhase.isQueryOrderRandom()) {
                    if(!this.queryListCache.containsKey("shuffledQueryList")) {
                        basicPhaseNamePattern = queryOrderBasicPhase.getQueryListProperty(0);
                        if(!this.properties.containsKey(basicPhaseNamePattern)) {
                            throw new IllegalArgumentException("Property " + basicPhaseNamePattern + " is not defined, but is the basis for shuffling the query list.");
                        }

                        this.queryListCache.put("shuffledQueryList", this.getQueryList(queryOrderBasicPhase, 0));
                    }

                    queryList = (List)this.queryListCache.get("shuffledQueryList");
                    this.shuffleList((List)queryList, this.rnd);
                } else {
                    queryList = this.getQueryList(queryOrderBasicPhase, 0);
                }
            } else {
                queryList = new ArrayList();
                String[] var11;
                int var10 = (var11 = this.properties.getProperty(propertyKey).split(",")).length;

                label65:
                for(int var9 = 0; var9 < var10; ++var9) {
                    basicPhaseNamePattern = var11[var9];
                    String[] queryRange = basicPhaseNamePattern.trim().split("-");
                    switch(queryRange.length) {
                    case 1:
                        ((List)queryList).add(Integer.valueOf(Integer.parseInt(queryRange[0].trim())));
                        break;
                    case 2:
                        int startQuery = Integer.parseInt(queryRange[0]);
                        int endQuery = Integer.parseInt(queryRange[1]);
                        int i;
                        if(startQuery > endQuery) {
                            i = startQuery;

                            while(true) {
                                if(i < endQuery) {
                                    continue label65;
                                }

                                ((List)queryList).add(Integer.valueOf(i));
                                --i;
                            }
                        } else {
                            i = startQuery;

                            while(true) {
                                if(i > endQuery) {
                                    continue label65;
                                }

                                ((List)queryList).add(Integer.valueOf(i));
                                ++i;
                            }
                        }
                    default:
                        throw new IllegalArgumentException("Query numbers must be in the form X or X-Y, comma separated.");
                    }
                }
            }

            if(queryOrderCached) {
                this.queryListCache.put(propertyKey, new ArrayList((Collection)queryList));
            }

            return new ArrayList((Collection)queryList);
        }
    }

parseEnvironment()

parseEnvironment()讀取系統的環境變量並解析。

Map env = System.getenv();
this.version = (String)env.get("BIG_BENCH_VERSION");
this.homeDir = (String)env.get("BIG_BENCH_HOME");
this.confDir = (String)env.get("BIG_BENCH_CONF_DIR");
this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";
this.datagenDir = (String)env.get("BIG_BENCH_DATA_GENERATOR_DIR");
this.logDir = (String)env.get("BIG_BENCH_LOGS_DIR");
this.dataGenLogFile = (String)env.get("BIG_BENCH_DATAGEN_STAGE_LOG");
this.loadLogFile = (String)env.get("BIG_BENCH_LOADING_STAGE_LOG");
this.engine = (String)env.get("BIG_BENCH_ENGINE");
this.database = (String)env.get("BIG_BENCH_DATABASE");
this.mapTasks = (String)env.get("BIG_BENCH_MAP_TASKS");
this.numberOfParallelStreams = Integer.parseInt((String)env.get("BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS"));
this.scaleFactor = Long.parseLong((String)env.get("BIG_BENCH_SCALE_FACTOR"));
this.stopAfterFailure = ((String)env.get("BIG_BENCH_STOP_AFTER_FAILURE")).equals("1");

並自動在用戶指定的參數后面加上 -U (解鎖專家模式)

this.userArguments.add("-U");

如果用戶指定了 PRETEND_MODE、PRINT_STD_OUT、WORKLOAD、QUERIES_TO_RUN，則以用戶指定的參數為准，否則使用默認值。

if(env.containsKey("USER_PRETEND_MODE")) {
	this.properties.setProperty("pretend_mode", (String)env.get("USER_PRETEND_MODE"));
}

if(env.containsKey("USER_PRINT_STD_OUT")) {
   this.properties.setProperty("show_command_stdout", (String)env.get("USER_PRINT_STD_OUT"));
}

if(env.containsKey("USER_DRIVER_WORKLOAD")) {
   this.properties.setProperty("workload", (String)env.get("USER_DRIVER_WORKLOAD"));
}

if(env.containsKey("USER_DRIVER_QUERIES_TO_RUN")) {
	this.properties.setProperty(BigBench.BenchmarkPhase.POWER_TEST.getQueryListProperty(0), (String)env.get("USER_DRIVER_QUERIES_TO_RUN"));
}

讀取 workload 並賦值 benchmarkPhases。如果 workload 里不包含 BENCHMARK_START 和 BENCHMARK_STOP，自動在 benchmarkPhases 的首位和末位分別加上 BENCHMARK_START 和 BENCHMARK_STOP。

this.benchmarkPhases = new ArrayList();
Iterator var7 = Arrays.asList(this.properties.getProperty("workload").split(",")).iterator();
	
while(var7.hasNext()) {
    String benchmarkPhase = (String)var7.next();
    this.benchmarkPhases.add(BigBench.BenchmarkPhase.valueOf(benchmarkPhase.trim()));
}
	
if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_START)) {
    this.benchmarkPhases.add(0, BigBench.BenchmarkPhase.BENCHMARK_START);
}
	
if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_STOP)) {
    this.benchmarkPhases.add(BigBench.BenchmarkPhase.BENCHMARK_STOP);
}

run()

run() 方法是 RunBigBench.jar 里核心的方法。所有的執行都是通過 run() 方法調用的。比如 runQueries()、runModule()、generateData()等。runQueries()、runModule()、generateData() 又通過調用 runCmd() 方法來創建操作系統進程，執行bash命令，調用bash腳本。

run() 方法里通過一個 while 循環來逐一執行 workload 里的每一個 benchmarkPhase。不同的 benchmarkPhase 會調用 runQueries()、runModule()、generateData()...中的不同方法。

try {
long e = 0L;
this.log.finer("Benchmark phases: " + this.benchmarkPhases);
Iterator startCheckpoint = this.benchmarkPhases.iterator();

long throughputStart;
while(startCheckpoint.hasNext()) {
    BigBench.BenchmarkPhase children = (BigBench.BenchmarkPhase)startCheckpoint.next();
    if(children.isPhaseDone()) {
        this.log.info("The phase " + children.name() + " was already performed earlier. Skipping this phase");
    } else {
        try {
            switch($SWITCH_TABLE$io$bigdatabenchmark$v1$driver$BigBench$BenchmarkPhase()[children.ordinal()]) {
            case 1:
            case 20:
                throw new IllegalArgumentException("The value " + children.name() + " is only used internally.");
            case 2:
                this.log.info(children.getConsoleMessage());
                e = System.currentTimeMillis();
                break;
            case 3:
                if(!BigBench.BenchmarkPhase.BENCHMARK_START.isPhaseDone()) {
                    throw new IllegalArgumentException("Error: Cannot stop the benchmark before starting it");
                }

                throughputStart = System.currentTimeMillis();
                this.log.info(String.format("%-55s finished. Time: %25s", new Object[]{children.getConsoleMessage(), BigBench.Helper.formatTime(throughputStart - e)}));
                this.logTreeRoot.setCheckpoint(new BigBench.Checkpoint(BigBench.BenchmarkPhase.BENCHMARK, -1L, -1L, e, throughputStart, this.logTreeRoot.isSuccessful()));
                break;
            case 4:
            case 15:
            case 18:
            case 22:
            case 27:
            case 28:
            case 29:
                this.runModule(children, this.userArguments);
                break;
            case 5:
            case 10:
            case 11:
                this.runQueries(children, 1, validationArguments);
                break;
            case 6:
            case 9:
                this.runModule(children, validationArguments);
                break;
            case 7:
                this.generateData(children, false, validationArguments);
                break;
            case 8:
                this.generateData(children, true, validationArguments);
                break;
            case 12:
            case 19:
            case 24:
                this.runQueries(children, 1, this.userArguments);
                break;
            case 13:
            case 14:
            case 21:
            case 23:
            case 25:
            case 26:
                this.runQueries(children, this.numberOfParallelStreams, this.userArguments);
                break;
            case 16:
                this.generateData(children, false, this.userArguments);
                break;
            case 17:
                this.generateData(children, true, this.userArguments);
            }

            children.setPhaseDone(true);
        } catch (IOException var21) {
            this.log.info("==============\nBenchmark run terminated\nReason: An error occured while running a command in phase " + children + "\n==============");
            var21.printStackTrace();
            if(this.stopAfterFailure || children.mustSucceed()) {
                break;
            }
        }
    }
}

這里的 case 1-29 並不是 1-29 條查詢，而是枚舉類型里的 1-29 個 benmarkPhase 。如下所示：

private static enum BenchmarkPhase {
BENCHMARK((String)null, "benchmark", false, false, false, false, "BigBench benchmark"),
BENCHMARK_START((String)null, "benchmark_start", false, false, false, false, "BigBench benchmark: Start"),
BENCHMARK_STOP((String)null, "benchmark_stop", false, false, false, false, "BigBench benchmark: Stop"),
CLEAN_ALL("cleanAll", "clean_all", false, false, false, false, "BigBench clean all"),
ENGINE_VALIDATION_CLEAN_POWER_TEST("cleanQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Clean power test queries"),
ENGINE_VALIDATION_CLEAN_LOAD_TEST("cleanMetastore", "engine_validation_metastore", false, false, false, false, "BigBench engine validation: Clean metastore"),
ENGINE_VALIDATION_CLEAN_DATA("cleanData", "engine_validation_data", false, false, false, false, "BigBench engine validation: Clean data"),
ENGINE_VALIDATION_DATA_GENERATION("dataGen", "engine_validation_data", false, false, false, true, "BigBench engine validation: Data generation"),
ENGINE_VALIDATION_LOAD_TEST("populateMetastore", "engine_validation_metastore", false, false, false, true, "BigBench engine validation: Populate metastore"),
ENGINE_VALIDATION_POWER_TEST("runQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Power test"),
ENGINE_VALIDATION_RESULT_VALIDATION("validateQuery", "engine_validation_power_test", false, false, true, false, "BigBench engine validation: Check all query results"),
CLEAN_POWER_TEST("cleanQuery", "power_test", false, false, false, false, "BigBench clean: Clean power test queries"),
CLEAN_THROUGHPUT_TEST_1("cleanQuery", "throughput_test_1", false, false, false, false, "BigBench clean: Clean first throughput test queries"),
CLEAN_THROUGHPUT_TEST_2("cleanQuery", "throughput_test_2", false, false, false, false, "BigBench clean: Clean second throughput test queries"),
CLEAN_LOAD_TEST("cleanMetastore", "metastore", false, false, false, false, "BigBench clean: Load test"),
CLEAN_DATA("cleanData", "data", false, false, false, false, "BigBench clean: Data"),
DATA_GENERATION("dataGen", "data", false, false, false, true, "BigBench preparation: Data generation"),
LOAD_TEST("populateMetastore", "metastore", false, false, false, true, "BigBench phase 1: Load test"),
POWER_TEST("runQuery", "power_test", false, true, false, false, "BigBench phase 2: Power test"),
THROUGHPUT_TEST((String)null, "throughput_test", false, false, false, false, "BigBench phase 3: Throughput test"),
THROUGHPUT_TEST_1("runQuery", "throughput_test_1", true, true, false, false, "BigBench phase 3: First throughput test run"),
THROUGHPUT_TEST_REFRESH("refreshMetastore", "throughput_test_refresh", false, false, false, false, "BigBench phase 3: Throughput test data refresh"),
THROUGHPUT_TEST_2("runQuery", "throughput_test_2", true, true, false, false, "BigBench phase 3: Second throughput test run"),
VALIDATE_POWER_TEST("validateQuery", "power_test", false, false, true, false, "BigBench validation: Power test results"),
VALIDATE_THROUGHPUT_TEST_1("validateQuery", "throughput_test_1", false, false, true, false, "BigBench validation: First throughput test results"),
VALIDATE_THROUGHPUT_TEST_2("validateQuery", "throughput_test_2", false, false, true, false, "BigBench validation: Second throughput test results"),
SHOW_TIMES("showTimes", "show_times", false, false, true, false, "BigBench: show query times"),
SHOW_ERRORS("showErrors", "show_errors", false, false, true, false, "BigBench: show query errors"),
SHOW_VALIDATION("showValidation", "show_validation", false, false, true, false, "BigBench: show query validation results");

private String runModule;
private String namePattern;
private boolean queryOrderRandom;
private boolean queryOrderCached;
private boolean printStdOut;
private boolean mustSucceed;
private String consoleMessage;
private boolean phaseDone;

private BenchmarkPhase(String runModule, String namePattern, boolean queryOrderRandom, boolean queryOrderCached, boolean printStdOut, boolean mustSucceed, String consoleMessage) {
    this.runModule = runModule;
    this.namePattern = namePattern;
    this.queryOrderRandom = queryOrderRandom;
    this.queryOrderCached = queryOrderCached;
    this.printStdOut = printStdOut;
    this.mustSucceed = mustSucceed;
    this.consoleMessage = consoleMessage;
    this.phaseDone = false;
}

3對應 BENCHMARK_STOP，4對應 CLEAN_ALL,29對應 SHOW_VALIDATION，依此類推...

可以看出：

CLEAN_ALL、CLEAN_LOAD_TEST、LOAD_TEST、THROUGHPUT_TEST_REFRESH、SHOW_TIMES、SHOW_ERRORS、SHOW_VALIDATION等benchmarkPhases調用的是

this.runModule(children, this.userArguments);

方法是 runModule ，參數是 this.userArguments。

ENGINE_VALIDATION_CLEAN_POWER_TEST、ENGINE_VALIDATION_POWER_TEST、ENGINE_VALIDATION_RESULT_VALIDATION 調用的是

this.runQueries(children, 1, validationArguments);

方法是 runQueries ，參數是 1(stream number) 和 validationArguments。

ENGINE_VALIDATION_CLEAN_LOAD_TEST 和 ENGINE_VALIDATION_LOAD_TEST 調用的是

this.runModule(children, validationArguments);

ENGINE_VALIDATION_CLEAN_DATA 調用的是

this.generateData(children, false, validationArguments);

ENGINE_VALIDATION_DATA_GENERATION 調用的是

this.generateData(children, true, validationArguments);

CLEAN_POWER_TEST、POWER_TEST、VALIDATE_POWER_TEST 調用的是

this.runQueries(children, 1, this.userArguments);

CLEAN_THROUGHPUT_TEST_1``CLEAN_THROUGHPUT_TEST_2``THROUGHPUT_TEST_1``THROUGHPUT_TEST_2``VALIDATE_THROUGHPUT_TEST_1 VALIDATE_THROUGHPUT_TEST_2 調用的是

this.runQueries(children, this.numberOfParallelStreams, this.userArguments);

CLEAN_DATA 調用的是

this.generateData(children, false, this.userArguments);

DATA_GENERATION 調用的是

this.generateData(children, true, this.userArguments);

總結一下以上的方法調用可以發現：

跟 ENGINE_VALIDATION 相關的benchmarkPhase用的參數都是 validationArguments。其余用的是 userArguments（ validationArguments 和 userArguments 唯一的區別是 validationArguments 的 SCALE_FACTOR 恆為1）
跟 POWER_TEST 相關的都是調用 runQueries() 方法，因為 POWER_TEST 就是執行SQL查詢
跟 CLEAN_DATA DATA_GENERATION 相關的都是調用 generateData() 方法
跟 LOAD_TEST SHOW 相關的都是調用 runModule() 方法

benchmarkPhase 和 module 對應關系

具體每個 benchmarkPhase 跟 module（執行的腳本）的對應關系如下：

CLEAN_ALL -> "cleanAll"
ENGINE_VALIDATION_CLEAN_POWER_TEST -> "cleanQuery"
ENGINE_VALIDATION_CLEAN_LOAD_TEST -> "cleanMetastore",
ENGINE_VALIDATION_CLEAN_DATA -> "cleanData"
ENGINE_VALIDATION_DATA_GENERATION -> "dataGen"
ENGINE_VALIDATION_LOAD_TEST -> "populateMetastore"
ENGINE_VALIDATION_POWER_TEST -> "runQuery"
ENGINE_VALIDATION_RESULT_VALIDATION -> "validateQuery"
CLEAN_POWER_TEST -> "cleanQuery"
CLEAN_THROUGHPUT_TEST_1 -> "cleanQuery"
CLEAN_THROUGHPUT_TEST_2 -> "cleanQuery"
CLEAN_LOAD_TEST -> "cleanMetastore"
CLEAN_DATA -> "cleanData"
DATA_GENERATION -> "dataGen"
LOAD_TEST -> "populateMetastore"
POWER_TEST -> "runQuery"
THROUGHPUT_TEST -> (String)null
THROUGHPUT_TEST_1 -> "runQuery"
THROUGHPUT_TEST_REFRESH -> "refreshMetastore"
THROUGHPUT_TEST_2 -> "runQuery"
VALIDATE_POWER_TEST -> "validateQuery"
VALIDATE_THROUGHPUT_TEST_1 -> "validateQuery"
VALIDATE_THROUGHPUT_TEST_2 -> "validateQuery"
SHOW_TIMES -> "showTimes"
SHOW_ERRORS -> "showErrors"
SHOW_VALIDATION -> "showValidation"

當執行某個 benchmarkPhase 時會去調用如上該 benchmarkPhase 對應的 module （腳本位於 $BENCH_MARK_HOME/engines/hive/bin 目錄下）

cmdLine.add(benchmarkPhase.getRunModule());

程序調用流程

Alt text

接下來介紹每個module的功能

module

cleanAll

1. DROP DATABASE
2. 刪除hdfs上的源數據

echo "dropping database (with all tables)"
runCmdWithErrorCheck runEngineCmd -e "DROP DATABASE IF EXISTS $BIG_BENCH_DATABASE CASCADE;"
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_HOME}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_HOME}"

cleanQuery

1. 刪除對應的 Query 生成的臨時表
2. 刪除對應的 Query 生成的結果表

runCmdWithErrorCheck runEngineCmd -e "DROP TABLE IF EXISTS $TEMP_TABLE1; DROP TABLE IF EXISTS $TEMP_TABLE2; DROP TABLE IF EXISTS $RESULT_TABLE;"
return $?

cleanMetastore

1. 調用 `dropTables.sql` 將23張表依次DROP

echo "cleaning metastore tables"
runCmdWithErrorCheck runEngineCmd -f "$BIG_BENCH_CLEAN_METASTORE_FILE"

export BIG_BENCH_CLEAN_METASTORE_FILE="$BIG_BENCH_CLEAN_DIR/dropTables.sql"

dropTables.sql 將23張表依次DROP,源碼如下：

DROP TABLE IF EXISTS ${hiveconf:customerTableName};
DROP TABLE IF EXISTS ${hiveconf:customerAddressTableName};
DROP TABLE IF EXISTS ${hiveconf:customerDemographicsTableName};
DROP TABLE IF EXISTS ${hiveconf:dateTableName};
DROP TABLE IF EXISTS ${hiveconf:householdDemographicsTableName};
DROP TABLE IF EXISTS ${hiveconf:incomeTableName};
DROP TABLE IF EXISTS ${hiveconf:itemTableName};
DROP TABLE IF EXISTS ${hiveconf:promotionTableName};
DROP TABLE IF EXISTS ${hiveconf:reasonTableName};
DROP TABLE IF EXISTS ${hiveconf:shipModeTableName};
DROP TABLE IF EXISTS ${hiveconf:storeTableName};
DROP TABLE IF EXISTS ${hiveconf:timeTableName};
DROP TABLE IF EXISTS ${hiveconf:warehouseTableName};
DROP TABLE IF EXISTS ${hiveconf:webSiteTableName};
DROP TABLE IF EXISTS ${hiveconf:webPageTableName};
DROP TABLE IF EXISTS ${hiveconf:inventoryTableName};
DROP TABLE IF EXISTS ${hiveconf:storeSalesTableName};
DROP TABLE IF EXISTS ${hiveconf:storeReturnsTableName};
DROP TABLE IF EXISTS ${hiveconf:webSalesTableName};
DROP TABLE IF EXISTS ${hiveconf:webReturnsTableName};
DROP TABLE IF EXISTS ${hiveconf:marketPricesTableName};
DROP TABLE IF EXISTS ${hiveconf:clickstreamsTableName};
DROP TABLE IF EXISTS ${hiveconf:reviewsTableName};

cleanData

1. 刪除hdfs上 /user/root/benchmarks/bigbench/data 目錄
2. 刪除hdfs上 /user/root/benchmarks/bigbench/data_refresh 目錄

echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"

dataGen

1. 創建目錄 /user/root/benchmarks/bigbench/data 並賦予權限
2. 創建目錄 /user/root/benchmarks/bigbench/data_refresh 並賦予權限
3. 調用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data 到 /user/root/benchmarks/bigbench/data 目錄下
4. 調用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data 到 /user/root/benchmarks/bigbench/data_refresh 目錄下

創建目錄 /user/root/benchmarks/bigbench/data 並賦予權限

runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"

創建目錄 /user/root/benchmarks/bigbench/data_refresh 並賦予權限

runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"

調用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data

runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 0 -o "'${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1

調用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data

runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 1 -o "'${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1

populateMetastore

該過程是真正的創建數據庫表的過程。建表的過程調用的是 $BENCH_MARK_HOME/engines/hive/population/ 目錄下的 hiveCreateLoad.sql ,通過該sql文件來建數據庫表。

從 /user/root/benchmarks/bigbench/data 路徑下讀取 .dat 的原始數據，生成 TEXTFILE 格式的外部臨時表
用 select * from 臨時表 來創建最終的 ORC 格式的數據庫表
刪除外部臨時表。

從 /user/root/benchmarks/bigbench/data 路徑下讀取 .dat 的原始數據，生成 TEXTFILE 格式的外部臨時表

DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
  ( c_customer_sk             bigint              --not null
  , c_customer_id             string              --not null
  , c_current_cdemo_sk        bigint
  , c_current_hdemo_sk        bigint
  , c_current_addr_sk         bigint
  , c_first_shipto_date_sk    bigint
  , c_first_sales_date_sk     bigint
  , c_salutation              string
  , c_first_name              string
  , c_last_name               string
  , c_preferred_cust_flag     string
  , c_birth_day               int
  , c_birth_month             int
  , c_birth_year              int
  , c_birth_country           string
  , c_login                   string
  , c_email_address           string
  , c_last_review_date        string
  )
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'
  STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'
;

用 select * from 臨時表 來創建最終的 ORC 格式的數據庫表

DROP TABLE IF EXISTS ${hiveconf:customerTableName};
CREATE TABLE ${hiveconf:customerTableName}
STORED AS ${hiveconf:tableFormat}
AS
SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
;

刪除外部臨時表

DROP TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};

runQuery

1. runQuery 調用每個query下的 run.sh 里的 `query_run_main_method()` 方法
2. `query_run_main_method()` 調用 `runEngineCmd` 來執行query腳本（qxx.sql）

runQuery 調用每個query下的 run.sh 里的 query_run_main_method() 方法

QUERY_MAIN_METHOD="query_run_main_method"
-----------------------------------------
"$QUERY_MAIN_METHOD" 2>&1 | tee -a "$LOG_FILE_NAME" 2>&1

query_run_main_method() 調用 runEngineCmd 來執行query腳本（qxx.sql）

query_run_main_method () {
	QUERY_SCRIPT="$QUERY_DIR/$QUERY_NAME.sql"
	if [ ! -r "$QUERY_SCRIPT" ]
	then
		echo "SQL file $QUERY_SCRIPT can not be read."
		exit 1
	fi

	runCmdWithErrorCheck runEngineCmd -f "$QUERY_SCRIPT"
	return $?
}

一般情況下 query_run_main_method () 方法只是執行對應的query腳本，但是像 q05、q20... 這些查詢，用到了機器學習算法，所以在執行對應的query腳本后會把生成的結果表作為輸入，然后調用執行機器學習算法（如聚類、邏輯回歸）的jar包繼續執行，得到最終的結果。

runEngineCmd () {
  if addInitScriptsToParams
  then
    "$BINARY" "${BINARY_PARAMS[@]}" "${INIT_PARAMS[@]}" "$@"
  else
    return 1
  fi
}
--------------------------
BINARY="/usr/bin/hive"
BINARY_PARAMS+=(--hiveconf BENCHMARK_PHASE=$BIG_BENCH_BENCHMARK_PHASE --hiveconf STREAM_NUMBER=$BIG_BENCH_STREAM_NUMBER --hiveconf QUERY_NAME=$QUERY_NAME --hiveconf QUERY_DIR=$QUERY_DIR --hiveconf RESULT_TABLE=$RESULT_TABLE --hiveconf RESULT_DIR=$RESULT_DIR --hiveconf TEMP_TABLE=$TEMP_TABLE --hiveconf TEMP_DIR=$TEMP_DIR --hiveconf TABLE_PREFIX=$TABLE_PREFIX)
INIT_PARAMS=(-i "$BIG_BENCH_QUERY_PARAMS_FILE" -i "$BIG_BENCH_ENGINE_SETTINGS_FILE")
INIT_PARAMS+=(-i "$LOCAL_QUERY_ENGINE_SETTINGS_FILE")

if [ -n "$USER_QUERY_PARAMS_FILE" ]
then
if [ -r "$USER_QUERY_PARAMS_FILE" ]
then
  echo "User defined query parameter file found. Adding $USER_QUERY_PARAMS_FILE to hive init."
  INIT_PARAMS+=(-i "$USER_QUERY_PARAMS_FILE")
else
  echo "User query parameter file $USER_QUERY_PARAMS_FILE can not be read."
  return 1
fi
fi

if [ -n "$USER_ENGINE_SETTINGS_FILE" ]
then
if [ -r "$USER_ENGINE_SETTINGS_FILE" ]
then
  echo "User defined engine settings file found. Adding $USER_ENGINE_SETTINGS_FILE to hive init."
  INIT_PARAMS+=(-i "$USER_ENGINE_SETTINGS_FILE")
else
  echo "User hive settings file $USER_ENGINE_SETTINGS_FILE can not be read."
  return 1
fi
fi
return 0

validateQuery

1. 調用每個query下的 run.sh 里的 `query_run_validate_method()` 方法
2. `query_run_validate_method()` 比較 `$BENCH_MARK_HOME/engines/hive/queries/qxx/results/qxx-result` 和hdfs上 `/user/root/benchmarks/bigbench/queryResults/qxx_hive_${BIG_BENCH_BENCHMARK_PHASE}_${BIG_BENCH_STREAM_NUMBER}_result` 兩個文件，如果一樣，則驗證通過，否則驗證失敗。

if diff -q "$VALIDATION_RESULTS_FILENAME" <(hadoop fs -cat "$RESULT_DIR/*")
then
	echo "Validation of $VALIDATION_RESULTS_FILENAME passed: Query returned correct results"
else
	echo "Validation of $VALIDATION_RESULTS_FILENAME failed: Query returned incorrect results"
	VALIDATION_PASSED="0"
fi

SF為1時(-f 1)，用上面的方法比較，SF不為1（>1）時,只要hdfs上的結果表中行數大於等於1即驗證通過

if [ `hadoop fs -cat "$RESULT_DIR/*" | head -n 10 | wc -l` -ge 1 ]
then
	echo "Validation passed: Query returned results"
else
	echo "Validation failed: Query did not return results"
	return 1
fi

refreshMetastore

1. 調用 `$BENCH_MARK_HOME/engines/hive/refresh/` 目錄下的 `hiveRefreshCreateLoad.sql` 腳本
2. `hiveRefreshCreateLoad.sql` 將hdfs上 `/user/root/benchmarks/bigbench/data_refresh/` 目錄下每個表數據插入外部臨時表
3. 外部臨時表再將每個表的數據插入Hive數據庫對應的表中

hiveRefreshCreateLoad.sql 將hdfs上 /user/root/benchmarks/bigbench/data_refresh/ 目錄下每個表數據插入外部臨時表

DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
  ( c_customer_sk             bigint              --not null
  , c_customer_id             string              --not null
  , c_current_cdemo_sk        bigint
  , c_current_hdemo_sk        bigint
  , c_current_addr_sk         bigint
  , c_first_shipto_date_sk    bigint
  , c_first_sales_date_sk     bigint
  , c_salutation              string
  , c_first_name              string
  , c_last_name               string
  , c_preferred_cust_flag     string
  , c_birth_day               int
  , c_birth_month             int
  , c_birth_year              int
  , c_birth_country           string
  , c_login                   string
  , c_email_address           string
  , c_last_review_date        string
  )
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'
  STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'
;
-----------------
set hdfsDataPath=${env:BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR};

外部臨時表再將每個表的數據插入Hive數據庫對應的表中

INSERT INTO TABLE ${hiveconf:customerTableName}
SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
;

附錄

在測試之前會先用PDGF（並行數據生成框架）生成數據。指定Scale Factor為1（1GB），生成以下共23張表。以下是每張表的建表語句及每張表里的某一條記錄。

23張表

hive> show tables;
OK
customer
customer_address
customer_demographics
date_dim
household_demographics
income_band
inventory
item
item_marketprices
product_reviews
promotion
reason
ship_mode
store
store_returns
store_sales
time_dim
warehouse
web_clickstreams
web_page
web_returns
web_sales
web_site
Time taken: 0.017 seconds, Fetched: 23 row(s)

customer （99000行）4.03MB

建表語句：

hive> show create table customer;
OK
CREATE TABLE `customer`(
 `c_customer_sk` bigint,
 `c_customer_id` string,
 `c_current_cdemo_sk` bigint,
 `c_current_hdemo_sk` bigint,
 `c_current_addr_sk` bigint,
 `c_first_shipto_date_sk` bigint,
 `c_first_sales_date_sk` bigint,
 `c_salutation` string,
 `c_first_name` string,
 `c_last_name` string,
 `c_preferred_cust_flag` string,
 `c_birth_day` int,
 `c_birth_month` int,
 `c_birth_year` int,
 `c_birth_country` string,
 `c_login` string,
 `c_email_address` string,
 `c_last_review_date` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'99000\',
 \'rawDataSize\'=\'88084062\',
 \'totalSize\'=\'4221267\',
 \'transient_lastDdlTime\'=\'1473167154\')
Time taken: 0.116 seconds, Fetched: 34 row(s)

某一行：

0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com 	gdMmGdU9

customer_address（49500行） 0.92MB

建表語句：

hive> show create table customer_address;
OK
CREATE TABLE `customer_address`(
 `ca_address_sk` bigint,
 `ca_address_id` string,
 `ca_street_number` string,
 `ca_street_name` string,
 `ca_street_type` string,
 `ca_suite_number` string,
 `ca_city` string,
 `ca_county` string,
 `ca_state` string,
 `ca_zip` string,
 `ca_country` string,
 `ca_gmt_offset` decimal(5,2),
 `ca_location_type` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer_address\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'49500\',
 \'rawDataSize\'=\'55836000\',
 \'totalSize\'=\'966548\',
 \'transient_lastDdlTime\'=\'1473167160\')
Time taken: 0.043 seconds, Fetched: 29 row(s)

某一行：

6187 AAAAAAAAAAAAAJDZ 536 6th Lake Drive UKL8bE5C Lowell Brule County SD 18464 United States -6 apartment

customer_demographics (19200800行）6.84MB

建表語句：

hive> show create table customer_demographics;
OK
CREATE TABLE `customer_demographics`(
 `cd_demo_sk` bigint,
 `cd_gender` string,
 `cd_marital_status` string,
 `cd_education_status` string,
 `cd_purchase_estimate` int,
 `cd_credit_rating` string,
 `cd_dep_count` int,
 `cd_dep_employed_count` int,
 `cd_dep_college_count` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer_demographics\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'1920800\',
 \'rawDataSize\'=\'718371044\',
 \'totalSize\'=\'7167503\',
 \'transient_lastDdlTime\'=\'1473167169\')

某一行：

0 F 	U Primary 6000 Good 0 5 0

date_dim （109573行） 1.67MB

建表語句：

hive> show create table date_dim;
OK
CREATE TABLE `date_dim`(
 `d_date_sk` bigint,
 `d_date_id` string,
 `d_date` string,
 `d_month_seq` int,
 `d_week_seq` int,
 `d_quarter_seq` int,
 `d_year` int,
 `d_dow` int,
 `d_moy` int,
 `d_dom` int,
 `d_qoy` int,
 `d_fy_year` int,
 `d_fy_quarter_seq` int,
 `d_fy_week_seq` int,
 `d_day_name` string,
 `d_quarter_name` string,
 `d_holiday` string,
 `d_weekend` string,
 `d_following_holiday` string,
 `d_first_dom` int,
 `d_last_dom` int,
 `d_same_day_ly` int,
 `d_same_day_lq` int,
 `d_current_day` string,
 `d_current_week` string,
 `d_current_month` string,
 `d_current_quarter` string,
 `d_current_year` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/date_dim\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'109573\',
 \'rawDataSize\'=\'123050479\',
 \'totalSize\'=\'1748707\',
 \'transient_lastDdlTime\'=\'1473167172\')

某一行：

0 AAAAAAAAAAAAAAAA 1900-01-01 0 0 0 1900 1 1 1 1 1900 0 0 Monday 1900Q1 Y 	N 	N 2448812 2458802 2472542 2420941 N 	N 	NN 	N

household_demographics （7200行） 14.31KB

建表語句：

hive> show create table household_demographics;
OK
CREATE TABLE `household_demographics`(
 `hd_demo_sk` bigint,
 `hd_income_band_sk` bigint,
 `hd_buy_potential` string,
 `hd_dep_count` int,
 `hd_vehicle_count` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/household_demographics\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'7200\',
 \'rawDataSize\'=\'835168\',
 \'totalSize\'=\'14655\',
 \'transient_lastDdlTime\'=\'1473167173\')

某一行：

0 3 1001-5000 0 0

income_band （20行） 754B

建表語句：

hive> show create table income_band;
OK
CREATE TABLE `income_band`(
 `ib_income_band_sk` bigint,
 `ib_lower_bound` int,
 `ib_upper_bound` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/income_band\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'20\',
 \'rawDataSize\'=\'320\',
 \'totalSize\'=\'754\',
 \'transient_lastDdlTime\'=\'1473167179\')
Time taken: 0.032 seconds, Fetched: 19 row(s)

某一行：

0 1 10000

inventory （23255100行） 34.55MB

建表語句：

hive> show create table inventory;
OK
CREATE TABLE `inventory`(
 `inv_date_sk` bigint,
 `inv_item_sk` bigint,
 `inv_warehouse_sk` bigint,
 `inv_quantity_on_hand` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/inventory\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'3\',
 \'numRows\'=\'23255100\',
 \'rawDataSize\'=\'651118804\',
 \'totalSize\'=\'36234106\',
 \'transient_lastDdlTime\'=\'1473167235\')
Time taken: 0.031 seconds, Fetched: 20 row(s)

某一行：

36890 0 0 503

item （17820行） 2.36MB

建表語句：

hive> show create table item;
OK
CREATE TABLE `item`(
 `i_item_sk` bigint,
 `i_item_id` string,
 `i_rec_start_date` string,
 `i_rec_end_date` string,
 `i_item_desc` string,
 `i_current_price` decimal(7,2),
 `i_wholesale_cost` decimal(7,2),
 `i_brand_id` int,
 `i_brand` string,
 `i_class_id` int,
 `i_class` string,
 `i_category_id` int,
 `i_category` strong,
 `i_manufact_id` int,
 `i_manufact` string,
 `i_size` string,
 `i_formulation` string,
 `i_color` string,
 `i_units` string,
 `i_container` string,
 `i_manager_id` int,
 `i_product_name` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/item\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'17820\',
 \'rawDataSize\'=\'31238428\',
 \'totalSize\'=\'2472973\',
 \'transient_lastDdlTime\'=\'1473167181\')

某一行：

0 AAAAAAAAAAAAAAAA 2000-01-14 quickly even dinos beneath the frays must have to boost boldly careful bold escapades: stealthily even forges over the dependencies integrate always past the quiet sly decoys-- notornis sol 72.29 64.96 3898712 71TS7NSbvH1YbdiQMG6ttBHKAljiNoIRB 1 Fan Shop 9 Sports & Outdoors 995 2VOxvrIWwlJQTSk6 small 99Ee1r6uFbZSSClAX3 dodger Oz Unknown 18 8m9n5Q7T33DNWidoA6nWlg6ydmpA1SKOoOJLXiLVb

item_marketprices （89100行） 0.63MB

建表語句：

hive> show create table item_marketprices;
OK
CREATE TABLE `item_marketprices`(
 `imp_sk` bigint,
 `imp_item_sk` bigint,
 `imp_competitor` string,
 `imp_competitor_price` decimal(7,2),
 `imp_start_date` bigint,
 `imp_end_date` bigint)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/item_marketprices\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'89100\',
 \'rawDataSize\'=\'21736912\',
 \'totalSize\'=\'657024\',
 \'transient_lastDdlTime\'=\'1473167275\')

某一行：

5 4737 AAAAAAAAAAAAAAIN 66.4 36890 36958

product_reviews （89991行） 24.36MB

建表語句：

hive> show create table product_reviews;
OK
CREATE TABLE `product_reviews`(
 `pr_review_sk` bigint,
 `pr_review_date` string,
 `pr_review_time` string,
 `pr_review_rating` int,
 `pr_item_sk` bigint,
 `pr_user_sk` bigint,
 `pr_order_sk` bigint,
 `pr_review_content` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/product_reviews\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'89991\',
 \'rawDataSize\'=\'79597043\',
 \'totalSize\'=\'25546821\',
 \'transient_lastDdlTime\'=\'1473167305\')

某一行：

4 2004-02-25 21:43:07 1 3683 88783 41241 Had to offset the praise handed down by so many years is probably the result of some sin committed (obviously so grievous don\'t "get" it. Why all the original artwork by Stephen Gammel.

promotion （300行） 15.83KB

建表語句：

hive> show create table promotion;
OK
CREATE TABLE `promotion`(
 `p_promo_sk` bigint,
 `p_promo_id` string,
 `p_start_date_sk` bigint,
 `p_end_date_sk` bigint,
 `p_item_sk` bigint,
 `p_cost` decimal(15,2),
 `p_response_target` int,
 `p_promo_name` string,
 `p_channel_dmail` string,
 `p_channel_email` string,
 `p_channel_catalog` string,
 `p_channel_tv` string,
 `p_channel_radio` string,
 `p_channel_press` string,
 `p_channel_event` string,
 `p_channel_demo` string,
 `p_channel_details` string,
 `p_purpose` string,
 `p_discount_active` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/promotion\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'300\',
 \'rawDataSize\'=\'404926\',
 \'totalSize\'=\'16204\',
 \'transient_lastDdlTime\'=\'1473167186\')

某一行：

7 AAAAAAAAAAAAAAAH 94455 108529 13511 427.76 1 bar 	YN 	Y 	N 	N 	N 	N 	N 	blithe grouches past the blithe quick epitaphs print rut Unknown N

reason （35行） 3.17KB

建表語句:

hive> show create table reason;
OK
CREATE TABLE `reason`(
 `r_reason_sk` bigint,
 `r_reason_id` string,
 `r_reason_desc` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/reason\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'35\',
 \'rawDataSize\'=\'9027\',
 \'totalSize\'=\'3240\',
 \'transient_lastDdlTime\'=\'1473167190\')

某一行：

5 uy 	busily sly excuses hang: slow braids to the daring somas was toward the epitaphs-- gifts betw

ship_mode （20行） 2.93KB

建表語句：

hive> show create table ship_mode;
OK
CREATE TABLE `ship_mode`(
 `sm_ship_mode_sk` bigint,
 `sm_ship_mode_id` string,
 `sm_type` string,
 `sm_code` string,
 `sm_carrier` strong,
 `sm_contract` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/ship_mode\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'20\',
 \'rawDataSize\'=\'9576\',
 \'totalSize\'=\'3001\',
 \'transient_lastDdlTime\'=\'1473167196\')

某一行：

12 wGyozLRZ3uL 	iCTZsMlNzsQ 	cBRc FlWM9v tm0ehuQ2

store （12行） 8.03KB

建表語句：

hive> show create table store;
OK
CREATE TABLE `store`(
 `s_store_sk` bigint,
 `s_store_id` string,
 `s_rec_start_date` string,
 `s_rec_end_date` string,
 `s_closed_date_sk` bigint,
 `s_store_name` string,
 `s_number_employees` int,
 `s_floor_space` int,
 `s_hours` string,
 `s_manager` string,
 `s_market_id` int,
 `s_geography_class` string,
 `s_market_desc` string,
 `s_market_manager` string,
 `s_division_id` int,
 `s_division_name` string,
 `s_company_id` int,
 `s_company_name` string,
 `s_street_number` string,
 `s_street_name` string,
 `s_street_type` string,
 `s_suite_number` string,
 `s_city` string,
 `s_county` string,
 `s_state` string,
 `s_zip` string,
 `s_country` string,
 `s_gmt_offset` decimal(5,2),
 `s_tax_precentage` decimal(5,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'12\',
 \'rawDataSize\'=\'25962\',
 \'totalSize\'=\'8214\',
 \'transient_lastDdlTime\'=\'1473167201\')

某一行：

10 AAAAAAAAAAAAAAAK 2000-11-01 6235 Avalon 254 6468537 8AM-12AM Michael Barlow 6 Unknown final sly gifts by the even final dependencies x-ray under the fluffy Barry Shaw 1 Unknown1Unknown 993 Mill Pkwy aQV Cold Springs Cumberland County TN 95692 United States -5 0.03

store_returns (37902行） 1.19MB

建表語句：

hive> show create table store_returns;
OK
CREATE TABLE `store_returns`(
 `sr_returned_date_sk` bigint,
 `sr_return_time_sk` bigint,
 `sr_item_sk` bigint,
 `sr_customer_sk` bigint,
 `sr_cdemo_sk` bigint,
 `sr_hdemo_sk` bigint,
 `sr_addr_sk` bigint,
 `sr_store_sk` bigint,
 `sr_reason_sk` bigint,
 `sr_ticket_number` bigint,
 `sr_return_quantity` int,
 `sr_return_amt` decimal(7,2),
 `sr_return_tax` decimal(7,2),
 `sr_return_amt_inc_tax` decimal(7,2),
 `sr_fee` decimal(7,2),
 `sr_return_ship_cost` decimal(7,2),
 `sr_refunded_cash` decimal(7,2),
 `sr_reversed_charge` decimal(7,2),
 `sr_store_credit` decimal(7,2),
 `sr_net_loss` decimal(7,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store_returns\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'37902\',
 \'rawDataSize\'=\'41388272\',
 \'totalSize\'=\'1250563\',
 \'transient_lastDdlTime\'=\'1473167251\')

某一行：

37375 38182 11520 7640 242073 6754 25731 2 22 20293 33 1990.89 119.45 2110.34 34.14 2433.41 477.81 559.84 953.24 2672.31

store_sales （667579行） 23.49MB

建表語句：

hive> show create table store_sales;
OK
CREATE TABLE `store_sales`(
 `ss_sold_date_sk` bigint,
 `ss_sold_time_sk` bigint,
 `ss_item_sk` bigint,
 `ss_customer_sk` bigint,
 `ss_cdemo_sk` bigint,
 `ss_hdemo_sk` bigint,
 `ss_addr_sk` bigint,
 `ss_store_sk` bigint,
 `ss_promo_sk` bigint,
 `ss_ticket_number` bigint,
 `ss_quantity` int,
 `ss_wholesale_cost` decimal(7,2),
 `ss_list_price` decimal(7,2),
 `ss_sales_price` decimal(7,2),
 `ss_ext_discount_amt` decimal(7,2),
 `ss_ext_sales_price` decimal(7,2),
 `ss_ext_wholesale_cost` decimal(7,2),
 `ss_ext_list_price` decimal(7,2),
 `ss_ext_tax` decimal(7,2),
 `ss_coupon_amt` decimal(7,2),
 `ss_net_paid` decimal(7,2),
 `ss_net_paid_inc_tax` decimal(7,2),
 `ss_net_profit` decimal(7,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store_sales\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'667579\',
 \'rawDataSize\'=\'953293700\',
 \'totalSize\'=\'24629162\',
 \'transient_lastDdlTime\'=\'1473167245\')

某一行：

37115 20244 16481 98676 1211207 5239 37107 9 38 11138 93 56.47 82.45 36.28 4293.81 3374.04 5251.71 7667.85 101.22 0 3374.04 3475.26 -1877.67

time_dim （86400行） 219.49KB

建表語句：

hive> show create table time_dim;
OK
CREATE TABLE `time_dim`(
 `t_time_sk` bigint,
 `t_time_id` string,
 `t_time` int,
 `t_hour` int,
 `t_minute` int,
 `t_second` int,
 `t_am_pm` string,
 `t_shift` string,
 `t_sub_shift` string,
 `t_meal_time` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/time_dim\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'86400\',
 \'rawDataSize\'=\'41040000\',
 \'totalSize\'=\'224757\',
 \'transient_lastDdlTime\'=\'1473167202\')

某一行：

2 AAAAAAAAAAAAAAAC 2 0 0 2 AM 	third 	night

warehouse （5行） 1.94KB

建表語句:

hive> show create table warehouse;
OK
CREATE TABLE `warehouse`(
 `w_warehouse_sk` bigint,
 `w_warehouse_id` string,
 `w_warehouse_name` string,
 `w_warehouse_sq_ft` int,
 `w_street_number` string,
 `w_street_name` string,
 `w_street_type` string,
 `w_suite_number` string,
 `w_city` string,
 `w_county` string,
 `w_state` string,
 `w_zip` string,
 `w_country` string,
 `w_gmt_offset` decimal(5,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/warehouse\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'5\',
 \'rawDataSize\'=\'5695\',
 \'totalSize\'=\'1980\',
 \'transient_lastDdlTime\'=\'1473167204\')

某一行：

1 AAAAAAAAAAAAAAAB 	frets would sleep 845707 181 9th 13thCourt kQ Sherwood Forest Teton County WY 87891 United States -7

web_clickstreams （6770550行） 34.11MB

建表語句：

hive> show create table web_clickstreams;
OK
CREATE TABLE `web_clickstreams`(
 `wcs_click_date_sk` bigint,
 `wcs_click_time_sk` bigint,
 `wcs_sales_sk` bigint,
 `wcs_item_sk` bigint,
 `wcs_web_page_sk` bigint,
 `wcs_user_sk` bigint)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_clickstreams\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'6770550\',
 \'rawDataSize\'=\'253337416\',
 \'totalSize\'=\'35768476\',
 \'transient_lastDdlTime\'=\'1473167298\')

某一行：

36890 27089 NULL 15465 47 NULL

web_page （60行） 6.81KB

建表語句：

hive> show create table web_page;
OK
CREATE TABLE `web_page`(
 `wp_web_page_sk` bigint,
 `wp_web_page_id` string,
 `wp_rec_start_date` string,
 `wp_rec_end_date` string,
 `wp_creation_date_sk` bigint,
 `wp_access_date_sk` bigint,
 `wp_autogen_flag` string,
 `wp_customer_sk` bigint,
 `wp_url` string,
 `wp_type` string,
 `wp_char_count` int,
 `wp_link_count` int,
 `wp_image_count` int,
 `wp_max_ad_count` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_page\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'60\',
 \'rawDataSize\'=\'38763\',
 \'totalSize\'=\'6969\',
 \'transient_lastDdlTime\'=\'1473167215\')

某一行：

24 AAAAAAAAAAAAAAAY 2000-12-19 38819 69008 0 44446 http://www.C8jdri37RmtbDNeFnXjmYbyBPzeO4WWK9pVYP6xtBJbaQ5yIj4s.com 	feedback	4340 	23 	5 	2

web_returns （38487行） 1.40MB

建表語句：

hive> show create table web_returns;
OK
CREATE TABLE `web_returns`(
 `wr_returned_date_sk` bigint,
 `wr_returned_time_sk` bigint,
 `wr_item_sk` bigint,
 `wr_refunded_customer_sk` bigint,
 `wr_refunded_cdemo_sk` bigint,
 `wr_refunded_hdemo_sk` bigint,
 `wr_refunded_addr_sk` bigint,
 `wr_returning_customer_sk` bigint,
 `wr_returning_cdemo_sk` bigint,
 `wr_returning_hdemo_sk` bigint,
 `wr_returning_addr_sk` bigint,
 `wr_web_page_sk` bigint,
 `wr_reason_sk` bigint,
 `wr_order_number` bigint,
 `wr_return_quantity` int,
 `wr_return_amt` decimal(7,2),
 `wr_return_tax` decimal(7,2),
 `wr_return_amt_inc_tax` decimal(7,2),
 `wr_fee` decimal(7,2),
 `wr_return_ship_cost` decimal(7,2),
 `wr_refunded_cash` decimal(7,2),
 `wr_reversed_charge` decimal(7,2),
 `wr_account_credit` decimal(7,2),
 `wr_net_loss` decimal(7,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_returns\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'38487\',
 \'rawDataSize\'=\'43257220\',
 \'totalSize\'=\'1471571\',
 \'transient_lastDdlTime\'=\'1473167269\')

某一行：

38952 68126 9590 52948 893223 2020 6942 52948 893223 2020 6942 32 13 63444 33 626.0125.04 651.05 18.74 958.4 244.14 286.4 95.47 1008.48

web_sales （668052行） 32.99MB

建表語句：

hive> show create table web_sales;
OK
CREATE TABLE `web_sales`(
 `ws_sold_date_sk` bigint,
 `ws_sold_time_sk` bigint,
 `ws_ship_date_sk` bigint,
 `ws_item_sk` bigint,
 `ws_bill_customer_sk` bigint,
 `ws_bill_cdemo_sk` bigint,
 `ws_bill_hdemo_sk` bigint,
 `ws_bill_addr_sk` bigint,
 `ws_ship_customer_sk` bigint,
 `ws_ship_cdemo_sk` bigint,
 `ws_ship_hdemo_sk` bigint,
 `ws_ship_addr_sk` bigint,
 `ws_web_page_sk` bigint,
 `ws_web_site_sk` bigint,
 `ws_ship_mode_sk` bigint,
 `ws_warehouse_sk` bigint,
 `ws_promo_sk` bigint,
 `ws_order_number` bigint,
 `ws_quantity` int,
 `ws_wholesale_cost` decimal(7,2),
 `ws_list_price` decimal(7,2),
 `ws_sales_price` decimal(7,2),
 `ws_ext_discount_amt` decimal(7,2),
 `ws_ext_sales_price` decimal(7,2),
 `ws_ext_wholesale_cost` decimal(7,2),
 `ws_ext_list_price` decimal(7,2),
 `ws_ext_tax` decimal(7,2),
 `ws_coupon_amt` decimal(7,2),
 `ws_ext_ship_cost` decimal(7,2),
 `ws_net_paid` decimal(7,2),
 `ws_net_paid_inc_tax` decimal(7,2),
 `ws_net_paid_inc_ship` decimal(7,2),
 `ws_net_paid_inc_ship_tax` decimal(7,2),
 `ws_net_profit` decimal(7,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_sales\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'668052\',
 \'rawDataSize\'=\'1221174888\',
 \'totalSize\'=\'34585166\',
 \'transient_lastDdlTime\'=\'1473167263\')

某一行：

36890 26789 36993 794 85457 380790 2436 42649 29934 1583844 5251 525 2 14 1 3 120 0 93 32.62 81.22 59.29 2039.49 5513.97 3033.66 7553.46 330.84 0 1661.76 5513.97 5844.81 7175.73 7506.57 2480.31

web_site （30行）10.31KB

建表語句：

hive> show create table web_site;
OK
CREATE TABLE `web_site`(
 `web_site_sk` bigint,
 `web_site_id` string,
 `web_rec_start_date` string,
 `web_rec_end_date` string,
 `web_name` string,
 `web_open_date_sk` bigint,
 `web_close_date_sk` bigint,
 `web_class` string,
 `web_manager` string,
 `web_mkt_id` int,
 `web_mkt_class` string,
 `web_mkt_desc` string,
 `web_market_manager` string,
 `web_company_id` int,
 `web_company_name` string,
 `web_street_number` string,
 `web_street_name` string,
 `web_street_type` string,
 `web_suite_number` string,
 `web_city` string,
 `web_county` string,
 `web_state` string,
 `web_zip` string,
 `web_country` string,
 `web_gmt_offset` decimal(5,2),
 `web_tax_percentage` decimal(5,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_site\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'30\',
 \'rawDataSize\'=\'62832\',
 \'totalSize\'=\'10554\',
 \'transient_lastDdlTime\'=\'1473167210\')

某一行：

2 AAAAAAAAAAAAAAAC 2002-07-17 site_0 16450 91500 Unknown Gregory George 1 sheaves despite the quietly sly asymp 	thin enticing frets except the sometimes final courts might promise blithe dino Frank Hernandez 1 ese 17 5th Ave EbDxJVL Georgetown Guadalupe County TX 75435 United States -6 0.01

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 一篇文章看懂大數據分析就業前景及職能定位 BenchmarkSQL數據庫基准測試工具 sysbench基准測試工具使用 MySQL 測試工具（基准測試、壓力測試）一篇文章看懂angularjs component組件一篇文章看懂JS執行上下文一篇文章看懂有關iOS開發語言的一切！ JMH--一款由OpenJDK開發的基准測試工具一篇文章帶您讀懂List集合（源碼分析）關於《羅織經》的一篇文章