引言
要評價一個系統的性能,通常有不同的指標,相應的會有不同的測試方法和測試工具,一般來說為了確保測試結果的公平和權威性,會選用比較成熟的商業測試軟件。但在特定情形下,只是想要簡單比較不同系統或比較一些函數庫性能時,也能夠從開源世界里選用一些優秀的工具來完成這個任務,本文就通過lmbench 簡要介紹系統綜合性能測試。
測試軟件
Lmbench是一套簡易,可移植的,符合ANSI/C標准為UNIX/POSIX而制定的微型測評工具。一般來說,它衡量兩個關鍵特征:反應時間和帶寬。Lmbench旨在使系統開發者深入了解關鍵操作的基礎成本。
軟件說明:
lmbench是個用於評價系統綜合性能的多平台開源benchmark,能夠測試包括文檔讀寫、內存操作、進程創建銷毀開銷、網絡等性能,測試方法簡單。
Lmbench是個多平台軟件,因此能夠對同級別的系統進行比較測試,反映不同系統的優劣勢,通過選擇不同的庫函數我們就能夠比較庫函數的性能;更為重要的是,作為一個開源軟件,lmbench提供一個測試框架,假如測試者對測試項目有更高的測試需要,能夠通過少量的修改源代碼達到目的(比如現在只能評測進程創建、終止的性能和進程轉換的開銷,通過修改部分代碼即可實現線程級別的性能測試)。
下載:
www.bitmover.com/lmbench,最新版本3.0-a9
lmbench官網鏈接
LMbench的主要功能:
帶寬測評工具
—讀取緩存文件
—拷貝內存
—讀內存
—寫內存
—管道
—TCP
反應時間測評工具
—上下文切換
—網絡: 連接的建立,管道,TCP,UDP和RPC hot potato
—文件系統的建立和刪除
—進程創建
—信號處理
—上層的系統調用
—內存讀入反應時間
其他
—處理器時鍾比率計算
LMbench的主要特性:
—對於操作系統的可移植性測試
評測工具是由C語言編寫的,具有較好的可移植性(盡管它們更易於被GCC編譯)。這對於產生系統間逐一明細的對比結果是有用的。
—自適應調整
Lmbench對於應激性行為是非常有用的。當遇到BloatOS比所有競爭者慢4倍的情況時,這個工具會將資源進行分配來修正這個問題。
— 數據庫計算結果
數據庫的計算結果包括了從大多數主流的計算機工作站制造商上的運行結果。
—存儲器延遲計算結果
存儲器延遲測試展示了所有系統(數據)的緩存延遲,例如一級,二級和三級緩存,還有內存和TLB表的未命中延遲。另外,緩存的大小可以被正確划分成一些結果集並被讀出。硬件族與上面的描述相象。這種測評工具已經找到了操作系統分頁策略的中的一些錯誤。
—上下文轉換計算結果
很多人好象喜歡上下文轉換的數量。這種測評工具並不是特別注重僅僅引用“在緩存中”的數量。它時常在進程數量和大小間進行變化,並且在當前內容不在緩存中的時候,將結果以一種對用戶可見的方式進行划分。您也可以得到冷緩存上下文切換的實際開銷。
— 回歸測試
Sun公司和SGI公司已經使用這種測評工具以尋找和補救存在於性能上的問題。
Intel公司在開發P6的過程中,使用了它們。
Linux在Linux的性能調整中使用了它們。
— 新的測評工具
源代碼是比較小的,可讀並且容易擴展。它可以按常規組合成不同的形式以測試其他內容。舉例來說,如包括處理連接建立的庫函數的網絡測量,服務器關閉等。
目錄結構
[root@jiangyi01.sqa.zmf /tmp/lmbench3]
#ls
lmbench3 lmbench3.tar.gz
[root@jiangyi01.sqa.zmf /tmp/lmbench3]
#cd lmbench3/
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#ls
ACKNOWLEDGEMENTS CHANGES COPYING-2 hbench-REBUTTAL README SCCS src
bin COPYING doc Makefile results scripts
配置文件
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#ll bin/x86_64-linux-gnu/*`hostname`
-rw-r--r-- 1 root root 719 Mar 8 17:18 bin/x86_64-linux-gnu/CONFIG.jiangyi01.sqa.zmf
-rwxr-xr-x 1 root root 1232 Mar 7 20:52 bin/x86_64-linux-gnu/INFO.jiangyi01.sqa.zmf
生成配置文件腳本
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#ll scripts/config-run
-r-xr-xr-x 1 14557 501 21018 Mar 8 17:18 scripts/config-run
生成配置文件腳本
make results 命令實際上是調用了 scripts/config-run
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#make results
cd src && make results
make[1]: Entering directory `/tmp/lmbench3/lmbench3/src'
gmake[2]: Entering directory `/tmp/lmbench3/lmbench3/src'
gmake[2]: Nothing to be done for `all'.
gmake[2]: Leaving directory `/tmp/lmbench3/lmbench3/src'
gmake[2]: Entering directory `/tmp/lmbench3/lmbench3/src'
gmake[2]: Nothing to be done for `opt'.
gmake[2]: Leaving directory `/tmp/lmbench3/lmbench3/src'
=====================================================================
L M B E N C H C ON F I G U R A T I O N
----------------------------------------
You need to configure some parameters to lmbench. Once you have configured
these parameters, you may do multiple runs by saying
"make rerun"
in the src subdirectory.
NOTICE: please do not have any other activity on the system if you can
help it. Things like the second hand on your xclock or X perfmeters
are not so good when benchmarking. In fact, X is not so good when
benchmarking.
=====================================================================
Hang on, we are calculating your timing granularity.
OK, it looks like you can time stuff down to 5000 usec resolution.
Hang on, we are calculating your timing overhead.
OK, it looks like your gettimeofday() costs 0 usecs.
Hang on, we are calculating your loop overhead.
OK, it looks like your benchmark loop costs 0.00000197 usecs.
=====================================================================
If you are running on an MP machine and you want to try running
multiple copies of lmbench in parallel, you can specify how many here.
Using this option will make the benchmark run 100x slower (sorry).
NOTE: WARNING! This feature is experimental and many results are
known to be incorrect or random!
MULTIPLE COPIES [default 1] 1
Options to control job placement
1) Allow scheduler to place jobs
2) Assign each benchmark process with any attendent child processes
to its own processor
3) Assign each benchmark process with any attendent child processes
to its own processor, except that it will be as far as possible
from other processes
4) Assign each benchmark and attendent processes to their own
processors
5) Assign each benchmark and attendent processes to their own
processors, except that they will be as far as possible from
each other and other processes
6) Custom placement: you assign each benchmark process with attendent
child processes to processors
7) Custom placement: you assign each benchmark and attendent
processes to processors
Note: some benchmarks, such as bw_pipe, create attendent child
processes for each benchmark process. For example, bw_pipe
needs a second process to send data down the pipe to be read
by the benchmark process. If you have three copies of the
benchmark process running, then you actually have six processes;
three attendent child processes sending data down the pipes and
three benchmark processes reading data and doing the measurements.
Job placement selection: 1
=====================================================================
Several benchmarks operate on a range of memory. This memory should be
sized such that it is at least 4 times as big as the external cache[s]
on your system. It should be no more than 80% of your physical memory.
The bigger the range, the more accurate the results, but larger sizes
take somewhat longer to run the benchmark.
MB [default 67535] 100
Checking to see if you have 100 MB; please wait for a moment...
100MB OK
100MB OK
100MB OK
Hang on, we are calculating your cache line size.
OK, it looks like your cache line is 128 bytes.
=====================================================================
lmbench measures a wide variety of system performance, and the full suite
of benchmarks can take a long time on some platforms. Consequently, we
offer the capability to run only predefined subsets of benchmarks, one
for operating system specific benchmarks and one for hardware specific
benchmarks. We also offer the option of running only selected benchmarks
which is useful during operating system development.
Please remember that if you intend to publish the results you either need
to do a full run or one of the predefined OS or hardware subsets.
SUBSET (ALL|HARWARE|OS|DEVELOPMENT) [default all] h
=====================================================================
This benchmark measures, by default, memory latency for a number of
different strides. That can take a long time and is most useful if you
are trying to figure out your cache line size or if your cache line size
is greater than 128 bytes.
If you are planning on sending in these results, please don't do a fast
run.
Answering yes means that we measure memory latency with a 128 byte stride.
FASTMEM [default no]
=====================================================================
This benchmark measures, by default, file system latency. That can
take a long time on systems with old style file systems (i.e., UFS,
FFS, etc.). Linux' ext2fs and Sun's tmpfs are fast enough that this
test is not painful.
If you are planning on sending in these results, please don't do a fast
run.
If you want to skip the file system latency tests, answer "yes" below.
SLOWFS [default no]
=====================================================================
This benchmark can measure disk zone bandwidths and seek times. These can
be turned into whizzy graphs that pretty much tell you everything you might
need to know about the performance of your disk.
This takes a while and requires read access to a disk drive.
Write is not measured, see disk.c to see how if you want to do so.
If you want to skip the disk tests, hit return below.
If you want to include disk tests, then specify the path to the disk
device, such as /dev/sda. For each disk that is readable, you'll be
prompted for a one line description of the drive, i.e.,
Iomega IDE ZIP
or
HP C3725S 2GB on 10MB/sec NCR SCSI bus
DISKS [default none]
=====================================================================
If you are running on an idle network and there are other, identically
configured systems, on the same wire (no gateway between you and them),
and you have rsh access to them, then you should run the network part
of the benchmarks to them. Please specify any such systems as a space
separated list such as: ether-host fddi-host hippi-host.
REMOTE [default none]
=====================================================================
Calculating mhz, please wait for a moment...
I think your CPU mhz is
2194 MHz, 0.4558 nanosec clock
but I am frequently wrong. If that is the wrong Mhz, type in your
best guess as to your processor speed. It doesn't have to be exact,
but if you know it is around 800, say 800.
Please note that some processors, such as the P4, have a core which
is double-clocked, so on those processors the reported clock speed
will be roughly double the advertised clock rate. For example, a
1.8GHz P4 may be reported as a 3592MHz processor.
Processor mhz [default 2194 MHz, 0.4558 nanosec clock]
=====================================================================
We need a place to store a 100 Mbyte file as well as create and delete a
large number of small files. We default to /usr/tmp. If /usr/tmp is a
memory resident file system (i.e., tmpfs), pick a different place.
Please specify a directory that has enough space and is a local file
system.
FSDIR [default /usr/tmp]
=====================================================================
lmbench outputs status information as it runs various benchmarks.
By default this output is sent to /dev/tty, but you may redirect
it to any file you wish (such as /dev/null...).
Status output file [default /dev/tty]
=====================================================================
There is a database of benchmark results that is shipped with new
releases of lmbench. Your results can be included in the database
if you wish. The more results the better, especially if they include
remote networking. If your results are interesting, i.e., for a new
fast box, they may be made available on the lmbench web page, which is
http://www.bitmover.com/lmbench
Mail results [default yes] n
OK, no results mailed.
=====================================================================
Confguration done, thanks.
There is a mailing list for discussing lmbench hosted at BitMover.
Send mail to majordomo@bitmover.com to join the list.
Using config in CONFIG.jiangyi01.sqa.zmf
Wed Mar 8 16:30:53 CST 2017
Latency measurements
Wed Mar 8 16:31:10 CST 2017
Local networking
Wed Mar 8 16:31:14 CST 2017
Bandwidth measurements
Wed Mar 8 16:31:27 CST 2017
Calculating effective TLB size
Wed Mar 8 16:31:29 CST 2017
Calculating memory load parallelism
Wed Mar 8 16:32:12 CST 2017
McCalpin's STREAM benchmark
Wed Mar 8 16:32:14 CST 2017
Calculating memory load latency
Wed Mar 8 16:52:16 CST 2017
make[1]: Leaving directory `/tmp/lmbench3/lmbench3/src'
讀取結果
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#make see
cd results && make summary percent 2>/dev/null | more
make[1]: Entering directory `/tmp/lmbench3/lmbench3/results'
L M B E N C H 3 . 0 S U M M A R Y
------------------------------------
(Alpha software, do not distribute)
Basic system parameters
------------------------------------------------------------------------------
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
jiangyi01 Linux 3.10.0- x86_64-linux-gnu 2194 32 128 6.4300 1
Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
jiangyi01 Linux 3.10.0- 2195 0.07 0.15 0.99 1.96 4.12 0.16 1.05
Basic integer operations - times in nanoseconds - smaller is better
-------------------------------------------------------------------
Host OS intgr intgr intgr intgr intgr
bit add mul div mod
--------- ------------- ------ ------ ------ ------ ------
jiangyi01 Linux 3.10.0- 0.4600 0.0700 1.4100 10.3 11.5
Basic float operations - times in nanoseconds - smaller is better
-----------------------------------------------------------------
Host OS float float float float
add mul div bogo
--------- ------------- ------ ------ ------ ------
jiangyi01 Linux 3.10.0- 1.3700 2.2800 6.5400 6.3900
Basic double operations - times in nanoseconds - smaller is better
------------------------------------------------------------------
Host OS double double double double
add mul div bogo
--------- ------------- ------ ------ ------ ------
jiangyi01 Linux 3.10.0- 1.3700 2.2800 10.2 10.0
File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
jiangyi01 Linux 3.10.0- 0.309 1.549
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
jiangyi01 Linux 3.10.0- 2405.2 4445.5 6114 5489.
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Rand mem Guesses
--------- ------------- --- ---- ---- -------- -------- -------
jiangyi01 Linux 3.10.0- 2194 1.8250 5.4860 49.3 117.6
make[1]: Leaving directory `/tmp/lmbench3/lmbench3/results'


技術參數
參數說明
我這里對每個測試結果參數的說明不全,更加全面的請看REF鏈接
(1)Basic system parameters(系統基本參數)
Tlb pages:TLB(Translation Lookaside Buffer)的頁面數
Cache line bytes :(cache的行字節數)
Mem par
memory hierarchy parallelism
Scal load:並行的lmbench數
(2)Processor, Processes(處理器、進程操作時間)
Null call:簡單系統調用(取進程號)
Null I/O:簡單IO操作(空讀寫的平均)
Stat:取文檔狀態的操作
Open clos:打開然后立即關閉關閉文檔操作
Slct tcp
Select:配置
Sig inst:配置信號
Sig hndl:捕獲處理信號
Fork proc :Fork進程后直接退出
Exec proc:Fork后執行execve調用再退出
Sh proc:Fork后執行shell再退出
(3)Basic integer/float/double operations
略
(4)Context switching 上下文切換時間
2p/16K: 表示2個並行處理16K大小的數據
(5)Local Communication latencies(本地通信延時,通過不同通信方式發送后自己立即讀)
Pipe:管道通信
AF UNIX
Unix協議
UDP
UDP
RPC/UDP
TCP
RPC/TCP
TCP conn
TCP建立connect並關閉描述字
(6)File & VM system latencies(文檔、內存延時)
File Create & Delete:創建並刪除文檔
MMap Latency:內存映射
Prot Fault
Protect fault
Page Fault:缺頁
100fd selct:對100個文檔描述符配置select的時間
(7)Local Communication bandwidths(本地通信帶寬)
Pipe:管道操作
AF UNIX
Unix協議
TCP
TCP通信
File reread:文檔重復讀
MMap reread:內存映射重復讀
Bcopy(libc):內存拷貝
Bcopy(hand):內存拷貝
Mem read:內存讀
Mem write:內存寫
(8)Memory latencies(內存操作延時)
L1:緩存1
L2:緩存2
Main Mem:連續內存
Rand Mem:內存隨機訪問延時
Guesses
假如L1和L2近似,會顯示“No L1 cache?”
假如L2和Main Mem近似,會顯示“No L2 cache?”
