C++ Profiler工具之初體驗

本文轉載自查看原文 2012-12-20 20:17 15323 c++profile/ c++

轉 http://www.cnblogs.com/lenolix/archive/2010/12/13/1904868.html

概要：本文同期調研了google profile工具以及其他常用profile的工具，如GNU gprof、oprofile等(都是開源項目)，並對其實現原理做了簡單分析和比較。希望對之后的推廣使用或二期開發有所幫助。

一、 GUN Gropf

Gprof是GNU profiler工具。可以顯示程序運行的“flatprofile”，包括每個函數的調用次數，每個函數消耗的處理器時間。也可以顯示“調用圖”，包括函數的調用關系，每個函數調用花費了多少時間。還可以顯示“注釋的源代碼”，是程序源代碼的一個復本，標記有程序中每行代碼的執行次數。關於Gprof的使用以及實現原理網上已有多篇文章提及，本文就不再詳述，只是對其進行梳理和總結，方便閱讀。（Gprof的官方網址：http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html，http: //sourceware.org/binutils/docs/gprof/index.html 絕對權威的參考資料。）

1.1 安裝

Glibc自帶，無需另外安裝

1.2 用法

參考http://hi.baidu.com/juventus/blog/item/312dd42a0faf169b033bf6ff.html/cmtid/3c34349bb5a8ceb8c9eaf4c5

圖形化輸出請參考大師blog：http://www.51testing.com/?uid-13997-action-viewspace-itemid-79952

1.3 實現原理

引用官網說明：

Profiling works by changing how every function in your program iscompiled so that when it is called, it will stash away some informationabout where it was called from. From this, the profiler can figure outwhat function called it, and can count how many times it was called.This change is made by the compiler when your program is compiled withthe `-pg' option.

Profiling also involves watching your program as it runs, andkeeping a histogram of where the program counter happens to be everynow and then. Typically the program counter is looked at around 100times per second of run time, but the exact frequency may vary fromsystem to system.

A special startup routine allocates memory for the histogram andsets up a clock signal handler to make entries in it. Use of thisspecial startup routine is one of the effects of using `gcc ... -pg' tolink. The startup file also includes an `exit' function which isresponsible for writing the file `gmon.out'.

Number-of-calls information for library routines is collected byusing a special version of the C library. The programs in it are thesame as in the usual C library, but they were compiled with `-pg'. Ifyou link your program with `gcc ... -pg', it automatically uses theprofiling version of the library.

The output from gprof gives no indication of parts of your programthat are limited by I/O or swapping bandwidth. This is because samplesof the program counter are taken at fixed intervals of run time.Therefore, the time measurements in gprof output say nothing about timethat your program was not running. For example, a part of the programthat creates so much data that it cannot all fit in physical memory atonce may run very slowly due to thrashing, but gprof will say it useslittle time. On the other hand, sampling by run time has the advantagethat the amount of load due to other users won't directly affect theoutput you get.

當我們使用"-pg" 選項編譯程序后，gcc會做三個工作：

1. 程序的入口處(main 函數之前)插入monstartup函數的調用代碼，完成profile的初始化工作，包括分配保存信息的內存以及設置一個clock 信號處理函數；

2. 在每個函數的入口處插入_mcount函數的調用代碼，用於統計函數的調用信息：包括調用時間、調用次數以及調用棧信息；

3. 在程序退出時(在 atexit () 里)插入_mcleanup()函數的調用代碼，負責將profile信息輸出到gmon.out中。

這些過程可以通過objdump反匯編顯示出來：

objdump -S a.out

0000000000400aba<main>:

400aba: 55 push %rbp

400abb: 48 89e5 mov %rsp,%rbp

400abe: 48 83 ec20 sub $0x20,%rsp

400ac2: e8 69 fd ffff callq 400830<mcount@plt>

......

可以看出，在main函數的入口插入了一行匯編代碼：callq 400830 <mcount@plt> ，這樣main函數的第一行執行代碼就是調用_mcount函數。

我們接下來再看看glibc的這三個函數具體都做了什么：

a ) __monstartup 此函數的定義在glibc的gmon/gmon.c中

A special startup routine allocates memory for the histogram andeither calls profil() or sets up a clock signal handler. This routine(monstartup) can be invoked in several ways. On Linux systems, aspecial profiling startup file gcrt0.o, which invokes monstartup beforemain, is used instead of the default crt0.o. Use of this specialstartup file is one of the effects of using `gcc ... -pg' to link. OnSPARC systems, no special startup files are used. Rather, the mcountroutine, when it is invoked for the first time (typically when main iscalled), calls monstartup.

linux系統中，__monstartup是在__gmon_start__ 中調用的。在程序鏈接過程中，gcc用gcrt0.o替代了默認的crt0.o，從而修改了main函數執行前的初始化工作：

crt0.o是應用程序編譯鏈接時需要的起動文件，在程序鏈接階段被鏈接。主要工作是初試化應用程序棧，初試化程序的運行環境和在程序退出時清除和釋放資源。

__gmon_start__的定義在csu/gmon-start.c中

void

__gmon_start__ (void)

{

#ifdef HAVE_INITFINI

/* Protect from being called more than once. Since crti.o is linked

into every shared library, each of their init functions will call us. */

static int called;

if (called)

return;

called = 1;

#endif

/* Start keeping profiling records. */

__monstartup ((u_long) TEXT_START, (u_long) &etext);

/* Call _mcleanup before exiting; it will write out gmon.out from the

collected data. */

atexit (&_mcleanup);

__gmon_start__ 不僅調用了__monstartup函數，還注冊了一個清理函數_mcleanup，此函數將在程序結束時被調用。_mcleanup的功能會在后續說明，接下來讓我們看看__monstartup函數都做了什么。

void

__monstartup (lowpc, highpc)

u_long lowpc;

u_long highpc;

{

char *cp;

struct gmonparam *p = &_gmonparam;

* round lowpc and highpc to multiples of the density we're using

* so the rest of the scaling (here and in gprof) stays in ints.

p->lowpc = ROUNDDOWN(lowpc, HISTFRACTION * sizeof(HISTCOUNTER));

p->highpc = ROUNDUP(highpc, HISTFRACTION * sizeof(HISTCOUNTER));

p->textsize = p->highpc - p->lowpc;

p->kcountsize = ROUNDUP(p->textsize / HISTFRACTION, sizeof(*p->froms));

p->hashfraction = HASHFRACTION;

p->log_hashfraction = -1;

/* The following test must be kept in sync with the corresponding

test in mcount.c. */

if ((HASHFRACTION & (HASHFRACTION - 1)) == 0) {

/* if HASHFRACTION is a power of two, mcount can use shifting

instead of integer division. Precompute shift amount. */

p->log_hashfraction = ffs(p->hashfraction * sizeof(*p->froms)) - 1;

}

p->fromssize = p->textsize / HASHFRACTION;

p->tolimit = p->textsize * ARCDENSITY / 100;

if (p->tolimit < MINARCS)

p->tolimit = MINARCS;

else if (p->tolimit > MAXARCS)

p->tolimit = MAXARCS;

p->tossize = p->tolimit * sizeof(struct tostruct);

cp = calloc (p->kcountsize + p->fromssize + p->tossize, 1);

if (! cp)

{

ERR("monstartup: out of memory\n");

p->tos = NULL;

p->state = GMON_PROF_ERROR;

return;

}

p->tos = (struct tostruct *)cp;

cp += p->tossize;

p->kcount = (HISTCOUNTER *)cp;

cp += p->kcountsize;

p->froms = (ARCINDEX *)cp;

p->tos[0].link = 0;

o = p->highpc - p->lowpc;

if (p->kcountsize < (u_long) o)

{

#ifndef hp300

s_scale = ((float)p->kcountsize / o ) * SCALE_1_TO_1;

#else

/* avoid floating point operations */

int quot = o / p->kcountsize;

if (quot >= 0x10000)

s_scale = 1;

else if (quot >= 0x100)

s_scale = 0x10000 / quot;

else if (o >= 0x800000)

s_scale = 0x1000000 / (o / (p->kcountsize >> 8));

else

s_scale = 0x1000000 / ((o << 8) / p->kcountsize);

#endif

} else

s_scale = SCALE_1_TO_1;

__moncontrol(1);

}

可以看書，函數中的大部分代碼都是在做初始化工作，為profile信息分配存儲空間，它的兩個參數lowpc，highpc（通過調試可以得知lowpc起始是程序代碼段的起始地址，而highpc是程序代碼段的結束地址，&etext），分別代表了需要記錄profile信息的地址范圍，超過這個范圍的地址，gprof是不會記錄profile信息的。這也解釋了為何gprof不能支持對動態庫的解析，以為動態庫的裝載是在程序代碼段之外的。我們通過一個實例可以證明這一點。

以一個簡單的測試程序為例：

#include <stdio.h>

int or_f(int a,int b)

{

return a^b;

}

int main(int argc,char** argv)

{

printf("%d\n",or_f(1,2));

sleep(30);

return 1;

}

編譯生成./test可執行程序。我們用readelf工具獲取test文件的段信息，

readelf -S test

Section Headers:

[Nr] Name Type Address Offset

Size EntSize Flags Link Info Align

......

[12] .text PROGBITS 0000000000400540 00000540

0000000000000278 0000000000000000 AX 0 0 16

.......

從輸出可以看出，test可執行程序的text代碼地址為0x400540 ~ 0x400540 + 0x278。

接下來運行./test ，通過對glibc代碼的修改，我們打印出__monstartup函數的兩個實參值，結果如下：

lowpc: 400540, highpc: 4007c6，正好對應着test程序的代碼段范圍。

同時我們也dump出test程度在內存中的裝載地址：

cat /proc/$self/maps:

00400000-00401000 r-xp 00000000 08:03 70746688 /tmp/test

00600000-00601000 rw-p 00000000 08:03 70746688 /tmp/test

10ca4000-10cc5000 rw-p 10ca4000 00:00 0 [heap]

3536600000-353661c000 r-xp 00000000 08:03 93028660 /lib64/ld-2.5.so

353681b000-353681c000 r--p 0001b000 08:03 93028660 /lib64/ld-2.5.so

353681c000-353681d000 rw-p 0001c000 08:03 93028660 /lib64/ld-2.5.so

2b4f1af23000-2b4f1af25000 rw-p 2b4f1af23000 00:00 0

2b4f1af25000-2b4f1b063000 r-xp 00000000 08:03 32931849 /root/glibc-2.5-42-build/lib/libc-2.5.so

2b4f1b063000-2b4f1b263000 ---p 0013e000 08:03 32931849 /root/glibc-2.5-42-build/lib/libc-2.5.so

2b4f1b263000-2b4f1b267000 r--p 0013e000 08:03 32931849 /root/glibc-2.5-42-build/lib/libc-2.5.so

2b4f1b267000-2b4f1b268000 rw-p 00142000 08:03 32931849 /root/glibc-2.5-42-build/lib/libc-2.5.so

2b4f1b268000-2b4f1b26f000 rw-p 2b4f1b268000 00:00 0

7fffa306b000-7fffa3080000 rw-p 7ffffffea000 00:00 0 [stack]

ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso]

test裝載到內存的地址范圍為00400000-00401000，為libc.so裝載到內存的地址范圍為2b4f1af25000-2b4f1b063000，現在不在lowpc和highpc范圍之內，所以libc中的函數是不會被gprof解析的。

__monstartup函數的最后會調用__moncontrol函數來設置一個clock信號處理函數用於設置提取sample。

__moncontrol的定義在glibc的gmon/gmon.c中

void

__moncontrol (mode)

int mode;

{

struct gmonparam *p = &_gmonparam;

/* Don't change the state if we ran into an error. */

if (p->state == GMON_PROF_ERROR)

return;

if (mode)

{

/* start */

__profil((void *) p->kcount, p->kcountsize, p->lowpc, s_scale);

p->state = GMON_PROF_ON;

}

else

{

/* stop */

__profil(NULL, 0, 0, 0);

p->state = GMON_PROF_OFF;

}

其中__profil的定義在sysdeps/posix/profil.c中

int

__profil (u_short *sample_buffer, size_t size, size_t offset, u_int scale)

{

struct sigaction act;

struct itimerval timer;

#ifndef IS_IN_rtld

static struct sigaction oact;

static struct itimerval otimer;

# define oact_ptr &oact

# define otimer_ptr &otimer

if (sample_buffer == NULL)

{

/* Disable profiling. */

if (samples == NULL)

/* Wasn't turned on. */

return 0;

if (__setitimer (ITIMER_PROF, &otimer, NULL) < 0)

return -1;

samples = NULL;

return __sigaction (SIGPROF, &oact, NULL);

}

if (samples)

{

/* Was already turned on. Restore old timer and signal handler

first. */

if (__setitimer (ITIMER_PROF, &otimer, NULL) < 0

|| __sigaction (SIGPROF, &oact, NULL) < 0)

return -1;

}

#else

/* In ld.so profiling should never be disabled once it runs. */

//assert (sample_buffer != NULL);

# define oact_ptr NULL

# define otimer_ptr NULL

#endif

samples = sample_buffer;

nsamples = size / sizeof *samples;

pc_offset = offset;

pc_scale = scale;

act.sa_handler = (sighandler_t) &profil_counter;

act.sa_flags = SA_RESTART;

__sigfillset (&act.sa_mask);

if (__sigaction (SIGPROF, &act, oact_ptr) < 0)

return -1;

timer.it_value.tv_sec = 0;

timer.it_value.tv_usec = 1000000 / __profile_frequency ();

timer.it_interval = timer.it_value;

return __setitimer (ITIMER_PROF, &timer, otimer_ptr);

}

這個函數的主要作用就是定義了一個SIGPROF信號處理函數，並通過__setitimer函數設置SIGPROF的發送頻率。這個信號處理函數的功能很關鍵，后續仍會說明。

b）_mcount 此函數的定義在sysdeps/generic/machine-gmon.h中

#define MCOUNT \

void _mcount (void) \

{ \

mcount_internal ((u_long) RETURN_ADDRESS (1), (u_long) RETURN_ADDRESS (0)); \

}

其中((u_long) RETURN_ADDRESS (nr)調用了__builtin_return_address(nr)函數，__builtin_return_address(nr)會返回當前調用棧中第nr幀的pc地址。所以(u_long)RETURN_ADDRESS (0)返回的是當前函數地址topc；而(u_long) RETURN_ADDRESS(1)返回的是當前函數的返回地址frompc。

__builtin_return_address(LEVEL)

---This function returns the return address of the currentfunction,or of one of its callers. The LEVEL argument is number offrames to scan up the call stack. A value of '0' yields the returnaddress of the current function,a value of '1' yields the returnaddress of the caller of the current function,and so forth.

mcount_internal的定義在gmon/mcont.c中

_MCOUNT_DECL(frompc, selfpc) /* _mcount; may be static, inline, etc */

{

int i;

p = &_gmonparam;

* check that we are profiling

* and that we aren't recursively invoked.

if (catomic_compare_and_exchange_bool_acq (&p->state, GMON_PROF_BUSY,

GMON_PROF_ON))

return;

* check that frompcindex is a reasonable pc value.

* for example: signal catchers get called from the stack,

* not from text space. too bad.

frompc -= p->lowpc;

if (frompc > p->textsize)

goto done;

/* The following test used to be

if (p->log_hashfraction >= 0)

But we can simplify this if we assume the profiling data

is always initialized by the functions in gmon.c. But

then it is possible to avoid a runtime check and use the

smae `if' as in gmon.c. So keep these tests in sync. */

if ((HASHFRACTION & (HASHFRACTION - 1)) == 0) {

/* avoid integer divide if possible: */

i = frompc >> p->log_hashfraction;

} else {

i = frompc / (p->hashfraction * sizeof(*p->froms));

}

frompcindex = &p->froms[i];

toindex = *frompcindex;

if (toindex == 0) {

* first time traversing this arc

toindex = ++p->tos[0].link;

if (toindex >= p->tolimit)

/* halt further profiling */

goto overflow;

*frompcindex = toindex;

top = &p->tos[toindex];

top->selfpc = selfpc;

top->count = 1;

top->link = 0;

goto done;

}

top = &p->tos[toindex];

if (top->selfpc == selfpc) {

* arc at front of chain; usual case.

top->count++;

goto done;

}

* have to go looking down chain for it.

* top points to what we are looking at,

* prevtop points to previous top.

* we know it is not at the head of the chain.

for (; /* goto done */; ) {

if (top->link == 0) {

* top is end of the chain and none of the chain

* had top->selfpc == selfpc.

* so we allocate a new tostruct

* and link it to the head of the chain.

toindex = ++p->tos[0].link;

if (toindex >= p->tolimit)

goto overflow;

top = &p->tos[toindex];

top->selfpc = selfpc;

top->count = 1;

top->link = *frompcindex;

*frompcindex = toindex;

goto done;

}

* otherwise, check the next arc on the chain.

prevtop = top;

top = &p->tos[top->link];

if (top->selfpc == selfpc) {

* there it is.

* increment its count

* move it to the head of the chain.

top->count++;

toindex = prevtop->link;

prevtop->link = top->link;

top->link = *frompcindex;

*frompcindex = toindex;

goto done;

}

done:

p->state = GMON_PROF_ON;

return;

overflow:

p->state = GMON_PROF_ERROR;

return;

}

此函數的主要功能就是記錄每個函數的調用次數，以及函數之間的調用關系表。並將這些信息保存在全局變量_gmonparam中。由於此函數是通過hack的方式來調用的（插入入口代碼），因此其獲取的信息都是精確的。強調z這一點的目的是為了下面將要介紹的另一個主要函數: profil_counter 。回溯到gcc的一個步驟，monstartup函數在初始化的最后階段，通過sigaction調用注冊了一個SIGPROF信號處理函數，這個函數 profil_counter。這個函數會以__profile_frequency()的頻率被調用，並完成profile的主要工作：收集 sample信息，以此來計算每個函數的消耗時間。

profil_counter函數的定義依賴於具體的系統平台，X86_64平台下的定義是在sysdeps/unix/sysv/linux/x86_64/profil-counter.h中

static void

profil_counter (int signo, SIGCONTEXT scp)

{

profil_count ((void *) GET_PC (scp));

/* This is a hack to prevent the compiler from implementing the

above function call as a sibcall. The sibcall would overwrite

the signal context. */

asm volatile ("");

}

其最終調用的profil_count定義在sysdeps/posix/profil.c中

static inline void

profil_count (void *pc)

{

size_t i = (pc - pc_offset - (void *) 0) / 2;

if (sizeof (unsigned long long int) > sizeof (size_t))

i = (unsigned long long int) i * pc_scale / 65536;

else

i = i / 65536 * pc_scale + i % 65536 * pc_scale / 65536;

if (i < nsamples)

++samples[i];

}

這段代碼的邏輯有點晦澀，需要聯系之前的處理邏輯來理解。pc_offset、pc_scale以及samples這些全局變量的賦值是在__profil函數中處理的。回溯__profil的邏輯代碼，就可以看出samples=_gmonparam-> kcount, 用於保存sample信息，pc_offset =p->lowpc，是程序代碼段的起始地址，pc_scale是一個比例因子，用於控制sample的提取粒度。綜合上下文，gprof在這里的處理邏輯是將lowpc~lowpc+65536（linux下默認一個段的大小為64K）范圍內的代碼映射到一個內存數組，而pc_scale其實就是決定了映射粒度。對於任何一個處於[lowpc,lowpc+65536]范圍內的pc，其對應的數組下標是: pc - lowpc / (65536/ pc_scale) = (pc - lowpc) * pc_scale /65536;這樣一個數組項（一個sample）對應了一段pc_scale長度的程序地址，而每當這段地址內的代碼被執行時，相應的sample計數就會加1。

c ) 最后當程序結束時，會調用_mcleanup，其定義在gmon/gmon.c中。

void

_mcleanup (void)

{

__moncontrol (0);

if (_gmonparam.state != GMON_PROF_ERROR)

write_gmon ();

/* free the memory. */

free (_gmonparam.tos);

}

首先其通過__moncontrol（0）結束profil工作，其次通過write_gmon ()函數將profile信息輸出到gmon.out文件中。

write_gmo函數的定義在gmon/gmon.c中

static void

write_gmon (void)

{

struct gmon_hdr ghdr __attribute__ ((aligned (__alignof__ (int))));

int fd = -1;

char *env;

#ifndef O_NOFOLLOW

# define O_NOFOLLOW 0

#endif

env = getenv ("GMON_OUT_PREFIX");

if (env != NULL && !__libc_enable_secure)

{

size_t len = strlen (env);

char buf[len + 20];

__snprintf (buf, sizeof (buf), "%s.%u", env, __getpid ());

fd = open_not_cancel (buf, O_CREAT|O_TRUNC|O_WRONLY|O_NOFOLLOW, 0666);

}

if (fd == -1)

{

fd = open_not_cancel ("gmon.out", O_CREAT|O_TRUNC|O_WRONLY|O_NOFOLLOW,

0666);

if (fd < 0)

{

char buf[300];

int errnum = errno;

__fxprintf (NULL, "_mcleanup: gmon.out: %s\n",

__strerror_r (errnum, buf, sizeof buf));

return;

}

/* write gmon.out header: */

memset (&ghdr, '\0', sizeof (struct gmon_hdr));

memcpy (&ghdr.cookie[0], GMON_MAGIC, sizeof (ghdr.cookie));

*(int32_t *) ghdr.version = GMON_VERSION;

write_not_cancel (fd, &ghdr, sizeof (struct gmon_hdr));

/* write PC histogram: */

write_hist (fd);

/* write call-graph: */

write_call_graph (fd);

/* write basic-block execution counts: */

write_bb_counts (fd);

close_not_cancel_no_status (fd);

}

通過write_hist、write_call_graph、write_bb_counts這三個子函數，其分別將pc histogram、call-graph以及basic-block execution counts信息輸出到gmon.out中。

1.4 gprof的輸出分析

在gmon.out文件產生之后，可以通過GNU binutils中提供的工具gprof來分析數據，轉換成容易閱讀、理解的格式（文字、圖片等）。

gprof的主要代碼在gprof/gprof.c中

在gmon_out_read函數中，其分別通過hist_read_rec、cg_read_rec、bb_read_rec來讀取 gmon.out中對應的pc histogram、call-graph以及basic-block executioncounts信息。在將pchistogram映射到具體函數時間的處理上，gprof采用了一種近似算法：

sym_high_pc

sym_low_pc

其中，bin_low_pc待用sample數組中的任意一項所對應的PC地址：而bin_high_pc代表bin_low_pc下一個sample對應的PC地址：

bin_low_pc = lowpc + (bfd_vma)(hist_scale * i);

bin_high_pc = lowpc +(bfd_vma) (hist_scale * (i + 1));

sym_low_pc待用可執行程序中某個符號（函數名、段名等）所對應的PC地址，sym_high_pc為下一個符號項所對應的PC地址：

sym_low_pc =symtab.base[j].hist.scaled_addr;

sym_high_pc = symtab.base[j +1].hist.scaled_addr;

gprof只將[bin_low_pc, bin_high_pc]和[sym_low_pc ,sym_high_pc]重合區域（以箭頭標識）的sample次數算為sym_low_pc符號的消耗時間。

overlap = MIN (bin_high_pc,sym_high_pc) - MAX (bin_low_pc, sym_low_pc);

credit = overlap * time /hist_scale; // time = sample[i], hist_scale = pc_scale.

1.5 小結

Gprof是GUN 工具鏈中自帶的profiler，無需安裝成本，與gcc的結合讓其使用方便，能夠快速上手。但是gprof也有其一定的缺陷，

1、它的測試結果並不能保證完全准確：它無法統計程序耗在IO以及swap上的時間：

The output from gprof gives no indication of parts of your programthat are limited by I/O or swapping bandwidth. This is because samplesof the program counter are taken at fixed intervals of the program'srun time. Therefore, the time measurements in gprof output say nothingabout time that your program was not running. For example, a part ofthe program that creates so much data that it cannot all fit inphysical memory at once may run very slowly due to thrashing, but gprofwill say it uses little time. On the other hand, sampling by run timehas the advantage that the amount of load due to other users won'tdirectly affect the output you get.

而且，由於其通過采集sample來計算profile的方式，本身就存在一定的失真：

The run-time figures that gprof gives you are based on a samplingprocess, so they are subject to statistical inaccuracy. If a functionruns only a small amount of time, so that on the average the samplingprocess ought to catch that function in the act only once, there is apretty good chance it will actually find that function zero times, ortwice.

By contrast, the number-of-calls figures are derived by counting,not sampling. They are completely accurate and will not vary from runto run if your program is deterministic.

The sampling period that is printed at the beginning of theflat profile says how often samples are taken. The rule of thumb isthat a run-time figure is accurate if it is considerably bigger thanthe sampling period.

The actual amount of error is usually more than one sampling period. In fact, if a value is n times the sampling period, the expected error in it is the square-root of nsampling periods. If the sampling period is 0.01 seconds and foo'srun-time is 1 second, the expected error in foo's run-time is 0.1seconds. It is likely to vary this much on the average from one profiling run to the next. (Sometimes it will vary more.)

This does not mean that a small run-time figure is devoid of information. If the program's totalrun-time is large, a small run-time for one function does tell you thatthat function used an insignificant fraction of the whole program'stime. Usually this means it is not worth optimizing.

2. gprof不能支持動態庫的解析。原因在本文中已經分析。

3. gprof不易維護和擴展，因為gprof的代碼是封裝在GNU工具鏈的glibc以及binutils中，修改libc的風險較大，而且版本也不易維護（不同系統中使用的libc版本不一致，如果單獨更新glibc，會出現程序crash）。

二、 GooglePerformance Tools

Goolgleperformance tools是google公司開發的一套用於C++Profile的工具集。其中包括：

一個優化的內存管理算法—tcmalloc性能優於malloc。

一個用於CPU profile的工具，用於檢測程序的性能熱點，這個功能和gprof類似。

一個用於堆檢查工具，用於檢測程序在是夠有內存泄露，這個功能和valgrind類似。

一個用於Heap profile的工具，用於監控程序在執行過程的內存使用情況。

官方文檔：

http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools

它的使用方式比較簡單：首先在編譯程序的時候加上相應的鏈接庫，然后在運行程序時

通過設置相應的環境變量來激活工具。

1.使用其提供的內存管理函數---TC Malloc:

gcc [...] -ltcmalloc

2.使用其堆內存檢查工具:

gcc [...] -o myprogram -ltcmalloc

HEAPCHECK=normal ./myprogram

3.使用Heap Profiler:

gcc [...] -o myprogram -ltcmalloc

HEAPPROFILE=/tmp/netheap ./myprogram

4.使用Cpu Profiler:

gcc [...] -o myprogram -lprofiler

CPUPROFILE=/tmp/profile ./myprogram

它的輸出也很清晰,下圖是一個CpuProfiler的結果圖，其中每個方塊代碼一個函數，方塊間的箭頭描述了函數之間的調用關系，每個方塊里面有兩個數字：X ofY，其中Y表示在程序執行過程中函數所消耗的總體時間，X表示函數自身所消耗的時間，所以Y-X及時函數所調用的子函數消耗時間。如果函數沒有子函數，則只顯示總體時間。（X，Y的單位得sample，每個sample所代表的時間可以設置，默認為10ms）

2.1 安裝

a) 安裝libunwind

libunwind是一個用於解析程序調用棧的C++庫，由於glibc內建的棧回滾功能在64位系統上有bug，因此googleperformance tools建議使用libunwind

下載libunwind-0.99-beta.tar.gz

cd $HOME

tarxzvf libunwind-0.99-beta.tar.gz

mkdir libunwind-0.99-beta-build

cd libunwind-0.99-beta

./configure –prefix=$HOME/libunwind-0.99-beta-build

b) 安裝Google PerformanceTools

注意：如果在系統目錄中找不到libunwind，google performance tools將默認使用glibc的內建功能，因此我們需要手動設置libunwind的安裝目錄。

下載google-perftools-1.6.tar.gz

cd $HOME

tar xzvf google-perftools-1.6.tar.gz

mkdir google-perftools-1.6-build

cd google-perftools-1.6

./configure –prefix=$HOME/ google-perftools-1.6-build

CPPFLAGS=-I$HOME/libunwind-0.99-beta-build/include

LDFLAGS=-L$HOME/libunwind-0.99-beta-build/lib

make && make install

2.2 用法

參考官方文檔。

這里有兩點想突出介紹下，一個是對動態庫的支持，一個對動態profiler功能的支持。

2.2.1 動態庫的支持

在第一章節里面我們已經證明和分析GUNProfiler不提供對動態庫的支持，雖然可以通過修改glibc的代碼來擴展此功能，但是維護成本較大。而Goolgle performancetools本身就已經提供了對動態庫的支持功能。當然動態庫的使用也分兩種情況：一種是在運行時動態鏈接庫，一種是在運行時動態加載庫。

運行時鏈接可以動態地將程序和共享庫鏈接並讓 Linux 在執行時加載庫（如果它已經在內存中了，則無需再加載）。以一個具體例子來說明：

//libtestprofiler.h

extern "C"{

int loopop();

}

libtestprofiler.cpp只定義了一個耗時計算函數，便於分析。

// libtestprofiler.cpp

#include "libtestprofiler.h"

extern "C"{

int loopop()

{

int n = 0;

for(int i = 0; i < 1000000; i++)

for(int j = 0; j < 10000; j++)

{

n |= i%100 + j/100;

}

return n;

}

將libtestprofiler.cpp編譯為動態庫：

g++--shared -fPIC -g -O0 -o libtestprofiler.so libtestprofiler.cpp

在主程序中調用動態庫：

#include <iostream>

#include "libtestprofiler.h"

using namespace std;

int main(int argc,char** argv)

{

cout << "loopop: " << loopop() << endl;

return 1;

}

編譯主程序，並動態鏈接libtestprofiler.so：

a) 首先采用GUN Profile的方式編譯主程序

g++ -g -O0 -omain main.cpp -ltestprofiler -L. –pg

./main

gprof –b ./main結果如下：

Each sample counts as 0.01 seconds.

no time accumulated

% cumulative self self total

time seconds seconds calls Ts/call Ts/call name

0.00 0.00 0.00 1 0.00 0.00 global constructors keyed to main

0.00 0.00 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)

0.00 0.00 0.00 1 0.00 0.00 data_start

和預想一樣，GNU Profile 不能解析動態庫的性能熱點。

b) 再以google CPU Profile的方式編譯主程序：

g++ -g -O0 -omain main.cpp -ltestprofiler -L. -lprofiler-L/home/wul/google-perftools-1.6-build/lib

CPUPROFILE=perf.out./main

pprof --text./main ./perf.out，結果如下：

Using local file ./main.

Using local file ./perf.out.

Removing killpg from all stack traces.

Total: 5923 samples

5923 100.0% 100.0% 5923 100.0% loopop

0 0.0% 100.0% 5923 100.0% __libc_start_main

0 0.0% 100.0% 5923 100.0% _start

0 0.0% 100.0% 5923 100.0% main

由此證明，Google CPU Profiler支持對動態鏈接庫的性能分析。

運行時加載允許程序可以有選擇地調用庫中的函數。使用動態加載過程，程序可以先加載一個特定的庫（已加載則不必），然后調用該庫中的某一特定函數，這是構建支持插件的應用程序的一個普遍的方法。

還是以上述程序為例，對主程序代碼進修改：

#include <stdio.h>

#include <dlfcn.h>

char LIBPATH[] = "./libtestprofiler.so";

typedef int (*op_t) ();

int main(int argc,char** argv)

{

void* dl_handle;

op_t loopop;

char* error;

/* Open the shared object */

dl_handle = dlopen( LIBPATH, RTLD_LAZY );

if (!dl_handle) {

printf( "dlopen failed! %s\n", dlerror() );

return 1;

}

/* Resolve the symbol (loopop) from the object */

loopop = (op_t)dlsym( dl_handle, "loopop");

error = dlerror();

if (error != NULL) {

printf( "dlsym failed! %s\n", error );

return 1;

}

/* Call the resolved loopop and print the result */

printf("result: %d\n", (loopop)() );

/* Close the object */

dlclose( dl_handle );

return 0;

}

編譯：

g++ -g -O0 -o main_dl main_dl.cpp -lprofiler -L/home/wul/google-perftools-1.6-build/lib-ldl

CPUPROFILE=perf_dl.out./main_dl

pprof--text ./main_dl ./perf_dl.out，結果如下：

Using local file ./main_dl.

Using local file ./perf_dl.out.

Removing killpg from all stack traces.

Total: 5949 samples

843 14.2% 14.2% 843 14.2% 0x00002b2f203d25d6

……

0 0.0% 100.0% 1 0.0% 0x00002b2f203d25ed

0 0.0% 100.0% 5949 100.0% __libc_start_main

0 0.0% 100.0% 5949 100.0% _start

0 0.0% 100.0% 5949 100.0% main

很奇怪，這個結果顯示libtestprofiler.so庫中的符號沒有正確解析，perf_dl.out文件也沒有包含 libtestprofiler.so的內存映射信息，但是我們確實在主程序已經通過dlopen將動態庫裝載到內存並執行成功了，為何在主程序的內存映射表中找不到動態庫的信息呢？經過一番分析和調查，終於找到原因，因為perf_dl.out文件的輸出工作是在主程序執行結束之后、系統回收資源的時候調用的（具體見實現原理一節），而在此時主程序執行了dlclose()函數卸載了libtestprofiler.so，所以隨后dump出的內存映射表當然就不會包含libtestprofiler.so的信息了。我們測試下將dlclose(dl_handle)注釋后的運行結果：

Using local file ./main_dl.

Using local file ./perf_dl.out.

Removing killpg from all stack traces.

Total: 5923 samples

5923 100.0% 100.0% 5923 100.0% loopop

0 0.0% 100.0% 5923 100.0% __libc_start_main

0 0.0% 100.0% 5923 100.0% _start

0 0.0% 100.0% 5923 100.0% main

哈哈，動態庫中的符號又能正常解析了。

2.2.2 動態profiler功能

這里首先需要解釋下何謂動態profiler功能：傳統的profiler工具，以GUNProfiler為例，只能編譯階段控制profiler的開關（-fprofile-arcs-ftest-coverage），但是我們有時候需要在程序的運行階段，或者說運行的中間階段控制profiler的開關。Googleperformance tools可以通過CPUPROFILE環境變量在程序運行初階段控制cpuprofiler的開關，而且根據文檔/usr/doc/google- perftools-1.5/pprof_remote_servers.html的提示，可以通過功能擴展可以實現在運行中間階段或通過http協議遠程控制profiler信息的功能。gperftools-httpd項目就已經初步完成了這個功能，我們可以體驗一下。

1．從http://code.google.com/p/gperftools-httpd/下載gperftools-httpd安裝。

2．修改下測試程序 main.cpp, 正常運行時間，方便測試

#include <iostream>

#include "gperftools-httpd.h"

#include "libtestprofiler.h"

using namespace std;

int main(int argc,char** argv)

{

ghttpd();

while(1)

cout << "loopop: " << loopop() << endl;

return 1;

}

這個程序主要做了兩點修改，調用ghttpd()啟動一個輕量級web servive，已完成pprof的遠程請求服務；通過while循環加長了程序的執行時間，已方便驗證動態profiler功能。

3．編譯，需要連接libghttpd.so、libprofiler.so

g++-g -O0 -o main main.cpp-I/home/wul/gperftools-httpd-0.2-ltestprofiler -L.-L/home/wul/gperftools-httpd-0.2/ -lghttpd -lprofiler -L/home/wul/google-perftools-1.6-build/lib-dl -lpthread

4. 啟動測試程序

./main 注意我們這時並沒有設置CPUPROFILE環境變量，即表示此時CPU PROFILE功能還沒有打開。

5．通過pprof工具遠程打開測試程序的CPU profile功能：

pprof ./main http://localhost:9999/pprof/profile，結果如下：

Using local file ./main.

Gathering CPU profile from http://localhost:9999/pprof/profile?seconds=30 for 30 seconds to

/home/wul/pprof/main.1292168091.localhost

Be patient...

Wrote profile to /home/wul/pprof/main.1292168091.localhost

Removing _L_mutex_unlock_15 from all stack traces.

Welcome to pprof! For help, type 'help'.

(pprof) text

Total: 2728 samples

2728 100.0% 100.0% 2728 100.0% loopop

0 0.0% 100.0% 2728 100.0% __libc_start_main

0 0.0% 100.0% 2728 100.0% _start

0 0.0% 100.0% 2728 100.0% main

從結果中可以看出，當pprof向本地web服務http://localhost:9999/發送Getpprof/profile請求時，測試程序就會自動開啟profile功能，默認的監控時間段是now~now+30s（時間長短可以通過 seconds參數設置），等待30s之后，測試程序停止profile，將結果返回給pprof並保存在/home/wul/pprof /main.1292168091.localhost中，此時再通過text命令就可以看到解析后的輸出了。pprof工具還支持其它的query參數，譬如采樣頻率控制、觸發采樣事件等，具體可以參考gperftools-httpd以及google performancetools的官方文檔。

2.3 實現原理

Google performance tools包含四大功能，但是本章主要集中介紹CPU profiler功能，以便和GNU profiler做橫向對比。

2.3.1 CPU Profile

googleCPU profile的實現方式不同於gprof，但是兩個的實現原理有點相似。CPUprofiler是通過設置SIGPROF信號處理函數來采集 sample的，這點和gprof一樣，但是CPUprofiler沒有在函數入口插入代碼，而是通過保存調用棧信息來記錄函數的調用圖和調用次數。 CPUprofiler的主要實現代碼在src/profiler.cc中。這個文件中定義了一個CpuProfiler類，並聲明一個該類的靜態實例。這樣在main函數之前，此靜態實例就會被初始化。

// Initialize profiling: activated if getenv("CPUPROFILE") exists.

CpuProfiler::CpuProfiler()

: prof_handler_token_(NULL) {

// TODO(cgd) Move this code *out* of the CpuProfile constructor into a

// separate object responsible for initialization. With ProfileHandler there

// is no need to limit the number of profilers. charfname[PATH_MAX]; if (!GetUniquePathFromEnv("CPUPROFILE", fname)) { return;

}

// We don't enable profiling if setuid -- it's a security risk

#ifdef HAVE_GETEUID

if (getuid() != geteuid())

return;

#endif

if (!Start(fname, NULL)) {

RAW_LOG(FATAL, "Can't turn on cpu profiling for '%s': %s\n",

fname, strerror(errno));

}

該構造函數首先會判斷系統變量CPUPROFILE是否被設置，如果設置了，則啟動CPU profiler進程，否則，直接返回。我們在看看Start函數做了什么：

bool CpuProfiler::Start(const char* fname, const ProfilerOptions* options) {

SpinLockHolder cl(&lock_);

if (collector_.enabled()) {

return false;

}

ProfileHandlerState prof_handler_state;

ProfileHandlerGetState(&prof_handler_state);

ProfileData::Options collector_options;

collector_options.set_frequency(prof_handler_state.frequency);

if (!collector_.Start(fname, collector_options)) {

return false;

}

filter_ = NULL;

if (options != NULL && options->filter_in_thread != NULL) {

filter_ = options->filter_in_thread;

filter_arg_ = options->filter_in_thread_arg;

}

// Setup handler for SIGPROF interrupts

EnableHandler();

return true;

}

此函數首先會調用ProfileHandlerGetState來獲取其它的控制參數，包括CPUPROFILE_REALTIME和CPUPROFILE_FREQUENCY。

CPUPROFILE_FREQUENCY=x	default: 100	How many interrupts/second the cpu-profiler samples.
CPUPROFILE_REALTIME=1	default: [not set]	If set to any value (including 0 or the empty string), useITIMER_REAL instead of ITIMER_PROF to gather profiles. In general,ITIMER_REAL is not as accurate as ITIMER_PROF, and also interacts badlywith use of alarm(), so prefer ITIMER_PROF unless you have a reasonprefer ITIMER_REAL.

其次，函數調用ProfileData::Start為記錄profiler信息分配內存並初始化，其定義在profiledata.cc中。

bool ProfileData::Start(const char* fname,

const ProfileData::Options& options) {

if (enabled()) {

return false;

}

// Open output file and initialize various data structures

int fd = open(fname, O_CREAT | O_WRONLY | O_TRUNC, 0666);

if (fd < 0) {

// Can't open outfile for write

return false;

}

start_time_ = time(NULL);

fname_ = strdup(fname);

// Reset counters

num_evicted_ = 0;

count_ = 0;

evictions_ = 0;

total_bytes_ = 0;

hash_ = new Bucket[kBuckets];

evict_ = new Slot[kBufferLength];

memset(hash_, 0, sizeof(hash_[0]) * kBuckets);

// Record special entries

evict_[num_evicted_++] = 0; // count for header

evict_[num_evicted_++] = 3; // depth for header

evict_[num_evicted_++] = 0; // Version number

CHECK_NE(0, options.frequency());

int period = 1000000 / options.frequency();

evict_[num_evicted_++] = period; // Period (microseconds)

evict_[num_evicted_++] = 0; // Padding

out_ = fd;

return true;

}

其中slot數組evict_就是profiler輸出文件中的保存內容，具體可參考profiler輸出文件的格式說明。Bucket數組hash_是用於臨時保存程序調用棧信息的hash表，num_evicted記錄evict_數組中的有效長度。這些變量在后續將會經常出現。回到profiler.cc中的CpuProfiler::Start函數，其最后一步調用的是EnableHandler()，用於設置SIGPROF的信號處理函數。

void CpuProfiler::EnableHandler() {

RAW_CHECK(prof_handler_token_ == NULL, "SIGPROF handler already registered");

prof_handler_token_ = ProfileHandlerRegisterCallback(prof_handler, this);

RAW_CHECK(prof_handler_token_ != NULL, "Failed to set up SIGPROF handler");

}

函數通過ProfileHandlerRegisterCallback注冊了一個回調函數prof_handler：

ProfileHandlerToken* ProfileHandler::RegisterCallback(

ProfileHandlerCallback callback, void* callback_arg) {

ProfileHandlerToken* token = new ProfileHandlerToken(callback, callback_arg);

SpinLockHolder cl(&control_lock_);

DisableHandler();

{

SpinLockHolder sl(&signal_lock_);

callbacks_.push_back(token);

}

// Start the timer if timer is shared and this is a first callback.

if ((callback_count_ == 0) && (timer_sharing_ == TIMERS_SHARED)) {

StartTimer();

}

++callback_count_;

EnableHandler();

return token;

}

緊接着通過ProfileHandler::EnableHandler注冊SIGPROF信號處理函數SignalHandler。

void ProfileHandler::EnableHandler() {

struct sigaction sa;

sa.sa_sigaction = SignalHandler;

sa.sa_flags = SA_RESTART | SA_SIGINFO;

sigemptyset(&sa.sa_mask);

const int signal_number = (timer_type_ == ITIMER_PROF ? SIGPROF : SIGALRM);

RAW_CHECK(sigaction(signal_number, &sa, NULL) == 0, "sigprof (enable)");

}

到此，CPU profile的初始化工作基本上都完成了，總結一下主要是完成了兩個工作：一個是內存的分配以及初始化，一個是注冊SIGPROF信號處理函數，以便采集sample信息。所以接下來的重點將是分析CPU profile是如何采集sample的。首先看看SignalHandler函數的定義：

void ProfileHandler::SignalHandler(int sig, siginfo_t* sinfo, void* ucontext) {

int saved_errno = errno;

RAW_CHECK(instance_ != NULL, "ProfileHandler is not initialized");

{

SpinLockHolder sl(&instance_->signal_lock_);

++instance_->interrupts_;

for (CallbackIterator it = instance_->callbacks_.begin();

it != instance_->callbacks_.end();

++it) {

(*it)->callback(sig, sinfo, ucontext, (*it)->callback_arg);

}

errno = saved_errno;

}

從代碼中可以看出，SignalHandler除了記錄中斷次數之外，遍歷調用了callbacks_鏈中的所有回調函數，回溯CPU Profile前面的初始化工作，這里就會調用prof_handler函數：

// Signal handler that records the pc in the profile-data structure. We do no

// synchronization here. profile-handler.cc guarantees that at most one

// instance of prof_handler() will run at a time. All other routines that

// access the data touched by prof_handler() disable this signal handler before

// accessing the data and therefore cannot execute concurrently with

// prof_handler().

void CpuProfiler::prof_handler(int sig, siginfo_t*, void* signal_ucontext,

void* cpu_profiler) {

CpuProfiler* instance = static_cast<CpuProfiler*>(cpu_profiler);

if (instance->filter_ == NULL ||

(*instance->filter_)(instance->filter_arg_)) {

void* stack[ProfileData::kMaxStackDepth];

// The top-most active routine doesn't show up as a normal

// frame, but as the "pc" value in the signal handler context.

stack[0] = GetPC(*reinterpret_cast<ucontext_t*>(signal_ucontext));

// We skip the top two stack trace entries (this function and one

// signal handler frame) since they are artifacts of profiling and

// should not be measured. Other profiling related frames may be

// removed by "pprof" at analysis time. Instead of skipping the top

// frames, we could skip nothing, but that would increase the

// profile size unnecessarily.

int depth = GetStackTraceWithContext(stack + 1, arraysize(stack) - 1,

2, signal_ucontext);

depth++; // To account for pc value in stack[0];

instance->collector_.Add(depth, stack);

}

從代碼的注解片段中可以理解此函數的主要工作就是記錄將當前程序的調用棧信息。顧名思義，GetPC函數用於獲取當前pc指針，它是利用linux系統的信號處理機制來獲取當前pc的（具體可參考《unix環境高級編程》），其主要實現代碼在getpc.h中：

inline void* GetPC(const ucontext_t& signal_ucontext) {

// fprintf(stderr,"In GetPC3");

return (void*)signal_ucontext.PC_FROM_UCONTEXT; // defined in config.h

}

GetStackTraceWithContext函數完成了cpu profiler過程中最重要的一步，它最終調用了libunwind庫，dump出了當前的函數調用棧信息，其主要實現代碼在stacktrace_libunwind-inl.h中：

int GET_STACK_TRACE_OR_FRAMES {

fprintf(stderr,"in libunwind\n");

void *ip;

int n = 0;

unw_cursor_t cursor;

unw_context_t uc;

#if IS_STACK_FRAMES

unw_word_t sp = 0, next_sp = 0;

#endif

if (recursive) {

return 0;

}

++recursive;

unw_getcontext(&uc);

int ret = unw_init_local(&cursor, &uc);

assert(ret >= 0);

skip_count++; // Do not include current frame

while (skip_count--) {

if (unw_step(&cursor) <= 0) {

goto out;

}

#if IS_STACK_FRAMES

if (unw_get_reg(&cursor, UNW_REG_SP, &next_sp)) {

goto out;

}

#endif

}

while (n < max_depth) {

if (unw_get_reg(&cursor, UNW_REG_IP, (unw_word_t *) &ip) < 0) {

break;

}

#if IS_STACK_FRAMES

sizes[n] = 0;

#endif

result[n++] = ip;

if (unw_step(&cursor) <= 0) {

break;

}

#if IS_STACK_FRAMES

sp = next_sp;

if (unw_get_reg(&cursor, UNW_REG_SP, &next_sp) , 0) {

break;

}

sizes[n - 1] = next_sp - sp;

#endif

}

out:

--recursive;

return n;

這個函數的過程有點復雜，它的主要功能是回滾當前調用棧，並將棧指針都保存在stack數組中，根據這些信息就可以記錄程序指令的執行次數，以及描述函數之間的調用關系圖。（具體實現原理請參考libunwind官網說明）。再對到prof_handler函數中，程序的最后一步就是將當前獲取的調用棧信息保存到預先分配的內存中，其具體實現在profiledata.cc文件中：

void ProfileData::Add(int depth, const void* const* stack) {

if (!enabled()) {

return;

}

if (depth > kMaxStackDepth) depth = kMaxStackDepth;

RAW_CHECK(depth > 0, "ProfileData::Add depth <= 0");

// Make hash-value

Slot h = 0;

for (int i = 0; i < depth; i++) {

Slot slot = reinterpret_cast<Slot>(stack[i]);

h = (h << 8) | (h >> (8*(sizeof(h)-1)));

h += (slot * 31) + (slot * 7) + (slot * 3);

}

count_++;

// See if table already has an entry for this trace

bool done = false;

Bucket* bucket = &hash_[h % kBuckets];

for (int a = 0; a < kAssociativity; a++) {

Entry* e = &bucket->entry[a];

if (e->depth == depth) {

bool match = true;

for (int i = 0; i < depth; i++) {

if (e->stack[i] != reinterpret_cast<Slot>(stack[i])) {

match = false;

break;

}

if (match) {

e->count++;

done = true;

break;

}

if (!done) {

// Evict entry with smallest count

Entry* e = &bucket->entry[0];

for (int a = 1; a < kAssociativity; a++) {

if (bucket->entry[a].count < e->count) {

e = &bucket->entry[a];

}

if (e->count > 0) {

evictions_++;

Evict(*e);

}

// Use the newly evicted entry

e->depth = depth;

e->count = 1;

for (int i = 0; i < depth; i++) {

e->stack[i] = reinterpret_cast<Slot>(stack[i]);

}

此函數的處理流程如下:

1．對stack數組的所有項做hash，得到一個hash值；

2．根據hash值在hash_表中查找此調用棧，如果找到匹配項則增加該項的執行次數；

3．如果沒有找到則將從相應的hash槽中pop出執行次數最少的一個調用棧，將此調用棧中的所有棧指針值按順序保存到evict_數組中，並將新調用棧push到hash槽中。

到此，CPU profile的主要流程都走完了，總結一下其一直在循環執行一個動作：定期保存程序的當前調用棧信息。在被測程序執行結束之后，CPU profile所做的最后一步工作就是將evict_數組中保存的數據輸出到%CPUPROFILER環境變量制定的文件中（profiledata.cc）：

void ProfileData::Stop() {

if (!enabled()) {

return;

}

// Move data from hash table to eviction buffer

for (int b = 0; b < kBuckets; b++) {

Bucket* bucket = &hash_[b];

for (int a = 0; a < kAssociativity; a++) {

if (bucket->entry[a].count > 0) {

Evict(bucket->entry[a]);

}

if (num_evicted_ + 3 > kBufferLength) {

// Ensure there is enough room for end of data marker

FlushEvicted();

}

// Write end of data marker

evict_[num_evicted_++] = 0; // count

evict_[num_evicted_++] = 1; // depth

evict_[num_evicted_++] = 0; // end of data marker

FlushEvicted();

// Dump "/proc/self/maps" so we get list of mapped shared libraries

DumpProcSelfMaps(out_);

Reset();

fprintf(stderr, "PROFILE: interrupts/evictions/bytes = %d/%d/%" PRIuS "\n",

count_, evictions_, total_bytes_);

}

在dump出evict_數組數據之后，函數還通過DumpProcSelfMaps將/prof/self/map中的信息追加到輸出文件中，這些信息記錄了應用程序的內存映射情況，是pprof工具解析指令符號的重要依據。（關於/prof/self/map中的信息說明可以參考《程序員的自我修養》）

雖然監控程序已經停止，但是CPUprofiler的工作還沒完全結束，因為之前保存在$CPUPROFILER文件中的數據都是二進制格式的，不具備可讀性，需要借助pprof工具的解析功能才能揭露它的真實信息。

pprof是用perl語言編寫的解析工具，它的主要功能就是將CPU profile的輸出數據轉換成容易閱讀理解的可視格式，如text、pdf、gif等，接下來本文將講解pprof的主要工作原理，具體細節可以參考pprof代碼。

$CPUPROFILER文件中保存了兩部分信息：前部分是定期dump的調用棧信息，每個調用棧信息中都包含了執行次數、棧深度以及棧指針值（即指令地址）；后半部分記錄應用程序的內存映射圖。所以第一步，pprof根據內存映射圖和程序符號表將調用棧中的指令地址翻譯成容易理解的程序代碼；第二步，pprof根據第一部分保存的棧信息描述出程序中的函數調用圖；最后一步，pprof根據棧執行次數計算出每段代碼的執行次數，再根據定時器的執行頻率估算出程序段的執行時間，進而找出程序的性能熱點。

2.4 小結

Google performance tools采用了和GUNProfiler近似的原理、不同的方式來達到profiler的效果。由於其通過記錄調用棧信息來反推程序段的執行次數，不可避免地會出現遺漏和誤算情況，而且和GUNProfiler一樣，它也是通過sample的采樣頻率來估算程序段的運行時間，因此最終計算結果並不是十分精確的，具有一定的誤差。但是，Google performance tools較之其他Profiler工具而言，有其自身的特點和優勢，Googleperformance tools是一個用戶態程序，不需要內核提供支持（對比oprofiler）；它對被監控程序的入侵程序度較小（對比GUNProfiler），無需修改程序代碼，以attach的方式跟蹤程序執行狀態；而且它也是google的開源項目之一，工程量較小，方面后期擴展和二次開發。

三、 C++ Profiler工具特性對比

總結前兩章的調研結果，對目前常用的C++ profiler工具做了一個簡單的對比，對比的焦點主要集中在日常使用中大家所發現或比較關注的問題。不過由於時間關系，所選工具和對比項都十分有限，希望能在后期的進一步工作中完善補充。

C++Profiler工具	精確度	對動態庫的支持	對動態控制的支持	二次開發和維護成本
GUN profile	較高，對函數執行次數的統計是100%正確的，但是對函數執行時間的統計是通過采樣平率估算的，存在一定的偏差。	No	編譯時決定，靈活性較差	代碼集成在glibc中，二次開發和修改的影響面較大，而且發布不易。
Google performance tools	一般，對函數次數和執行時間的統計都是通過采樣頻率估算的，存在一定的偏差和遺漏。	Yes	運行時控制，更方面操作	獨立的第三方庫，開源項目，二次開發和維護成本較低。
Oprofile	待調查	待調查	待調查	待調查

未完待續……

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 UEditor 之初體驗后記 IPFS實踐之初體驗（三）IdentityServer4 結合 Mysql 之初體驗 Flask開發系列之初體驗 CUDA之初體驗——數組求和 [OpenGL ES 01]iOS上OpenGL ES之初體驗 Spring JDBCTemplate連接SQL Server之初體驗 .NET自帶IOC容器MEF之初體驗 Golang微服務入門到精通之路-1-Go之初體驗並發王者課-青銅4：寶刀屠龍-如何使用synchronized之初體驗