erlang 虛機CPU 占用高排查


-問題起因

  近期線上一組服務中,個別節點服務器CPU使用率很低,只有其他1/4。排除業務不均,曾懷疑是系統top統計錯誤,從 Erlang調度器的利用率調查  找到通過erlang:statistics(scheduler_wall_time) 查看服務器CPU低的機器調度器實際的CPU利用率很高接近100%,而其他機器都不到30%。

     分析不同業務服務,發現只有在node 中進程數采用調度器CPU利用低這個問題。

  - 高利用率

Cpu0  : 68.2%us, 11.4%sy,  0.0%ni,  3.7%id,  0.0%wa,  0.0%hi, 16.7%si,  0.0%st
Cpu1  : 81.6%us, 13.4%sy,  0.0%ni,  4.3%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu2  :  1.3%us,  0.3%sy,  0.0%ni, 98.0%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  1.3%us,  0.3%sy,  0.0%ni, 98.0%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.7%us,  1.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
[{total,0.06939920430400189},
 {1,0.8268947112724254},
 {2,0.8384896040437823},
 {3,8.867169683843468e-6},
 {4,1.0365168328954172e-5},
 {5,1.0024957622820418e-5},
 {6,8.853059601737486e-6},
 {7,8.402152852410522e-6},
 {8,7.63072324998243e-6},
 {9,8.474728373485058e-6},
 {10,1.1576532481056016e-5},
 {11,1.6194115883237974e-5},
 {12,1.44167774196793e-5},
 {13,9.819386220386243e-6},
 {14,7.892097518034394e-6},
 {15,7.163693583884608e-6},
 {16,7.1916733850567694e-6},
 {17,7.148667780983167e-6},
 {18,6.134044075369504e-6},
 {19,8.508809953551505e-6},
 {20,8.418451926460262e-6},
 {21,7.99327959145592e-6},
 {22,1.0466001303723225e-5},
 {23,1.165690052491439e-5},
 {24,1.110477551389582e-5}]
View Code

   - 低利用率

Cpu0  : 50.9%us, 13.7%sy,  0.0%ni, 21.4%id,  0.0%wa,  0.0%hi, 14.0%si,  0.0%st
Cpu1  : 70.5%us, 14.1%sy,  0.0%ni, 15.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 68.2%us, 16.1%sy,  0.0%ni, 15.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  : 68.4%us, 15.1%sy,  0.0%ni, 16.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  : 69.9%us, 15.4%sy,  0.0%ni, 14.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 67.9%us, 14.1%sy,  0.0%ni, 17.6%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu6  : 50.6%us, 13.4%sy,  0.0%ni, 35.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 41.5%us, 10.8%sy,  0.0%ni, 47.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 32.1%us,  9.6%sy,  0.0%ni, 58.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.7%us,  1.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

[{total,0.09717691269949602},
 {1,0.18233794842702225},
 {2,0.29948956042597197},
 {3,0.29957276129564725},
 {4,0.3039782882961829},
 {5,0.29122320708472654},
 {6,0.31739635715470543},
 {7,0.2454373354022171},
 {8,0.26177474519403443},
 {9,0.13033757084128628},
 {10,8.274133360405898e-5},
 {11,4.181167100346997e-5},
 {12,4.0870150064878635e-5},
 {13,4.012856385623552e-5},
 {14,4.402024019534071e-5},
 {15,4.464950668964882e-5},
 {16,4.662729312473385e-5},
 {17,4.765041344339578e-5},
 {18,4.442241285611087e-5},
 {19,4.494246472994659e-5},
 {20,4.1057127449095396e-5},
 {21,4.487741704964992e-5},
 {22,3.939601150277982e-5},
 {23,4.02231871509171e-5},
 {24,3.866736564497342e-5}]
View Code

-Whatsapp 案例 

  erlang方面能找到案例不多,幸運的發現whatsapp 給出了類似案例詳細的分析:

First bottleneck showed up at 425K. System ran into a lot of contention. Work stopped. 
Instrumented the scheduler to measure how much useful work is being done, or sleeping, or spinning. 
Under load it started to hit sleeping locks so 35-45% CPU was being used across the system but the schedulers are at 95% utilization.
      whatsapp 在單機425K連接時遇到瓶頸,虛機實際只35~45%cpu卻給系統造成了95%的cpu。
 
文中沒有提到具體細節,關於scheduler:
1. +swt low  Set the scheduler wake up threshold to low because schedulers would go to sleep and would never wake up
2. 設置進程優先級為實時  Run BEAM at real-time priority so that other things like cron jobs don’t interrupt schedule. Prevents glitches that would cause backlogs of important user traffic
3. 禁用spin( patch beam)Patch to dial down spin counts so the scheduler wouldn’t spin,+ssct 1 (via patch; scheduler spin count)

 

-工具分析

  通過微博私信,請教了下鄭思瑤,推薦VTune分析,推測是大量進程時調度器消耗過大。

  通過Intel 官方網站,填寫注冊信息,會立即回復郵件下載地址,並給30天試用期。

  速度很慢,建議掛着VPN下載;VTune 的linux版本命令行模式使用很簡單:

tar -zxf vtune_amplifier_xe_2015.tar.gz

cd vtune_amplifier_xe_2015

./install.sh

cd /opt/intel/vtune_amplifier_xe_2015.1.0.367959/

 source amplxe-vars.sh

 amplxe-cl -collect lightweight-hotspots -run-pass-thru=--no-altstack -target-pid=1575

 amplxe-cl -report hotspots

  可直接線上執行,不影響服務正常運行,得到如下結果:

Summary
-------
Elapsed Time:       19.345 CPU Time: 182.023
Average CPU Usage:  9.155
CPI Rate:           1.501

Function                                     Module              CPU Time:Self
-------------------------------------------  ------------------  ------------- sched_spin_wait beam.smp 72.754
raw_local_irq_enable                         vmlinux                    19.282
process_main                                 beam.smp                   10.476
ethr_native_atomic32_read                    beam.smp                    8.337
func@0xffffffff8100af60                      vmlinux                     3.007
__pthread_mutex_lock                         libpthread-2.12.so          2.342
raw_local_irq_restore                        vmlinux                     1.973
__sched_yield                                libc-2.12.so                1.913
pthread_mutex_unlock                         libpthread-2.12.so          1.553
__audit_syscall_exit                         vmlinux                     1.192
system_call                                  vmlinux                     1.156
erts_thr_yield                               beam.smp                    1.114
handle_delayed_dealloc                       beam.smp                    0.977
update                                       beam.smp                    0.828
raw_local_irq_enable                         vmlinux                     0.780

 

可以看到sched_spin_wait占用了 40% 的CPU時間

#define ERTS_SCHED_SPIN_UNTIL_YIELD 100
2121 static erts_aint32_t
2122 sched_spin_wait(ErtsSchedulerSleepInfo *ssi, int spincount)
2123 {
2124     int until_yield = ERTS_SCHED_SPIN_UNTIL_YIELD;
2125     int sc = spincount;
2126     erts_aint32_t flgs;
2127
2128     do {
2129     flgs = erts_smp_atomic32_read_acqb(&ssi->flags);
2130     if ((flgs & (ERTS_SSI_FLG_SLEEPING|ERTS_SSI_FLG_WAITING))
2131         != (ERTS_SSI_FLG_SLEEPING|ERTS_SSI_FLG_WAITING)) {
2132         break;
2133     }
2134     ERTS_SPIN_BODY;
2135     if (--until_yield == 0) {
2136         until_yield = ERTS_SCHED_SPIN_UNTIL_YIELD;
2137         erts_thr_yield();
2138     }
2139     } while (--sc > 0);
2140     return flgs;
2141 }

  默認spincount = 10000,但每次都有atom 讀操作,原子操作一般幾十到幾百CPU周期,導致這個忙等待 實際執行完的話會很長。

  同時也找到對應的配置:

  +sbwt none|very_short|short|medium|long|very_longSet scheduler busy wait threshold. Default is medium. The threshold determines how long schedulers should busy wait when running out of work before going to sleep.

     啟動參數:+sbwt none 即可見spin徹底關掉,而不必像whatsapp patch beam解決。

 

     同時strace 系統調用比較:

利用率高機器:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 34.07    0.022954           0     49747           writev
 33.15    0.022336           0     58925     11806 recvfrom
 21.04    0.014176           1     14558      2394 futex
  8.37    0.005636           0     24722         4 epoll_ctl
  2.71    0.001828           0      6947           epoll_wait
  0.30    0.000199          10        19           munmap
  0.24    0.000164           0       612           sched_yield
  0.12    0.000078          16         5           mmap
  0.00    0.000000           0         4           close

利用率低機器:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 86.38   13.138511         307     42838      7218 futex
  6.91    1.050431          12     90659           writev
  5.70    0.866909          13     69221     25351 recvfrom
  0.54    0.082772           9      9307           sched_yield
  0.43    0.065219           1     50157           epoll_ctl
  0.01    0.002220          34        66           munmap
  0.01    0.001092           1       815           epoll_wait
  0.00    0.000675          20        34           mmap
  0.00    0.000612          19        32        32 connect
  0.00    0.000564           9        64           fcntl
  0.00    0.000529          17        32           getsockname
  0.00    0.000457          14        32           getsockopt
  0.00    0.000341          11        32           socket
  0.00    0.000127          16         8           read
  0.00    0.000109           3        32           bind

  兩次strace 時間不同,可通過writev比例看出,第二次futex量要快高一倍,調度器線程切換較為嚴重。

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM