轉自:http://blog.csdn.net/Androidlushangderen/article/details/50282593
前言
最近在運維我們部門的hadoop集群時,發現了很多Job OOM的現象,因為在機器上可以用命令進行查看,full gc比較嚴重.我們都知道,full gc帶來的后果是比較大的,會"stop the world"的,一旦你的full gc elapsed time超過幾分鍾,那么其他的活動都得暫停這么多時間.所以Full gc一旦出現並且異常,一定要找到根源並將其解決.本篇文章就為大家講述一下我們是如何解決這類問題並且在這個基礎上做了一些別的什么優化的.
Full Gc源於何處
OOM發生了,導致頻繁的Full Gc的現象,首先就要想到底是什么導致Full gc.第一個聯想一定是上面跑的Job,Job的運行時的體現又是從每個task身上來的,而每個Task又是表現於每個TaskAttempt.而TaskAttempt是跑在所申請到的container中的.每個container都是一個獨立的進程,你可以在某台datanode上用jps命令,可以看到很多叫"YarnChild"的進程,那就是container所起的.找到了源頭之后,我們估計會想,那應該是container所啟的jvm的內存大小配小了,導致內存不夠用了,內存配大些不就可以解決問題了?事實上問題並沒有這么簡單,這里面的水還是有點深的.
為什么會發生Full Gc
一.為什么會發生full gc,第一種原因就是平常大家所說的內存配小了,就是下面2個配置項:
- public static final String MAP_MEMORY_MB = "mapreduce.map.memory.mb";
- public static final String REDUCE_MEMORY_MB = "mapreduce.reduce.memory.mb";
默認都是1024M,就是1個G.
二.另外一個原因估計想到的人就不太多了,除非你真的在生活中碰到過,概況來說一句話:錯誤的配置導致.上面的這個配置其實並不是container設置他所啟的jvm的配置,而是每個Task所能用的內存的上限值,但是這里會有個前提,你的jvm必須保證可以用到這么多的內存,如果你的jvm最大內存上限就只有512M,你的task的memory設的再大也沒有,最后造成的直接后果就是內存一用超,就會出現full gc.上面的2個值更偏向於理論值.而真正掌控jvm的配置項的其實是這2個配置:
- public static final String MAP_JAVA_OPTS = "mapreduce.map.java.opts";
- public static final String REDUCE_JAVA_OPTS = "mapreduce.reduce.java.opts";
所以理想的配置應該是java.opts的值必須大於等於memory.mb的值.所以說,這種配置不當的方式也會引發頻繁的full gc.
Container內存監控
不過比較幸運的是針對上面所列舉的第二種問題,hadoop自身已經對此進行了contaienr級別的監控,對於所有啟動過container,他會額外開啟一個叫container-monitor的線程,專門有對於這些container的pmem(物理內存),vmem(虛擬內存)的監控.相關的配置屬於如下:
- String org.apache.hadoop.yarn.conf.YarnConfiguration.NM_PMEM_CHECK_ENABLED = "yarn.nodemanager.pmem-check-enabled"
- String org.apache.hadoop.yarn.conf.YarnConfiguration.NM_VMEM_CHECK_ENABLED = "yarn.nodemanager.vmem-check-enabled"
默認都是開啟的.內存監控的意思就是一旦這個container所使用的內存超過這個jvm本身所能使用的最大上限值,則將此conyainer kill掉.下面簡單的從源代碼的級別為大家分析一下,過其實不難.首先進入到ContainersMonitorImpl.java這個類.
- @Override
- protected void serviceInit(Configuration conf) throws Exception {
- this.monitoringInterval =
- conf.getLong(YarnConfiguration.NM_CONTAINER_MON_INTERVAL_MS,
- YarnConfiguration.DEFAULT_NM_CONTAINER_MON_INTERVAL_MS);
- ....
- pmemCheckEnabled = conf.getBoolean(YarnConfiguration.NM_PMEM_CHECK_ENABLED,
- YarnConfiguration.DEFAULT_NM_PMEM_CHECK_ENABLED);
- vmemCheckEnabled = conf.getBoolean(YarnConfiguration.NM_VMEM_CHECK_ENABLED,
- YarnConfiguration.DEFAULT_NM_VMEM_CHECK_ENABLED);
- LOG.info("Physical memory check enabled: " + pmemCheckEnabled);
- LOG.info("Virtual memory check enabled: " + vmemCheckEnabled);
- ....
在serviceInit方法中就會從配置中讀取是否開啟內存監控功能,並輸出日志信息.然后我們直接進入到此類的MonitorThread監控線程類中.
- ....
- private class MonitoringThread extends Thread {
- public MonitoringThread() {
- super("Container Monitor");
- }
- @Override
- public void run() {
- while (true) {
- // Print the processTrees for debugging.
- if (LOG.isDebugEnabled()) {
- StringBuilder tmp = new StringBuilder("[ ");
- for (ProcessTreeInfo p : trackingContainers.values()) {
- tmp.append(p.getPID());
- tmp.append(" ");
- }
- ....
在監控線程的run方法中,他會對所監控的container做遍歷判斷處理
- @Override
- public void run() {
- while (true) {
- ....
- // Now do the monitoring for the trackingContainers
- // Check memory usage and kill any overflowing containers
- long vmemStillInUsage = 0;
- long pmemStillInUsage = 0;
- for (Iterator<Map.Entry<ContainerId, ProcessTreeInfo>> it =
- trackingContainers.entrySet().iterator(); it.hasNext();) {
- Map.Entry<ContainerId, ProcessTreeInfo> entry = it.next();
- ContainerId containerId = entry.getKey();
- ProcessTreeInfo ptInfo = entry.getValue();
- try {
- String pId = ptInfo.getPID();
- ....
我們以物理內存監控的原理實現為一個例子.
首先他會根據pTree拿到進程相關的運行信息,比如內存,CPU信息等
- LOG.debug("Constructing ProcessTree for : PID = " + pId
- + " ContainerId = " + containerId);
- ResourceCalculatorProcessTree pTree = ptInfo.getProcessTree();
- pTree.updateProcessTree(); // update process-tree
- long currentVmemUsage = pTree.getVirtualMemorySize();
- long currentPmemUsage = pTree.getRssMemorySize();
- // if machine has 6 cores and 3 are used,
- // cpuUsagePercentPerCore should be 300% and
- // cpuUsageTotalCoresPercentage should be 50%
- float cpuUsagePercentPerCore = pTree.getCpuUsagePercent();
- float cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore /
- resourceCalculatorPlugin.getNumProcessors();
然后拿到內存使用上限值
- // Multiply by 1000 to avoid losing data when converting to int
- int milliVcoresUsed = (int) (cpuUsageTotalCoresPercentage * 1000
- * maxVCoresAllottedForContainers /nodeCpuPercentageForYARN);
- // as processes begin with an age 1, we want to see if there
- // are processes more than 1 iteration old.
- long curMemUsageOfAgedProcesses = pTree.getVirtualMemorySize(1);
- long curRssMemUsageOfAgedProcesses = pTree.getRssMemorySize(1);
- long vmemLimit = ptInfo.getVmemLimit();
- long pmemLimit = ptInfo.getPmemLimit();
而這個pememLimit就不是pTree的信息,而是來自於外界所啟動container時候所傳進來的值,這個值其實就是java.opts的值.
- ContainerId containerId = monitoringEvent.getContainerId();
- switch (monitoringEvent.getType()) {
- case START_MONITORING_CONTAINER:
- ContainerStartMonitoringEvent startEvent =
- (ContainerStartMonitoringEvent) monitoringEvent;
- synchronized (this.containersToBeAdded) {
- ProcessTreeInfo processTreeInfo =
- new ProcessTreeInfo(containerId, null, null,
- startEvent.getVmemLimit(), startEvent.getPmemLimit(),
- startEvent.getCpuVcores());
- this.containersToBeAdded.put(containerId, processTreeInfo);
- }
- break;
然后是內存監控的核心邏輯
- ....
- } else if (isPmemCheckEnabled()
- && isProcessTreeOverLimit(containerId.toString(),
- currentPmemUsage, curRssMemUsageOfAgedProcesses,
- pmemLimit)) {
- // Container (the root process) is still alive and overflowing
- // memory.
- // Dump the process-tree and then clean it up.
- msg = formatErrorMessage("physical",
- currentVmemUsage, vmemLimit,
- currentPmemUsage, pmemLimit,
- pId, containerId, pTree);
- isMemoryOverLimit = true;
- containerExitStatus = ContainerExitStatus.KILLED_EXCEEDED_PMEM;
- ....
傳入當前的內存使用量和限制值然后做比較,isProcessTreeOverLimit最終會調用到下面的這個方法.
- /**
- * Check whether a container's process tree's current memory usage is over
- * limit.
- *
- * When a java process exec's a program, it could momentarily account for
- * double the size of it's memory, because the JVM does a fork()+exec()
- * which at fork time creates a copy of the parent's memory. If the
- * monitoring thread detects the memory used by the container tree at the
- * same instance, it could assume it is over limit and kill the tree, for no
- * fault of the process itself.
- *
- * We counter this problem by employing a heuristic check: - if a process
- * tree exceeds the memory limit by more than twice, it is killed
- * immediately - if a process tree has processes older than the monitoring
- * interval exceeding the memory limit by even 1 time, it is killed. Else it
- * is given the benefit of doubt to lie around for one more iteration.
- *
- * @param containerId
- * Container Id for the container tree
- * @param currentMemUsage
- * Memory usage of a container tree
- * @param curMemUsageOfAgedProcesses
- * Memory usage of processes older than an iteration in a container
- * tree
- * @param vmemLimit
- * The limit specified for the container
- * @return true if the memory usage is more than twice the specified limit,
- * or if processes in the tree, older than this thread's monitoring
- * interval, exceed the memory limit. False, otherwise.
- */
- boolean isProcessTreeOverLimit(String containerId,
- long currentMemUsage,
- long curMemUsageOfAgedProcesses,
- long vmemLimit) {
- boolean isOverLimit = false;
- if (currentMemUsage > (2 * vmemLimit)) {
- LOG.warn("Process tree for container: " + containerId
- + " running over twice " + "the configured limit. Limit=" + vmemLimit
- + ", current usage = " + currentMemUsage);
- isOverLimit = true;
- } else if (curMemUsageOfAgedProcesses > vmemLimit) {
- LOG.warn("Process tree for container: " + containerId
- + " has processes older than 1 "
- + "iteration running over the configured limit. Limit=" + vmemLimit
- + ", current usage = " + curMemUsageOfAgedProcesses);
- isOverLimit = true;
- }
- return isOverLimit;
- }
有2種情況會導致內存超出的現象,1個是使用內存超出限制內存2倍,理由是新的jvm會執行fork和exec操作,fork操作會拷貝父進程的信息,還有1個就是內存年齡值的限制.其他的上面注釋已經寫的很清楚了,如果還看不懂英文的話,自行找工具翻譯.
最后如果發現container內存使用的確是超出內存限制值了,之后,就會發送container kill的event事件,會觸發后續的container kill的動作.
- ....
- } else if (isVcoresCheckEnabled()
- && cpuUsageTotalCoresPercentage > vcoresLimitedRatio) {
- msg =
- String.format(
- "Container [pid=%s,containerID=%s] is running beyond %s vcores limits."
- + " Current usage: %s. Killing container.\n", pId,
- containerId, vcoresLimitedRatio);
- isCpuVcoresOverLimit = true;
- containerExitStatus = ContainerExitStatus.KILLED_EXCEEDED_VCORES;
- }
- if (isMemoryOverLimit) {
- // Virtual or physical memory over limit. Fail the container and
- // remove
- // the corresponding process tree
- LOG.warn(msg);
- // warn if not a leader
- if (!pTree.checkPidPgrpidForMatch()) {
- LOG.error("Killed container process with PID " + pId
- + " but it is not a process group leader.");
- }
- // kill the container
- eventDispatcher.getEventHandler().handle(
- new ContainerKillEvent(containerId,
- containerExitStatus, msg));
- it.remove();
- LOG.info("Removed ProcessTree with root " + pId);
- } else {
- ...
這就是container的內存監控的整個過程.我們當時又恰巧把這個功能給關了,所以導致了大量的Ful gc的現象.
為什么只對Container內存做監控
對於小標題上的問題,不知道有沒有哪位同學想過?當時我在解決掉這個問題之后,我就在想,同樣是很關鍵的指標,CPU的使用監控為什么不在ContainersMonitorImpl一起做掉呢.下面是我個人所總結出來的幾點原因.
1.內存問題所造成的結果比CPU使用造成的影響更大,因為OOM問題一旦發生,就會引起gc.
2.內存問題比較CPU使用問題更加常見.因為大家在平常生活或寫程序時,經常發碰到類似"啊,內存不夠用了"等類似的問題,相對比較少碰到"CPU不夠用了"的問題.
3.內存問題與Job運行規模,數據量使用規模密切相關.內存的使用與Job所處理的數據量密切相關,一般大Job,處理數據量大了,內存使用自然會變多,CPU也會變多,但不會那么明顯.
綜上3點原因,所以CPU監控並沒有被加入到監控代碼中(個人分析而言).
但是hadop自身沒有加CPU監控並不代表我們不可以加這樣的監控,有一些程序可能就是那種應用內存並不多,但是會耗盡很多CPU資源的程序,比如說開大量的線程,但是每個線程都在做很簡單的操作,就會造成機器線程占比過高的問題.基於這個出發點,我添加了CPU使用百分比的監控.
Container的Cpu使用率監控
首先你要定義是否開啟此功能的配置:
- /** Specifies whether cpu vcores check is enabled. */
- public static final String NM_VCORES_CHECK_ENABLED = NM_PREFIX
- + "vcores-check-enabled";
- public static final boolean DEFAULT_NM_VCORES_CHECK_ENABLED = false;
因為是新功能,默認是關閉的,然后你還需要定義1個使用閾值,在0~1之間,就是說一旦某個container的使用CPU的百分比超過這個值,就會被kill.
- /** Limit ratio of Virtual CPU Cores which can be allocated for containers. */
- public static final String NM_VCORES_LIMITED_RATIO = NM_PREFIX
- + "resource.cpu-vcores.limited.ratio";
- public static final float DEFAULT_NM_VCORES_LIMITED_RATIO = 0.8f;
默認這個值0.8,這個可以你隨便設置.監控代碼的邏輯,與內存監控完全類似,我將比較快的帶過.
多定義2個變量值
- private boolean pmemCheckEnabled;
- ...
- private boolean vcoresCheckEnabled;
- private float vcoresLimitedRatio;
然后在serviceInit中進程配置初始化工作
- ...
- pmemCheckEnabled = conf.getBoolean(YarnConfiguration.NM_PMEM_CHECK_ENABLED,
- YarnConfiguration.DEFAULT_NM_PMEM_CHECK_ENABLED);
- vmemCheckEnabled = conf.getBoolean(YarnConfiguration.NM_VMEM_CHECK_ENABLED,
- YarnConfiguration.DEFAULT_NM_VMEM_CHECK_ENABLED);
- vcoresCheckEnabled =
- conf.getBoolean(YarnConfiguration.NM_VCORES_CHECK_ENABLED,
- YarnConfiguration.DEFAULT_NM_VCORES_CHECK_ENABLED);
- LOG.info("Physical memory check enabled: " + pmemCheckEnabled);
- LOG.info("Virtual memory check enabled: " + vmemCheckEnabled);
- LOG.info("Cpu vcores check enabled: " + vcoresCheckEnabled);
- if (vcoresCheckEnabled) {
- vcoresLimitedRatio =
- conf.getFloat(YarnConfiguration.NM_VCORES_LIMITED_RATIO,
- YarnConfiguration.DEFAULT_NM_VCORES_LIMITED_RATIO);
- LOG.info("Vcores limited ratio: " + vcoresLimitedRatio);
- }
然后利用monitor監控代碼中已計算出的cpu百分比變量
- LOG.debug("Constructing ProcessTree for : PID = " + pId
- + " ContainerId = " + containerId);
- ResourceCalculatorProcessTree pTree = ptInfo.getProcessTree();
- pTree.updateProcessTree(); // update process-tree
- long currentVmemUsage = pTree.getVirtualMemorySize();
- long currentPmemUsage = pTree.getRssMemorySize();
- // if machine has 6 cores and 3 are used,
- // cpuUsagePercentPerCore should be 300% and
- // cpuUsageTotalCoresPercentage should be 50%
- float cpuUsagePercentPerCore = pTree.getCpuUsagePercent();
- float cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore /
- resourceCalculatorPlugin.getNumProcessors();
最后進行大小判斷即可
- ....
- } else if (isVcoresCheckEnabled()
- && cpuUsageTotalCoresPercentage > vcoresLimitedRatio) {
- msg =
- String.format(
- "Container [pid=%s,containerID=%s] is running beyond %s vcores limits."
- + " Current usage: %s. Killing container.\n", pId,
- containerId, vcoresLimitedRatio);
- isCpuVcoresOverLimit = true;
- containerExitStatus = ContainerExitStatus.KILLED_EXCEEDED_VCORES;
- }
- if (isMemoryOverLimit || isCpuVcoresOverLimit) {
- // Virtual or physical memory over limit. Fail the container and
- // remove
- // the corresponding process tree
- LOG.warn(msg);
- // warn if not a leader
- if (!pTree.checkPidPgrpidForMatch()) {
- LOG.error("Killed container process with PID " + pId
- + " but it is not a process group leader.");
- }
- // kill the container
- eventDispatcher.getEventHandler().handle(
- new ContainerKillEvent(containerId,
- containerExitStatus, msg));
- it.remove();
- LOG.info("Removed ProcessTree with root " + pId);
- } else {
對了,還要在這里添加1個新的ExitStatus退出碼:
- /**
- * Container terminated because of exceeding allocated cpu vcores.
- */
- public static final int KILLED_EXCEEDED_VCORES = -108;
CPU監控代碼的改動就是這么多.此功能的完整代碼可以查看文章末尾的鏈接.在這里我要特別申請一下,此功能代碼由於我在本地電腦上不支持ProcfsBasedProcessTree,導致單元測試沒法跑通,所以我還沒有完整測過,理論上是OK,大家可以拿去試試,可以給我一些反饋.希望能帶給大家收獲.
相關鏈接
Github patch鏈接:https://github.com/linyiqun/open-source-patch/tree/master/yarn/others/YARN-VcoresMonitor