[erlang]一次erlcron崩潰引起的事故分析

本文轉載自查看原文 2016-03-24 13:46 1626 erlcron/ erlang

事故背景

由於誤操作在erlcron設置了一個超過3個月后的定時任務。然后第二天之后發現每天的daily reset沒有被執行，一些定時任務也沒有被執行。瞬間感覺整個人都不好了，怎么無端端就不執行了呢。

通過排查日志，發現了以下報錯：

2016-03-22 16:54:32.014 [error] gen_server ecrn_control terminated with reason: no case clause matching {ok,[<0.14123.1577>,<0.13079.1576>,<0.25254.1569>,<0.13402.1577>,...]} in ecrn_control:internal_cancel/1 line 111
2016-03-22 16:54:32.015 [error] CRASH REPORT Process ecrn_control with 0 neighbours exited with reason: no case clause matching {ok,[<0.14123.1577>,<0.13079.1576>,<0.25254.1569>,<0.13402.1577>,...]} in ecrn_control:internal_cancel/1 line 111 in gen_server:terminate/6 line 744

我擦，ecrn_control都崩了，怎么回事。

找到具體出錯的代碼：

internal_cancel(AlarmRef) ->
    case ecrn_reg:get(AlarmRef) of
        undefined ->
            undefined;
        {ok, [Pid]} ->
            ecrn_agent:cancel(Pid)
    end.

發現調用ecrn_reg:get(AlarmRef)被返回了{ok, List}，而且這個List的數據遠不止一個。明顯在設置那個超過3個月的定時任務的時候，ecrn_reg被注冊進了臟數據。

事故重現

先設置幾個正常的定時任務

erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}).
erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}).
erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}).

查看observer:start() 可以看到進程樹如下：

再設置一個4294968秒之后的定時任務

erlcron:cron({{once, 4294968}, {io, fwrite, ["Hello, world!~n"]}}).

結果就gg了，好多崩潰信息是不是：

22:49:16.818 [error] CRASH REPORT Process <0.5822.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358
22:49:16.818 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5822.64> exit with reason timeout_value in context child_terminated
22:49:16.819 [error] CRASH REPORT Process <0.5701.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358
22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5701.64> exit with reason timeout_value in context child_terminated
22:49:16.821 [error] CRASH REPORT Process <0.6237.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358
22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.6237.64> exit with reason timeout_value in context child_terminated
22:49:16.821 [error] CRASH REPORT Process <0.5862.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358
22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5862.64> exit with reason timeout_value in context child_terminated

...(總共有25條)

再看一下進程數：

我擦，為毛原來的 scrn_agent 進程也沒有了。

可以發現，erlcron 在嘗試了25次設置這個定時任務之后，也就是 scrn_agent 崩潰了25次之后，原來設置的三個正常的定時任務的scrn_agent 進程也沒有掉了。也就是說，不但我新設置的定時任務沒有成功，而且我原來正常的定時任務也沒有掉了。

再看一下崩潰日志里面的崩掉的進程號，每一個都是不一樣的。可以推算其實原來的報錯ecrn_reg:get(AlarmRef)獲取到了多個Pid，其實就是這里插入失敗的定時任務產生的25個Pid。也就是說，雖然ecrn_agent進程崩潰了，但是ecrn_reg還是保存了這些Pid。所以在取消這些定時任務的時候，ecrn_reg:get(AlarmRef)返回的內容在internal_cancel(AlarmRef)沒有被匹配到。

為什么是4294968，其實是2^32

為什么設置了4294968秒后的定時任務就崩潰了。這個數估計很多人很熟悉，2^32=4294967296，而4294968000也就是剛好大於2^32。即，如果設置的定時任務超過了2^32毫秒，在erlcron里面就不支持了。

查看gen_server:loop的源碼，找到引起崩潰的代碼：

loop(Parent, Name, State, Mod, hibernate, Debug) ->
    proc_lib:hibernate(?MODULE,wake_hib,[Parent, Name, State, Mod, Debug]);
loop(Parent, Name, State, Mod, Time, Debug) ->
    Msg = receive
          Input ->
            Input
      after Time ->
          timeout
      end,
    decode_msg(Msg, Parent, Name, State, Mod, Time, Debug, false).

可以發現引起崩潰的，358行是一段receive代碼。也就是說receive是不支持超過2^32大小的。

自測了一下，的確如果receive的after后面如果是大於等於2^32的數值就會出現bad receive timeout value的報錯。查看官方解釋，已經明確說明不能大於32位大小。

ExprT is to evaluate to an integer. The highest allowed value is 16#FFFFFFFF, that is, the value must fit in 32 bits. receive..after works exactly as receive, except that if no matching message has arrived within ExprT milliseconds, then BodyT is evaluated instead. The return value of BodyT then becomes the return value of the receive..after expression.

引用自：http://erlang.org/doc/reference_manual/expressions.html

再回到erlcron，在 ecrn_agent:start_link的時候，ecrn_agent:init執行完ecrn_reg:register(JobRef, self())返回{ok, NewState, Millis}到gen_server之后，Millis如果超過2^32在gen_server:loop就會引起gen_server的timeout_value異常退出。

%% @private
init([JobRef, Job]) ->
    State = #state{job=Job,
                   alarm_ref=JobRef},
    {DateTime, Actual} = ecrn_control:datetime(),
    NewState = set_internal_time(State, DateTime, Actual),
    case until_next_milliseconds(NewState, Job) of
        {ok, Millis} when is_integer(Millis) ->
            ecrn_reg:register(JobRef, self()),
            {ok, NewState, Millis};
        {error, _}  ->
            {stop, normal}
    end.

最后

這坑踩的，有點郁悶。其實這跟erlcron也沒關系，也不是gen_server的問題。而是erlang自身receive不支持2^32引起的。繼續往下查其實可以發現，再往下是其它語言寫的了。

-module(prim_eval).

%% This module is simply a stub which abstract code gets included in the result
%% of compilation of prim_eval.S, to keep Dialyzer happy.

-export(['receive'/2]).

-spec 'receive'(fun((term()) -> nomatch | T), timeout()) -> T.
'receive'(_, _) ->
    erlang:nif_error(stub).

與君共勉

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ThreadLocal引起的一次線上事故記一次go中map並發引起的事故記一次dirty_ratio引起的線上事故由java程序引起的一次系統崩潰記一次蘇寧小店事故分析和總結一次Mysql 死鎖事故 Spring+SpringMVC+MyBatis+easyUI整合進階篇（七）一次線上Mysql數據庫崩潰事故的記錄由一次程序崩潰引起的對new表達式的再次學習一次對象過大引起的gc性能問題的分析與定位原創-K8S INGRESS配置分析並總結一次配置事故