AOF 持久化策略

Redis為了解決AOF后台重寫造成的數據不一致問題，設置了AOF重寫緩沖區。即使設置了no-appendfsync-on-rewrite yes也會造成短暫的主進程阻塞。原因就在於子進程完成AOF重寫之后，會發送一個信號給主進程，而父進程會在這個時候調用信號處理函數，主要是將新的AOF文件替換舊的AOF文件，那么在這段時間內，主進程是阻塞的。

簡介

AOF 持久化和 RDB 持久化的最主要區別在於，前者記錄了數據的變更，而后者是保存了數據本身。本篇主要講的是AOF 持久化，了解 AOF 的數據組織方式和運作機制。Redis 主要在 aof.c 中實現 AOF 的操作。

同樣，AOF 持久化也會涉及文件的讀寫，會用到數據結構 rio。關於 rio 已經在上一個篇章已經講述，在此不做展開。

AOF 數據組織方式

假設 redis 內存有「name:Jhon」的鍵值對，那么進行 AOF 持久化后，AOF 文件有如下內容：

*2     # 2 個參數
$6     # 第一個參數長度為6
SELECT # 第一個參數
$1     # 第二參數長度為1
8      # 第二參數
*3     # 3 個參數
$3     # 第一個參數長度為4
SET    # 第一個參數
$4     # 第二參數長度為4
name   # 第二個參數
$4     # 第三個參數長度為4
Jhon   # 第二參數長度為4

所以對上面的內容進行恢復，能得到熟悉的一條 Redis 命令：SELECT 8;SET name Jhon. 可以想象的是，Redis 遍歷內存數據集中的每個 key-value 對，依次寫入磁盤中；Redis 啟動的時候，從 AOF 文件中讀取數據，恢復數據。

AOF 持久化運作機制

和 redis RDB 持久化運作機制不同，redis AOF 有后台執行和邊服務邊備份兩種方式。

1）AOF 后台執行的方式和 RDB 有類似的地方，fork 一個子進程，主進程仍進行服務，子進程執行AOF 持久化，數據被dump 到磁盤上。與 RDB 不同的是，后台子進程持久化過程中，主進程會記錄期間的所有數據變更（主進程還在服務），並存儲在 server.aof_rewrite_buf_blocks 中；后台子進程結束后，Redis 更新緩存追加到 AOF 文件中，是 RDB 持久化所不具備的。

來說說更新緩存這個東西。Redis 服務器產生數據變更的時候，譬如 set name Jhon，不僅僅會修改內存數據集，也會記錄此更新（修改）操作，記錄的方式就是上面所說的數據組織方式。

更新緩存可以存儲在 server.aofbuf 中，你可以把它理解為一個小型臨時中轉站，所有累積的更新緩存都會先放入這里，它會在特定時機寫入文件或者插入到server.aof-rewrite_buf_blocks 下鏈表（下面會詳述）；server.aofbuf 中的數據在 propagrate() 添加，在涉及數據更新的地方都會調用propagrate() 以累積變更。更新緩存也可以存儲在 server.aof-rewrite_buf_blocks，這是一個元素類型為 struct aofrwblock 的鏈表，你可以把它理解為一個倉庫，當后台有AOF 子進程的時候，會將累積的更新緩存（在 server.aof_buf 中）插入到鏈表中，而當 AOF 子進程結束，它會被整個寫入到文件。兩者是有關聯的。

這里的意圖即是不用每次出現數據變更的時候都觸發一個寫操作，可以將寫操作先緩存到內存中，待到合適的時機寫入到磁盤，如此避免頻繁的寫操作。當然，完全可以實現讓數據變更及時更新到磁盤中。兩種做法的好壞就是一種博弈了。

下面是后台執行的主要代碼：

// 啟動后台子進程，執行AOF 持久化操作。bgrewriteaofCommand()，startAppendOnly()，
// serverCron() 中會調用此函數
/* This is how rewriting of the append only file in background works:
**1) The user calls BGREWRITEAOF
* 2) Redis calls this function, that forks():
* * 2a) the child rewrite the append only file in a temp file.
* 2b) the parent accumulates differences in server.aof_rewrite_buf.
* 3) When the child finished '2a' exists.
* 4) The parent will trap the exit code, if it's OK, will append the
* data accumulated into server.aof_rewrite_buf into the temp file, and
* finally will rename(2) the temp file in the actual file name.
* The the new file is reopened as the new append only file. Profit!
*/

int rewriteAppendOnlyFileBackground(void) {
    pid_t childpid;
    long long start;
    // 已經有正在執行備份的子進程
    if (server.aof_child_pid != -1) return REDIS_ERR;
        start = ustime();
    if ((childpid = fork()) == 0) {
        char tmpfile[256];
        // 子進程
        /* Child */
        // 關閉監聽
        closeListeningSockets(0);
        // 設置進程title
        redisSetProcTitle("redis-aof-rewrite");
        // 臨時文件名
        snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
        // 開始執行AOF 持久化
    if (rewriteAppendO nlyFile(tmpfile) == REDIS_OK) {
        // 臟數據，其實就是子進程所消耗的內存大小
        // 獲取臟數據大小
        size_t private_dirty = zmalloc_get_private_dirty();
        // 記錄臟數據
    if (private_dirty) {
        redisLog(REDIS_NOTICE,
        "AOF rewrite: %zu MB of memory used by copy-on-write",
        private_dirty/(1024*1024));
    }
        exitFromChild(0);
    } else {
        exitFromChild(1);
    }
    } else {
        /* Parent */
        server.stat_fork_time = ustime()-start;
    if (childpid == -1) {
        redisLog(REDIS_WARNING,
        "Can't rewrite append only file in background: fork: %s",
        strerror(errno));
        return REDIS_ERR;
    }
    redisLog(REDIS_NOTICE,
    "Background append only file rewriting started by pid %d",childpid);
    // AOF 已經開始執行，取消AOF 計划
    server.aof_rewrite_scheduled = 0;
    // AOF 最近一次執行的起始時間
    server.aof_rewrite_time_start = time(NULL);
    // 子進程ID
    server.aof_child_pid = childpid;
    updateDictResizePolicy();
// 因為更新緩存都將寫入文件，要強制產生選擇數據集的指令SELECT ，以防出現數據
// 合並錯誤。
/* We set appendseldb to -1 in order to force the next call to the
* feedAppendOnlyFile() to issue a SELECT command, so the differences
* accumulated by the parent into server.aof_rewrite_buf will start
* with a SELECT statement and it will be safe to merge. */
    server.aof_selected_db = -1;
    replicationScriptCacheFlush();
    return REDIS_OK;
    }
    return REDIS_OK; /* unreached */
}

如上，子進程執行 AOF 持久化，父進程則會記錄一些 AOF 的執行信息。下面來看看 AOF 持久化具體是怎么做的？

// AOF 持久化主函數。只在rewriteAppendOnlyFileBackground() 中會調用此函數
/* Write a sequence of commands able to fully rebuild the dataset into
* "filename". Used both by REWRITEAOF and BGREWRITEAOF.
**
In order to minimize the number of commands needed in the rewritten
* log Redis uses variadic commands when possible, such as RPUSH, SADD
* and ZADD. However at max REDIS_AOF_REWRITE_ITEMS_PER_CMD items per time
* are inserted using a single command. */
    int rewriteAppendOnlyFile(char *filename) {
    dictIterator *di = NULL;
    dictEntry *de;
    rio aof;
    FILE *fp;
    char tmpfile[256];
    int j;
    long long now = mstime();
    /* Note that we have to use a different temp name here compared to the
    * one used by rewriteAppendOnlyFileBackground() function. */

    snprintf(tmpfile,256,"temp-rewriteaof-%d.aof", (int) getpid());
    // 打開文件
    fp = fopen(tmpfile,"w");
    if (!fp) {
        redisLog(REDIS_WARNING, "Opening the temp file for AOF rewrite in"
        "rewriteAppendOnlyFile(): %s", strerror(errno));
        return REDIS_ERR;
    }
        // 初始化rio 結構體
        rioInitWithFile(&aof,fp);
        // 如果設置了自動備份參數，將進行設置
    if (server.aof_rewrite_incremental_fsync)
        rioSetAutoSync(&aof,REDIS_AOF_AUTOSYNC_BYTES);
        // 備份每一個數據集
    for (j = 0; j < server.dbnum; j++) {
        char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
        redisDb *db = server.db+j;
        dict *d = db->dict;
    if (dictSize(d) == 0) continue;
        // 獲取數據集的迭代器
        di = dictGetSafeIterator(d);
    if (!di) {
        fclose(fp);
        return REDIS_ERR;
    }
    // 寫入AOF 操作碼
    /* SELECT the new DB */
    if (rioWrite(&aof,selectcmd,sizeof(selectcmd)-1) == 0) goto werr;
    // 寫入數據集序號
    if (rioWriteBulkLongLong(&aof,j) == 0) goto werr;
    // 寫入數據集中每一個數據項
    /* Iterate this DB writing every entry */
    while((de = dictNext(di)) != NULL) {
        sds keystr;
        robj key, *o;
        long long expiretime;
        keystr = dictGetKey(de);
        o = dictGetVal(de);
        // 將keystr 封裝在robj 里
        initStaticStringObject(key,keystr);
        // 獲取過期時間
        expiretime = getExpire(db,&key);

        // 如果已經過期，放棄存儲
        /* If this key is already expired skip it */
    if (expiretime != -1 && expiretime < now) continue;
        // 寫入鍵值對應的寫操作
        /* Save the key and associated value */
    if (o->type == REDIS_STRING) {
        /* Emit a SET command */
        char cmd[]="*3\r\n$3\r\nSET\r\n";
    if (rioWrite(&aof,cmd,sizeof(cmd)-1) == 0) goto werr;
        /* Key and value */
    if (rioWriteBulkObject(&aof,&key) == 0) goto werr;
    if (rioWriteBulkObject(&aof,o) == 0) goto werr;
    } else if (o->type == REDIS_LIST) {
    if (rewriteListObject(&aof,&key,o) == 0) goto werr;
    } else if (o->type == REDIS_SET) {
    if (rewriteSetObject(&aof,&key,o) == 0) goto werr;
    } else if (o->type == REDIS_ZSET) {
    if (rewriteSortedSetObject(&aof,&key,o) == 0) goto werr;
    } else if (o->type == REDIS_HASH) {
    if (rewriteHashObject(&aof,&key,o) == 0) goto werr;
    } else {
        redisPanic("Unknown object type");
    }
    // 寫入過期時間
    /* Save the expire time */
    if (expiretime != -1) {
        char cmd[]="*3\r\n$9\r\nPEXPIREAT\r\n";
    if (rioWrite(&aof,cmd,sizeof(cmd)-1) == 0) goto werr;
    if (rioWriteBulkObject(&aof,&key) == 0) goto werr;
    if (rioWriteBulkLongLong(&aof,expiretime) == 0) goto werr;
    }
}
    // 釋放迭代器
    dictReleaseIterator(di);
}
    // 寫入磁盤
    /* Make sure data will not remain on the OS's output buffers */
    fflush(fp);
    aof_fsync(fileno(fp));
    fclose(fp);
    // 重寫文件名
    /* Use RENAME to make sure the DB file is changed atomically only
    * if the generate DB file is ok. */
    if (rename(tmpfile,filename) == -1) {
        redisLog(REDIS_WARNING,"Error moving temp append only file on the "
        "final destination: %s", strerror(errno));
        unlink(tmpfile);
        return REDIS_ERR;
    }
    redisLog(REDIS_NOTICE,"SYNC append only file rewrite performed");
    return REDIS_OK;
    werr:
    // 清理工作
    fclose(fp);
    unlink(tmpfile);
    redisLog(REDIS_WARNING,"Write error writing append only file on disk: "
    "%s", strerror(errno));
    if (di) dictReleaseIterator(di);
        return REDIS_ERR;
}

剛才所說，AOF 在持久化結束后，持久化過程產生的數據變更也會追加到 AOF 文件中。如果有留意定時處理函數 serverCorn()：父進程會在子進程結束后，將 AOF 持久化過程中產生的數據變更，追加到 AOF 文件。這就是 backgroundRewriteDoneHandler() 要做的：將 server.aof_rewrite_buf_blocks 追加到 AOF 文件。

// 后台子進程結束后，Redis 更新緩存server.aof_rewrite_buf_blocks 追加到AOF 文件中
// 在AOF 持久化結束后會執行這個函數， backgroundRewriteDoneHandler() 主要工作是
// 將server.aof_rewrite_buf_blocks，即AOF 緩存寫入文件
/* A background append only file rewriting (BGREWRITEAOF) terminated its work.
* Handle this. */
    void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    ......
    // 將AOF 緩存server.aof_rewrite_buf_blocks 的AOF 寫入磁盤
    if (aofRewriteBufferWrite(newfd) == -1) {
        redisLog(REDIS_WARNING,
        "Error trying to flush the parent diff to the rewritten AOF: %s",
        strerror(errno));
        close(newfd);
        goto cleanup;
    }
    ......
    }
// 將累積的更新緩存server.aof_rewrite_buf_blocks 同步到磁盤
/* Write the buffer (possibly composed of multiple blocks) into the specified
* fd. If no short write or any other error happens -1 is returned,
* otherwise the number of bytes written is returned. */
    ssize_t aofRewriteBufferWrite(int fd) {
    listNode *ln;
    listIter li;
    ssize_t count = 0;
    listRewind(server.aof_rewrite_buf_blocks,&li);
    while((ln = listNext(&li))) {
    aofrwblock *block = listNodeValue(ln);
    ssize_t nwritten;
    if (block->used) {
        nwritten = write(fd,block->buf,block->used);
    if (nwritten != block->used) {
    if (nwritten == 0) errno = EIO;
        return -1;
    }
        count += nwritten;
    }
  }
  return count;
}

2）邊服務邊備份的方式，即 Redis 服務器會把所有的數據變更存儲在 server.aof_buf 中，並在特定時機將更新緩存寫入預設定的文件（server.aof_filename）。特定時機有三種：

進入事件循環之前
Redis 服務器定時程序 serverCron() 中
停止 AOF 策略的 stopAppendOnly() 中

Redis 無非是不想服務器突然崩潰終止，導致過多的數據丟失。Redis 默認是每隔固定時間進行一次邊服務邊備份，即隔固定時間將累積的變更的寫入文件。

下面是邊服務邊執行 AOF 持久化的主要代碼：

// 同步磁盤；將所有累積的更新server.aof_buf 寫入磁盤
/* Write the append only file buffer on disk.
**
Since we are required to write the AOF before replying to the client,
* and the only way the client socket can get a write is entering when the
* the event loop, we accumulate all the AOF writes in a memory
* buffer and write it on disk using this function just before entering
* the event loop again.
**
About the 'force' argument:
**
When the fsync policy is set to 'everysec' we may delay the flush if there
* is still an fsync() going on in the background thread, since for instance
* on Linux write(2) will be blocked by the background fsync anyway.
* When this happens we remember that there is some aof buffer to be
* flushed ASAP, and will try to do that in the serverCron() function.
**
However if force is set to 1 we'll write regardless of the background
* fsync. */
void flushAppendOnlyFile(int force) {

    ssize_t nwritten;
    int sync_in_progress = 0;
    // 無數據，無需同步到磁盤
    if (sdslen(server.aof_buf) == 0) return;
    // 創建線程任務，主要調用fsync()
    if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
        sync_in_progress = bioPendingJobsOfType(REDIS_BIO_AOF_FSYNC) != 0;
    // 如果沒有設置強制同步的選項，可能不會立即進行同步
    if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
    // 推遲執行AOF
    /* With this append fsync policy we do background fsyncing.
    * If the fsync is still in progress we can try to delay
    * the write for a couple of seconds. */
    if (sync_in_progress) {
    if (server.aof_flush_postponed_start == 0) {
        // 設置延遲沖洗時間選項
    /* No previous write postponinig, remember that we are
    * postponing the flush and return. */
    // /* Unix time sampled every cron cycle. */
        server.aof_flush_postponed_start = server.unixtime;
        return;
    // 沒有超過2s，直接結束
    } else if (server.unixtime - server.aof_flush_postponed_start < 2) {
    /* We were already waiting for fsync to finish, but for less
    * than two seconds this is still ok. Postpone again. */
    return;
    }
    // 否則，要強制寫入磁盤
    /* Otherwise fall trough, and go write since we can't wait
    * over two seconds. */
        server.aof_delayed_fsync++;
        redisLog(REDIS_NOTICE,"Asynchronous AOF fsync is taking too long (disk"
    " is busy?). Writing the AOF buffer without waiting for fsync to "
    "complete, this may slow down Redis.");
    }
  }
    // 取消延遲沖洗時間設置
/* If you are following this code path, then we are going to write so
* set reset the postponed flush sentinel to zero. */
server.aof_flush_postponed_start = 0;
/* We want to perform a single write. This should be guaranteed atomic
* at least if the filesystem we are writing is a real physical one.
* While this will save us against the server being killed I don't think
* there is much to do about the whole server stopping for power problems
* or alike */
// AOF 文件已經打開了。將server.aof_buf 中的所有緩存數據寫入文件

    nwritten = write(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
    if (nwritten != (signed)sdslen(server.aof_buf)) {
    /* Ooops, we are in troubles. The best thing to do for now is
    * aborting instead of giving the illusion that everything is
    * working as expected. */
    if (nwritten == -1) {
        redisLog(REDIS_WARNING,"Exiting on error writing to the append-only"
        " file: %s",strerror(errno));
    } else {
        redisLog(REDIS_WARNING,"Exiting on short write while writing to "
        "the append-only file: %s (nwritten=%ld, "
        "expected=%ld)",
        strerror(errno),
        (long)nwritten,
        (long)sdslen(server.aof_buf));
    if (ftruncate(server.aof_fd, server.aof_current_size) == -1) {
        redisLog(REDIS_WARNING, "Could not remove short write "
        "from the append-only file. Redis may refuse "
        "to load the AOF the next time it starts. "
        "ftruncate: %s", strerror(errno));
        }
    }
    exit(1);
}
    // 更新AOF 文件的大小
    server.aof_current_size += nwritten;
    // 當server.aof_buf 足夠小, 重新利用空間，防止頻繁的內存分配。
    // 相反，當server.aof_buf 占據大量的空間，采取的策略是釋放空間，可見redis
    // 對內存很敏感。
    /* Re-use AOF buffer when it is small enough. The maximum comes from the
    * arena size of 4k minus some overhead (but is otherwise arbitrary). */
    if ((sdslen(server.aof_buf)+sdsavail(server.aof_buf)) < 4000) {
        sdsclear(server.aof_buf);
    } else {
        sdsfree(server.aof_buf);
        server.aof_buf = sdsempty();
    }
    /* Don't fsync if no-appendfsync-on-rewrite is set to yes and there are
    * children doing I/O in the background. */
    if (server.aof_no_fsync_on_rewrite &&
    (server.aof_child_pid != -1 || server.rdb_child_pid != -1))
    return;
    // sync, 寫入磁盤
    /* Perform the fsync if needed. */
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        /* aof_fsync is defined as fdatasync() for Linux in order to avoid
        * flushing metadata. */
        aof_fsync(server.aof_fd); /* Let's try to get this data on the disk */
        server.aof_last_fsync = server.unixtime;
    } else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &&
        server.unixtime > server.aof_last_fsync)) {
    if (!sync_in_progress) aof_background_fsync(server.aof_fd);
        server.aof_last_fsync = server.unixtime;
    }
}

細說更新緩存

上面兩次提到了「更新緩存」，它即是 Redis 累積的數據變更。

更新緩存可以存儲在 server.aof_buf 中，可以存儲在 server.server.aof_rewrite_buf_blocks 連表中。他們的關系是：每一次數據變更記錄都會寫入 server.aof_buf 中，同時如果后台子進程在持久化，變更記錄還會被寫入 server.server.aof_rewrite_buf_blocks 中。server.aof_buf 會在特定時期寫入指定文件，server.server.aof_rewrite_buf_blocks 會在后台持久化結束后追加到文件。

Redis 源碼中是這么實現的：propagrate()->feedAppendOnlyFile()->aofRewriteBufferAppend()

注意，feedAppendOnlyFile() 會把更新添加到server.aof_buf；接下來會有一個判斷，如果存在 AOF 子進程，則調用aofRewriteBufferAppend() 將server.aof_buf 中的所有數據插入到 server.aof_rewrite_buf_blocks 鏈表。這樣，就能夠理解為什么在AOF 持久化子進程結束后，父進程會將 server.aof_rewrite_buf_blocks 追加到 AOF 文件了。

// 向AOF 和從機發布數據更新
/* Propagate the specified command (in the context of the specified database id)
* to AOF and Slaves.
**
flags are an xor between:
* + REDIS_PROPAGATE_NONE (no propagation of command at all)
* + REDIS_PROPAGATE_AOF (propagate into the AOF file if is enabled)
* + REDIS_PROPAGATE_REPL (propagate into the replication link)
*/
void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
    int flags)
    {
    // AOF 策略需要打開，且設置AOF 傳播標記，將更新發布給本地文件
    if (server.aof_state != REDIS_AOF_OFF && flags & REDIS_PROPAGATE_AOF)
        feedAppendOnlyFile(cmd,dbid,argv,argc);
    // 設置了從機傳播標記，將更新發布給從機
    if (flags & REDIS_PROPAGATE_REPL)
        replicationFeedSlaves(server.slaves,dbid,argv,argc);
    }
    // 將數據更新記錄到AOF 緩存中
void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv,
    int argc) {
    sds buf = sdsempty();
    robj *tmpargv[3];
    /* The DB this command was targeting is not the same as the last command
    * we appendend. To issue a SELECT command is needed. */
    if (dictid != server.aof_selected_db) {
        char seldb[64];
        snprintf(seldb,sizeof(seldb),"%d",dictid);
        buf = sdscatprintf(buf,"*2\r\n$6\r\nSELECT\r\n$%lu\r\n%s\r\n",
    (unsigned long)strlen(seldb),seldb);
    server.aof_selected_db = dictid;
    }
    if (cmd->proc == expireCommand || cmd->proc == pexpireCommand ||
        cmd->proc == expireatCommand) {
        /* Translate EXPIRE/PEXPIRE/EXPIREAT into PEXPIREAT */
        buf = catAppendOnlyExpireAtCommand(buf,cmd,argv[1],argv[2]);
    } else if (cmd->proc == setexCommand || cmd->proc == psetexCommand) {
        /* Translate SETEX/PSETEX to SET and PEXPIREAT */
        tmpargv[0] = createStringObject("SET",3);
    tmpargv[1] = argv[1];
    tmpargv[2] = argv[3];
    buf = catAppendOnlyGenericCommand(buf,3,tmpargv);
    decrRefCount(tmpargv[0]);
    buf = catAppendOnlyExpireAtCommand(buf,cmd,argv[1],argv[2]);
    } else {
    /* All the other commands don't need translation or need the
    * same translation already operated in the command vector
    * for the replication itself. */
    buf = catAppendOnlyGenericCommand(buf,argc,argv);
    }
// 將生成的AOF 追加到server.aof_buf 中。server. 在下一次進入事件循環之前，
// aof_buf 中的內容將會寫到磁盤上
/* Append to the AOF buffer. This will be flushed on disk just before
* of re-entering the event loop, so before the client will get a
* positive reply about the operation performed. */
if (server.aof_state == REDIS_AOF_ON)
server.aof_buf = sdscatlen(server.aof_buf,buf,sdslen(buf));
// 如果已經有AOF 子進程運行，redis 采取的策略是累積子進程AOF 備份的數據和
// 內存中數據集的差異。aofRewriteBufferAppend() 把buf 的內容追加到
// server.aof_rewrite_buf_blocks 數組中
/* If a background append only file rewriting is in progress we want to
* accumulate the differences between the child DB and the current one
* in a buffer, so that when the child process will do its work we
* can append the differences to the new append only file. */
    if (server.aof_child_pid != -1)
        aofRewriteBufferAppend((unsigned char*)buf,sdslen(buf));
        sdsfree(buf);
    }
    // 將數據更新記錄寫入server.aof_rewrite_buf_blocks，此函數只由
    // feedAppendOnlyFile() 調用
    /* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
void aofRewriteBufferAppend(unsigned char *s, unsigned long len) {
    // 尾插法
    listNode *ln = listLast(server.aof_rewrite_buf_blocks);
    aofrwblock *block = ln ? ln->value : NULL;
    while(len) {
    /* If we already got at least an allocated block, try appending
    * at least some piece into it. */
    if (block) {
        unsigned long thislen = (block->free < len) ? block->free : len;
    if (thislen) { /* The current block is not already full. */
        memcpy(block->buf+block->used, s, thislen);
        block->used += thislen;
        block->free -= thislen;
        s += thislen;
        len -= thislen;
    }
}
    if (len) { /* First block to allocate, or need another block. */
        int numblocks;
        // 創建新的節點，插到尾部
        block = zmalloc(sizeof(*block));
        block->free = AOF_RW_BUF_BLOCK_SIZE;
        block->used = 0;
        // 尾插法
        listAddNodeTail(server.aof_rewrite_buf_blocks,block);
        /* Log every time we cross more 10 or 100 blocks, respectively
        * as a notice or warning. */
        numblocks = listLength(server.aof_rewrite_buf_blocks);
    if (((numblocks+1) % 10) == 0) {
        int level = ((numblocks+1) % 100) == 0 ? REDIS_WARNING :
        REDIS_NOTICE;
        redisLog(level,"Background AOF buffer size: %lu MB",
        aofRewriteBufferSize()/(1024*1024));
         }
      }
   }
}

一副可以緩解視力疲勞的圖片——AOF 持久化運作機制：

兩種數據落地的方式，就是 AOF 的兩個主線。因此，redis AOF 持久化機制有兩條主線：后台執行和邊服務邊備份，抓住這兩點就能理解 redis AOF 了。

這里有一個疑問，兩條主線都會涉及文件的寫：后台執行會寫一個AOF 文件，邊服務邊備份也會寫一個，以哪個為准？

后台持久化的數據首先會被寫入“temp-rewriteaof-bg-%d.aof”，其中“%d”是AOF 子進程 id；待 AOF 子進程結束后，“temp-rewriteaof-bg-%d.aof”會被以追加的方式打開，繼而寫入 server.aof_rewrite_buf_blocks 中的更新緩存，最后“temp-rewriteaof-bg-%d.aof”文件被命名為 server.aof_filename，所以之前的名為 server.aof_filename 的文件會被刪除，也就是說邊服務邊備份寫入的文件會被刪除。邊服務邊備份的數據會被一直寫入到 server.aof_filename文件中。

因此，確實會產生兩個文件，但是最后都會變成 server.aof_filename 文件。這里可能還有一個疑問，既然有了后台持久化，為什么還要邊服務邊備份？邊服務邊備份時間長了會產生數據冗余甚至備份過舊的數據，而后台持久化可以消除這些東西。看，這里是 Redis 的雙保險。

AOF 恢復過程

AOF 的數據恢復過程設計很巧妙，它模擬一個 Redis 的服務過程。Redis 首先虛擬一個客戶端，讀取 AOF 文件恢復 Redis 命令和參數；接着過程就和服務客戶端一樣執行命令相應的函數，從而恢復數據，這樣做的目的無非是提高代碼的復用率。這些過程主要在 loadAppendOnlyFile() 中實現。

// 加載AOF 文件，恢復數據
/* Replay the append log file. On error REDIS_OK is returned. On non fatal
* error (the append only file is zero-length) REDIS_ERR is returned. On
* fatal error an error message is logged and the program exists. */
int loadAppendOnlyFile(char *filename) {
    struct redisClient *fakeClient;
    FILE *fp = fopen(filename,"r");
    struct redis_stat sb;
    int old_aof_state = server.aof_state;
    long loops = 0;
    // 文件大小不能為0
    if (fp && redis_fstat(fileno(fp),&sb) != -1 && sb.st_size == 0) {
        server.aof_current_size = 0;
        fclose(fp);
        return REDIS_ERR;
    }
    if (fp == NULL) {
        redisLog(REDIS_WARNING,"Fatal error: can't open the append log file "
        "for reading: %s",strerror(errno));
        exit(1);
    }
    // 正在執行AOF 加載操作，於是暫時禁止AOF 的所有操作，以免混淆
    /* Temporarily disable AOF, to prevent EXEC from feeding a MULTI
    * to the same file we're about to read. */
    server.aof_state = REDIS_AOF_OFF;
    // 虛擬出一個客戶端，即redisClient
    fakeClient = createFakeClient();
    startLoading(fp);
    while(1) {
        int argc, j;
        unsigned long len;
        robj **argv;
        char buf[128];
        sds argsds;
        struct redisCommand *cmd;
        // 每循環1000 次，在恢復數據的同時，服務器也為客戶端服務。
        // aeProcessEvents() 會進入事件循環
        /* Serve the clients from time to time */
    if (!(loops++ % 1000)) {
        loadingProgress(ftello(fp));
        aeProcessEvents(server.el, AE_FILE_EVENTS|AE_DONT_WAIT);
    }
    // 可能aof 文件到了結尾
    if (fgets(buf,sizeof(buf),fp) == NULL) {
    if (feof(fp))
        break;
    else
        goto readerr;
    }
    // 必須以“*”開頭，格式不對，退出
    if (buf[0] != '*') goto fmterr;
        // 參數的個數
        argc = atoi(buf+1);
        // 參數個數錯誤
    if (argc < 1) goto fmterr;
        // 為參數分配空間
        argv = zmalloc(sizeof(robj*)*argc);
        // 依次讀取參數
    for (j = 0; j < argc; j++) {
    if (fgets(buf,sizeof(buf),fp) == NULL) goto readerr;
    if (buf[0] != '$') goto fmterr;
        len = strtol(buf+1,NULL,10);
        argsds = sdsnewlen(NULL,len);
    if (len && fread(argsds,len,1,fp) == 0) goto fmterr;
        argv[j] = createObject(REDIS_STRING,argsds);
    if (fread(buf,2,1,fp) == 0) goto fmterr; /* discard CRLF */
    }
    // 找到相應的命令
    /* Command lookup */
    cmd = lookupCommand(argv[0]->ptr);
    if (!cmd) {
        redisLog(REDIS_WARNING,"Unknown command '%s' reading the "
        "append only file", (char*)argv[0]->ptr);
        exit(1);
    }
    // 執行命令，模擬服務客戶端請求的過程，從而寫入數據
    /* Run the command in the context of a fake client */
    fakeClient->argc = argc;
    fakeClient->argv = argv;
    cmd->proc(fakeClient);
    /* The fake client should not have a reply */
    redisAssert(fakeClient->bufpos == 0 && listLength(fakeClient->reply)
    == 0);
    /* The fake client should never get blocked */
    redisAssert((fakeClient->flags & REDIS_BLOCKED) == 0);
    // 釋放虛擬客戶端空間
    /* Clean up. Command code may have changed argv/argc so we use the
    * argv/argc of the client instead of the local variables. */
    for (j = 0; j < fakeClient->argc; j++)
        decrRefCount(fakeClient->argv[j]);
        zfree(fakeClient->argv);
    }
    /* This point can only be reached when EOF is reached without errors.
    * If the client is in the middle of a MULTI/EXEC, log error and quit. */
    if (fakeClient->flags & REDIS_MULTI) goto readerr;
        // 清理工作
        fclose(fp);
        freeFakeClient(fakeClient);
        // 恢復舊的AOF 狀態
        server.aof_state = old_aof_state;
        stopLoading();
        // 記錄最近AOF 操作的文件大小
        aofUpdateCurrentSize();
        server.aof_rewrite_base_size = server.aof_current_size;
        return REDIS_OK;
        readerr:
    // 錯誤，清理工作
    if (feof(fp)) {
        redisLog(REDIS_WARNING,"Unexpected end of file reading the append "
        "only file");
    } else {
        redisLog(REDIS_WARNING,"Unrecoverable error reading the append only "
        "file: %s", strerror(errno));
    }
    exit(1);
    fmterr:
    redisLog(REDIS_WARNING,"Bad file format reading the append only file: "
    "make a backup of your AOF file, then use ./redis-check-aof --fix "
    "<filename>");
    exit(1);
}

AOF 的適用場景

如果對數據比較關心，分秒必爭，可以用 AOF 持久化，而且AOF 文件很容易進行分析。

AOF 數據組織方式

AOF 持久化運作機制

細說更新緩存

AOF 恢復過程

AOF 的適用場景

免責聲明！