『源碼閱讀』Warp CTC 源碼解讀

本文轉載自查看原文 2020-08-23 20:21 649 TensorFlow/ ML/DL論文及原理

推導簡易版本知乎文章：https://zhuanlan.zhihu.com/p/43534801

完全版推導過程：https://blog.csdn.net/JackyTintin/article/details/79425866

源碼地址：https://github.com/baidu-research/warp-ctc

便利的label稀疏化api講解博客（官網沒有講解）：tf.keras.backend.ctc_label_dense_to_sparse

百度實驗室的高效CTC實現，對標tf遠古版本1.4，不知道和新版tf實現是不是還有領先，之后會解讀一下新版tf的CTC實現方案。

tensorflow_binding\src\warpctc_op.cc：

WarpCTCOpBase::Compute方法
　　get_workspace_size函數：獲取工作區大小
　　compute_ctc_loss(activations_t.data(),             // [max_time, batch_size, num_classes_raw]
                      grads_t.data(),                 // [max_time, batch_size, num_classes_raw]
　　　　　　　　　　  　　flat_labels_t.data(),           // [batch_size, max_time]
　　　　　　　　　　  　　label_lengths_t.data(),         // [batch_size]
　　　　　　　　　　  　　input_lengths_t.data(),         // [batch_size]
　　　　　　　　　　  　　alphabet_size, batch_size,      // int， int
　　　　　　　　  　　　　costs_t.data(),                 // [batch_size]
　　　　　　　　　　 　　 workspace_t.data(), options);   // 內存/顯存申請尺寸

/** Compute the connectionist temporal classification loss between a sequence
 *  of probabilities and a ground truth labeling.  Optionally compute the
 *  gradient with respect to the inputs.
 * \param [in] activations pointer to the activations in either CPU or GPU
 *             addressable memory, depending on info.  We assume a fixed
 *             memory layout for this 3 dimensional tensor, which has dimension
 *             (t, n, p), where t is the time index, n is the minibatch index,
 *             and p indexes over probabilities of each symbol in the alphabet.
 *             The memory layout is (t, n, p) in C order (slowest to fastest changing
 *             index, aka row-major), or (p, n, t) in Fortran order (fastest to slowest
 *             changing index, aka column-major). We also assume strides are equal to
 *             dimensions - there is no padding between dimensions.
 *             More precisely, element (t, n, p), for a problem with mini_batch examples
 *             in the mini batch, and alphabet_size symbols in the alphabet, is located at:
 *             activations[(t * mini_batch + n) * alphabet_size + p]
 * \param [out] gradients if not NULL, then gradients are computed.  Should be
 *              allocated in the same memory space as probs and memory
 *              ordering is identical.
 * \param [in]  flat_labels Always in CPU memory.  A concatenation
 *              of all the labels for the minibatch.
 * \param [in]  label_lengths Always in CPU memory. The length of each label
 *              for each example in the minibatch.
 * \param [in]  input_lengths Always in CPU memory.  The number of time steps
 *              for each sequence in the minibatch.
 * \param [in]  alphabet_size The number of possible output symbols.  There
 *              should be this many probabilities for each time step.
 * \param [in]  mini_batch How many examples in a minibatch.
 * \param [out] costs Always in CPU memory.  The cost of each example in the
 *              minibatch.
 * \param [in,out] workspace In same memory space as probs. Should be of
 *                 size requested by get_workspace_size.
 * \param [in]  options see struct ctcOptions
 *
 *  \return Status information
 *
 * */
API_REFERENCE ctcStatus_t compute_ctc_loss(const float* const activations,
                             float* gradients,
                             const int* const flat_labels,
                             const int* const label_lengths,
                             const int* const input_lengths,
                             int alphabet_size,
                             int minibatch,
                             float *costs,
                             void *workspace,
                             ctcOptions options);

注意：

element (t, n, p), for a problem with mini_batch examples in the mini batch, and alphabet_size symbols in the alphabet, is located at: activations[(t * mini_batch + n) * alphabet_size + p]

src\ctc_entrypoint.cpp：

get_workspace_size函數
　　compute_ctc_loss函數：此函數決定后面調用CPU實現還是調用GPU實現
　　cost_and_grad方法：計算梯度
　　　　參數：activations, gradients, costs, flat_labels, label_lengths, input_lengths
　　score_forward方法：單純forward

include\detail\cpu_ctc.h：

https://zhuanlan.zhihu.com/p/23293860

CpuCTC<ProbT>::cost_and_grad方法
　　利用OpenMP並行循環batch
　　　　// const int T = input_lengths[mb];
　　　　// const int L = label_lengths[mb];
　　　　// const int S = 2*L + 1; // Number of labels with blanks
　　　　CpuCTC<ProbT>::cost_and_grad_kernel
　　　　　　CpuCTC<ProbT>::CpuCTC_metadata
　　　　　　　　CpuCTC<ProbT>::CpuCTC_metadata::setup_labels
　　　　　　if (L + ctcm.repeats > T)  return 0;  // 標簽+重復元素數目需要小於有效input長度，重復標簽之間需要插入blank符號
　　　　　　CpuCTC<ProbT>::compute_alphas(probs, 
　　　　　　　　　　　　　　　　　　　　　　　　ctcm.repeats, 
　　　　　　　　　　　　　　　　　　　　　　　　S, 
　　　　　　　　　　　　　　　　　　　　　　　　T, 
　　　　　　　　　　　　　　　　　　　　　　　　ctcm.e_inc,
　　　　　　　　　　　　　　　　　　　　　　　　ctcm.s_inc, 
　　　　　　　　　　　　　　　　　　　　　　　　ctcm.labels_w_blanks,
　　　　　　　　　　　　　　　　　　　　　　　　ctcm.alphas);
　　　　　　CpuCTC<ProbT>::compute_betas_and_grad

1、cpu下的前向傳播

數據probs記錄了網絡輸出的概率，labels用於將alphas中的位置索引到probs對應時刻的概率，這個程序就是在填充alphas，然后根據最后時刻的alpha計算出對數似然。三個主要數據其邏輯形式如下：

S=2*標簽長度+1

T為句子長度，或者說timesteps長度

alphabet表示詞表長度（實際上為+1，因為包含了blank）

// Computes forward probabilities
template<typename ProbT>
ProbT CpuCTC<ProbT>::compute_alphas(const ProbT* probs, int repeats, int S, int T,
                                    const int* const e_inc,
                                    const int* const s_inc,
                                    const int* const labels,
                                    ProbT* alphas) {
    int start =  (((S /2) + repeats - T) < 0) ? 0 : 1,
            end = S > 1 ? 2 : 1;

    for (int i = start; i < end; ++i) {
        alphas[i] = std::log(probs[labels[i]]);
    }

    for(int t = 1; t < T; ++t) {
        int remain = (S / 2) + repeats - (T - t);
        if(remain >= 0)
            start += s_inc[remain];
        if(t <= (S / 2) + repeats)
            end += e_inc[t - 1];
        int startloop = start;
        int idx1 = t * S, idx2 = (t - 1) * S, idx3 = t * (alphabet_size_ * minibatch_);

        if (start == 0) {
            alphas[idx1] = alphas[idx2] + std::log(probs[blank_label_ + idx3]);
            startloop += 1;
        }

        for(int i = startloop; i < end; ++i) {
            ProbT prev_sum = ctc_helper::log_plus<ProbT>()(alphas[i + idx2], alphas[(i-1) + idx2]);

            // Skip two if not on blank and not on repeat.
            if (labels[i] != blank_label_ && i != 1 && labels[i] != labels[i-2])
                prev_sum = ctc_helper::log_plus<ProbT>()(prev_sum, alphas[(i-2) + idx2]);

            alphas[i + idx1] = prev_sum + std::log(probs[labels[i] + idx3]);
        }
    }

    ProbT loglike = ctc_helper::neg_inf<ProbT>();
    for(int i = start; i < end; ++i) {
        loglike = ctc_helper::log_plus<ProbT>()(loglike, alphas[i + (T - 1) * S]);
    }

    return loglike;
}

函數返回最后一個step的結果之和（對數，實際上是相乘），為整個任務的對數似然（loglike）。

其他參量中：remain表示在當前t時刻，至少要到達的有效標簽數目（位置索引），start表示當前時刻最慢路徑位置，end表示最快路徑位置，兩者組成的窗表示當前時刻的全部可能位置。

// remain=標簽長度+必要的blank分隔符-剩余時間步
int remain = (S / 2) + repeats - (T - t);


if(remain >= 0)
    start += s_inc[remain];  // s_inc的索引表示本步驟已經完成remain個有效字符


if(t <= (S / 2) + repeats)
    end += e_inc[t - 1];  // e_inc在0時刻已經完成一個字符，所以t-1時刻已經完成t個有效字符，本時刻將到達t+1個字符

s_inc、e_inc表示每完成當前位置較下一個必要字符（必須要輸出的，不是可選的，包含有效字符和重復字符間的blank）需要的步長。start起始位置為第一個空格，s_inc[0]表示從第一個空格到第一個字符的距離1，s_inc[2]表示第一個字符到第二個字符的必要距離2；end起始位置為第一個字符，e_inc[0]表示第一個字符到第二個字符的必要距離2。出現重復字符時（假設重復一次），兩者會+2到達第一個位置，然后+1到達空格，再+1到達第二個位置。據此可以構建出s_inc、e_inc兩個步長對未擴充標簽的函數。start的終點是最后一個有效字符，end終點是最后一個空格。

索引	s[0]	e[0]						s[-1]		e[-1]
label	-	s	-	a	-	t	-	t	-	e	-
s_inc	1	2		2		1	1	2
e_inc		2		2		1	1	2		1

由於start是最慢的路徑，當前最慢路徑到達remain計算出的必要的字符即可；end是最快路徑，當前時刻的最快路徑是上一時刻最快路徑位置的下一有效字符，每一個時刻都要向前到下一個有效字符。

2、cpu下反向傳播

包含步驟如下：

計算bate

將alpha*beta更新在alpha矩陣中

對每一時刻計算output（將每一時刻同一字符的概率進行疊加，入satte中每一時刻有2個t）

更新grad矩陣

// Starting from T, we sweep backward over the alpha array computing one column
// of betas as we go.  At each position we can update product alpha * beta and then
// sum into the gradient associated with each label.
// NOTE computes gradient w.r.t UNNORMALIZED final layer activations.
// Assumed passed in grads are already zeroed!
template<typename ProbT>
ProbT CpuCTC<ProbT>::compute_betas_and_grad(ProbT* grad, const ProbT* const probs,
                                            ProbT log_partition, int repeats,
                                            int S, int T, const int* const e_inc,
                                            const int* const s_inc,
                                            const int* const labels,
                                            ProbT* alphas,
                                            ProbT* betas,
                                            ProbT* output) {
    int start = S > 1 ? (S - 2) : 0,
            end = (T > (S / 2) + repeats) ? S : S-1;

    std::fill(output, output + alphabet_size_, ctc_helper::neg_inf<ProbT>());

    //set the starting values in the beta column at the very right edge
    for (int i = start; i < end; ++i) {
        betas[i] = std::log(probs[labels[i] + (T - 1) * (alphabet_size_ * minibatch_)]);

        //compute alpha * beta in log space at this position in (S, T) space
        alphas[i + (T - 1) * S] += betas[i];

        //update the gradient associated with this label
        //essentially performing a reduce-by-key in a sequential manner
        output[labels[i]] =
                ctc_helper::log_plus<ProbT>()(alphas[i + (T - 1) * S], output[labels[i]]);
    }

    //update the gradient wrt to each unique label
    for (int i = 0; i < alphabet_size_; ++i) {
        int idx3 = (T - 1) * alphabet_size_ * minibatch_ + i;

        if (output[i] == 0.0 || output[i] == ctc_helper::neg_inf<ProbT>() ||
            probs[idx3] == 0.0) {
            grad[idx3] = probs[idx3];
        } else {
            grad[idx3] = probs[idx3] - std::exp(output[i] -
                                                std::log(probs[idx3]) - log_partition);
        }
    }

    //loop from the second to last column all the way to the left
    for(int t = T - 2; t >= 0; --t) {
        int remain = (S / 2) + repeats - (T - t);
        if(remain >= -1)
            start -= s_inc[remain + 1];
        if(t < (S / 2) + repeats)
            end -= e_inc[t];

        int endloop = end == S ? end - 1 : end;
        int idx1 = t * S, idx3 = t * (alphabet_size_ * minibatch_);

        std::fill(output, output + alphabet_size_, ctc_helper::neg_inf<ProbT>());

        for(int i = start; i < endloop; ++i) {
            ProbT next_sum = ctc_helper::log_plus<ProbT>()(betas[i], betas[(i+1)]);
            // Skip two if not on blank and not on repeat.
            if (labels[i] != blank_label_ && i != (S-2) && labels[i] != labels[i+2]){
                next_sum = ctc_helper::log_plus<ProbT>()(next_sum, betas[(i+2)]);
            }
            betas[i] = next_sum + std::log(probs[labels[i] + idx3]);

            //compute alpha * beta in log space
            alphas[i + idx1] += betas[i];

            //update the gradient associated with this label
            output[labels[i]] =
                    ctc_helper::log_plus<ProbT>()(alphas[i + idx1], output[labels[i]]);
        }

        if (end == S) {
            betas[(S-1)] = betas[(S-1)] + std::log(probs[blank_label_ + idx3]);
            alphas[(S-1) + idx1] += betas[(S-1)];

            output[labels[S-1]] =
                    ctc_helper::log_plus<ProbT>()(alphas[S-1 + idx1], output[labels[S-1]]);
        }

        //go over the unique labels and compute the final grad
        // wrt to each one at this time step
        for (int i = 0; i < alphabet_size_; ++i) {

            if (output[i] == 0.0 || output[i] == ctc_helper::neg_inf<ProbT>() ||
                probs[idx3] == 0.0) {
                grad[idx3] = probs[idx3];
            } else {
                grad[idx3] = probs[idx3] - std::exp(output[i] -
                                                    std::log(probs[idx3]) - log_partition);
            }
            ++idx3;
        }
    }

    ProbT loglike = ctc_helper::neg_inf<ProbT>();
    for(int i = start; i < end; ++i) {
        loglike = ctc_helper::log_plus<ProbT>()(loglike, betas[i]);
    }

    return loglike;
}

include\detail\gpu_ctc.h

GpuCTC<ProbT>::cost_and_grad(const ProbT* const activations,
                             ProbT* grads,
                             ProbT* costs,
                             const int* const flat_labels,
                             const int* const label_lengths,
                             const int* const input_lengths)
    GpuCTC<ProbT>::compute_cost_and_score(const ProbT* const activations,
                                      ProbT* grads,
                                      ProbT* costs,
                                      const int* const flat_labels,
                                      const int* const label_lengths,
                                      const int* const input_lengths,
                                      bool compute_alpha,
                                      bool compute_betas_and_grad)
        GpuCTC<ProbT>::create_metadata_and_choose_config(const int* const flat_labels,
                                                 const int* const label_lengths,
                                                 const int* const input_lengths,
                                                 size_t& best_config)
            GpuCTC<ProbT>::setup_gpu_metadata(const int* const flat_labels,
                                  const int* const label_lengths,
                                  const int* const input_lengths)
                // 計算輔助參數，並將其注冊到gpu上
                // 主要信息為batch的input_lengths、label_lengths、flat_labels，以及labels的索引用偏移個每個label的repeat信息
                // activation_cols_ = minibatch_ * Tmax;
            // 依據S_（batch中最長有效標簽長度）在幾個預定義config中找到最小的有效config
        GpuCTC<ProbT>::compute_probs(const ProbT* const activations) 
            // 將probs（網絡輸出）存入gpu，每行為詞表長度（實際上是一維向量，這里取義）
            reduce_max 計算每行最大值
            prepare_stable_SM_kernel 將probs更新為和對應行最大值的差值
            reduce_exp 獲取每行的指數和（即softmax分母，前面的做差是為了數值穩定）
            compute_probs_kernel  各個位置除以分母
            truncate_probs_kernel  最小值截斷，防止數值問題

1、GPU下的前向傳播：compute_alphas

相關參數說明：

activation_cols_ = minibatch_ * Tmax；// batch * 最長句子長度

out_dim_; // alphabet_size，詞表長度

probs_; // out_dim_ * activation_cols_，即詞表*（batch*最長句子）

label_sizes_; // minibatch_尺寸，存儲label_lengths

utt_length_; // minibatch_尺寸，存儲input_lengths

repeats_; // minibatch_尺寸指針數組，存儲每個數據的標簽重復度

labels_without_blanks_; // total_label_length尺寸，存儲flat_labels

label_offsets_; // minibatch_尺寸指針數組，存儲每個數據的標簽偏移（類似對每個標簽長進行cumsum）

labels_with_blanks_; // Smax * minibatch_尺寸，其中Smax = 2 * Lmax + 1，Lmax為最大標簽長度

alphas_; // (S_ * T_) * minibatch_尺寸，S_ 為最大有效標簽長度的2倍加1，T_為最大有效句子長度

nll_forward_; // minibatch_尺寸

stride; // minibatch_

blank_label_; // int，即空白標簽的數字

【注意】repeats_、label_offsets_在申請顯存時是以64*sizeof(int)來分批進行的（最后一批可能不足64），64標記為cpu_buffer_size。

void compute_alpha_kernel (const ProbT* probs, const int *label_sizes,

const int *utt_length, const int *repeats_in_labels,

const int *labels_without_blanks, const int *label_offsets,

int *labels_with_blanks, ProbT *alphas,

ProbT* nll_forward, int stride, int out_dim,

int S_memoffset, int T_memoffset, int blank_label)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 tensorflow源碼分析——CTC [閱讀筆記]fsnotify源碼閱讀 Spring源碼解析——如何閱讀源碼 Spring源碼解析——如何閱讀源碼 superset：源碼閱讀 git源碼閱讀 Tor源碼閱讀與改造（一） Apollo源碼閱讀筆記（一） netty源碼閱讀之UnpooledByteBufAllocator golang goroutine源碼閱讀