推導簡易版本知乎文章:https://zhuanlan.zhihu.com/p/43534801
完全版推導過程:https://blog.csdn.net/JackyTintin/article/details/79425866
源碼地址:https://github.com/baidu-research/warp-ctc
便利的label稀疏化api講解博客(官網沒有講解):tf.keras.backend.ctc_label_dense_to_sparse
百度實驗室的高效CTC實現,對標tf遠古版本1.4,不知道和新版tf實現是不是還有領先,之后會解讀一下新版tf的CTC實現方案。
tensorflow_binding\src\warpctc_op.cc:
WarpCTCOpBase::Compute方法 get_workspace_size函數:獲取工作區大小 compute_ctc_loss(activations_t.data(), // [max_time, batch_size, num_classes_raw] grads_t.data(), // [max_time, batch_size, num_classes_raw] flat_labels_t.data(), // [batch_size, max_time] label_lengths_t.data(), // [batch_size] input_lengths_t.data(), // [batch_size] alphabet_size, batch_size, // int, int costs_t.data(), // [batch_size] workspace_t.data(), options); // 內存/顯存申請尺寸
/** Compute the connectionist temporal classification loss between a sequence * of probabilities and a ground truth labeling. Optionally compute the * gradient with respect to the inputs. * \param [in] activations pointer to the activations in either CPU or GPU * addressable memory, depending on info. We assume a fixed * memory layout for this 3 dimensional tensor, which has dimension * (t, n, p), where t is the time index, n is the minibatch index, * and p indexes over probabilities of each symbol in the alphabet. * The memory layout is (t, n, p) in C order (slowest to fastest changing * index, aka row-major), or (p, n, t) in Fortran order (fastest to slowest * changing index, aka column-major). We also assume strides are equal to * dimensions - there is no padding between dimensions. * More precisely, element (t, n, p), for a problem with mini_batch examples * in the mini batch, and alphabet_size symbols in the alphabet, is located at: * activations[(t * mini_batch + n) * alphabet_size + p] * \param [out] gradients if not NULL, then gradients are computed. Should be * allocated in the same memory space as probs and memory * ordering is identical. * \param [in] flat_labels Always in CPU memory. A concatenation * of all the labels for the minibatch. * \param [in] label_lengths Always in CPU memory. The length of each label * for each example in the minibatch. * \param [in] input_lengths Always in CPU memory. The number of time steps * for each sequence in the minibatch. * \param [in] alphabet_size The number of possible output symbols. There * should be this many probabilities for each time step. * \param [in] mini_batch How many examples in a minibatch. * \param [out] costs Always in CPU memory. The cost of each example in the * minibatch. * \param [in,out] workspace In same memory space as probs. Should be of * size requested by get_workspace_size. * \param [in] options see struct ctcOptions * * \return Status information * * */ API_REFERENCE ctcStatus_t compute_ctc_loss(const float* const activations, float* gradients, const int* const flat_labels, const int* const label_lengths, const int* const input_lengths, int alphabet_size, int minibatch, float *costs, void *workspace, ctcOptions options);
注意:
element (t, n, p), for a problem with mini_batch examples in the mini batch, and alphabet_size symbols in the alphabet, is located at: activations[(t * mini_batch + n) * alphabet_size + p]
src\ctc_entrypoint.cpp:
get_workspace_size函數 compute_ctc_loss函數:此函數決定后面調用CPU實現還是調用GPU實現 cost_and_grad方法:計算梯度 參數:activations, gradients, costs, flat_labels, label_lengths, input_lengths score_forward方法:單純forward
include\detail\cpu_ctc.h:
https://zhuanlan.zhihu.com/p/23293860
CpuCTC<ProbT>::cost_and_grad方法 利用OpenMP並行循環batch // const int T = input_lengths[mb]; // const int L = label_lengths[mb]; // const int S = 2*L + 1; // Number of labels with blanks CpuCTC<ProbT>::cost_and_grad_kernel CpuCTC<ProbT>::CpuCTC_metadata CpuCTC<ProbT>::CpuCTC_metadata::setup_labels if (L + ctcm.repeats > T) return 0; // 標簽+重復元素數目需要小於有效input長度,重復標簽之間需要插入blank符號 CpuCTC<ProbT>::compute_alphas(probs, ctcm.repeats, S, T, ctcm.e_inc, ctcm.s_inc, ctcm.labels_w_blanks, ctcm.alphas); CpuCTC<ProbT>::compute_betas_and_grad
1、cpu下的前向傳播
數據probs記錄了網絡輸出的概率,labels用於將alphas中的位置索引到probs對應時刻的概率,這個程序就是在填充alphas,然后根據最后時刻的alpha計算出對數似然。三個主要數據其邏輯形式如下:
S=2*標簽長度+1
T為句子長度,或者說timesteps長度
alphabet表示詞表長度(實際上為+1,因為包含了blank)
// Computes forward probabilities template<typename ProbT> ProbT CpuCTC<ProbT>::compute_alphas(const ProbT* probs, int repeats, int S, int T, const int* const e_inc, const int* const s_inc, const int* const labels, ProbT* alphas) { int start = (((S /2) + repeats - T) < 0) ? 0 : 1, end = S > 1 ? 2 : 1; for (int i = start; i < end; ++i) { alphas[i] = std::log(probs[labels[i]]); } for(int t = 1; t < T; ++t) { int remain = (S / 2) + repeats - (T - t); if(remain >= 0) start += s_inc[remain]; if(t <= (S / 2) + repeats) end += e_inc[t - 1]; int startloop = start; int idx1 = t * S, idx2 = (t - 1) * S, idx3 = t * (alphabet_size_ * minibatch_); if (start == 0) { alphas[idx1] = alphas[idx2] + std::log(probs[blank_label_ + idx3]); startloop += 1; } for(int i = startloop; i < end; ++i) { ProbT prev_sum = ctc_helper::log_plus<ProbT>()(alphas[i + idx2], alphas[(i-1) + idx2]); // Skip two if not on blank and not on repeat. if (labels[i] != blank_label_ && i != 1 && labels[i] != labels[i-2]) prev_sum = ctc_helper::log_plus<ProbT>()(prev_sum, alphas[(i-2) + idx2]); alphas[i + idx1] = prev_sum + std::log(probs[labels[i] + idx3]); } } ProbT loglike = ctc_helper::neg_inf<ProbT>(); for(int i = start; i < end; ++i) { loglike = ctc_helper::log_plus<ProbT>()(loglike, alphas[i + (T - 1) * S]); } return loglike; }
函數返回最后一個step的結果之和(對數,實際上是相乘),為整個任務的對數似然(loglike)。
其他參量中:remain表示在當前t時刻,至少要到達的有效標簽數目(位置索引),start表示當前時刻最慢路徑位置,end表示最快路徑位置,兩者組成的窗表示當前時刻的全部可能位置。
// remain=標簽長度+必要的blank分隔符-剩余時間步
int remain = (S / 2) + repeats - (T - t);
if(remain >= 0) start += s_inc[remain]; // s_inc的索引表示本步驟已經完成remain個有效字符
if(t <= (S / 2) + repeats) end += e_inc[t - 1]; // e_inc在0時刻已經完成一個字符,所以t-1時刻已經完成t個有效字符,本時刻將到達t+1個字符
s_inc、e_inc表示每完成當前位置較下一個必要字符(必須要輸出的,不是可選的,包含有效字符和重復字符間的blank)需要的步長。start起始位置為第一個空格,s_inc[0]表示從第一個空格到第一個字符的距離1,s_inc[2]表示第一個字符到第二個字符的必要距離2;end起始位置為第一個字符,e_inc[0]表示第一個字符到第二個字符的必要距離2。出現重復字符時(假設重復一次),兩者會+2到達第一個位置,然后+1到達空格,再+1到達第二個位置。據此可以構建出s_inc、e_inc兩個步長對未擴充標簽的函數。start的終點是最后一個有效字符,end終點是最后一個空格。
索引 | s[0] | e[0] | s[-1] | e[-1] | |||||||
label | - | s | - | a | - | t | - | t | - | e | - |
s_inc | 1 | 2 | 2 | 1 | 1 | 2 | |||||
e_inc | 2 | 2 | 1 | 1 | 2 | 1 |
由於start是最慢的路徑,當前最慢路徑到達remain計算出的必要的字符即可;end是最快路徑,當前時刻的最快路徑是上一時刻最快路徑位置的下一有效字符,每一個時刻都要向前到下一個有效字符。
2、cpu下反向傳播
包含步驟如下:
計算bate
將alpha*beta更新在alpha矩陣中
對每一時刻計算output(將每一時刻同一字符的概率進行疊加,入satte中每一時刻有2個t)
更新grad矩陣
// Starting from T, we sweep backward over the alpha array computing one column // of betas as we go. At each position we can update product alpha * beta and then // sum into the gradient associated with each label. // NOTE computes gradient w.r.t UNNORMALIZED final layer activations. // Assumed passed in grads are already zeroed! template<typename ProbT> ProbT CpuCTC<ProbT>::compute_betas_and_grad(ProbT* grad, const ProbT* const probs, ProbT log_partition, int repeats, int S, int T, const int* const e_inc, const int* const s_inc, const int* const labels, ProbT* alphas, ProbT* betas, ProbT* output) { int start = S > 1 ? (S - 2) : 0, end = (T > (S / 2) + repeats) ? S : S-1; std::fill(output, output + alphabet_size_, ctc_helper::neg_inf<ProbT>()); //set the starting values in the beta column at the very right edge for (int i = start; i < end; ++i) { betas[i] = std::log(probs[labels[i] + (T - 1) * (alphabet_size_ * minibatch_)]); //compute alpha * beta in log space at this position in (S, T) space alphas[i + (T - 1) * S] += betas[i]; //update the gradient associated with this label //essentially performing a reduce-by-key in a sequential manner output[labels[i]] = ctc_helper::log_plus<ProbT>()(alphas[i + (T - 1) * S], output[labels[i]]); } //update the gradient wrt to each unique label for (int i = 0; i < alphabet_size_; ++i) { int idx3 = (T - 1) * alphabet_size_ * minibatch_ + i; if (output[i] == 0.0 || output[i] == ctc_helper::neg_inf<ProbT>() || probs[idx3] == 0.0) { grad[idx3] = probs[idx3]; } else { grad[idx3] = probs[idx3] - std::exp(output[i] - std::log(probs[idx3]) - log_partition); } } //loop from the second to last column all the way to the left for(int t = T - 2; t >= 0; --t) { int remain = (S / 2) + repeats - (T - t); if(remain >= -1) start -= s_inc[remain + 1]; if(t < (S / 2) + repeats) end -= e_inc[t]; int endloop = end == S ? end - 1 : end; int idx1 = t * S, idx3 = t * (alphabet_size_ * minibatch_); std::fill(output, output + alphabet_size_, ctc_helper::neg_inf<ProbT>()); for(int i = start; i < endloop; ++i) { ProbT next_sum = ctc_helper::log_plus<ProbT>()(betas[i], betas[(i+1)]); // Skip two if not on blank and not on repeat. if (labels[i] != blank_label_ && i != (S-2) && labels[i] != labels[i+2]){ next_sum = ctc_helper::log_plus<ProbT>()(next_sum, betas[(i+2)]); } betas[i] = next_sum + std::log(probs[labels[i] + idx3]); //compute alpha * beta in log space alphas[i + idx1] += betas[i]; //update the gradient associated with this label output[labels[i]] = ctc_helper::log_plus<ProbT>()(alphas[i + idx1], output[labels[i]]); } if (end == S) { betas[(S-1)] = betas[(S-1)] + std::log(probs[blank_label_ + idx3]); alphas[(S-1) + idx1] += betas[(S-1)]; output[labels[S-1]] = ctc_helper::log_plus<ProbT>()(alphas[S-1 + idx1], output[labels[S-1]]); } //go over the unique labels and compute the final grad // wrt to each one at this time step for (int i = 0; i < alphabet_size_; ++i) { if (output[i] == 0.0 || output[i] == ctc_helper::neg_inf<ProbT>() || probs[idx3] == 0.0) { grad[idx3] = probs[idx3]; } else { grad[idx3] = probs[idx3] - std::exp(output[i] - std::log(probs[idx3]) - log_partition); } ++idx3; } } ProbT loglike = ctc_helper::neg_inf<ProbT>(); for(int i = start; i < end; ++i) { loglike = ctc_helper::log_plus<ProbT>()(loglike, betas[i]); } return loglike; }
include\detail\gpu_ctc.h
GpuCTC<ProbT>::cost_and_grad(const ProbT* const activations, ProbT* grads, ProbT* costs, const int* const flat_labels, const int* const label_lengths, const int* const input_lengths) GpuCTC<ProbT>::compute_cost_and_score(const ProbT* const activations, ProbT* grads, ProbT* costs, const int* const flat_labels, const int* const label_lengths, const int* const input_lengths, bool compute_alpha, bool compute_betas_and_grad) GpuCTC<ProbT>::create_metadata_and_choose_config(const int* const flat_labels, const int* const label_lengths, const int* const input_lengths, size_t& best_config) GpuCTC<ProbT>::setup_gpu_metadata(const int* const flat_labels, const int* const label_lengths, const int* const input_lengths) // 計算輔助參數,並將其注冊到gpu上 // 主要信息為batch的input_lengths、label_lengths、flat_labels,以及labels的索引用偏移個每個label的repeat信息 // activation_cols_ = minibatch_ * Tmax; // 依據S_(batch中最長有效標簽長度)在幾個預定義config中找到最小的有效config GpuCTC<ProbT>::compute_probs(const ProbT* const activations) // 將probs(網絡輸出)存入gpu,每行為詞表長度(實際上是一維向量,這里取義) reduce_max 計算每行最大值 prepare_stable_SM_kernel 將probs更新為和對應行最大值的差值 reduce_exp 獲取每行的指數和(即softmax分母,前面的做差是為了數值穩定) compute_probs_kernel 各個位置除以分母 truncate_probs_kernel 最小值截斷,防止數值問題
1、GPU下的前向傳播:compute_alphas
相關參數說明: