《Fast Traking via Spatio-Temporal Context Learning》要點整理與代碼實現之一

本文轉載自查看原文 2014-04-23 11:57 2916

　　最近讀了一篇名為《Fast Tracking via Spatio-Temporal Context Learning》的論文，主要介紹了一種基於時空上下文的物體跟蹤算法。在此之前，CSDN博主“zouxy09”已經寫過一篇對該論文的解讀http://blog.csdn.net/zouxy09/article/details/16889905。在本博文中，我沒有按照原論文的思路，而是在對原文內容已經熟悉的基礎上（希望讀者也能在熟悉原論文后再看本博文），按照自己的思路重新歸納，主要對一些自己容易忽略的重點，或者原文中沒有得到強調的邏輯關系加以整理。

幾個概念：

貝葉斯框架（Bayesian framework）：建立目標與其周圍的區域在低層特征（像素強度值）上的統計相關性

時空上下文模型（Spatio-temporal context model）：空間關系（量化為“空間上下文模型”（spatial context model））是目標及其鄰域間固有的，時間關系則是通過加權累加每一歷史幀（當前幀之前的所有幀）的空間關系而得到的

主要思想：

　　基於“貝葉斯框架”建立目標物體和其局部上下文之間的時空關系，通過這一“時空模型”計算“置信圖”（confidence map），由“置信圖”中似然概率最大的位置預測出新的目標位置。

1.求“時空上下文模型”

　　是已得到的“時空上下文模型”（Spatio-temporal context model）（其中加權累加了第t-1幀到第1幀的所有“空間上下文模型”），可用於計算本幀的“置信圖”，從而得到本幀的目標位置。是在當前幀（第t幀）中新計算出來的“空間上下文模型”（spatial context model）。是由以上兩個模型按權相加后得到的新“時空上下文模型”，用於在下一幀中計算其“置信圖”，得到下一幀的新目標位置。是權值，是一個常數。

（1）“時空上下文模型”的初始化

　　對第1幀，上式顯然不成立，因為我們並沒有第0幀的“空間上下文模型”和“時空上下文模型”，所以在初始化時，直接令第1幀的“時空上下文模型”直接等於其“空間上下文模型”。

2.求“置信圖”

　　置信圖由上式求出，有這么幾點應該注意：

當前幀是第t+1幀，用該式求“置信圖”的目的是為了求出本幀的目標位置；
此時此刻手頭上擁有的信息是：在上一幀中更新了的“時空上下文模型”，當前幀（第t+1幀）的像素強度值，上一幀的目標位置和目標尺度。
用FFT加速卷積運算
由於該置信圖的值在每一幀中都會被更新，是不斷變化的，且與時空信息有關，是從全局角度來看的，所以我將其稱為“絕對置信圖”，那么對應的還有“相對置信圖”，在后面我會說到。在這里只需要記住，“絕對置信圖”在每一次迭代中都要重新計算，是不斷更新的；而“相對置信圖”是不變的，是一個常值。
第一幀不用求該置信圖，因為“絕對置信圖”總是用上一幀的目標位置和尺度來求的，而在第一幀中，我們並沒有第0幀的目標位置和尺度，我們初始化的直接是第1幀的目標位置和尺度，況且我們已經由初始化得到了第1幀的目標位置和尺度，也就不用由“絕對置信圖”來求該幀的目標位置和尺度了。

3.求“空間上下文模型”

　　若已知本幀的目標位置，則在以該目標位置為中心的鄰域（局部上下文）內，“置信圖”可以如下建模：

這里，、和b都是常數，故其值只與目標鄰域中各點相對目標位置的距離有關，因此當該鄰域大小確定時，該“置信圖”是一個常量矩陣，不因每一幀目標位置的變化而變化，因此我將該置信圖稱為“相對置信圖”，意指其大小只與目標周圍點到目標的相對距離有關。

　　以上兩式用來求“空間上下文模型”，但請注意它們均沒有腳標，即沒有時間上下文的概念。這也就是說，所有計算利用的都是本幀自身的信息（本幀自己的目標位置、本幀自己的目標尺度、本幀自己的像素強度值等）。因此在計算“空間上下文模型”時，特別要注意，我們首先要得到本幀的目標位置和目標尺度。因此除了第1幀外，context prior model要求兩次，一次用上一幀的目標位置和尺度來求（用於求本幀的目標位置），一次用本幀的目標位置和尺度來求（用於求本幀的空間上下文模型）。

算法流程：

代碼實現（Matlab）：

　　在參考了作者的matlab代碼后，我按照自己畫的流程圖重新寫了一遍，參數初始化則完全照搬了作者的。

% My implementation for STC model based tracking
%%
clear, clc, close all;
%%
fftw('planner','patient');
%%
addpath('./data');
img_dir = dir('./data/*.jpg');
%% Initializaiton
% constant parameter
target_sz = [95, 75];   % (rows, cols)
context_sz = target_sz * 2;    % (rows, cols)
rho = 0.075;
alpha = 2.25;
beta = 1;
lambda = 0.25;
num = 5;
hamming_window = hamming(context_sz(1)) * hann(context_sz(2))';
% variable initialization
target_center = [65, 161] + target_sz ./ 2;  % (row, col)
sigma = mean(target_sz);
scale = 1;
[rs, cs] = ndgrid((1:context_sz(1)) - context_sz(1)/2, (1:context_sz(2)) - context_sz(2)/2);
dist = rs.^2 + cs.^2; % a 190x150 matrix
conf = exp( -1 .* (sqrt(dist) ./ alpha).^beta );
conf = conf/sum(sum(conf)); % normalization
conff = fft2(conf);
scale_arr = zeros(num+1, 1);
%% 
for frame = 1:numel(img_dir)
    % read one frame
    img = imread(img_dir(frame).name);
    im = rgb2gray(img);
    % compute the context prior model with the target center and sigma of the previous frame
    window = hamming_window .* exp(-0.5 * dist / (sigma^2));  % It should be a 190x150 matrix
    window = window/sum(sum(window));%normalization
    context_prior = compute_prior(im, target_center, context_sz, window);   % using the target_center and sigma of the previous frame
    if frame > 1
        % compute the confidence map
        conf_map = real(ifft2( Hstcf .* fft2(context_prior) ));
        % find the new target center of the current frame
        [row, col] = find(conf_map == max(conf_map(:)), 1);
        target_center = target_center - context_sz / 2 + [row, col];    % here is weird, why we should substract 'context_sz/2' from the 'target_center'?
        % update sigma to current frame
        sigma = sigma * scale;
        % compute the new context prior model with the target center and sigma
        % of the current frame
        window = hamming_window .* exp(-0.5 * dist / (sigma^2));  % It should be a 190x150 matrix
        window = window/sum(sum(window));%normalization
        context_prior = compute_prior(im, target_center, context_sz, window);   % using the new target_center and sigma of current frame
        % update the confidence map of the current frame
        conf_map = real(ifft2( Hstcf .* fft2(context_prior) ));
        % update scale (you can also keep 'scale=1')
%         scale_arr(1:num) = scale_arr(2:end);
%         scale_arr(num) = max(conf_map(:));
%         scale_avr = 0;
%         if frame > (num + 1)
%             for i = 1:num
%                 scale_avr = scale_avr + sqrt(scale_arr(num + 2 - i) / scale_arr(num + 1 - i));
%             end
%             scale = (1 - lambda) * scale + lambda * scale_avr / num;
%         end
    end
    % learn the spatial context model
    hscf = conff ./ fft2(context_prior);
    % update spatio_temporal context model
    if frame == 1
        Hstcf = hscf;
    else
        Hstcf = (1-rho) * Hstcf + rho * hscf;
    end
    %visualization
    target_sz([2,1]) = target_sz([2,1])*scale;% update object size
    rect_position = [target_center([2,1]) - (target_sz([2,1])/2), (target_sz([2,1]))];  
    imagesc(uint8(img))
    colormap(gray)
    rectangle('Position',rect_position,'LineWidth',4,'EdgeColor','r');
    hold on;
    text(5, 18, strcat('#',num2str(frame)), 'Color','y', 'FontWeight','bold', 'FontSize',20);
    set(gca,'position',[0 0 1 1]); 
    pause(0.001); 
    hold off;
    drawnow;
end
%% The end

　　代碼目前還存在一些問題有待討論：

自己寫的尺度更新的部分不能工作，因此我把他注釋掉了，現在只能工作於單尺度（scale=1）；
conf_map = real(ifft2( Hstcf .* fft2(context_prior) )) 這句中，ifft2的結果仍是復數，因此作者取了其實部用於后續計算，為什么ifft2的結果是復數？為什么作者只取其實部而不是取其模？
“hamming window”對結果的影響有多大；對原圖的像素強度值作者也做了normalization，如果把這些都去掉，對結果影響有多大，這還有待測試
目標位置初始化：target_center = [65, 161] + target_sz ./ 2；目標位置更新：target_center = target_center - context_sz / 2 + [row, col]，加上target_sz/2和減去context_sz/2顯然是作者有意而為之的，意義何在？

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Correlation Filter in Visual Tracking系列二：Fast Visual Tracking via Dense Spatio-Temporal Context Learning 論文筆記 Reading papers_16(Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis) CVPR2018: Unsupervised Cross-dataset Person Re-identification by Transfer Learning of Spatio-temporal Patterns Spatio-Temporal Backpropagation for Training High-performance Spiking Neural Networks 文獻閱讀報告 - Situation-Aware Pedestrian Trajectory Prediction with Spatio-Temporal Attention Model RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning 對偶問題復習要點整理線性規划問題復習要點整理 spring data jpa 數據庫連接配置要點整理 Meta-RL——Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables