[轉]高斯-牛頓算法

本文轉載自查看原文 2017-12-09 20:07 3698 基礎數學

Gauss-Newton算法是解決非線性最優問題的常見算法之一，最近研讀開源項目代碼，又碰到了，索性深入看下。本次講解內容如下：

基本數學名詞識記
牛頓法推導、算法步驟、計算實例
高斯牛頓法推導(如何從牛頓法派生)、算法步驟、編程實例
高斯牛頓法優劣總結

一、基本概念定義

1.非線性方程定義及最優化方法簡述

指因變量與自變量之間的關系不是線性的關系，比如平方關系、對數關系、指數關系、三角函數關系等等。對於此類方程，求解n元實函數f在整個n維向量空間Rn上的最優值點往往很難得到精確解，經常需要求近似解問題。

求解該最優化問題的方法大多是逐次一維搜索的迭代算法，基本思想是在一個近似點處選定一個有利於搜索方向，沿這個方向進行一維搜索，得到新的近似點。如此反復迭代，知道滿足預定的精度要求為止。根據搜索方向的取法不同，這類迭代算法可分為兩類：

解析法：需要用目標函數的到函數，

梯度法：又稱最速下降法，是早期的解析法，收斂速度較慢

牛頓法：收斂速度快，但不穩定，計算也較困難。高斯牛頓法基於其改進，但目標作用不同

共軛梯度法：收斂較快，效果好

變尺度法：效率較高，常用DFP法(Davidon Fletcher Powell)

直接法：不涉及導數，只用到函數值。有交替方向法(又稱坐標輪換法)、模式搜索法、旋轉方向法、鮑威爾共軛方向法和單純形加速法等。

2.非線性最小二乘問題

非線性最小二乘問題來自於非線性回歸，即通過觀察自變量和因變量數據，求非線性目標函數的系數參數，使得函數模型與觀測量盡量相似。

高斯牛頓法解決非線性最小二乘問題的最基本方法，並且它只能處理二次函數。(使用時必須將目標函數轉化為二次的)

Unlike Newton'smethod, the Gauss–Newton algorithm can only be used to minimize a sum ofsquared function values

3.基本數學表達

a.梯度gradient，由多元函數的各個偏導數組成的向量

以二元函數為例，其梯度為：

b.黑森矩陣Hessian matrix，由多元函數的二階偏導數組成的方陣，描述函數的局部曲率，以二元函數為例，

c.雅可比矩陣 Jacobian matrix，是多元函數一階偏導數以一定方式排列成的矩陣，體現了一個可微方程與給出點的最優線性逼近。以二元函數為例，

如果擴展多維的話F: Rn-> Rm，則雅可比矩陣是一個m行n列的矩陣：

雅可比矩陣作用，如果P是Rn中的一點，F在P點可微分，那么在這一點的導數由JF(P)給出，在此情況下，由F(P)描述的線性算子即接近點P的F的最優線性逼近：

d.殘差 residual，表示實際觀測值與估計值(擬合值)之間的差

二、牛頓法

牛頓法的基本思想是采用多項式函數來逼近給定的函數值，然后求出極小點的估計值，重復操作，直到達到一定精度為止。

1.考慮如下一維無約束的極小化問題：

因此，一維牛頓法的計算步驟如下：

需要注意的是，牛頓法在求極值的時候，如果初始點選取不好，則可能不收斂於極小點

2.下面給出多維無約束極值的情形：

若非線性目標函數f(x)具有二階連續偏導，在x(k)為其極小點的某一近似，在這一點取f(x)的二階泰勒展開，即：

如果f(x)是二次函數，則其黑森矩陣H為常數，式(1)是精確的(等於號)，在這種情況下，從任意一點處罰，用式(2)只要一步可求出f(x)的極小點(假設黑森矩陣正定，所有特征值大於0)

如果f(x)不是二次函數，式(1)僅是一個近似表達式，此時，按式(2)求得的極小點，只是f(x)的近似極小點。在這種情況下，常按照下面選取搜索方向：

牛頓法收斂的速度很快，當f(x)的二階導數及其黑森矩陣的逆矩陣便於計算時，這一方法非常有效。【但通常黑森矩陣很不好求】

3.下面給出一個實際計算例子。

例：試用牛頓法求的極小值

解：

【f(x)是二次函數，H矩陣為常數，只要任意點出發，只要一步即可求出極小點】

三、牛頓高斯法

1. gauss-newton是如何由上述派生的

有時候為了擬合數據，比如根據重投影誤差求相機位姿(R,T為方程系數)，常常將求解模型轉化為非線性最小二乘問題。高斯牛頓法正是用於解決非線性最小二乘問題，達到數據擬合、參數估計和函數估計的目的。

假設我們研究如下形式的非線性最小二乘問題：

這兩個位置間殘差（重投影誤差）：

如果有大量觀測點(多維)，我們可以通過選擇合理的T使得殘差的平方和最小求得兩個相機之間的位姿。機器視覺這塊暫時不擴展，接着說怎么求非線性最小二乘問題。

若用牛頓法求式3，則牛頓迭代公式為：

看到這里大家都明白高斯牛頓和牛頓法的差異了吧，就在這迭代項上。經典高斯牛頓算法迭代步長λ為1.

那回過頭來，高斯牛頓法里為啥要舍棄黑森矩陣的二階偏導數呢？主要問題是因為牛頓法中Hessian矩陣中的二階信息項通常難以計算或者花費的工作量很大，而利用整個H的割線近似也不可取，因為在計算梯度時已經得到J(x)，這樣H中的一階信息項JTJ幾乎是現成的。鑒於此，為了簡化計算，獲得有效算法，我們可用一階導數信息逼近二階信息項。注意這么干的前提是，殘差r接近於零或者接近線性函數從而接近與零時，二階信息項才可以忽略。通常稱為“小殘量問題”，否則高斯牛頓法不收斂。

3. 舉例

接下來的代碼里並沒有保證算法收斂的機制，在例子2的自嗨中可以看到劣勢。關於自變量維數，代碼可以支持多元，但兩個例子都是一維的，比如例子1中只有年份t，其實可以增加其他因素的，不必在意。

例子1，根據美國1815年至1885年數據，估計人口模型中的參數A和B。如下表所示，已知年份和人口總量，及人口模型方程，求方程中的參數。

// A simple demo of Gauss-Newton algorithm on a user defined function

#include <cstdio>
#include <vector>
#include <opencv2/core/core.hpp>

using namespace std;
using namespace cv;

const double DERIV_STEP = 1e-5;
const int MAX_ITER = 100;


void GaussNewton(double(*Func)(const Mat &input, const Mat ¶ms), // function pointer
				 const Mat &inputs, const Mat &outputs, Mat ¶ms);

double Deriv(double(*Func)(const Mat &input, const Mat ¶ms), // function pointer
			 const Mat &input, const Mat ¶ms, int n);

// The user defines their function here
double Func(const Mat &input, const Mat ¶ms);

int main()
{
	// For this demo we're going to try and fit to the function
	// F = A*exp(t*B)
	// There are 2 parameters: A B
	int num_params = 2;

    // Generate random data using these parameters
    int total_data = 8;

    Mat inputs(total_data, 1, CV_64F);
    Mat outputs(total_data, 1, CV_64F);

	//load observation data
    for(int i=0; i < total_data; i++) {
        inputs.at<double>(i,0) = i+1;  //load year
    }
	//load America population
	outputs.at<double>(0,0)= 8.3;
	outputs.at<double>(1,0)= 11.0;
	outputs.at<double>(2,0)= 14.7;
	outputs.at<double>(3,0)= 19.7;
	outputs.at<double>(4,0)= 26.7;
	outputs.at<double>(5,0)= 35.2;
	outputs.at<double>(6,0)= 44.4;
	outputs.at<double>(7,0)= 55.9;

    // Guess the parameters, it should be close to the true value, else it can fail for very sensitive functions!
    Mat params(num_params, 1, CV_64F);

	//init guess
    params.at<double>(0,0) = 6;
	params.at<double>(1,0) = 0.3;

    GaussNewton(Func, inputs, outputs, params);

    printf("Parameters from GaussNewton: %f %f\n", params.at<double>(0,0), params.at<double>(1,0));

    return 0;
}

double Func(const Mat &input, const Mat ¶ms)
{
	// Assumes input is a single row matrix
	// Assumes params is a column matrix

	double A = params.at<double>(0,0);
	double B = params.at<double>(1,0);

	double x = input.at<double>(0,0);

    return A*exp(x*B);
}

//calc the n-th params' partial derivation ， the params are our  final target
double Deriv(double(*Func)(const Mat &input, const Mat ¶ms), const Mat &input, const Mat ¶ms, int n)
{
	// Assumes input is a single row matrix

	// Returns the derivative of the nth parameter
	Mat params1 = params.clone();
	Mat params2 = params.clone();

	// Use central difference  to get derivative
	params1.at<double>(n,0) -= DERIV_STEP;
	params2.at<double>(n,0) += DERIV_STEP;

	double p1 = Func(input, params1);
	double p2 = Func(input, params2);

	double d = (p2 - p1) / (2*DERIV_STEP);

	return d;
}

void GaussNewton(double(*Func)(const Mat &input, const Mat ¶ms),
				 const Mat &inputs, const Mat &outputs, Mat ¶ms)
{
    int m = inputs.rows;
    int n = inputs.cols;
    int num_params = params.rows;

    Mat r(m, 1, CV_64F); // residual matrix
    Mat Jf(m, num_params, CV_64F); // Jacobian of Func()
    Mat input(1, n, CV_64F); // single row input

    double last_mse = 0;

    for(int i=0; i < MAX_ITER; i++) {
        double mse = 0;

        for(int j=0; j < m; j++) {
        	for(int k=0; k < n; k++) {//copy Independent variable vector, the year
        		input.at<double>(0,k) = inputs.at<double>(j,k);
        	}

            r.at<double>(j,0) = outputs.at<double>(j,0) - Func(input, params);//diff between estimate and observation population

            mse += r.at<double>(j,0)*r.at<double>(j,0);

            for(int k=0; k < num_params; k++) {
            	Jf.at<double>(j,k) = Deriv(Func, input, params, k);
            }
        }

        mse /= m;

        // The difference in mse is very small, so quit
        if(fabs(mse - last_mse) < 1e-8) {
        	break;
        }

        Mat delta = ((Jf.t()*Jf)).inv() * Jf.t()*r;
        params += delta;

        //printf("%d: mse=%f\n", i, mse);
        printf("%d %f\n", i, mse);

        last_mse = mse;
    }
}

　　運行結果：

A=7.0,B=0.26 (初始值，A=6,B=0.3)，100次迭代到第4次就收斂了。

若初始值A=1,B=1，則要迭代14次收斂。

下圖為根據上面得到的A、B系數，利用matlab擬合的人口模型曲線

例子2：我想要擬合如下模型，

由於缺乏觀測量，就自導自演，假設4個參數已知A=5,B=1,C=10,D=2，構造100個隨機數作為x的觀測值，計算相應的函數觀測值。然后，利用這些觀測值，反推4個參數。

// A simple demo of Gauss-Newton algorithm on a user defined function

#include <cstdio>
#include <vector>
#include <opencv2/core/core.hpp>

using namespace std;
using namespace cv;

const double DERIV_STEP = 1e-5;
const int MAX_ITER = 100;


void GaussNewton(double(*Func)(const Mat &input, const Mat ¶ms), // function pointer
				 const Mat &inputs, const Mat &outputs, Mat ¶ms);

double Deriv(double(*Func)(const Mat &input, const Mat ¶ms), // function pointer
			 const Mat &input, const Mat ¶ms, int n);

// The user defines their function here
double Func(const Mat &input, const Mat ¶ms);

int main()
{
	// For this demo we're going to try and fit to the function
	// F = A*sin(Bx) + C*cos(Dx)
	// There are 4 parameters: A, B, C, D
	int num_params = 4;

    // Generate random data using these parameters
    int total_data = 100;

    double A = 5;
    double B = 1;
    double C = 10;
    double D = 2;

    Mat inputs(total_data, 1, CV_64F);
    Mat outputs(total_data, 1, CV_64F);

    for(int i=0; i < total_data; i++) {
        double x = -10.0 + 20.0* rand() / (1.0 + RAND_MAX); // random between [-10 and 10]
        double y = A*sin(B*x) + C*cos(D*x);

        // Add some noise
       // y += -1.0 + 2.0*rand() / (1.0 + RAND_MAX);

        inputs.at<double>(i,0) = x;
        outputs.at<double>(i,0) = y;
    }

    // Guess the parameters, it should be close to the true value, else it can fail for very sensitive functions!
    Mat params(num_params, 1, CV_64F);

    params.at<double>(0,0) = 1;
    params.at<double>(1,0) = 1;
    params.at<double>(2,0) = 8; // changing to 1 will cause it not to find the solution, too far away
    params.at<double>(3,0) = 1;

    GaussNewton(Func, inputs, outputs, params);

    printf("True parameters: %f %f %f %f\n", A, B, C, D);
    printf("Parameters from GaussNewton: %f %f %f %f\n", params.at<double>(0,0), params.at<double>(1,0),
    													params.at<double>(2,0), params.at<double>(3,0));

    return 0;
}

double Func(const Mat &input, const Mat ¶ms)
{
	// Assumes input is a single row matrix
	// Assumes params is a column matrix

	double A = params.at<double>(0,0);
	double B = params.at<double>(1,0);
	double C = params.at<double>(2,0);
	double D = params.at<double>(3,0);

	double x = input.at<double>(0,0);

    return A*sin(B*x) + C*cos(D*x);
}

double Deriv(double(*Func)(const Mat &input, const Mat ¶ms), const Mat &input, const Mat ¶ms, int n)
{
	// Assumes input is a single row matrix

	// Returns the derivative of the nth parameter
	Mat params1 = params.clone();
	Mat params2 = params.clone();

	// Use central difference  to get derivative
	params1.at<double>(n,0) -= DERIV_STEP;
	params2.at<double>(n,0) += DERIV_STEP;

	double p1 = Func(input, params1);
	double p2 = Func(input, params2);

	double d = (p2 - p1) / (2*DERIV_STEP);

	return d;
}

void GaussNewton(double(*Func)(const Mat &input, const Mat ¶ms),
				 const Mat &inputs, const Mat &outputs, Mat ¶ms)
{
    int m = inputs.rows;
    int n = inputs.cols;
    int num_params = params.rows;

    Mat r(m, 1, CV_64F); // residual matrix
    Mat Jf(m, num_params, CV_64F); // Jacobian of Func()
    Mat input(1, n, CV_64F); // single row input

    double last_mse = 0;

    for(int i=0; i < MAX_ITER; i++) {
        double mse = 0;

        for(int j=0; j < m; j++) {
        	for(int k=0; k < n; k++) {
        		input.at<double>(0,k) = inputs.at<double>(j,k);
        	}

            r.at<double>(j,0) = outputs.at<double>(j,0) - Func(input, params);

            mse += r.at<double>(j,0)*r.at<double>(j,0);

            for(int k=0; k < num_params; k++) {
            	Jf.at<double>(j,k) = Deriv(Func, input, params, k);
            }
        }

        mse /= m;

        // The difference in mse is very small, so quit
        if(fabs(mse - last_mse) < 1e-8) {
        	break;
        }

        Mat delta = ((Jf.t()*Jf)).inv() * Jf.t()*r;
        params += delta;

        //printf("%d: mse=%f\n", i, mse);
        printf("%f\n",mse);

        last_mse = mse;
    }
}

　　運行結果，得到的參數並不夠理想，50次后收斂了