GBDT原理實例演示 2


   

一開始我們設定F(x)也就是每個樣本的預測值是0(也可以做一定的隨機化)

Scores = { 0, 0, 0, 0, 0, 0, 0, 0}

   

那么我們先計算當前情況下的梯度值

   

GetGradientInOneQuery = [this](int query, const Fvec& scores)

{

//和實際代碼稍有出入 簡化版本

_gradient[query] = ((2.0 * label) * sigmoidParam) / (1.0 + std::exp(((2.0 * label) * sigmoidParam) * scores[query]));

};

   

考慮 0號樣本 label1 , learningRate也就是sigmoidParam設置為0.1, scores[query] = 0 當前Scores全是0

2 * 1 * 0.1 / (1 + exp(2 * 1 * 0.1 * 0)) = 0.1

考慮 7號樣本 label是-1

2 * -1 * 0.1 / (1 + exp(2 * -1 * 0.1 * 0)) = -0.1

   

因此當前計算的梯度值是

Gradient = { 0.10.10.10.1, -0.1, -0.1, -0.1, -0.1}

   

於是我們要當前樹的輸出F(x)擬合的targets就是這個Grandient

Targets = { 0.10.10.10.1, -0.1, -0.1, -0.1, -0.1}

RegressionTree tree = TreeLearner->FitTargets(activeFeatures, AdjustTargetsAndSetWeights());

   

virtual RegressionTree FitTargets(BitArray& activeFeatures, Fvec& targets) override

   

現在我們考慮擬合這個梯度

gdb ./test_fastrank_train

(gdb) r -in dating.txt -cl gbdt -ntree 2 -nl 3 -lr 0.1 -mil 1 -c train -vl 1 -mjson=1

p Partitioning

$3 = {_documents = std::vector of length 8, capacity 8 = {0, 1, 2, 3, 4, 5, 6, 7}, _initialDocuments = std::vector of length 0, capacity 0, _leafBegin = std::vector of length 3, capacity 3 = {0, 0,

0}, _leafCount = std::vector of length 3, capacity 3 = {8, 0, 0}, _tempDocuments = std::vector of length 0, capacity 0}

   

gbdt對應每個特征要做離散化分桶處理,比如分255個桶,這里樣本數據比較少,對應height特征,

20, 60, 3, 66, 30, 20, 15, 10

分桶也就是變成

BinMedians = std::vector of length 7, capacity 7 = {3, 10, 15, 20, 30, 60, 66}

p *Feature

$11 = {_vptr.Feature = 0xce8650 <vtable for gezi::Feature+16>, Name = "hight",

BinUpperBounds = std::vector of length 7, capacity 7 = {6.5, 12.5, 17.5, 25, 45, 63, 1.7976931348623157e+308},

BinMedians = std::vector of length 7, capacity 7 = {3, 10, 15, 20, 30, 60, 66},

Bins = {_vptr.TVector = 0xce8670 <vtable for gezi::TVector<int>+16>, indices = std::vector of length 0, capacity 0,

values = std::vector of length 8, capacity 8 = {3, 5, 0, 6, 4, 3, 2, 1}, sparsityRatio = 0.29999999999999999, keepDense = false, keepSparse = false, normalized = false, numNonZeros = 7,

length = 8, _zeroValue = 0}, Trust = 1}

 

Bins對應分桶的結果,比如_0樣本hight 20,那么分桶結果是編號3的桶(0開始index)

   

考慮Root節點的分裂,分裂前考慮是8個樣本在一個節點,我們選取一個最佳的特征,以及對應該特征最佳的分裂點

考慮hight特征,我們要掃描所有可能的分裂點 這里也就是說 考慮6個不同的分裂點

for (int t = 0; t < (histogram.NumFeatureValues - 1); t += 1)

   

比如6.5這個分裂點

那么

就是左子樹 1(_2樣本) 右子樹7個,考慮下面公式 收益是 0.1^2/1 + (-0.1)^2/7 - CONSTANT = 0.01142857142857143 - CONSTANT

   

類似的考慮分裂點12.5,17.5……….. 選取一個最佳分裂點

   

然后同樣的考慮 money, face 特征 選取最優(特征,分裂點)組合,

這里最優組合是(hight, 45)

   

左側得到

   

_0,_2,_4,_5,_6, _7 -> 0.1 + 0.1 - 0.1 - 0.1 - 0.1 -0.1

   

右側得到

_1,_3 -> 0.1 + 0.1

收益是

(-0.2)^2 /6 + (0.2)^2 / 2 - CONSTANT = 0.026666666666666665 - CONSTANT

(gdb) p bestShiftedGain

$22 = 0.026666666666666675

   

對應>的子樹輸出應該是 0.2 / 2 = 0.1 下圖對應展示output1,因為后續還有AdjustOutput,因為至少需要 F_m(x) = F_m-1(x) + learning_rate*(當前樹的預測值(也就是預測負梯度..)) 黃色部分是最終該棵樹的輸出值

   

   

之后再選取兩個分裂后的組 選一個最佳(特征,分裂)組合 -> (face, 57.5)

   

(gdb) p tree

$26 = {<gezi::OnlineRegressionTree> = {NumLeaves = 3, _gainPValue = std::vector of length 2, capacity 2 = {0.15304198078836101, 0.27523360741160119},

_lteChild = std::vector of length 2, capacity 2 = {1, -1}, _gtChild = std::vector of length 2, capacity 2 = {-2, -3}, _leafValue = std::vector of length 3, capacity 3 = {-0.10000000000000002,

0.10000000000000002, 0.033333333333333347}, _threshold = std::vector of length 2, capacity 2 = {4, 2}, _splitFeature = std::vector of length 2, capacity 2 = {0, 2},

_splitGain = std::vector of length 2, capacity 2 = {0.026666666666666675, 0.026666666666666679}, _maxOutput = 0.10000000000000002, _previousLeafValue = std::vector of length 2, capacity 2 = {0,

-0.033333333333333333}, _weight = 1, _featureNames = 0x6e6a5a <gezi::FastRank::GetActiveFeatures(std::vector<bool, std::allocator<bool> >&)+34>},

_parent = std::vector of length 3, capacity 3 = {1, -1, -2}}

   

調整一下Output

//GradientDecent.h

virtual RegressionTree& TrainingIteration(BitArray& activeFeatures) override

{

RegressionTree tree = TreeLearner->FitTargets(activeFeatures, AdjustTargetsAndSetWeights());

if (AdjustTreeOutputsOverride == nullptr)

{ //如果父類ObjectiveFunction里面沒有虛函數 不能使用dynamic_pointer_cast... @TODO

(dynamic_pointer_cast<IStepSearch>(ObjectiveFunction))->AdjustTreeOutputs(tree, TreeLearner->Partitioning, *TrainingScores);

}

{

UpdateAllScores(tree);

}

Ensemble.AddTree(tree);

return Ensemble.Tree();

}

   

virtual void AdjustTreeOutputs(RegressionTree& tree, DocumentPartitioning& partitioning, ScoreTracker& trainingScores) override

{

//AutoTimer timer("dynamic_pointer_cast<IStepSearch>(ObjectiveFunction))->AdjustTreeOutputs");

for (int l = 0; l < tree.NumLeaves; l++)

{

Float output = 0.0;

if (_bestStepRankingRegressionTrees)

{

output = _learningRate * tree.GetOutput(l);

}

else

{ //現在走這里

output = (_learningRate * (tree.GetOutput(l) + 1.4E-45)) / (partitioning.Mean(_weights, Dataset.SampleWeights, l, false) + 1.4E-45);

}

if (output > _maxTreeOutput)

{

output = _maxTreeOutput;

}

else if (output < -_maxTreeOutput)

{

output = -_maxTreeOutput;

}

tree.SetOutput(l, output);

}

}

   

(gdb) p _weights

$33 = std::vector of length 8, capacity 8 = {0.010000000000000002, 0.010000000000000002, 0.010000000000000002, 0.010000000000000002, 0.010000000000000002, 0.010000000000000002,

0.010000000000000002, 0.010000000000000002}

   

_learningRate * tree.Getoutput(1) / partioning.Mean(_weights..) = 0.1 * 0.1 / 0.01 = 1

   

(gdb) p tree

$35 = (gezi::RegressionTree &) @0x7fffffffd480: {<gezi::OnlineRegressionTree> = {

NumLeaves = 3, _gainPValue = std::vector of length 2, capacity 2 = {0.15304198078836101, 0.27523360741160119},

_lteChild = std::vector of length 2, capacity 2 = {1, -1}, _gtChild = std::vector of length 2, capacity 2 = {-2, -3}, _leafValue = std::vector of length 3, capacity 3 = {-1, 1,

0.33333333333333343}, _threshold = std::vector of length 2, capacity 2 = {4, 2}, _splitFeature = std::vector of length 2, capacity 2 = {0, 2},

_splitGain = std::vector of length 2, capacity 2 = {0.026666666666666675, 0.026666666666666679}, _maxOutput = 0.10000000000000002, _previousLeafValue = std::vector of length 2, capacity 2 = {0, -0.033333333333333333}, _weight = 1, _featureNames = 0x6e6a5a <gezi::FastRank::GetActiveFeatures(std::vector<bool, std::allocator<bool> >&)+34>}, _parent = std::vector of length 3, capacity 3 = {1, -1, -2}}

   

之后UpdateAllScores(tree); 是用來更新scores的值,這里就是8個樣本對應的scores值,也就是計算F(x),注意多棵樹則是對應記錄多棵樹的輸出的值累加

virtual void AddScores(RegressionTree& tree, DocumentPartitioning& partitioning, Float multiplier = 1)

{

for (int l = 0; l < tree.NumLeaves; l++)

{

int begin;

int count;

ivec& documents = partitioning.ReferenceLeafDocuments(l, begin, count);

Float output = tree.LeafValue(l) * multiplier;

int end = begin + count;

#pragma omp parallel for

for (int i = begin; i < end; i++)

{

Scores[documents[i]] += output;

}

SendScoresUpdatedMessage();

}

   

對應第一個棵樹生成結束后

   

(gdb) p Scores

$7 = std::vector of length 8, capacity 8 = {0.33333333333333343, 1, 0.33333333333333343, 1, -1, -1, 0.33333333333333343, -1}

   

這個時候再對應計算梯度:

for (int query = 0; query < Dataset.NumDocs; query++)

{

GetGradientInOneQuery(query, scores);

}

   

_gradient[0] =

2 * 1 * 0.1 / (1 + exp(2 * 1 * 0.1 * 0.33333333333333343))

: 0.2/(1.0 + math.exp(2*0.1/3))

Out[2]: 0.09666790068611772

   

這時候 我們需要擬合的梯度變為

(gdb) p _gradient

$9 = std::vector of length 8, capacity 8 = {0.096667900686117719, 0.090033200537504438,

0.096667900686117719, 0.090033200537504438, -0.090033200537504438, -0.090033200537504438,

-0.10333209931388229, -0.090033200537504438}

   

   

第二棵樹

p tree

$10 = {<gezi::OnlineRegressionTree> = {NumLeaves = 3,

_gainPValue = std::vector of length 2, capacity 2 = {0.13944890100441296,

0.02357537149418417}, _lteChild = std::vector of length 2, capacity 2 = {-1, -2},

_gtChild = std::vector of length 2, capacity 2 = {1, -3},

_leafValue = std::vector of length 3, capacity 3 = {-0.9721949587186075,

-0.30312179217966367, 0.94840573799486361},

_threshold = std::vector of length 2, capacity 2 = {1, 1},

_splitFeature = std::vector of length 2, capacity 2 = {1, 2},

_splitGain = std::vector of length 2, capacity 2 = {0.024924858166579064,

0.023238200798742146}, _maxOutput = 0.094456333969913306,

_previousLeafValue = std::vector of length 2, capacity 2 = {0, 0.032222633562039242},

_weight = 1,

_featureNames = 0x6e6a5a <gezi::FastRank::GetActiveFeatures(std::vector<bool, std::allocator<bool> >&)+34>}, _parent = std::vector of length 3, capacity 3 = {0, 1, -2}}

   

累加第二棵樹后的Scores,如果有第三棵樹,那么在這個Scores的基礎上再計算梯度值

(gdb) p Scores

$11 = std::vector of length 8, capacity 8 = {1.2817390713281971, 0.69687820782033638,

1.2817390713281971, 1.9484057379948636, -1.3031217921796636, -1.9721949587186076,

-0.63886162538527413, -1.3031217921796636}

   

   

   

   

   

   

   


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM