評估函數eval.py


voc_eval.py 解析

http://gh606.com/shawncheer/article/details/78317711

Mean Average Precise 
也叫(mAP),“ to evaluate the ranked retrieval results”。 
這個評價指標用於對信息檢索,推薦算法的評價。最近在做目標檢測的項目,mAP也用到了檢測結果的好壞的評價上。與一般precise,recall,f評價體系不同的是,mAP將“順序”的因素考慮在內。希望在positive結果中,tp的部分先出現。

參考了不少資料發現以下兩組資料最為詳盡,足夠答疑解惑。同時結合代碼分析mAP的計算過程。

源地址1:http://blog.sina.com.cn/s/blog_9db078090102whzw.html 

源地址2(斯坦福NLP實驗室資料):https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

代碼來源: py-faster-rcnn /lib/dataSet/voc_eval.py AP計算部分源碼
# tp = [1,0,1,0,0,0,...] size(tp) = all positive # fp = [0,1,0,1,1,1,...] # compute precision recall fp = np.cumsum(fp) # fp = [0,1,1,2,3,4,...] tp = np.cumsum(tp) # tp = [1,1,2,2,2,2,...] rec = tp / float(npos) # npos(number of pos),此時rec為累計的rec """ # recall的計算: 假定一次輸出x(positive)個結果,真正相關的結果數為r(true/relevant),當前計算的位置為i; i為fp,則deta_rec = 0; i為tp, (1)當 x>=r ,deta_rec = 1/r; (2)else,deta_rec = 1/x; 上述例子中,deta_rec 取 1/x """ # avoid divide by zero in case the first detection matches a difficult ground truth # 所有的運算都不考慮difficult的樣本 prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps) ap = voc_ap(rec, prec, use_07_metric) # 計算 Everage Precise def voc_ap(rec, prec, use_07_metric=False): """ ap = voc_ap(rec, prec, [use_07_metric]) Compute VOC AP given precision and recall. If use_07_metric is true, uses the VOC 07 11 point method (default:False). """ # voc07版本 if use_07_metric: # 11 point metric ap = 0. # 將recall分成11等份 (0-0.1),(0.1-0.2),... for t in np.arange(0., 1.1, 0.1): if np.sum(rec >= t) == 0: p = 0 else: # 計算該范圍對應的precise p = np.max(prec[rec >= t]) # deta_precise = p; deta_recall = 1/11 ap = ap + p / 11. # voc07以上版本 else: # correct AP calculation # first append sentinel values at the end mrec = np.concatenate(([0.], rec, [1.])) mpre = np.concatenate(([0.], prec, [0.])) # compute the precision envelope """ 目的是找出precise的包絡。參考“源2”,Figure8.2 中紅線 """ for i in range(mpre.size - 1, 0, -1): mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i]) # to calculate area under PR curve, look for points # where X axis (recall) changes value """ 找出recall變化的位置 """ i = np.where(mrec[1:] != mrec[:-1])[0] # and sum (\Delta recall) * prec ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1]) return ap 

 

【PASCALVOC】The Pascal Visual Object Classes Challenge: A Retrospective

http://www.javashuo.com/content/p-6577825.html

基本情況

這篇文章PASCALVOC官方發表在IJCV2015上的一篇文章,主要是對之前的2008-2012challenge的回顧.

Abstract

PASCAL VOC(pattern analysis,statistical modelling and computational learning visual object classes)主要包含兩個主要部分:(1)一個公開可用的數據集,包含ground truth annotaion和標准化的評估軟件;(2)每年一屆的比賽和研討會. 
為了評估在VOC數據集上提交的算法,我們引入了許多novel的方法:(1)一種用於確定兩種算法的性能差異是否顯著的自舉方法(2)歸一化的平均精度(AP),使得性能可以在具有不同比例的正實例的類中進行比較(3)一種用於在多個算法上可視化性能的聚類方法,以便可以識別硬而簡單的圖像(4)對所提交的算法使用聯合分類器,以便測量它們的互補性和組合性能.

文章主體

這篇文章是一篇比較長的文章,所以我也沒有全部詳細閱讀本文,只是主要關注了Submission and Evaluation部分. 
Section 2是對VOC challenge的回顧.關於voc2012競賽、數據集、標注步驟和評估標准做了一個簡單的介紹. 
Section 3對voc2012的各項結果進行了展示,然后使用新穎的方法對2012比賽結果進行了分析,最后提出了一種方法對在同樣的testset上進行預測的方法進行公平的比較,摘要中的方法(1). 
Section 4進行評估,並嘗試回答關於我們的領域在可能或不能解決的分類和檢測問題方面的更廣泛的問題,摘要中提出的(2)(3)方法用於此處. 
Section 5調查不同方法的互補性水平.摘要中的(4)方法(super methods). 
Section 6考慮在時間上的進步.

Submission and evaluation

競賽的過程包括兩個階段:第一個階段,每個參賽者會被分發一個開發者套件:包括train/val圖片和標注信息,以及使用matlab軟件可以訪問的標注文件(xml文件).第二階段是在test上進行提交.test數據大概在提交結果的前三個才發布. 
Detection任務的評估 
test
這里引入了另一個指標IoU,定義Bp是預測的區域,Bgt是ground true,兩者的交面積除以兩者的並的面積(intersection over Union),就是IoU值.這個值大於50%才認為方法有效.PASCALVOC給出的評測結果就是在IoU>=0.5的情況下的mAP值. 
Segmentation任務的評估 
test
對每個類都采用IoU. 
mAP 采用了每個類別的P-R圖的面積作為該類別mAP值.在2007年之前使用的ROC-AUC.

====================================================================================================================================

http://blog.csdn.net/weixin_35653315/article/details/71028523

介紹Pascal VOC數據集:

  • Challenge and tasks, 只介紹Detection與Segmentation相關內容。
  • 數據格式
  • 衡量方式
  • voc2007, voc2012

Challenge and tasks

給定自然圖片, 從中識別出特定物體。 
待識別的物體有20類:

  • person
  • bird, cat, cow, dog, horse, sheep
  • aeroplane, bicycle, boat, bus, car, motorbike, train
  • bottle, chair, dining table, potted plant, sofa, tv/monitor

有以下幾個task: 
* Classification(略過) 
* Detection: 將圖片中所有的目標用bounding box(bbox)框出來 
* Segmentation: 將圖片中所有的目標分割出來 
* Person Layout(略過)

接下來本文只介紹Detection與Segmentation相關的內容。

Dataset

  • 所有的標注圖片都有Detection需要的label, 但只有部分數據有Segmentation Label。
  • VOC2007中包含9963張標注過的圖片, 由train/val/test三部分組成, 共標注出24,640個物體。
  • VOC2007的test數據label已經公布, 之后的沒有公布(只有圖片,沒有label)。
  • 對於檢測任務,VOC2012的trainval/test包含08-11年的所有對應圖片。 trainval有11540張圖片共27450個物體。
  • 對於分割任務, VOC2012的trainval包含07-11年的所有對應圖片, test只包含08-11。trainval有 2913張圖片共6929個物體。

Detection Ground Truth and Evaluation

Ground truth

<annotation> <folder>VOC2007</folder> <filename>009961.jpg</filename> <source> <database>The VOC2007 Database</database> <annotation>PASCAL VOC2007</annotation> <image>flickr</image> <flickrid>334575803</flickrid> </source> <owner> <flickrid>dictioncanary</flickrid> <name>Lucy</name> </owner> <size><!--image shape--> <width>500</width> <height>374</height> <depth>3</depth> </size> <segmented>0</segmented><!--是否有分割label--> <object> <name>dog</name> <!--類別--> <pose>Unspecified</pose><!--物體的姿態--> <truncated>0</truncated><!--物體是否被部分遮擋(>15%)--> <difficult>0</difficult><!--是否為難以辨識的物體, 主要指要結體背景才能判斷出類別的物體。雖有標注, 但一般忽略這類物體--> <bndbox><!--bounding box--> <xmin>69</xmin> <ymin>4</ymin> <xmax>392</xmax> <ymax>345</ymax> </bndbox> </object> </annotation>

Evaluation

提交的結果存儲在一個文件中, 每行的格式為:

<image identifier> <confidence> <left> <top> <right> <bottom>

例如:

comp3_det_test_car.txt: 000004 0.702732 89 112 516 466 000006 0.870849 373 168 488 229 000006 0.852346 407 157 500 213 000006 0.914587 2 161 55 221 000008 0.532489 175 184 232 201
  • confidence會被用於計算mean average precision(mAP). 簡要流程如下, 詳細可參考https://sanchom.wordpress.com/tag/average-precision/ 
    • 根據confidence對結果排序,計算top-1, 2, …N對應的precision和recall
    • 將recall划分為n個區間t in [t1, ..., tn]
    • 找出滿足recall>=t的最大presicision
    • 最后得到n個最大precision, 求它們的平均值
        aps = []
        for t in np.arange(0., 1.1, 0.1):#將recall分為多個區間 # 在所有 recall > t對應的precision中找出最大值 mask = tf.greater_equal(recall, t) v = tf.reduce_max(tf.boolean_mask(precision, mask)) aps.append(v / 11.) # 得到其平均值 ap = tf.add_n(aps) return ap

代碼給出的是voc07的計算方式, voc2010在recall區間區分上有變化: 假如有M個正樣例,則將recall划分為[1/M, 1/(M - 1), 1/(M - 2), ... 1]。其余步驟不變。

  • 如輸出的bbox與一個ground truth bbox的 IOU大於0.5, 且類別相同,則為True Positive, 否則為False Positive
  • 對於一個ground truth bbox, 只會有一個 true positive, 其余都為false positive.

Segmentation

Ground Truth

分割的label由兩部分組成: 
* class segmentation: 標注出每一個像素的類別 
* object segmentation: 標注出每一個像素屬於哪一個物體 
這里寫圖片描述

Evaluation

每類的precision和總體precision.


Reference

 

PASCAL VOC數據集分析

http://blog.csdn.net/zhangjunbob/article/details/52769381

 

PASCAL VOC數據集分析
PASCAL VOC為圖像識別和分類提供了一整套標准化的優秀的數據集,從2005年到2012年每年都會舉行一場圖像識別challenge。
本文主要分析PASCAL VOC數據集中和圖像中物體識別相關的內容。
 
在這里采用PASCAL VOC2012作為例子。下載地址為: 點擊打開鏈接。(本文中的系統環境為ubuntu14.04)
下載完之后解壓,可以在VOCdevkit目錄下的VOC2012中看到如下的文件:
其中在圖像物體識別上着重需要了解的是Annotations、ImageSets和JPEGImages。
 
①JPEGImages
 
JPEGImages文件夾中包含了PASCAL VOC所提供的所有的圖片信息,包括了訓練圖片和測試圖片。
這些圖像都是以“年份_編號.jpg”格式命名的。
圖片的像素尺寸大小不一,但是橫向圖的尺寸大約在500*375左右,縱向圖的尺寸大約在375*500左右,基本不會偏差超過100。(在之后的訓練中,第一步就是將這些圖片都resize到300*300或是500*500,所有原始圖片不能離這個標准過遠。)
這些圖像就是用來進行訓練和測試驗證的圖像數據。
 
②Annotations
 
Annotations文件夾中存放的是xml格式的標簽文件,每一個xml文件都對應於JPEGImages文件夾中的一張圖片。
xml文件的具體格式如下:(對於2007_000392.jpg)
[html]  view plain  copy
 
  1. <annotation>  
  2.     <folder>VOC2012</folder>                             
  3.     <filename>2007_000392.jpg</filename>                               //文件名  
  4.     <source>                                                           //圖像來源(不重要)  
  5.         <database>The VOC2007 Database</database>  
  6.         <annotation>PASCAL VOC2007</annotation>  
  7.         <image>flickr</image>  
  8.     </source>  
  9.     <size>                                               //圖像尺寸(長寬以及通道數)                        
  10.         <width>500</width>  
  11.         <height>332</height>  
  12.         <depth>3</depth>  
  13.     </size>  
  14.     <segmented>1</segmented>                                   //是否用於分割(在圖像物體識別中01無所謂)  
  15.     <object>                                                           //檢測到的物體  
  16.         <name>horse</name>                                         //物體類別  
  17.         <pose>Right</pose>                                         //拍攝角度  
  18.         <truncated>0</truncated>                                   //是否被截斷(0表示完整)  
  19.         <difficult>0</difficult>                                   //目標是否難以識別(0表示容易識別)  
  20.         <bndbox>                                                   //bounding-box(包含左下角和右上角xy坐標)  
  21.             <xmin>100</xmin>  
  22.             <ymin>96</ymin>  
  23.             <xmax>355</xmax>  
  24.             <ymax>324</ymax>  
  25.         </bndbox>  
  26.     </object>  
  27.     <object>                                                           //檢測到多個物體  
  28.         <name>person</name>  
  29.         <pose>Unspecified</pose>  
  30.         <truncated>0</truncated>  
  31.         <difficult>0</difficult>  
  32.         <bndbox>  
  33.             <xmin>198</xmin>  
  34.             <ymin>58</ymin>  
  35.             <xmax>286</xmax>  
  36.             <ymax>197</ymax>  
  37.         </bndbox>  
  38.     </object>  
  39. </annotation>  
對應的圖片為:
 
③ImageSets
 
ImageSets存放的是每一種類型的challenge對應的圖像數據。
在ImageSets下有四個文件夾:
其中Action下存放的是人的動作(例如running、jumping等等,這也是VOC challenge的一部分)
Layout下存放的是具有人體部位的數據(人的head、hand、feet等等,這也是VOC challenge的一部分)
Main下存放的是圖像物體識別的數據,總共分為20類。
Segmentation下存放的是可用於分割的數據。
 
在這里主要考察Main文件夾。
Main文件夾下包含了20個分類的***_train.txt、***_val.txt和***_trainval.txt。
這些txt中的內容都差不多如下:
前面的表示圖像的name,后面的1代表正樣本,-1代表負樣本。
_train中存放的是訓練使用的數據,每一個class的train數據都有5717個。
_val中存放的是驗證結果使用的數據,每一個class的val數據都有5823個。
_trainval將上面兩個進行了合並,每一個class有11540個。
需要保證的是train和val兩者沒有交集,也就是訓練數據和驗證數據不能有重復,在選取訓練數據的時候 ,也應該是隨機產生的。
 
④SegmentationClass和SegmentationObject
 
這兩個文件夾下保存了物體分割后的圖片,在物體識別中沒有用到,在這里不做詳細展開。
 
接下來需要研究的是如何自己生成訓練數據和測試數據。
====================================================================================================================

Tag: average precision

It’s a bird… it’s a plane… it… depends on your classifier’s threshold

Evaluation of an information retrieval system (a search engine, for example) generally focuses on two things:
1. How relevant are the retrieved results? (precision)
2. Did the system retrieve many of the truly relevant documents? (recall)

For those that aren’t familiar, I’ll explain what precision and recall are, and for those that are familiar, I’ll explain some of the confusion in the literature when comparing precision-recall curves.

Geese and airplanes

Suppose you have an image collection consisting of airplanes and geese.

Images of geese and airplanes
You want your system to retrieve all the airplane images and none of the geese images.
Given a set of images that your system retrieves from this collection, we can define four accuracy counts:
True positives: Airplane images that your system correctly retrieved
True negatives: Geese images that your system correctly did not retrieve
False positives: Geese images that your system incorrectly retrieved, believing them to be airplanes
False negatives: Airplane images that your system did incorrectly did not retrieve, believing them to be geese

Collection of geese and airplanesIn this example retrieval, there are three true positives and one false positive.
Using the terms I just defined, in this example retrieval, there are three true positives and one false positive. How many false negatives are there? How many true negatives are there?

 

There are two false negatives (the airplanes that the system failed to retrieve) and four true negatives (the geese that the system did not retrieve).

Precision and recall

Now, you’ll be able to understand more exactly what precision and recall are.

Precision is the percentage true positives in the retrieved results. That is:

where n is equal to the total number of images retrieved (tp + fp).

Recall is the percentage of the airplanes that the system retrieves. That is:

In our example above, with 3 true positives, 1 false positive, 4 true negatives, and 2 false negatives, precision = 0.75, and recall = 0.6.

75% of the retrieved results were airplanes, and 60% of the airplanes were retrieved.

Adjusting the threshold

What if we’re not happy with that performance? We could ask the system to return more examples. This would be done be relaxing our threshold of what we want our system to consider as an airplane. We could also ask our system to be more strict, and return fewer examples. In our example so far, the system retrieved four examples. That corresponds to a particular threshold (shown below by a blue line). The system retrieved the examples that appeared more airplane-like than that threshold.

This is a hypothetical ordering that our airplane retrieval system could give to the images in our collection. More airplane-like are at the top of the list. The blue line is the threshold that gave our example retrieval.

We can move that threshold up and down to get a different set of retrieved documents. At each position of the threshold, we would get a different precision and recall value. Specifically, if we retrieved only the top example, precision would be 100% and recall would be 20%. If we retrieved the top two examples, precision would still be 100%, and recall will have gone up to 40%. The following chart gives precision and recall for the above hypothetical ordering at all the possible thresholds.

Retrieval cutoff Precision Recall
Top 1 image 100% 20%
Top 2 images 100% 40%
Top 3 images 66% 40%
Top 4 images 75% 60%
Top 5 images 60% 60%
Top 6 images 66% 80%
Top 7 images 57% 80%
Top 8 images 50% 80%
Top 9 images 44% 80%
Top 10 images 50% 100%

Precision-recall curves

A good way to characterize the performance of a classifier is to look at how precision and recall change as you change the threshold. A good classifier will be good at ranking actual airplane images near the top of the list, and be able to retrieve a lot of airplane images before retrieving any geese: its precision will stay high as recall increases. A poor classifier will have to take a large hit in precision to get higher recall. Usually, a publication will present a precision-recall curve to show how this tradeoff looks for their classifier. This is a plot of precision p as a function of recall r.

The precision-recall curve for our example airplane classifier. It can achieve 40% recall without sacrificing any precision, but to get 100% recall, its precision drops to 50%.

Average precision

Rather than comparing curves, its sometimes useful to have a single number that characterizes the performance of a classifier. A common metric is the average precision. This can actually mean one of several things.

Average precision

Strictly, the average precision is precision averaged across all values of recall between 0 and 1:

That’s equal to taking the area under the curve. In practice, the integral is closely approximated by a sum over the precisions at every possible threshold value, multiplied by the change in recall:

where N is the total number of images in the collection, P(k) is the precision at a cutoff of k images, and delta r(k) is the change in recall that happened between cutoff k-1 and cutoff k.

In our example, this is (1 * 0.2) + (1 * 0.2) + (0.66 * 0) + (0.75 * 0.2) + (0.6 * 0) + (0.66 * 0.2) + (0.57 * 0) + (0.5 * 0) + (0.44 * 0) + (0.5 * 0.2) = 0.782.

Notice that the points at which the recall doesn’t change don’t contribute to this sum (in the graph, these points are on the vertical sections of the plot, where it’s dropping straight down). This makes sense, because since we’re computing the area under the curve, those sections of the curve aren’t adding any area.

Interpolated average precision

Some authors choose an alternate approximation that is called the interpolated average precision. Often, they still call it average precision. Instead of using P(k), the precision at a retrieval cutoff of k images, the interpolated average precision uses:

In other words, instead of using the precision that was actually observed at cutoff k, the interpolated average precision uses the maximum precision observed across all cutoffs with higher recall. The full equation for computing the interpolated average precision is:

Visually, here’s how the interpolated average precision compares to the approximated average precision (to show a more interesting plot, this one isn’t from the earlier example):

The approximated average precision closely hugs the actually observed curve. The interpolated average precision over estimates the precision at many points and produces a higher average precision value than the approximated average precision.

Further, there are variations on where to take the samples when computing the interpolated average precision. Some take samples at a fixed 11 points from 0 to 1: {0, 0.1, 0.2, …, 0.9, 1.0}. This is called the 11-point interpolated average precision. Others sample at every k where the recall changes.

Confusion

Some important publications use the interpolated average precision as their metric and still call it average precision. For example, the PASCAL Visual Objects Challenge has used this as their evaluation metric since 2007. I don’t think their justification is strong. They say, “the intention in interpolating the precision/recall curve in this way is to reduce the impact of the “wiggles” in the precision/recall curve”. Regardless, everyone compares against each other on this metric, so within the competition, this is not an issue. However, the rest of us need to be careful when comparing “average precision” values against other published results. Are we using the VOC’s interpolated average precision, while previous work had used the non-interpolated average precision? This would incorrectly show improvement of a new method when compared to the previous work.

Summary

Precision and recall are useful metrics for evaluating the performance of a classifier.

Precision and recall vary with the strictness of your classifier’s threshold.

There are several ways to summarize the precision-recall curve with a single number called average precision; be sure you’re using the same metric as the previous work that you’re comparing with.

===============================================================================================================================

http://blog.csdn.net/applecore123456/article/details/53164538

Fast-RCNN代碼解讀(1)

Fast-RCNN代碼解讀(1)

這篇博文主要根據博主微薄的代碼基礎作一些簡單的Fast-RCNN源碼的解讀,基本上就是一些代碼閱讀記錄,寫出來跟大家一起分享。這次主要記錄一下測試的過程,和訓練類似,Fast-RCNN中的測試過程主要由test_net.py,test.py及其他一些功能性python文件構成(比如,bbox_transform.py),其中test_net.py是測試的主要入口,功能是解析輸入參數,包含了”__main__”,調用test.py中的test_net()函數,實現了整個測試過程。


Fast-RCNN測試代碼解讀

root/tools/test_net.py

root/lib/fast_rcnn/test.py

  • def im_detect(net, im, boxes=None)

    • 該函數實現了檢測的功能,其中我比較關注的是如何在測試過程中將檢測到的proposal變換到target_bbox。由於在Fast-RCNN中,bounding-box回歸的過程實際上是回歸了離target_bbox最近的box與其的變換(dx, dy, dw, dh),因此在測試過程中,需要將檢測到的box通過回歸得到的變換,得到最終的bbox。
  • box_deltas

    if cfg.TEST.BBOX_REG: # cfg.TEST.BBOX_REG = {bool}True # Apply bounding-box regression deltas box_deltas = blobs_out['bbox_pred'] pred_boxes = bbox_transform_inv(boxes, box_deltas) pred_boxes = clip_boxes(pred_boxes, im.shape) else: # Simply repeat the boxes, once for each class pred_boxes = np.tile(boxes, (1, scores.shape[1]))

root/lib/fast_rcnn/bbox_transform.py

  • def bbox_transform_inv(boxes, deltas) 
    • 該函數將selective search得到的proposal通過與測試過程中輸出的變換deltas計算,得到最終的bbox。
    • 代碼非常簡單,如下所示:
def bbox_transform_inv(boxes, deltas): if boxes.shape[0] == 0: return np.zeros((0, deltas.shape[1]), dtype=deltas.dtype) boxes = boxes.astype(deltas.dtype, copy=False) widths = boxes[:, 2] - boxes[:, 0] + 1.0 heights = boxes[:, 3] - boxes[:, 1] + 1.0 ctr_x = boxes[:, 0] + 0.5 * widths ctr_y = boxes[:, 1] + 0.5 * heights dx = deltas[:, 0::4] # start from 0 and jump by 4, [0, 4, 8, ...] dy = deltas[:, 1::4] dw = deltas[:, 2::4] dh = deltas[:, 3::4] pred_ctr_x = dx * widths[:, np.newaxis] + ctr_x[:, np.newaxis] pred_ctr_y = dy * heights[:, np.newaxis] + ctr_y[:, np.newaxis] pred_w = np.exp(dw) * widths[:, np.newaxis] pred_h = np.exp(dh) * heights[:, np.newaxis] pred_boxes = np.zeros(deltas.shape, dtype=deltas.dtype) # x1 pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w # y1 pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h # x2 pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w # y2 pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h return pred_boxes
 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM