评估函数eval.py


voc_eval.py 解析

http://gh606.com/shawncheer/article/details/78317711

Mean Average Precise 
也叫(mAP),“ to evaluate the ranked retrieval results”。 
这个评价指标用于对信息检索,推荐算法的评价。最近在做目标检测的项目,mAP也用到了检测结果的好坏的评价上。与一般precise,recall,f评价体系不同的是,mAP将“顺序”的因素考虑在内。希望在positive结果中,tp的部分先出现。

参考了不少资料发现以下两组资料最为详尽,足够答疑解惑。同时结合代码分析mAP的计算过程。

源地址1:http://blog.sina.com.cn/s/blog_9db078090102whzw.html 

源地址2(斯坦福NLP实验室资料):https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

代码来源: py-faster-rcnn /lib/dataSet/voc_eval.py AP计算部分源码
# tp = [1,0,1,0,0,0,...] size(tp) = all positive # fp = [0,1,0,1,1,1,...] # compute precision recall fp = np.cumsum(fp) # fp = [0,1,1,2,3,4,...] tp = np.cumsum(tp) # tp = [1,1,2,2,2,2,...] rec = tp / float(npos) # npos(number of pos),此时rec为累计的rec """ # recall的计算: 假定一次输出x(positive)个结果,真正相关的结果数为r(true/relevant),当前计算的位置为i; i为fp,则deta_rec = 0; i为tp, (1)当 x>=r ,deta_rec = 1/r; (2)else,deta_rec = 1/x; 上述例子中,deta_rec 取 1/x """ # avoid divide by zero in case the first detection matches a difficult ground truth # 所有的运算都不考虑difficult的样本 prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps) ap = voc_ap(rec, prec, use_07_metric) # 计算 Everage Precise def voc_ap(rec, prec, use_07_metric=False): """ ap = voc_ap(rec, prec, [use_07_metric]) Compute VOC AP given precision and recall. If use_07_metric is true, uses the VOC 07 11 point method (default:False). """ # voc07版本 if use_07_metric: # 11 point metric ap = 0. # 将recall分成11等份 (0-0.1),(0.1-0.2),... for t in np.arange(0., 1.1, 0.1): if np.sum(rec >= t) == 0: p = 0 else: # 计算该范围对应的precise p = np.max(prec[rec >= t]) # deta_precise = p; deta_recall = 1/11 ap = ap + p / 11. # voc07以上版本 else: # correct AP calculation # first append sentinel values at the end mrec = np.concatenate(([0.], rec, [1.])) mpre = np.concatenate(([0.], prec, [0.])) # compute the precision envelope """ 目的是找出precise的包络。参考“源2”,Figure8.2 中红线 """ for i in range(mpre.size - 1, 0, -1): mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i]) # to calculate area under PR curve, look for points # where X axis (recall) changes value """ 找出recall变化的位置 """ i = np.where(mrec[1:] != mrec[:-1])[0] # and sum (\Delta recall) * prec ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1]) return ap 

 

【PASCALVOC】The Pascal Visual Object Classes Challenge: A Retrospective

http://www.javashuo.com/content/p-6577825.html

基本情况

这篇文章PASCALVOC官方发表在IJCV2015上的一篇文章,主要是对之前的2008-2012challenge的回顾.

Abstract

PASCAL VOC(pattern analysis,statistical modelling and computational learning visual object classes)主要包含两个主要部分:(1)一个公开可用的数据集,包含ground truth annotaion和标准化的评估软件;(2)每年一届的比赛和研讨会. 
为了评估在VOC数据集上提交的算法,我们引入了许多novel的方法:(1)一种用于确定两种算法的性能差异是否显著的自举方法(2)归一化的平均精度(AP),使得性能可以在具有不同比例的正实例的类中进行比较(3)一种用于在多个算法上可视化性能的聚类方法,以便可以识别硬而简单的图像(4)对所提交的算法使用联合分类器,以便测量它们的互补性和组合性能.

文章主体

这篇文章是一篇比较长的文章,所以我也没有全部详细阅读本文,只是主要关注了Submission and Evaluation部分. 
Section 2是对VOC challenge的回顾.关于voc2012竞赛、数据集、标注步骤和评估标准做了一个简单的介绍. 
Section 3对voc2012的各项结果进行了展示,然后使用新颖的方法对2012比赛结果进行了分析,最后提出了一种方法对在同样的testset上进行预测的方法进行公平的比较,摘要中的方法(1). 
Section 4进行评估,并尝试回答关于我们的领域在可能或不能解决的分类和检测问题方面的更广泛的问题,摘要中提出的(2)(3)方法用于此处. 
Section 5调查不同方法的互补性水平.摘要中的(4)方法(super methods). 
Section 6考虑在时间上的进步.

Submission and evaluation

竞赛的过程包括两个阶段:第一个阶段,每个参赛者会被分发一个开发者套件:包括train/val图片和标注信息,以及使用matlab软件可以访问的标注文件(xml文件).第二阶段是在test上进行提交.test数据大概在提交结果的前三个才发布. 
Detection任务的评估 
test
这里引入了另一个指标IoU,定义Bp是预测的区域,Bgt是ground true,两者的交面积除以两者的并的面积(intersection over Union),就是IoU值.这个值大于50%才认为方法有效.PASCALVOC给出的评测结果就是在IoU>=0.5的情况下的mAP值. 
Segmentation任务的评估 
test
对每个类都采用IoU. 
mAP 采用了每个类别的P-R图的面积作为该类别mAP值.在2007年之前使用的ROC-AUC.

====================================================================================================================================

http://blog.csdn.net/weixin_35653315/article/details/71028523

介绍Pascal VOC数据集:

  • Challenge and tasks, 只介绍Detection与Segmentation相关内容。
  • 数据格式
  • 衡量方式
  • voc2007, voc2012

Challenge and tasks

给定自然图片, 从中识别出特定物体。 
待识别的物体有20类:

  • person
  • bird, cat, cow, dog, horse, sheep
  • aeroplane, bicycle, boat, bus, car, motorbike, train
  • bottle, chair, dining table, potted plant, sofa, tv/monitor

有以下几个task: 
* Classification(略过) 
* Detection: 将图片中所有的目标用bounding box(bbox)框出来 
* Segmentation: 将图片中所有的目标分割出来 
* Person Layout(略过)

接下来本文只介绍Detection与Segmentation相关的内容。

Dataset

  • 所有的标注图片都有Detection需要的label, 但只有部分数据有Segmentation Label。
  • VOC2007中包含9963张标注过的图片, 由train/val/test三部分组成, 共标注出24,640个物体。
  • VOC2007的test数据label已经公布, 之后的没有公布(只有图片,没有label)。
  • 对于检测任务,VOC2012的trainval/test包含08-11年的所有对应图片。 trainval有11540张图片共27450个物体。
  • 对于分割任务, VOC2012的trainval包含07-11年的所有对应图片, test只包含08-11。trainval有 2913张图片共6929个物体。

Detection Ground Truth and Evaluation

Ground truth

<annotation> <folder>VOC2007</folder> <filename>009961.jpg</filename> <source> <database>The VOC2007 Database</database> <annotation>PASCAL VOC2007</annotation> <image>flickr</image> <flickrid>334575803</flickrid> </source> <owner> <flickrid>dictioncanary</flickrid> <name>Lucy</name> </owner> <size><!--image shape--> <width>500</width> <height>374</height> <depth>3</depth> </size> <segmented>0</segmented><!--是否有分割label--> <object> <name>dog</name> <!--类别--> <pose>Unspecified</pose><!--物体的姿态--> <truncated>0</truncated><!--物体是否被部分遮挡(>15%)--> <difficult>0</difficult><!--是否为难以辨识的物体, 主要指要结体背景才能判断出类别的物体。虽有标注, 但一般忽略这类物体--> <bndbox><!--bounding box--> <xmin>69</xmin> <ymin>4</ymin> <xmax>392</xmax> <ymax>345</ymax> </bndbox> </object> </annotation>

Evaluation

提交的结果存储在一个文件中, 每行的格式为:

<image identifier> <confidence> <left> <top> <right> <bottom>

例如:

comp3_det_test_car.txt: 000004 0.702732 89 112 516 466 000006 0.870849 373 168 488 229 000006 0.852346 407 157 500 213 000006 0.914587 2 161 55 221 000008 0.532489 175 184 232 201
  • confidence会被用于计算mean average precision(mAP). 简要流程如下, 详细可参考https://sanchom.wordpress.com/tag/average-precision/ 
    • 根据confidence对结果排序,计算top-1, 2, …N对应的precision和recall
    • 将recall划分为n个区间t in [t1, ..., tn]
    • 找出满足recall>=t的最大presicision
    • 最后得到n个最大precision, 求它们的平均值
        aps = []
        for t in np.arange(0., 1.1, 0.1):#将recall分为多个区间 # 在所有 recall > t对应的precision中找出最大值 mask = tf.greater_equal(recall, t) v = tf.reduce_max(tf.boolean_mask(precision, mask)) aps.append(v / 11.) # 得到其平均值 ap = tf.add_n(aps) return ap

代码给出的是voc07的计算方式, voc2010在recall区间区分上有变化: 假如有M个正样例,则将recall划分为[1/M, 1/(M - 1), 1/(M - 2), ... 1]。其余步骤不变。

  • 如输出的bbox与一个ground truth bbox的 IOU大于0.5, 且类别相同,则为True Positive, 否则为False Positive
  • 对于一个ground truth bbox, 只会有一个 true positive, 其余都为false positive.

Segmentation

Ground Truth

分割的label由两部分组成: 
* class segmentation: 标注出每一个像素的类别 
* object segmentation: 标注出每一个像素属于哪一个物体 
这里写图片描述

Evaluation

每类的precision和总体precision.


Reference

 

PASCAL VOC数据集分析

http://blog.csdn.net/zhangjunbob/article/details/52769381

 

PASCAL VOC数据集分析
PASCAL VOC为图像识别和分类提供了一整套标准化的优秀的数据集,从2005年到2012年每年都会举行一场图像识别challenge。
本文主要分析PASCAL VOC数据集中和图像中物体识别相关的内容。
 
在这里采用PASCAL VOC2012作为例子。下载地址为: 点击打开链接。(本文中的系统环境为ubuntu14.04)
下载完之后解压,可以在VOCdevkit目录下的VOC2012中看到如下的文件:
其中在图像物体识别上着重需要了解的是Annotations、ImageSets和JPEGImages。
 
①JPEGImages
 
JPEGImages文件夹中包含了PASCAL VOC所提供的所有的图片信息,包括了训练图片和测试图片。
这些图像都是以“年份_编号.jpg”格式命名的。
图片的像素尺寸大小不一,但是横向图的尺寸大约在500*375左右,纵向图的尺寸大约在375*500左右,基本不会偏差超过100。(在之后的训练中,第一步就是将这些图片都resize到300*300或是500*500,所有原始图片不能离这个标准过远。)
这些图像就是用来进行训练和测试验证的图像数据。
 
②Annotations
 
Annotations文件夹中存放的是xml格式的标签文件,每一个xml文件都对应于JPEGImages文件夹中的一张图片。
xml文件的具体格式如下:(对于2007_000392.jpg)
[html]  view plain  copy
 
  1. <annotation>  
  2.     <folder>VOC2012</folder>                             
  3.     <filename>2007_000392.jpg</filename>                               //文件名  
  4.     <source>                                                           //图像来源(不重要)  
  5.         <database>The VOC2007 Database</database>  
  6.         <annotation>PASCAL VOC2007</annotation>  
  7.         <image>flickr</image>  
  8.     </source>  
  9.     <size>                                               //图像尺寸(长宽以及通道数)                        
  10.         <width>500</width>  
  11.         <height>332</height>  
  12.         <depth>3</depth>  
  13.     </size>  
  14.     <segmented>1</segmented>                                   //是否用于分割(在图像物体识别中01无所谓)  
  15.     <object>                                                           //检测到的物体  
  16.         <name>horse</name>                                         //物体类别  
  17.         <pose>Right</pose>                                         //拍摄角度  
  18.         <truncated>0</truncated>                                   //是否被截断(0表示完整)  
  19.         <difficult>0</difficult>                                   //目标是否难以识别(0表示容易识别)  
  20.         <bndbox>                                                   //bounding-box(包含左下角和右上角xy坐标)  
  21.             <xmin>100</xmin>  
  22.             <ymin>96</ymin>  
  23.             <xmax>355</xmax>  
  24.             <ymax>324</ymax>  
  25.         </bndbox>  
  26.     </object>  
  27.     <object>                                                           //检测到多个物体  
  28.         <name>person</name>  
  29.         <pose>Unspecified</pose>  
  30.         <truncated>0</truncated>  
  31.         <difficult>0</difficult>  
  32.         <bndbox>  
  33.             <xmin>198</xmin>  
  34.             <ymin>58</ymin>  
  35.             <xmax>286</xmax>  
  36.             <ymax>197</ymax>  
  37.         </bndbox>  
  38.     </object>  
  39. </annotation>  
对应的图片为:
 
③ImageSets
 
ImageSets存放的是每一种类型的challenge对应的图像数据。
在ImageSets下有四个文件夹:
其中Action下存放的是人的动作(例如running、jumping等等,这也是VOC challenge的一部分)
Layout下存放的是具有人体部位的数据(人的head、hand、feet等等,这也是VOC challenge的一部分)
Main下存放的是图像物体识别的数据,总共分为20类。
Segmentation下存放的是可用于分割的数据。
 
在这里主要考察Main文件夹。
Main文件夹下包含了20个分类的***_train.txt、***_val.txt和***_trainval.txt。
这些txt中的内容都差不多如下:
前面的表示图像的name,后面的1代表正样本,-1代表负样本。
_train中存放的是训练使用的数据,每一个class的train数据都有5717个。
_val中存放的是验证结果使用的数据,每一个class的val数据都有5823个。
_trainval将上面两个进行了合并,每一个class有11540个。
需要保证的是train和val两者没有交集,也就是训练数据和验证数据不能有重复,在选取训练数据的时候 ,也应该是随机产生的。
 
④SegmentationClass和SegmentationObject
 
这两个文件夹下保存了物体分割后的图片,在物体识别中没有用到,在这里不做详细展开。
 
接下来需要研究的是如何自己生成训练数据和测试数据。
====================================================================================================================

Tag: average precision

It’s a bird… it’s a plane… it… depends on your classifier’s threshold

Evaluation of an information retrieval system (a search engine, for example) generally focuses on two things:
1. How relevant are the retrieved results? (precision)
2. Did the system retrieve many of the truly relevant documents? (recall)

For those that aren’t familiar, I’ll explain what precision and recall are, and for those that are familiar, I’ll explain some of the confusion in the literature when comparing precision-recall curves.

Geese and airplanes

Suppose you have an image collection consisting of airplanes and geese.

Images of geese and airplanes
You want your system to retrieve all the airplane images and none of the geese images.
Given a set of images that your system retrieves from this collection, we can define four accuracy counts:
True positives: Airplane images that your system correctly retrieved
True negatives: Geese images that your system correctly did not retrieve
False positives: Geese images that your system incorrectly retrieved, believing them to be airplanes
False negatives: Airplane images that your system did incorrectly did not retrieve, believing them to be geese

Collection of geese and airplanesIn this example retrieval, there are three true positives and one false positive.
Using the terms I just defined, in this example retrieval, there are three true positives and one false positive. How many false negatives are there? How many true negatives are there?

 

There are two false negatives (the airplanes that the system failed to retrieve) and four true negatives (the geese that the system did not retrieve).

Precision and recall

Now, you’ll be able to understand more exactly what precision and recall are.

Precision is the percentage true positives in the retrieved results. That is:

where n is equal to the total number of images retrieved (tp + fp).

Recall is the percentage of the airplanes that the system retrieves. That is:

In our example above, with 3 true positives, 1 false positive, 4 true negatives, and 2 false negatives, precision = 0.75, and recall = 0.6.

75% of the retrieved results were airplanes, and 60% of the airplanes were retrieved.

Adjusting the threshold

What if we’re not happy with that performance? We could ask the system to return more examples. This would be done be relaxing our threshold of what we want our system to consider as an airplane. We could also ask our system to be more strict, and return fewer examples. In our example so far, the system retrieved four examples. That corresponds to a particular threshold (shown below by a blue line). The system retrieved the examples that appeared more airplane-like than that threshold.

This is a hypothetical ordering that our airplane retrieval system could give to the images in our collection. More airplane-like are at the top of the list. The blue line is the threshold that gave our example retrieval.

We can move that threshold up and down to get a different set of retrieved documents. At each position of the threshold, we would get a different precision and recall value. Specifically, if we retrieved only the top example, precision would be 100% and recall would be 20%. If we retrieved the top two examples, precision would still be 100%, and recall will have gone up to 40%. The following chart gives precision and recall for the above hypothetical ordering at all the possible thresholds.

Retrieval cutoff Precision Recall
Top 1 image 100% 20%
Top 2 images 100% 40%
Top 3 images 66% 40%
Top 4 images 75% 60%
Top 5 images 60% 60%
Top 6 images 66% 80%
Top 7 images 57% 80%
Top 8 images 50% 80%
Top 9 images 44% 80%
Top 10 images 50% 100%

Precision-recall curves

A good way to characterize the performance of a classifier is to look at how precision and recall change as you change the threshold. A good classifier will be good at ranking actual airplane images near the top of the list, and be able to retrieve a lot of airplane images before retrieving any geese: its precision will stay high as recall increases. A poor classifier will have to take a large hit in precision to get higher recall. Usually, a publication will present a precision-recall curve to show how this tradeoff looks for their classifier. This is a plot of precision p as a function of recall r.

The precision-recall curve for our example airplane classifier. It can achieve 40% recall without sacrificing any precision, but to get 100% recall, its precision drops to 50%.

Average precision

Rather than comparing curves, its sometimes useful to have a single number that characterizes the performance of a classifier. A common metric is the average precision. This can actually mean one of several things.

Average precision

Strictly, the average precision is precision averaged across all values of recall between 0 and 1:

That’s equal to taking the area under the curve. In practice, the integral is closely approximated by a sum over the precisions at every possible threshold value, multiplied by the change in recall:

where N is the total number of images in the collection, P(k) is the precision at a cutoff of k images, and delta r(k) is the change in recall that happened between cutoff k-1 and cutoff k.

In our example, this is (1 * 0.2) + (1 * 0.2) + (0.66 * 0) + (0.75 * 0.2) + (0.6 * 0) + (0.66 * 0.2) + (0.57 * 0) + (0.5 * 0) + (0.44 * 0) + (0.5 * 0.2) = 0.782.

Notice that the points at which the recall doesn’t change don’t contribute to this sum (in the graph, these points are on the vertical sections of the plot, where it’s dropping straight down). This makes sense, because since we’re computing the area under the curve, those sections of the curve aren’t adding any area.

Interpolated average precision

Some authors choose an alternate approximation that is called the interpolated average precision. Often, they still call it average precision. Instead of using P(k), the precision at a retrieval cutoff of k images, the interpolated average precision uses:

In other words, instead of using the precision that was actually observed at cutoff k, the interpolated average precision uses the maximum precision observed across all cutoffs with higher recall. The full equation for computing the interpolated average precision is:

Visually, here’s how the interpolated average precision compares to the approximated average precision (to show a more interesting plot, this one isn’t from the earlier example):

The approximated average precision closely hugs the actually observed curve. The interpolated average precision over estimates the precision at many points and produces a higher average precision value than the approximated average precision.

Further, there are variations on where to take the samples when computing the interpolated average precision. Some take samples at a fixed 11 points from 0 to 1: {0, 0.1, 0.2, …, 0.9, 1.0}. This is called the 11-point interpolated average precision. Others sample at every k where the recall changes.

Confusion

Some important publications use the interpolated average precision as their metric and still call it average precision. For example, the PASCAL Visual Objects Challenge has used this as their evaluation metric since 2007. I don’t think their justification is strong. They say, “the intention in interpolating the precision/recall curve in this way is to reduce the impact of the “wiggles” in the precision/recall curve”. Regardless, everyone compares against each other on this metric, so within the competition, this is not an issue. However, the rest of us need to be careful when comparing “average precision” values against other published results. Are we using the VOC’s interpolated average precision, while previous work had used the non-interpolated average precision? This would incorrectly show improvement of a new method when compared to the previous work.

Summary

Precision and recall are useful metrics for evaluating the performance of a classifier.

Precision and recall vary with the strictness of your classifier’s threshold.

There are several ways to summarize the precision-recall curve with a single number called average precision; be sure you’re using the same metric as the previous work that you’re comparing with.

===============================================================================================================================

http://blog.csdn.net/applecore123456/article/details/53164538

Fast-RCNN代码解读(1)

Fast-RCNN代码解读(1)

这篇博文主要根据博主微薄的代码基础作一些简单的Fast-RCNN源码的解读,基本上就是一些代码阅读记录,写出来跟大家一起分享。这次主要记录一下测试的过程,和训练类似,Fast-RCNN中的测试过程主要由test_net.py,test.py及其他一些功能性python文件构成(比如,bbox_transform.py),其中test_net.py是测试的主要入口,功能是解析输入参数,包含了”__main__”,调用test.py中的test_net()函数,实现了整个测试过程。


Fast-RCNN测试代码解读

root/tools/test_net.py

root/lib/fast_rcnn/test.py

  • def im_detect(net, im, boxes=None)

    • 该函数实现了检测的功能,其中我比较关注的是如何在测试过程中将检测到的proposal变换到target_bbox。由于在Fast-RCNN中,bounding-box回归的过程实际上是回归了离target_bbox最近的box与其的变换(dx, dy, dw, dh),因此在测试过程中,需要将检测到的box通过回归得到的变换,得到最终的bbox。
  • box_deltas

    if cfg.TEST.BBOX_REG: # cfg.TEST.BBOX_REG = {bool}True # Apply bounding-box regression deltas box_deltas = blobs_out['bbox_pred'] pred_boxes = bbox_transform_inv(boxes, box_deltas) pred_boxes = clip_boxes(pred_boxes, im.shape) else: # Simply repeat the boxes, once for each class pred_boxes = np.tile(boxes, (1, scores.shape[1]))

root/lib/fast_rcnn/bbox_transform.py

  • def bbox_transform_inv(boxes, deltas) 
    • 该函数将selective search得到的proposal通过与测试过程中输出的变换deltas计算,得到最终的bbox。
    • 代码非常简单,如下所示:
def bbox_transform_inv(boxes, deltas): if boxes.shape[0] == 0: return np.zeros((0, deltas.shape[1]), dtype=deltas.dtype) boxes = boxes.astype(deltas.dtype, copy=False) widths = boxes[:, 2] - boxes[:, 0] + 1.0 heights = boxes[:, 3] - boxes[:, 1] + 1.0 ctr_x = boxes[:, 0] + 0.5 * widths ctr_y = boxes[:, 1] + 0.5 * heights dx = deltas[:, 0::4] # start from 0 and jump by 4, [0, 4, 8, ...] dy = deltas[:, 1::4] dw = deltas[:, 2::4] dh = deltas[:, 3::4] pred_ctr_x = dx * widths[:, np.newaxis] + ctr_x[:, np.newaxis] pred_ctr_y = dy * heights[:, np.newaxis] + ctr_y[:, np.newaxis] pred_w = np.exp(dw) * widths[:, np.newaxis] pred_h = np.exp(dh) * heights[:, np.newaxis] pred_boxes = np.zeros(deltas.shape, dtype=deltas.dtype) # x1 pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w # y1 pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h # x2 pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w # y2 pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h return pred_boxes
 


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM