RCNN算法的tensorflow實現
轉載自:https://blog.csdn.net/MyJournal/article/details/77841348?locationNum=9&fps=1
這個算法的思路大致如下:
1、訓練人臉分類模型 輸入:圖像;輸出:這張圖像的特征
1-1、在Caltech256數據集上pre-trained,訓練出一個較大的圖片識別庫;
1-2、利用之前人臉與非人臉的數據集對預訓練模型進行fine tune,得到一個人臉分類模型。
2、訓練SVM模型(重新定義正負樣本)輸入:圖像的特征 輸出:圖像類別
3、將圖片分為多個矩形選框,用SVM模型對這些選框區域進行分類,即判定該區域中是否包含人臉
4、使用回歸器精細修正候選框位置
下面將進行具體的解釋。
1、訓練人臉分類模型
以初學者的思維(在基本掌握了MNIST手寫數字識別后),我們通常是設置一個神經網絡(通常是借鑒在圖片分類中較好的模型的網絡層次結構,例如Alexnet、VGG16等,但據說VGG16的計算量較大,這個我也沒有試過)直接開始訓練即可。但這時需要考慮到一個問題:我們為模型選擇的數據集的規模如何?
如果我的神經網絡結構是七層,前四層是卷積池化層,后三層是連接層,對於這樣較復雜的網絡使用多少的數據量合適呢?幾千張?幾萬張?可能都有些少了。當圖片較少時,模型很容易欠擬合,因此需要借用別人用大數據量作為數據集已經訓練好的模型。但需要注意的是,一旦借用別人的模型,之后fine-tuning定義的模型結構需要與之相同,除了最終的圖片分類數目不同以外。
以下是我定義的神經網絡:
def inference(input_tensor, train, regularizer,num): with tf.name_scope('layer1-conv1'): conv1_weights = tf.get_variable("weight1",[5,5,3,32],initializer=tf.truncated_normal_initializer(stddev=0.1)) conv1_biases = tf.get_variable("bias1", [32], initializer=tf.constant_initializer(0.0)) conv1 = tf.nn.conv2d(input_tensor, conv1_weights, strides=[1, 1, 1, 1], padding='SAME') relu1 = tf.nn.relu(tf.nn.bias_add(conv1, conv1_biases)) with tf.name_scope("layer2-pool1"): pool1 = tf.nn.max_pool(relu1, ksize = [1,2,2,1],strides=[1,2,2,1],padding="VALID") with tf.variable_scope("layer3-conv2"): conv2_weights = tf.get_variable("weight2",[5,5,32,64],initializer=tf.truncated_normal_initializer(stddev=0.1)) conv2_biases = tf.get_variable("bias2", [64], initializer=tf.constant_initializer(0.0)) conv2 = tf.nn.conv2d(pool1, conv2_weights, strides=[1, 1, 1, 1], padding='SAME') relu2 = tf.nn.relu(tf.nn.bias_add(conv2, conv2_biases)) with tf.name_scope("layer4-pool2"): pool2 = tf.nn.max_pool(relu2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='VALID') with tf.variable_scope("layer5-conv3"): conv3_weights = tf.get_variable("weight3",[3,3,64,128],initializer=tf.truncated_normal_initializer(stddev=0.1)) conv3_biases = tf.get_variable("bias3", [128], initializer=tf.constant_initializer(0.0)) conv3 = tf.nn.conv2d(pool2, conv3_weights, strides=[1, 1, 1, 1], padding='SAME') relu3 = tf.nn.relu(tf.nn.bias_add(conv3, conv3_biases)) with tf.name_scope("layer6-pool3"): pool3 = tf.nn.max_pool(relu3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='VALID') with tf.variable_scope("layer7-conv4"): conv4_weights = tf.get_variable("weight4",[3,3,128,128],initializer=tf.truncated_normal_initializer(stddev=0.1)) conv4_biases = tf.get_variable("bias4", [128], initializer=tf.constant_initializer(0.0)) conv4 = tf.nn.conv2d(pool3, conv4_weights, strides=[1, 1, 1, 1], padding='SAME') relu4 = tf.nn.relu(tf.nn.bias_add(conv4, conv4_biases)) with tf.name_scope("layer8-pool4"): pool4 = tf.nn.max_pool(relu4, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='VALID') nodes = 6*6*128 reshaped = tf.reshape(pool4,[-1,nodes]) with tf.variable_scope('layer9-fc1'): fc1_weights = tf.get_variable("weight5", [nodes, 1024],initializer=tf.truncated_normal_initializer(stddev=0.1)) if regularizer != None: tf.add_to_collection('losses1', regularizer(fc1_weights)) fc1_biases = tf.get_variable("bias5", [1024], initializer=tf.constant_initializer(0.1)) fc1 = tf.nn.relu(tf.matmul(reshaped, fc1_weights) + fc1_biases) if train: fc1 = tf.nn.dropout(fc1, 0.5) with tf.variable_scope('layer10-fc2'): fc2_weights = tf.get_variable("weight6", [1024, 512],initializer=tf.truncated_normal_initializer(stddev=0.1)) if regularizer != None: tf.add_to_collection('losses2', regularizer(fc2_weights)) fc2_biases = tf.get_variable("bias6", [512], initializer=tf.constant_initializer(0.1)) fc2 = tf.nn.relu(tf.matmul(fc1, fc2_weights) + fc2_biases) if train: fc2 = tf.nn.dropout(fc2, 0.5) with tf.variable_scope('layer11-fc3'): fc3_weights = tf.get_variable("weight7", [512, num],initializer=tf.truncated_normal_initializer(stddev=0.1)) if regularizer != None: tf.add_to_collection('losses3', regularizer(fc3_weights)) fc3_biases = tf.get_variable("bias7", [num], initializer=tf.constant_initializer(0.1)) logit = tf.matmul(fc2, fc3_weights) + fc3_biases return logit #fc3
1-1、pre-trained——Caltech256數據集(有256類的圖片,包括靜物、動物、人物等)
最終分的類別是256類,將上述網絡中num設置為256即可。
將訓練好的模型保存到model.ckpt中,之后fine-tuning需要將預訓練模型重新加載。
checkpoint_file = os.path.join(log_dir, 'model.ckpt') saver.save(sess,checkpoint_file)
1-2、fine tuning
在這里我先介紹論文中的做法。(由於我的電腦運行速度太慢了,我就沒有這么做,只是找了人臉及非人臉的數據集拉進去fine tuning,但是效果不是很好…)
如果做的目標定位系統是定位男人、女人、貓、狗這四類目標,那我們將fine tuning的神經網絡中的最后一層num設置為5(4+1),加的這一類代表背景。那么背景如何獲得呢? 首先,需要我們提前對圖片數據提前標定目標位置,對於每張圖可能獲得一個或更多的標定矩形框(x,y,w,h分別表示橫坐標的最小值,縱坐標的最小值、矩形框寬度、矩形框長度)。其次,我們通過Python selectivesearch庫中的selectivesearch指令獲得多個目標框(Proposals)(selectivesearch指令根據圖片的顏色變化、紋理等將多個像素合並為多個選框)。接着,我們通過定義並計算出的IoU(目標框與標定框的重合程度,即IoU=重合面積/兩個矩形所占的面積(其中一個矩形是標定框,另一個矩形是目標框))與閾值比較,若大於這個閾值則表示該目標框標出的是男人、女人、貓或狗四類中的一類,若小於這個閾值則表示該標定框標出的是背景。論文中選取的閾值threshold=0.5。最后,加載pre-trained模型后,訓練這些圖片,在預訓練模型的基礎上對各個參數進行微調。
IOU的定義如下:
def if_intersection(xmin_a, xmax_a, ymin_a, ymax_a, xmin_b, xmax_b, ymin_b, ymax _b): if_intersect = False # 通過四條if來查看兩個方框是否有交集。如果四種狀況都不存在,我們視為無交集
if xmin_a < xmax_b <= xmax_a and (ymin_a < ymax_b <= ymax_a or ymin_a <= ymin_b < ymax_a): if_intersect = True elif xmin_a <= xmin_b < xmax_a and (ymin_a < ymax_b <= ymax_a or ymin_a <= ymin_b < ymax_a): if_intersect = True elif xmin_b < xmax_a <= xmax_b and (ymin_b < ymax_a <= ymax_b or ymin_b <= ymin_a < ymax_b): if_intersect = True elif xmin_b <= xmin_a < xmax_b and (ymin_b < ymax_a <= ymax_b or ymin_b <= ymin_a < ymax_b): if_intersect = True else: return False # 在有交集的情況下,我們通過大小關系整理兩個方框各自的四個頂點, 通過它們得到交集面積
if if_intersect == True: x_sorted_list = sorted([xmin_a, xmax_a, xmin_b, xmax_b])#from small to big number
y_sorted_list = sorted([ymin_a, ymax_a, ymin_b, ymax_b]) x_intersect_w = x_sorted_list[2] - x_sorted_list[1] y_intersect_h = y_sorted_list[2] - y_sorted_list[1] area_inter = x_intersect_w * y_intersect_h return area_inter def IOU(ver1, ver2): vertice1 = [ver1[0], ver1[1], ver1[0]+ver1[2], ver1[1]+ver1[3]] vertice2 = [ver2[0], ver2[1], ver2[0]+ver2[2], ver2[1]+ver2[3]] area_inter = if_intersection(vertice1[0], vertice1[2], vertice1[1], vertice1[3], vertice2[0], vertice2[2], vertice2[1], vertice2[3]) # 如果有交集,計算IOU
if area_inter: area_1 = ver1[2] * ver1[3] area_2 = ver2[2] * ver2[3] iou = float(area_inter) / (area_1 + area_2 - area_inter) return iou iou = 0 return iou
加載pre-trained模型並進行fine-tune訓練:
def load_with_skip(data_path, session, skip_layer): reader = pywrap_tensorflow.NewCheckpointReader(ckpt.model_checkpoint_path) data_dict = reader.get_variable_to_shape_map() for key in data_dict: print("tensor_name: ", key) if key not in skip_layer: print ( data_dict[key]) print (reader.get_tensor(key)) session.run([key]) saver = tf.train.Saver() with tf.Session() as sess: restore = False sess.run(tf.global_variables_initializer()) ckpt1 = tf.train.get_checkpoint_state(aim_dir) if ckpt1 and ckpt1.model_checkpoint_path: restore = True saver.restore(sess,ckpt1.model_checkpoint_path) print ('fine-tuning model has already exist!') print("Continue training") else: ckpt = tf.train.get_checkpoint_state(log_dir) if ckpt and ckpt.model_checkpoint_path: restore = True print ('original model has already exist!') print("Continue training") load_with_skip(ckpt.model_checkpoint_path, sess, ['layer11-fc3','layer11-fc2','layer11-fc1'])
2、訓練SVM模型,論文中是這么說的:
(1)SVM分類與CNN分類的數據集區別:
‘for finetuning we map each object proposal to the ground-truth instance with which it has maximum IoU overlap (if any) and label it as a positive for the matched ground-truth class if the IoU is at least 0.5. All other proposals are labeled “background” (i.e., negative examples for all classes). For training SVMs, in contrast, we take only the ground-truth boxes as positive examples for their respective classes and label proposals with less than 0.3 IoU overlap with all instances of a class as a negative for that class. Proposals that fall into the grey zone (more than 0.3 IoU overlap, but are not ground truth) are ignored.’
Fine tuning 階段我們將IoU大於0.5的目標框圈定的圖片作為正樣本,小於0.5的目標框圈定的圖片作為負樣本。而在對每一類目標分類的SVM訓練階段,我們將標定框圈定的圖片作為正樣本,IoU小於0.3的目標框圈定的圖片作為負樣本,其余目標框舍棄。
(2)對每一類目標選擇SVM模型
‘Once features are extracted and training labels are applied, we optimize one linear SVM per class.’
對SVM(支持向量機)簡單的理解就是:尋找一個(超)平面將一個事物與其對立面盡可能划分開來。(二分類問題)
我們將正樣本作為輸入送入fine-tune模型中,輸出是某一連接層得到的特征值,將這個輸出與其標簽(上面標定過的正負樣本)作為SVM的樣本進行訓練,得到SVM模型。
(3)為什么選擇SVM?
‘In Appendix B we discuss why the positive and negative examples are defined differently in fine-tuning versus SVM training. We also discuss the trade-offs involved in training detection SVMs rather than simply using the outputs from the final softmax layer of the fine-tuned CNN.’
論文的附錄中提到了為什么不直接選擇CNN模型及softmax對目標分類,而是選擇SVM來分類。
def load_from_pkl(dataset_file): X, Y = pickle.load(open(dataset_file, 'rb')) return X,Y def load_train_proposals(datafile, num_clss, threshold = 0.5, svm = False, save=False, save_path='dataset.pkl'): train_list = open(datafile,'r') labels = [] images = [] n = 0 for line in train_list: n = n+1
print ('n: '+str(n)) tmp = line.strip().split(' ') # tmp0 = image address
# tmp1 = label
# tmp2 = rectangle vertices
img = skimage.io.imread(tmp[0]) ref_rect = tmp[2].split(',') ref_rect_int = [int(i) for i in ref_rect] print (ref_rect) # im_orig:輸入圖片 scale:表示felzenszwalb分割時,值越大,表示保留的下來的集合就越大
# sigma:表示felzenszwalb分割時,用的高斯核寬度 min_size:表示分割后最小組尺寸
img_lbl, regions = selectivesearch.selective_search(img, scale=200, sigma=0.3, min_size=25) candidates = set() for r in regions: # excluding same rectangle (with different segments)
if r['rect'] in candidates:# 剔除重復的方框
continue
if r['size'] < 220:# 剔除太小的方框
continue
if r['size'] > 4000: continue proposal_img, proposal_vertice = clip_pic(img, r['rect']) if len(proposal_img) == 0:# Delete Empty array
continue x, y, w, h = r['rect'] if w == 0 or h == 0: # 長或寬為0的方框,剔除
continue
if h/w <= 0.7 or h/w>=1.3: continue
# Check if any 0-dimension exist image array的dim里有0的,剔除
[a, b, c] = np.shape(proposal_img) if a == 0 or b == 0 or c == 0: continue im = Image.fromarray(proposal_img) resized_proposal_img = resize_image(im, 100, 100,resize_mode=3) # 重整方框的大小
candidates.add(r['rect']) img_float = pil_to_nparray(resized_proposal_img) images.append(img_float) # 計算IOU
iou_val = IOU(ref_rect_int, proposal_vertice) # labels, let 0 represent default class, which is background
index = int(tmp[1]) if svm == False: label = np.zeros(num_clss+1) if iou_val < threshold: labels.append(0) else: labels.append(index) labels.append(label) else: if iou_val < threshold: labels.append(0) else: labels.append(index) print (r['rect']) print ('iou_val: '+str(iou_val)) print ('labels append!') if save: pickle.dump((images, labels), open(save_path, 'wb')) return images, labels def generate_single_svm_train(one_class_train_file):#獲取SVM訓練樣本
trainfile = one_class_train_file savepath = one_class_train_file.replace('txt', 'pkl') print (savepath) images = [] Y = [] if os.path.isfile(savepath): print("restoring svm dataset " + savepath) images, Y = load_from_pkl(savepath) else: print("loading svm dataset " + savepath) images, Y = load_train_proposals(trainfile, 3, threshold=0.3, svm=True, save=True, save_path=savepath) return images, Y def train_svms(train_file_folder, model): listings = os.listdir(train_file_folder) print (listings) svms = [] for train_file in listings: if "pkl" in train_file: continue X, Y = generate_single_svm_train(train_file_folder+train_file) print (np.shape(X)) print ('success!') train_features = [] for i in range(0,len(Y)): imgsvm = X[i] labelsvm = Y[i] print ('svm LABEL:'+str(labelsvm)) feats,prelabel = Restore_show(imgsvm) train_features.append(feats[0]) print("feature dimension") clf = svm.LinearSVC() print("fit svm") clf.fit(train_features,Y) print (clf) print(clf.score(train_features, Y)) # 打印擬合優度
joblib.dump(clf,os.getcwd()+'/svm/filename.pkl')#保存SVM模型
svms.append(clf) print (svms) return svms
3、將圖片用selectivesearch指令分為多個矩形選框,用SVM模型對這些選框區域進行分類,即判定該區域中是否包含人臉,並將標簽為1(即包含人臉的圖片)記錄下來:
imgs, verts = image_proposal(img_path)#image_proposal函數類似於之前的load_train_proposals函數,用於將選框篩選出來
with tf.Session() as sess: features = [] box_images = [] print("predict image:") results = [] results_label = [] results_ratio = [] count = 0 number = 0 temp = [] for f in imgs: feats ,prelabel ,ratio= Restore_show(f)#Restore_show函數是將圖片送入CNN分類模型預測,輸出分別是特征、預測標簽、是人臉的概率
clf=joblib.load(os.getcwd()+'/svm/filename.pkl')#載入SVM模型
pred = clf.predict(feats[0])#用模型進行預測,feats[0]是圖片的特征
print(pred) if pred[0] != 0: results.append(verts[count]) results_label.append(pred[0]) results_ratio.append(ratio) temp.append ((ratio,verts[count][0],verts[count][1],verts[count][2],verts[count][3])) number += 1 count += 1
4、使用回歸器精細修正候選框位置 (box regression)
至於這一部分論文中及許多博客上都仔細講過,主要計算公式我就不再贅述。大致的原理就是標定框與目標框之間存在一定誤差,我們需要尋找一種關系重新對目標框設置中心點及大小。為了保持這個關系為線性關系,我們在使用ridge regression時選擇的目標框應是與標定框之間的IoU在0.6以上的值(論文中選擇的值,我選取的是0.7,感覺效果也可以。)
4-1、ridge regression訓練的輸入是:圖片標定框的特征值,標定框的中心點坐標、長、寬(x,y,w,h),目標框的中心點坐標、長、寬(x,y,w,h)
4-2、預測:
feature, classnum = Output_show(img_path,0,0,size[0],size[1]) #Output_show函數類似於Restore_show函數,將圖片送入CNN分類模型預測,輸出分別是特征、預測標簽
clf=joblib.load(os.getcwd()+'/boxregression/filenamex.pkl')#載入ridge regression模型
predx = clf.predict(feature) clf=joblib.load(os.getcwd()+'/boxregression/filenamey.pkl') predy = clf.predict(feature) clf=joblib.load(os.getcwd()+'/boxregression/filenamew.pkl') predw = clf.predict(feature) clf=joblib.load(os.getcwd()+'/boxregression/filenameh.pkl') predh = clf.predict(feature) for i in range(number-1,-1,-1): if i not in flag_not: print (temp[i][1],temp[i][2],temp[i][3],temp[i][4]) x = float(temp[i][1]) y = float(temp[i][2]) w = float(temp[i][3]) h = float(temp[i][4]) x1 = max(w*predx+x,0) y1 = max(h*predy+y,0) w1 = w*math.exp(predw) h1 = h*math.exp(predh) print (str(x1)+' '+str(y1)+' '+str(w1)+' '+str(h1)) rect = mpatches.Rectangle( (x1, y1), w1, h1, fill=False, edgecolor='red', linewidth=2) ax.add_patch(rect)#畫出邊框回歸后的矩形
rect1 = mpatches.Rectangle( (x, y), w, h, fill=False, edgecolor='white', linewidth=2) ax.add_patch(rect1)#畫出為邊框回歸的矩形
out_ratio = str(temp[i][1]) plt.text(x1+15, y1+15, str(temp[i][0]),color='red') #在矩形框上寫出預測概率
1、http://blog.csdn.net/bixiwen_liu/article/details/53840913
2、http://blog.csdn.net/ture_dream/article/details/52896452
3、http://blog.csdn.net/daunxx/article/details/51578787
4、https://github.com/rbgirshick/rcnn
5、http://www.cnblogs.com/edwardbi/p/5647522.html