xgboost原生包中有一個dump_model方法,這個方法能幫助我們看到基分類器的決策樹如何選擇特征進行分裂節點的,使用的基分類器有兩個特點:
- 二叉樹;
- 特征可以重復選擇,來切分當前節點所含的數據集.
由dump_model生成的booster格式如下:
我們可以對該類型的樹結構進行解析,得到這個基分類器中特征用來分裂的頻率,簡單的腳本如下:
# -*- coding: utf-8 -*- import re with open('./tree_like.txt', 'r') as f: lines = f.readlines() # 初步預處理 comp = [] for line in lines: new_line = line.replace(' ', '*') if line.find('leaf') < 0 and (line.startswith('*') or line.startswith('0')): regular = re.sub(r'(\**)[0-9]{1,2}:\[([0-9]{3})<.*', r'\1\2', new_line).strip() # print regular comp.append(regular) # 解析部分 i = 0 res = {} for cur in comp: cur_8 = cur.count('*') if comp.index(cur, i, len(comp)) + 1 <= len(comp) - 1: cur_8_next_index = comp.index(cur, i, len(comp)) + 1 cur_8_next = comp[cur_8_next_index] if cur_8_next.count('*') > cur_8: obj_1 = str(cur).replace("*", '') + "-" + str(cur_8_next).replace("*", '') print obj_1 if res.has_key(obj_1): res[obj_1] = res[obj_1] + 1 else: res[obj_1] = 1 # print 'parent:' + str(cur) + ", left_child:" + str(cur_8_next) for x in comp[cur_8_next_index + 1:]: if x.count('*') < cur_8_next.count('*'): break if cur_8_next.count('*') == x.count('*'): obj_2 = str(cur).replace("*", '') + "-" + str(x).replace("*", '') print obj_2 if res.has_key(obj_2): res[obj_2] = res[obj_2] + 1 else: res[obj_2] = 1 # print 'parent:' + str(cur) + ", right_child:" + str(x) break i = i + 1 # print res
得到結果如下:
特征005-053組成子樹的次數為3次,053-017組成子樹的次數為2次,以此類推...