數據集:seeds.tsv

15.26 14.84 0.871 5.763 3.312 2.221 5.22 Kama 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 Kama 14.29 14.09 0.905 5.291 3.337 2.699 4.825 Kama 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 Kama 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 Kama 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 Kama 14.69 14.49 0.8799 5.563 3.259 3.586 5.219 Kama 14.11 14.1 0.8911 5.42 3.302 2.7 5.0 Kama 16.63 15.46 0.8747 6.053 3.465 2.04 5.877 Kama 16.44 15.25 0.888 5.884 3.505 1.969 5.533 Kama 15.26 14.85 0.8696 5.714 3.242 4.543 5.314 Kama 14.03 14.16 0.8796 5.438 3.201 1.717 5.001 Kama 13.89 14.02 0.888 5.439 3.199 3.986 4.738 Kama 13.78 14.06 0.8759 5.479 3.156 3.136 4.872 Kama 13.74 14.05 0.8744 5.482 3.114 2.932 4.825 Kama 14.59 14.28 0.8993 5.351 3.333 4.185 4.781 Kama 13.99 13.83 0.9183 5.119 3.383 5.234 4.781 Kama 15.69 14.75 0.9058 5.527 3.514 1.599 5.046 Kama 14.7 14.21 0.9153 5.205 3.466 1.767 4.649 Kama 12.72 13.57 0.8686 5.226 3.049 4.102 4.914 Kama 14.16 14.4 0.8584 5.658 3.129 3.072 5.176 Kama 14.11 14.26 0.8722 5.52 3.168 2.688 5.219 Kama 15.88 14.9 0.8988 5.618 3.507 0.7651 5.091 Kama 12.08 13.23 0.8664 5.099 2.936 1.415 4.961 Kama 15.01 14.76 0.8657 5.789 3.245 1.791 5.001 Kama 16.19 15.16 0.8849 5.833 3.421 0.903 5.307 Kama 13.02 13.76 0.8641 5.395 3.026 3.373 4.825 Kama 12.74 13.67 0.8564 5.395 2.956 2.504 4.869 Kama 14.11 14.18 0.882 5.541 3.221 2.754 5.038 Kama 13.45 14.02 0.8604 5.516 3.065 3.531 5.097 Kama 13.16 13.82 0.8662 5.454 2.975 0.8551 5.056 Kama 15.49 14.94 0.8724 5.757 3.371 3.412 5.228 Kama 14.09 14.41 0.8529 5.717 3.186 3.92 5.299 Kama 13.94 14.17 0.8728 5.585 3.15 2.124 5.012 Kama 15.05 14.68 0.8779 5.712 3.328 2.129 5.36 Kama 16.12 15.0 0.9 5.709 3.485 2.27 5.443 Kama 16.2 15.27 0.8734 5.826 3.464 2.823 5.527 Kama 17.08 15.38 0.9079 5.832 3.683 2.956 5.484 Kama 14.8 14.52 0.8823 5.656 3.288 3.112 5.309 Kama 14.28 14.17 0.8944 5.397 3.298 6.685 5.001 Kama 13.54 13.85 0.8871 5.348 3.156 2.587 5.178 Kama 13.5 13.85 0.8852 5.351 3.158 2.249 5.176 Kama 13.16 13.55 0.9009 5.138 3.201 2.461 4.783 Kama 15.5 14.86 0.882 5.877 3.396 4.711 5.528 Kama 15.11 14.54 0.8986 5.579 3.462 3.128 5.18 Kama 13.8 14.04 0.8794 5.376 3.155 1.56 4.961 Kama 15.36 14.76 0.8861 5.701 3.393 1.367 5.132 Kama 14.99 14.56 0.8883 5.57 3.377 2.958 5.175 Kama 14.79 14.52 0.8819 5.545 3.291 2.704 5.111 Kama 14.86 14.67 0.8676 5.678 3.258 2.129 5.351 Kama 14.43 14.4 0.8751 5.585 3.272 3.975 5.144 Kama 15.78 14.91 0.8923 5.674 3.434 5.593 5.136 Kama 14.49 14.61 0.8538 5.715 3.113 4.116 5.396 Kama 14.33 14.28 0.8831 5.504 3.199 3.328 5.224 Kama 14.52 14.6 0.8557 5.741 3.113 1.481 5.487 Kama 15.03 14.77 0.8658 5.702 3.212 1.933 5.439 Kama 14.46 14.35 0.8818 5.388 3.377 2.802 5.044 Kama 14.92 14.43 0.9006 5.384 3.412 1.142 5.088 Kama 15.38 14.77 0.8857 5.662 3.419 1.999 5.222 Kama 12.11 13.47 0.8392 5.159 3.032 1.502 4.519 Kama 11.42 12.86 0.8683 5.008 2.85 2.7 4.607 Kama 11.23 12.63 0.884 4.902 2.879 2.269 4.703 Kama 12.36 13.19 0.8923 5.076 3.042 3.22 4.605 Kama 13.22 13.84 0.868 5.395 3.07 4.157 5.088 Kama 12.78 13.57 0.8716 5.262 3.026 1.176 4.782 Kama 12.88 13.5 0.8879 5.139 3.119 2.352 4.607 Kama 14.34 14.37 0.8726 5.63 3.19 1.313 5.15 Kama 14.01 14.29 0.8625 5.609 3.158 2.217 5.132 Kama 14.37 14.39 0.8726 5.569 3.153 1.464 5.3 Kama 12.73 13.75 0.8458 5.412 2.882 3.533 5.067 Kama 17.63 15.98 0.8673 6.191 3.561 4.076 6.06 Rosa 16.84 15.67 0.8623 5.998 3.484 4.675 5.877 Rosa 17.26 15.73 0.8763 5.978 3.594 4.539 5.791 Rosa 19.11 16.26 0.9081 6.154 3.93 2.936 6.079 Rosa 16.82 15.51 0.8786 6.017 3.486 4.004 5.841 Rosa 16.77 15.62 0.8638 5.927 3.438 4.92 5.795 Rosa 17.32 15.91 0.8599 6.064 3.403 3.824 5.922 Rosa 20.71 17.23 0.8763 6.579 3.814 4.451 6.451 Rosa 18.94 16.49 0.875 6.445 3.639 5.064 6.362 Rosa 17.12 15.55 0.8892 5.85 3.566 2.858 5.746 Rosa 16.53 15.34 0.8823 5.875 3.467 5.532 5.88 Rosa 18.72 16.19 0.8977 6.006 3.857 5.324 5.879 Rosa 20.2 16.89 0.8894 6.285 3.864 5.173 6.187 Rosa 19.57 16.74 0.8779 6.384 3.772 1.472 6.273 Rosa 19.51 16.71 0.878 6.366 3.801 2.962 6.185 Rosa 18.27 16.09 0.887 6.173 3.651 2.443 6.197 Rosa 18.88 16.26 0.8969 6.084 3.764 1.649 6.109 Rosa 18.98 16.66 0.859 6.549 3.67 3.691 6.498 Rosa 21.18 17.21 0.8989 6.573 4.033 5.78 6.231 Rosa 20.88 17.05 0.9031 6.45 4.032 5.016 6.321 Rosa 20.1 16.99 0.8746 6.581 3.785 1.955 6.449 Rosa 18.76 16.2 0.8984 6.172 3.796 3.12 6.053 Rosa 18.81 16.29 0.8906 6.272 3.693 3.237 6.053 Rosa 18.59 16.05 0.9066 6.037 3.86 6.001 5.877 Rosa 18.36 16.52 0.8452 6.666 3.485 4.933 6.448 Rosa 16.87 15.65 0.8648 6.139 3.463 3.696 5.967 Rosa 19.31 16.59 0.8815 6.341 3.81 3.477 6.238 Rosa 18.98 16.57 0.8687 6.449 3.552 2.144 6.453 Rosa 18.17 16.26 0.8637 6.271 3.512 2.853 6.273 Rosa 18.72 16.34 0.881 6.219 3.684 2.188 6.097 Rosa 16.41 15.25 0.8866 5.718 3.525 4.217 5.618 Rosa 17.99 15.86 0.8992 5.89 3.694 2.068 5.837 Rosa 19.46 16.5 0.8985 6.113 3.892 4.308 6.009 Rosa 19.18 16.63 0.8717 6.369 3.681 3.357 6.229 Rosa 18.95 16.42 0.8829 6.248 3.755 3.368 6.148 Rosa 18.83 16.29 0.8917 6.037 3.786 2.553 5.879 Rosa 18.85 16.17 0.9056 6.152 3.806 2.843 6.2 Rosa 17.63 15.86 0.88 6.033 3.573 3.747 5.929 Rosa 19.94 16.92 0.8752 6.675 3.763 3.252 6.55 Rosa 18.55 16.22 0.8865 6.153 3.674 1.738 5.894 Rosa 18.45 16.12 0.8921 6.107 3.769 2.235 5.794 Rosa 19.38 16.72 0.8716 6.303 3.791 3.678 5.965 Rosa 19.13 16.31 0.9035 6.183 3.902 2.109 5.924 Rosa 19.14 16.61 0.8722 6.259 3.737 6.682 6.053 Rosa 20.97 17.25 0.8859 6.563 3.991 4.677 6.316 Rosa 19.06 16.45 0.8854 6.416 3.719 2.248 6.163 Rosa 18.96 16.2 0.9077 6.051 3.897 4.334 5.75 Rosa 19.15 16.45 0.889 6.245 3.815 3.084 6.185 Rosa 18.89 16.23 0.9008 6.227 3.769 3.639 5.966 Rosa 20.03 16.9 0.8811 6.493 3.857 3.063 6.32 Rosa 20.24 16.91 0.8897 6.315 3.962 5.901 6.188 Rosa 18.14 16.12 0.8772 6.059 3.563 3.619 6.011 Rosa 16.17 15.38 0.8588 5.762 3.387 4.286 5.703 Rosa 18.43 15.97 0.9077 5.98 3.771 2.984 5.905 Rosa 15.99 14.89 0.9064 5.363 3.582 3.336 5.144 Rosa 18.75 16.18 0.8999 6.111 3.869 4.188 5.992 Rosa 18.65 16.41 0.8698 6.285 3.594 4.391 6.102 Rosa 17.98 15.85 0.8993 5.979 3.687 2.257 5.919 Rosa 20.16 17.03 0.8735 6.513 3.773 1.91 6.185 Rosa 17.55 15.66 0.8991 5.791 3.69 5.366 5.661 Rosa 18.3 15.89 0.9108 5.979 3.755 2.837 5.962 Rosa 18.94 16.32 0.8942 6.144 3.825 2.908 5.949 Rosa 15.38 14.9 0.8706 5.884 3.268 4.462 5.795 Rosa 16.16 15.33 0.8644 5.845 3.395 4.266 5.795 Rosa 15.56 14.89 0.8823 5.776 3.408 4.972 5.847 Rosa 15.38 14.66 0.899 5.477 3.465 3.6 5.439 Rosa 17.36 15.76 0.8785 6.145 3.574 3.526 5.971 Rosa 15.57 15.15 0.8527 5.92 3.231 2.64 5.879 Rosa 15.6 15.11 0.858 5.832 3.286 2.725 5.752 Rosa 16.23 15.18 0.885 5.872 3.472 3.769 5.922 Rosa 13.07 13.92 0.848 5.472 2.994 5.304 5.395 Canadian 13.32 13.94 0.8613 5.541 3.073 7.035 5.44 Canadian 13.34 13.95 0.862 5.389 3.074 5.995 5.307 Canadian 12.22 13.32 0.8652 5.224 2.967 5.469 5.221 Canadian 11.82 13.4 0.8274 5.314 2.777 4.471 5.178 Canadian 11.21 13.13 0.8167 5.279 2.687 6.169 5.275 Canadian 11.43 13.13 0.8335 5.176 2.719 2.221 5.132 Canadian 12.49 13.46 0.8658 5.267 2.967 4.421 5.002 Canadian 12.7 13.71 0.8491 5.386 2.911 3.26 5.316 Canadian 10.79 12.93 0.8107 5.317 2.648 5.462 5.194 Canadian 11.83 13.23 0.8496 5.263 2.84 5.195 5.307 Canadian 12.01 13.52 0.8249 5.405 2.776 6.992 5.27 Canadian 12.26 13.6 0.8333 5.408 2.833 4.756 5.36 Canadian 11.18 13.04 0.8266 5.22 2.693 3.332 5.001 Canadian 11.36 13.05 0.8382 5.175 2.755 4.048 5.263 Canadian 11.19 13.05 0.8253 5.25 2.675 5.813 5.219 Canadian 11.34 12.87 0.8596 5.053 2.849 3.347 5.003 Canadian 12.13 13.73 0.8081 5.394 2.745 4.825 5.22 Canadian 11.75 13.52 0.8082 5.444 2.678 4.378 5.31 Canadian 11.49 13.22 0.8263 5.304 2.695 5.388 5.31 Canadian 12.54 13.67 0.8425 5.451 2.879 3.082 5.491 Canadian 12.02 13.33 0.8503 5.35 2.81 4.271 5.308 Canadian 12.05 13.41 0.8416 5.267 2.847 4.988 5.046 Canadian 12.55 13.57 0.8558 5.333 2.968 4.419 5.176 Canadian 11.14 12.79 0.8558 5.011 2.794 6.388 5.049 Canadian 12.1 13.15 0.8793 5.105 2.941 2.201 5.056 Canadian 12.44 13.59 0.8462 5.319 2.897 4.924 5.27 Canadian 12.15 13.45 0.8443 5.417 2.837 3.638 5.338 Canadian 11.35 13.12 0.8291 5.176 2.668 4.337 5.132 Canadian 11.24 13.0 0.8359 5.09 2.715 3.521 5.088 Canadian 11.02 13.0 0.8189 5.325 2.701 6.735 5.163 Canadian 11.55 13.1 0.8455 5.167 2.845 6.715 4.956 Canadian 11.27 12.97 0.8419 5.088 2.763 4.309 5.0 Canadian 11.4 13.08 0.8375 5.136 2.763 5.588 5.089 Canadian 10.83 12.96 0.8099 5.278 2.641 5.182 5.185 Canadian 10.8 12.57 0.859 4.981 2.821 4.773 5.063 Canadian 11.26 13.01 0.8355 5.186 2.71 5.335 5.092 Canadian 10.74 12.73 0.8329 5.145 2.642 4.702 4.963 Canadian 11.48 13.05 0.8473 5.18 2.758 5.876 5.002 Canadian 12.21 13.47 0.8453 5.357 2.893 1.661 5.178 Canadian 11.41 12.95 0.856 5.09 2.775 4.957 4.825 Canadian 12.46 13.41 0.8706 5.236 3.017 4.987 5.147 Canadian 12.19 13.36 0.8579 5.24 2.909 4.857 5.158 Canadian 11.65 13.07 0.8575 5.108 2.85 5.209 5.135 Canadian 12.89 13.77 0.8541 5.495 3.026 6.185 5.316 Canadian 11.56 13.31 0.8198 5.363 2.683 4.062 5.182 Canadian 11.81 13.45 0.8198 5.413 2.716 4.898 5.352 Canadian 10.91 12.8 0.8372 5.088 2.675 4.179 4.956 Canadian 11.23 12.82 0.8594 5.089 2.821 7.524 4.957 Canadian 10.59 12.41 0.8648 4.899 2.787 4.975 4.794 Canadian 10.93 12.8 0.839 5.046 2.717 5.398 5.045 Canadian 11.27 12.86 0.8563 5.091 2.804 3.985 5.001 Canadian 11.87 13.02 0.8795 5.132 2.953 3.597 5.132 Canadian 10.82 12.83 0.8256 5.18 2.63 4.853 5.089 Canadian 12.11 13.27 0.8639 5.236 2.975 4.132 5.012 Canadian 12.8 13.47 0.886 5.16 3.126 4.873 4.914 Canadian 12.79 13.53 0.8786 5.224 3.054 5.483 4.958 Canadian 13.37 13.78 0.8849 5.32 3.128 4.67 5.091 Canadian 12.62 13.67 0.8481 5.41 2.911 3.306 5.231 Canadian 12.76 13.38 0.8964 5.073 3.155 2.828 4.83 Canadian 12.38 13.44 0.8609 5.219 2.989 5.472 5.045 Canadian 12.67 13.32 0.8977 4.984 3.135 2.3 4.745 Canadian 11.18 12.72 0.868 5.009 2.81 4.051 4.828 Canadian 12.7 13.41 0.8874 5.183 3.091 8.456 5.0 Canadian 12.37 13.47 0.8567 5.204 2.96 3.919 5.001 Canadian 12.19 13.2 0.8783 5.137 2.981 3.631 4.87 Canadian 11.23 12.88 0.8511 5.14 2.795 4.325 5.003 Canadian 13.2 13.66 0.8883 5.236 3.232 8.315 5.056 Canadian 11.84 13.21 0.8521 5.175 2.836 3.598 5.044 Canadian 12.3 13.34 0.8684 5.243 2.974 5.637 5.063 Canadian
第一步:加載數據
load.py
import numpy as np def load_dataset(dataset_name): data = [] label = [] with open('{0}.tsv'.format(dataset_name),'r') as f: lines = f.readlines() for line in lines: linedata = line.strip().split('\t') data.append([float(da) for da in linedata[:-1]]) label.append(linedata[-1]) data = np.array(data) label = np.array(label) return data,label
第二步:設計分類模型
閾值分類模型是在所有的訓練數據中找最佳的閾值,這個閾值使得訓練集的預測效果最好。
threshold.py
#coding:utf-8 import numpy as np def learn_model(features,labels): best_acc = -1.0 thresh = features.copy() for fi in range(features.shape[1]): # 逐列 thresh = features[:,fi].copy() thresh.sort() for t in thresh: # 列中每一個元素 pred = (features[:,fi]>t) acc = (pred == labels).mean() if acc > best_acc: best_acc = acc best_fi = fi best_t = t print 'model->best_fi,t,acc:',best_fi,best_t,best_acc return best_t,best_fi def apply_model(features,model): t,fi = model return features[:,fi] > t def accurcy(features,labels,model): predictions = apply_model(features,model) return (predictions == labels).mean() #prediction == labels 同為真或同為假
第三步:測試模型的預測准確性
在這里采用十折交叉驗證,即把樣本數據分成10份,每次取其中一份作為測試數據,其余9份作為訓練數據。這種方法的優點是充分利用了數據樣本資源,缺點是計算量大。
seeds_threshold.py
#coding:utf-8 from load import load_dataset import numpy as np from threshold import learn_model,accurcy,apply_model features,labels = load_dataset('seeds') labels = (labels =='Canadian') #相等就為 True 不相等就為 False sumacc = 0.0 for flod in xrange(10): print '第 ',flod+1,' 次交叉驗證' training = np.ones(len(features),bool) training[flod::10] = False testing = ~training model = learn_model(features[training],labels[training]) acc = accurcy(features[testing],labels[testing],model) print '測試集預測准確率{0:.1%}'.format(acc) sumacc += acc sumacc /= 10 print '平均測試集預測准確率{0:.1%}'.format(sumacc)
運行 seeds_threshold.py
這樣下來一個簡單的閾值分類模型就建好了。
第五次交叉驗證的准確率都是 81% 分類的閾值分別是第 fi=5 列,t = 4.308.我們就可以用這個閾值預測給定種子是否是 Canadian.
現在我們在已有的基礎上把三種 seed 的分類閾值都求出來:
seeds.py

#coding:utf-8 from load import load_dataset import numpy as np from threshold import learn_model,accurcy,apply_model features,rawlabels = load_dataset('seeds') labelset = set(rawlabels) print labelset #labels = (labels =='Canadian') #相等就為 True 不相等就為 False for label in labelset: print label labels = rawlabels.copy() labels = (labels == label) sumacc = 0.0 bestacc = 0.0 for flod in xrange(10): print '第 ',flod+1,' 次交叉驗證' training = np.ones(len(features),bool) training[flod::10] = False testing = ~training model = learn_model(features[training],labels[training]) acc = accurcy(features[testing],labels[testing],model) print '測試集預測准確率{0:.1%}'.format(acc) if acc > bestacc: bestacc = acc bestmodel = model sumacc += acc sumacc /= 10 print '平均測試集預測准確率{0:.1%}'.format(sumacc) print '最佳模型:',model
分別求得各個類別的分類閾值。
根據閾值可以畫出分類樹。
用data表示一個待分類數據 data 有 7個元素,分別表示 7 個不同的特征。
分類樹: