Weka 二次開發使用心得

本文轉載自查看原文 2014-12-17 21:02 2688

Weka 二次開發使用心得

一、weka數據挖掘流程

使用weka圖形界面，初步嘗試了下數據的預處理、分類、關聯等操作，因為weka本身就是一個開源的機器學習庫，於是想自己嘗試下利用weka的api進行相關的學習。
在Eclipse中新建一個工程，導入weka.jar，就可以開始編寫代碼了，具體的配置很簡單，不清楚的話網上有很多的參考教程，這里只是記錄一些學習中大致的過程。

weka作為開源的數據挖掘平台，封裝了很多優秀的機器學習算法，它進行數據挖掘的過程一般如下：

讀入訓練、測試樣本
初始化分類器
使用訓練樣本訓練分類器
使用測試樣本測試分類器的學習效果
打印分類結果

下面是示例代碼，其中引入的jar包沒有給出，屆時可以在Eclipse中使用快捷鍵ctrl+shift+o 來自動導入所需要的包。

public class WekaTest {
	public static void main(String[] args) throws Exception {
		//讀取訓練數據
		Instances ins=null;
		Classifier cfs=null;
		File file=new File("data/iris.arff");
		ArffLoader loader=new ArffLoader();
		loader.setFile(file);
		ins=loader.getDataSet();
		ins.setClassIndex(ins.numAttributes()-1);
		//初始化分類器
        cfs = (Classifier)Class.forName("weka.classifiers.bayes.NaiveBayes").newInstance();
        //使用訓練樣本進行分類
        cfs.buildClassifier(ins);
        //使用測試樣本測試分類器的學習效果 ，這里使用的還是原來的數據集，只是為了方便
        //具體操作過程中需要導入新的測試數據
        Instance testInst;
        Evaluation testingEvaluation = new Evaluation(ins);
        int length = ins.numInstances();
        for(int i = 0; i < length ; i++){
           testInst = ins.instance(i);
           testingEvaluation.evaluateModelOnceAndRecordPrediction(cfs, testInst);
        }
        //打印分類結果
        System.out.println("right classifier=="+(1-testingEvaluation.errorRate()));
	}
}

大體的學習和評測過程就是這樣，然后可能在不同的應用中會選擇不同的算法或者其他參數等。這個還在進一步摸索之中。

備注：
使用weka進行模型的訓練過程中，如果沒有測試集，可以采用k-fold交叉驗證的方式。

        //why not like this?
        testingEvaluation.evaluateModel(cfs, ins);
        System.out.println(1-testingEvaluation.errorRate());
        
        // k-fold cross evaluation
        Evaluation tencrosseva=new Evaluation(ins);
        tencrosseva.crossValidateModel(cfs, ins, 14, new Random(1));
        System.out.println(1-tencrosseva.errorRate());
        
        //save the model
        SerializationHelper.write("data/knn.model", cfs);
        
        //load the model
        Classifier clf_name = (Classifier) SerializationHelper.read("data/knn.model");

常用分類器介紹，有些名字筆記晦澀。

bayes下的Naïve Bayes（朴素貝葉斯）和BayesNet（貝葉斯信念網絡）。
functions下的LibLinear、LibSVM（這兩個需要安裝擴展包）、Logistic Regression、Linear Regression
lazy下的IB1（1-NN）和IBK（KNN）。
meta下的很多boosting和bagging分類器，比如AdaBoostM1。
trees下的J48（weka版的C4.5）、RandomForest。

二、weka 屬性選擇

在數據挖掘的研究中，通常要通過距離來計算樣本之間的距離，而樣本距離是通過屬性值來計算的。我們知道對於不同的屬性，它們在樣本空間的權重是不一樣的，即它們與類別的關聯度是不同的，因此有必要篩選一些屬性或者對各個屬性賦一定的權重。這樣屬性選擇的方法就應運而生了。 ——weka屬性選擇

這里我使用的是kdd99 進行網絡入侵檢測的10%數據集合（大概4w多條記錄），每條記錄包含41個特征屬性以及一個類標簽。使用weka訓練這么點數據的時候顯得還是有點吃力，因為有些屬性是相關而且相對冗余，有必要對其進行屬性的選擇，可以理解成主成分分析PCA不？有些概念還是比較模糊，一定要理解清楚。

導入kdd99數據集

默認安裝的堆內存只有1024m ，在運行大的數據集的時候可能會出現堆溢出的錯誤。
有兩種方法可以改變堆內存的大小

在控制台運行java -Xmx1500m -jar weka.jar啟動weka。
或者修改安裝目錄下的runweka.ini配置文件。

# placeholders ("#bla#" in command gets replaced with content of key "bla")
# Note: "#wekajar#" gets replaced by the launcher class, since that jar gets
#       provided as parameter
maxheap=1024M
# The MDI GUI
#mainclass=weka.gui.Main

原先的逗號分隔的文本文件（csv）,導入weka中然后可以另存為arff文件，可以很清晰明了的看到哪些是連續型變量、哪些是離散變量。

kdd99 數據概覽

@relation attr

@attribute duration numeric
@attribute protocol_type {tcp,udp,icmp}
@attribute service {http,smtp,finger,domain_u,auth,telnet,ftp,eco_i,ntp_u,ecr_i,other,private,pop_3,ftp_data,rje,time,mtp,link,remote_job,gopher,ssh,name,whois,domain,login,imap4,daytime,ctf,nntp,shell,IRC,nnsp,http_443,exec,printer,efs,courier,uucp,klogin,kshell,echo,discard,systat,supdup,iso_tsap,hostnames,csnet_ns,pop_2,sunrpc,uucp_path,netbios_ns,netbios_ssn,netbios_dgm,sql_net,vmnet,bgp,Z39_50,ldap,netstat,urh_i,X11,urp_i,pm_dump,tftp_u,tim_i,red_i}
@attribute flag {SF,S1,REJ,S2,S0,S3,RSTO,RSTR,RSTOS0,OTH,SH}
@attribute src_bytes numeric
@attribute dst_bytes numeric
@attribute land numeric
@attribute wrong_fragment numeric
@attribute urgent numeric
...
...
@attribute dst_host_srv_serror_rate numeric
@attribute dst_host_rerror_rate numeric
@attribute dst_host_srv_rerror_rate numeric
@attribute lable {normal.,buffer_overflow.,loadmodule.,perl.,neptune.,smurf.,guess_passwd.,pod.,teardrop.,portsweep.,ipsweep.,land.,ftp_write.,back.,imap.,satan.,phf.,nmap.,multihop.,warezmaster.,warezclient.,spy.,rootkit.}

@data
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0,0,0,0,1,0,0,9,9,1,0,0.11,0,0,0,0,0,normal.
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0,0,0,0,1,0,0,19,19,1,0,0.05,0,0,0,0,0,normal.
...
...

進行預處理

有些方法的使用對數據集的類型有要求，比如關聯方法的話就要求是離散型的，如果有數值型的數據的話，那么就要對這些個屬性進行離散化操作，同樣的道理，有時候需要對數據進行規范化、正則化等操作，目的為了就是能夠使用特定的算法，或者說是提高精度與訓練速度等。。。

屬性選擇操作

對這個數據集使用信息增益算法進行屬性選擇的時候內存溢出，故重新抽樣選擇了原數據的一半進行選擇，采用10-fold交叉驗證的選擇方式進行。

在右側的列表中可以看到屬性排名，信息量越大的越能很好的區分分類類別，故用來做分類屬性的話更具有價值。

示例代碼

public class WekaASE {

	public static void main(String[] args) throws Exception {
		// 1. 讀取訓練數據
		Instances ins = null;
		Classifier cfs = null;
		File file = new File("data/kdd99.arff");
		ArffLoader loader = new ArffLoader();
		loader.setFile(file);
		ins = loader.getDataSet();
		ins.setClassIndex(ins.numAttributes() - 1);
		//初始化搜索算法（search method）及屬性評測算法（attribute evaluator）
		Ranker rank = new Ranker();
		InfoGainAttributeEval eval = new InfoGainAttributeEval();
	    // 3.根據評測算法評測各個屬性
		eval.buildEvaluator(ins);
        // 4.按照特定搜索算法對屬性進行篩選
        //在這里使用的Ranker算法僅僅是屬性按照InfoGain的大小進行排序
		int[] attrIndex = rank.search(eval, ins);
		//5.打印結果信息 在這里我們了屬性的排序結果同時將每個屬性的InfoGain信息打印出來
		StringBuffer attrIndexInfo = new StringBuffer();
		StringBuffer attrInfoGainInfo = new StringBuffer();
		attrIndexInfo.append("Selected attributes:");
		attrInfoGainInfo.append("Ranked attributes:\n");
		for (int i = 0; i < attrIndex.length; i++) {
			attrIndexInfo.append(attrIndex[i]);
			attrIndexInfo.append(",");
			attrInfoGainInfo.append(eval.evaluateAttribute(attrIndex[i]));
			attrInfoGainInfo.append("\t");
			attrInfoGainInfo.append((ins.attribute(attrIndex[i]).name()));
			attrInfoGainInfo.append("\n");
		}
		System.out.println(attrIndexInfo.toString());
		System.out.println(attrInfoGainInfo.toString());

	}
}

三、關聯分析

四、分類探究

五、聚類分析

六、驗證&評估

七、特征工程

華麗的分割線~~~

參考資料

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Weka的使用和二次開發(朴素貝葉斯及其屬性選擇) 區塊鏈開源心得和區塊鏈二次開發 NX二次開發-使用NXOPEN C++向導模板做二次開發 jmeter 二次開發 discuz 二次開發 Jmeter二次開發 vtiger二次開發 datax二次開發 phpcms二次開發 Thinkcmf 二次開發