LingPipe-TextClassification(文本分類)


What is Text Classification?

Text classification typically involves assigning a document to a category by automated or human means. LingPipe provides a classification facility that takes examples of text classifications--typically generated by a human--and learns how to classify further documents using what it learned with language models. There are many other ways to construct classifiers, but language models are particularly good at some versions of this task.

什么是文本分類?

文本分類通常指的是把一個文檔自動或者按照人的意願去歸類。LingPipe 提供了基於人為已經分好類的文本根據語言模型去學習自動分類。有很多方法都可以構建分類器,但是語言模型對於分類器的構建很有幫助。

20 Newsgroups Demo(20個新聞包示例)

A publicly available data set to work with is the 20 newsgroups data available from the20 Newsgroups Home Page

從這可以下載一個包含20個新聞包的公開數據集。

4 Newsgroups Sample (4個新聞包例子)

We have included a sample of 4 newsgroups with the LingPipe distribution in order to allow you to run the tutorial out of the box. You may also download and run over the entire 20 newsgroup dataset. LingPipe's performance over the whole data set is state of the art.

為了讓用戶能夠順利的看着本教程運行LingPipe的發布版本,我們的例子里面已經包含了4個新聞包。你也可以下載完整的新聞包(20個)然后在完整的數據集上運行LingPipe。LingPipe運行在完整的數據集上的效果會更好。

Quick Start(快速入門)

Once you have downloaded and installed LingPipe, change directories to the one containing this read-me:

如果你已經下載並且安裝了LingPipe,進入跟目錄(包含ReadMe的目錄):

 cd demos/tutorial/classify

You may then run the demo from the command line (placing all of the code on one line):

然后你就可以在命令行下運行我們的示例了(沒有換行):

On Windows:(在windows下)

java
-cp "../../../lingpipe-4.1.0.jar;
     classifyNews.jar"
ClassifyNews

or through Ant:(或者通過終端)

ant classifyNews

The demo will then train on the data in demos/fourNewsGroups/4news-train/ and evaluate on demos/4newsgroups/4news-test. The results of scoring are printed to the command line and explained in the rest of this tutorial.

自帶的示例將會在 demos/fourNewsGroups/4news-train/這四個訓練集中進行訓練,在 demos/4newsgroups/4news-test中進行測試評估。最終得分結果會打印在命令行窗口中,在接下來的教程中我們會對這個結果進行解釋。

The Code(代碼)

The entire source for the example is ClassifyNews.java. We will be using the API from Classifier and its subclasses to train the classifier, and Classifcation to evaluate it. The code should be pretty self explanatory in terms of how training and evaluation are done. Below I go over the API calls.

ClassifyNews.java 是整個分類的源文件。我們要用Classifier 以及它的子類去訓練評估分類器。你可以通過閱讀我們提供的規范的代碼去了解怎么實現訓練和評估分類器的。接下來我們開始學習分類器的API。

Training(分類器的訓練)

We are going to train up a set of character based language models (one per newsgroup as named in the static array CATEGORIES) that processes data in 6 character sequences as specified by the NGRAM_SIZE constant.

根據語言模型訓練出一組特征值(每一個新聞集都被命名進靜態數組 CATEGORIES)that processes data in 6 character sequences as specified by the NGRAM_SIZE constant.

private static String[] CATEGORIES
    = { "soc.religion.christian",
        "talk.religion.misc",
        "alt.atheism",
        "misc.forsale" };

private static int NGRAM_SIZE = 6;

The smaller your data generally the smaller the n-gram sample, but you can play around with different values--reasonable ranges are from 1 to 16 with 6 being a good general starting place.

通常情況下你的訓練數據集越小,你的一階馬爾科夫鏈(n-gram)樣本集就越小,但是你可以合理的權值范圍范圍1到16內選取起始值,6就是一個不錯的起始值。

The actual classifier involves one language model per classifier. In this case, we are going to use process language models (LanguageModel.Process). There is a factory method in DynamicLMClassifier to construct actual models.

實際上每個分類器都有一個語言分類模型。在這種情況下,我們才能處理語言模型。 DynamicLMClassifier是一個構建實際模型的動態工廠方法。

DynamicLMClassifier classifier
  = DynamicLMClassifier
    .createNGramBoundary(CATEGORIES,
                         NGRAM_SIZE);

There are two other kinds of language model classifiers that may be constructed, for bounded character language models and tokenized language models.

還可以構造另外其他的兩類分類器,邊界特征集語言模型和標記語言模型。

Training a classifier simply involves providing examples of text of the various categories. This is called through the handle method after first constructing a classification from the category and a classified object from the classification and text:

簡單的通過提供每個分類的示例文本來訓練一個分類器。

Classification classification
    = new Classification(CATEGORIES[i]);
Classified<CharSequence> classified
    = new Classified<CharSequence>(text,classification);
classifier.handle(classified);

That's all you need to train up a language model classifier. Now we can see what it can do with some evaluation data.

這就已經完成了一個語言模型的分類器的訓練。現在怎么利用測試數據測試分類器呢。

Classifying News Articles(對新聞進行分類)

The DynamicLMClassifier is pretty slow when doing classification so it is generally worth going through a compile step to produce the more efficient compiled version, which will classify character sequences into joint classification results. A simple way to do that is in the code as:

DynamicLMClassifier 這個動態類在進行分類的時候是相當慢的,所以對分類器進行聯合編譯是很有必要的。如代碼所示:

JointClassifier<CharSequence> compiledClassifier
    = (JointClassifier<CharSequence>)
      AbstractExternalizable.compile(classifier);

Now the rubber hits the road and we can can see how well the machine learning is doing. The example code both reports classifications to the console and evaluates the performance. The crucial lines of code are:

現在一切准備就緒,我們看一下機器是怎么自動學習的。示例代碼包括輸出分類信息和評估測試。關鍵代碼如下:

JointClassification jc = compiledClassifier.classifyJoint(text);
String bestCategory = jc.bestCategory();
String details = jc.toString();

The text is an article that was not trained on and the JointClassification is the result of evaluating the text against all the language models. Contained in it is a bestCategory() method that returns the highest scoring language model name for the text. Just to be sure that some statistics are involved the toString() method dumps out all the results and they are presented as:

找一篇沒有訓練過的文章,通過JointClassification對這個文章在 所有語言模型上進行評估測試。在這個類中的bestCategory() 方法可以針對這個文章返回一個得分最高的分類語言模型。統計結果會通過toString()方法輸出,如下:

Testing on soc.religion.christian/21417
Best Cat: soc.religion.christian
Rank Cat Score P(Cat|In) log2 P(Cat,In)
0=soc.religion.christian -1.56 0.45 -1.56
1=talk.religion.misc -2.68 0.20 -2.68
2=alt.atheism -2.70 0.20 -2.70
3=misc.forsale -3.25 0.13 -3.25

Scoring Accuracy(得分精度)

The remaining API of note is how the system is scored against a gold standard. In this case our testing data. Since we know what newsgroup the article came from we can evaluate how well the software is doing with the JointClassifierEvaluator class.

剩余的API主要說明系統的黃金標准是怎么得分的。在測試數據中,我們知道新聞集的分類以及來源,但是軟件是怎么做到的呢。

boolean storeInputs = true;
JointClassifierEvaluator<CharSequence> evaluator
    = new JointClassifierEvaluator<CharSequence>(compiledClassifier,
                                                                CATEGORIES,
                                                                storeInputs);

This class wraps the compiledClassifier in an evaluation framework that provide very rich reporting of how well the system is doing. Later in the code it is populated with data points with the method addCase(), after first constructing a classified object as for training:

這個類封裝在一個提供了非常豐富的系統是如何好做報告的編制分類評估框架。在后面的代碼是填充的方法addCase(數據點),之后先構造一個分類對象的培訓:

Classification classification
    = new Classification(CATEGORIES[i]);
Classified<CharSequence> classified
    = new Classified<CharSequence>(text,classification);
evaluator.handle(classified);

This will get a JointClassification for the text and then keep track of the results for reporting later. After all the data is run, then many methods exist to see how well the software did. In the demo code we just print out the total accuracy via the ConfusionMatrix class, but it is well worth looking at the relevant Javadoc for what reporting is available.

Cross-Validation(交叉驗證)

Running Cross-Validation(進行交叉驗證)

There's an ant target crossValidateNews which cross-validates the news classifier over 10 folds. Here's what a run looks like:

ant的目標是對10倍以上的新聞進行交叉驗證。運行結果如下:

> cd $LINGPIPE/demos/tutorial/classify
> ant crossValidateNews

Reading data.
Num instances=250.
Permuting corpus.
 FOLD        ACCU
    0  1.00 +/- 0.00
    1  0.96 +/- 0.08
    2  0.84 +/- 0.14
    3  0.92 +/- 0.11
    4  1.00 +/- 0.00
    5  0.96 +/- 0.08
    6  0.88 +/- 0.13
    7  0.84 +/- 0.14
    8  0.88 +/- 0.13
    9  0.84 +/- 0.14

 

原文:http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html

英語太菜了,接下來的吃力了

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM