利用JAVA計算TFIDF和Cosine相似度-學習版本

本文轉載自查看原文 2016-03-31 20:54 2161 JAVA/ Cosine相似度/ 計算/ TFIDF/ Computer Vision

寫在前面的話，既然是學習版本，那么就不是一個好用的工程實現版本，整套代碼全部使用List進行匹配效率可想而知。

【原文轉自】：http://computergodzilla.blogspot.com/2013/07/how-to-calculate-tf-idf-of-document.html，修改了其中一些bug。

P.S：如果不是被迫需要語言統一，盡量不要使用此工程計算TF-IDF，計算2W條短文本，Matlab實現僅是幾秒之間，此Java工程要計算良久。。半個小時？甚至更久，因此此程序作為一個學習版本，並不適用於工程實現。。工程試驗版本

For beginners doing a project in text mining aches them a lot by various term like :

TF-IDF
COSINE SIMILARITY
CLUSTERING
DOCUMENT VECTORS

In my earlier post I showed you guys what is Cosine Similarity. I will not talk about Cosine Similarity in this post but rather I will show a nice little code to calculate Cosine Similarity in java.

Many of you must be familiar with Tf-Idf(Term frequency-Inverse Document Frequency).
I will enlighten them in brief.

Term Frequency:
Suppose for a document “Tf-Idf Brief Introduction” there are overall 60000 words and a word Term-Frequency occurs 60times.
Then , mathematically, its Term Frequency, TF = 60/60000 =0.001.

Inverse Document Frequency:
Suppose one bought Harry-Potter series, all series. Suppose there are 7 series and a word “AbraKaDabra” comes in 2 of the series.
Then, mathematically, its Inverse-Document Frequency , IDF = 1 + log(7/2) = …….(calculated it guys, don’t be lazy, I am lazy not you guys.)

And Finally, TFIDF = TF * IDF;

By mathematically I assume you now know its meaning physically.

Document Vector:
There are various ways to calculate document vectors. I am just giving you an example. Suppose If I calculate all the term’s TF-IDF of a document A and store them in an array(list, matrix … in any ordered way, .. you guys are genius you know how to create a vector. ) then I get an Document Vector of TF-IDF scores of document A.

The class shown below calculates the Term Frequency(TF) and Inverse Document Frequency(IDF).

//TfIdf.java
package com.computergodzilla.tfidf;
import java.util.List;
/**
* Class to calculate TfIdf of term.
* @author Mubin Shrestha
*/
public class TfIdf {
/**
* Calculates the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(String[] totalterms, String termToCheck) {
double count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count / totalterms.length;
}
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List<String[]> allTerms, String termToCheck) {
double count = 0;
for (String[] ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
}

The class shown below parsed the text documents and split them into tokens. This class will communicate with TfIdf.java class to calculated TfIdf. It also calls CosineSimilarity.java class to calculated the similarity between the passed documents.

Code View Copy Print

//DocumentParser.java
package com.computergodzilla.tfidf;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
/**
* Class to read documents
*
* @author Mubin Shrestha
*/
public class DocumentParser {
//This variable will hold all terms of each document in an array.
private List<String[]> termsDocsArray = new ArrayList<String[]>();
private List<String> allTerms = new ArrayList<String>(); //to hold all terms
private List<double[]> tfidfDocsVector = new ArrayList<double[]>();
/**
* Method to read files and store in array.
* @param filePath : source file path
* @throws FileNotFoundException
* @throws IOException
*/
public void parseFiles(String filePath) throws FileNotFoundException, IOException {
File[] allfiles = new File(filePath).listFiles();
BufferedReader in = null;
for (File f : allfiles) {
if (f.getName().endsWith(“.txt”)) {
in = new BufferedReader(new FileReader(f));
StringBuilder sb = new StringBuilder();
String s = null;
while ((s = in.readLine()) != null) {
sb.append(s);
}
String[] tokenizedTerms = sb.toString().replaceAll(“[\\W&&[^\\s]]”, “”).split(“\\W+”); //to get individual terms
for (String term : tokenizedTerms) {
if (!allTerms.contains(term)) { //avoid duplicate entry
allTerms.add(term);
}
}
termsDocsArray.add(tokenizedTerms);
}
}
}
/**
* Method to create termVector according to its tfidf score.
*/
public void tfIdfCalculator() {
double tf; //term frequency
double idf; //inverse document frequency
double tfidf; //term requency inverse document frequency
for (String[] docTermsArray : termsDocsArray) {
double[] tfidfvectors = new double[allTerms.size()];
int count = 0;
for (String terms : allTerms) {
tf = new TfIdf().tfCalculator(docTermsArray, terms);
idf = new TfIdf().idfCalculator(termsDocsArray, terms);
tfidf = tf * idf;
tfidfvectors[count] = tfidf;
count++;
}
tfidfDocsVector.add(tfidfvectors); //storing document vectors;
}
}
/**
* Method to calculate cosine similarity between all the documents.
*/
public void getCosineSimilarity() {
for (int i = 0; i < tfidfDocsVector.size(); i++) {
for (int j = 0; j < tfidfDocsVector.size(); j++) {
System.out.println(“between ” + i + “ and ” + j + “ = ”
+ new CosineSimilarity().cosineSimilarity
(
tfidfDocsVector.get(i),
tfidfDocsVector.get(j)
)
);
}
}
}
}

This is the class that calculates Cosine Similarity:

Code View Copy Print

//CosineSimilarity.java
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package com.computergodzilla.tfidf;
/**
* Cosine similarity calculator class
* @author Mubin Shrestha
*/
public class CosineSimilarity {
/**
* Method to calculate cosine similarity between two documents.
* @param docVector1 : document vector 1 (a)
* @param docVector2 : document vector 2 (b)
* @return
*/
public double cosineSimilarity(double[] docVector1, double[] docVector2) {
double dotProduct = 0.0;
double magnitude1 = 0.0;
double magnitude2 = 0.0;
double cosineSimilarity = 0.0;
for (int i = 0; i < docVector1.length; i++) //docVector1 and docVector2 must be of same length
{
dotProduct += docVector1[i] * docVector2[i]; //a.b
magnitude1 += Math.pow(docVector1[i], 2); //(a^2)
magnitude2 += Math.pow(docVector2[i], 2); //(b^2)
}
magnitude1 = Math.sqrt(magnitude1);//sqrt(a^2)
magnitude2 = Math.sqrt(magnitude2);//sqrt(b^2)
if (magnitude1 != 0.0 | magnitude2 != 0.0) {
cosineSimilarity = dotProduct / (magnitude1 * magnitude2);
} else {
return 0.0;
}
return cosineSimilarity;
}
}

Here’s the main class to run the code:

Code View Copy Print

//TfIdfMain.java
package com.computergodzilla.tfidf;
import java.io.FileNotFoundException;
import java.io.IOException;
/**
*
* @author Mubin Shrestha
*/
public class TfIdfMain {
/**
* Main method
* @param args
* @throws FileNotFoundException
* @throws IOException
*/
public static void main(String args[]) throws FileNotFoundException, IOException
{
DocumentParser dp = new DocumentParser();
dp.parseFiles(“D:\\FolderToCalculateCosineSimilarityOf”); // give the location of source file
dp.tfIdfCalculator(); //calculates tfidf
dp.getCosineSimilarity(); //calculates cosine similarity
}
}

You can also download the whole source code from here: Download. （Google Drive）

Overall what I did is, I first calculate the TfIdf matrix of all the documents and then document vectors of each documents. Then I used those document vectors to calculate cosine similarity.

You think clarification is not enough. Hit me..
Happy Text-Mining!!

from: http://jacoxu.com/?p=1619

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 KNN cosine 余弦相似度計算余弦相似度Cosine Similarity相關計算公式利用simhash計算文本相似度利用sklearn進行tfidf計算 spark MLlib 概念 5：余弦相似度（Cosine similarity）舉例說明利用《知網》計算詞語相似度利用余弦定理計算文本的相似度相似度度量：歐氏距離與余弦相似度（Similarity Measurement Euclidean Distance Cosine Similarity） Python 計算相似度 TF版本的Word2Vec和余弦相似度的計算