Hadoop學習筆記(8) ——實戰做個倒排索引

本文轉載自查看原文 2014-08-14 22:04 5862

Hadoop學習筆記(8)

——實戰做個倒排索引

倒排索引是文檔檢索系統中最常用數據結構。根據單詞反過來查在文檔中出現的頻率，而不是根據文檔來，所以稱倒排索引(Inverted Index)。結構如下:

這張索引表中，每個單詞都對應着一系列的出現該單詞的文檔，權表示該單詞在該文檔中出現的次數。現在我們假定輸入的是以下的文件清單：

T1 ： hello world hello china

T2 : hello hadoop

T3 ： bye world bye hadoop bye bye

輸入這些文件，我們最終將會得到這樣的索引文件：

bye T3:4;

china T1:1;

hadoop T2:1;T3:1;

hello T1:2;T2:1;

world T1:1;T3:1;

接下來，我們就是要想辦法利用hadoop來把這個輸入，變成輸出。從上一章中，其實也就是分析如何將hadoop中的步驟個性化，讓其工作。整個步驟中，最主要的還是map和reduce過程，其它的都可稱之為配角，所以我們先來分析下map和reduce的過程將會是怎樣？

首先是Map的過程。Map的輸入是文本輸入，一條條的行記錄進入。輸出呢？應該包含：單詞、所在文件、單詞數。 Map的輸入是key-value。那這三個信息誰是key，誰是value呢？數量是需要累計的，單詞數肯定在value里，單詞在key中，文件呢？不同文件內的相同單詞也不能累加的，所以這個文件應該在key中。這樣key中就應該包含兩個值：單詞和文件，value則是默認的數量1，用於后面reduce來進行合並。

所以Map后的結果應該是這樣的：

Key value

Hello;T1 1

Hello:T1 1

World:T1 1

China:T1 1

Hello:T2 1

…

即然這個key是復合的，所以常歸的類型已經不能滿足我們的要求了，所以得設置一個復合健。復合健的寫法在上一章中描述到了。所以這里我們就直接上代碼：

public static class MyType implements WritableComparable<MyType>{
public MyType(){
}
private String word;
public String Getword(){return word;}
public void Setword(String value){ word = value;}
private String filePath;
public String GetfilePath(){return filePath;}
public void SetfilePath(String value){ filePath = value;}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(word);
out.writeUTF(filePath);
}
@Override
public void readFields(DataInput in) throws IOException {
word = in.readUTF();
filePath = in.readUTF();
}
@Override
public int compareTo(MyType arg0) {
if (word != arg0.word)
return word.compareTo(arg0.word);
return filePath.compareTo(arg0.filePath);
}
}

有了這個復合健的定義后，這個Map函數就好寫了：

public static class InvertedIndexMapper extends
Mapper<Object, Text, MyType, Text> {
public void map(Object key, Text value, Context context)
throws InterruptedException, IOException {
FileSplit split = (FileSplit) context.getInputSplit();
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
MyType keyInfo = new MyType();
keyInfo.Setword(itr.nextToken());
keyInfo.SetfilePath(split.getPath().toUri().getPath().replace("/user/zjf/in/", ""));
context.write(keyInfo, new Text("1"));
}
}
}

注意：第13行，路徑是全路徑的，為了看起來方便，我們把目錄替換掉，直接取文件名。

有了Map，接下來就可以考慮Recude了，以及在Map之后的Combine。Map的輸出的Key類型是MyType，所以Reduce以及Combine的輸入就必須是MyType了。

如果直接將Map的結果送到Reduce后，發現還需要做大量的工作來將Key中的單詞再重排一下。所以我們考慮在Reduce前加一個Combine，先將數量進行一輪合並。

這個Combine將會輸入下面的值：

Key value

bye T3:4;

china T1:1;

hadoop T2:1;

hadoop T3:1;

hello T1:2;

hello T2:1;

world T1:1;

world T3:1;

代碼如下：

public static class InvertedIndexCombiner extends
Reducer<MyType, Text, MyType, Text> {
public void reduce(MyType key, Iterable<Text> values, Context context)
throws InterruptedException, IOException {
int sum = 0;
for (Text value : values) {
sum += Integer.parseInt(value.toString());
}
context.write(key, new Text(key.GetfilePath()+ ":" + sum));
}
}

有了上面Combine后的結果，再進行Reduce就容易了，只需要將value結果進行合並處理：

public static class InvertedIndexReducer extends
Reducer<MyType, Text, Text, Text> {
public void reduce(MyType key, Iterable<Text> values, Context context)
throws InterruptedException, IOException {
Text result = new Text();
String fileList = new String();
for (Text value : values) {
fileList += value.toString() + ";";
}
result.set(fileList);
context.write(new Text(key.Getword()), result);
}
}

經過這個Reduce處理，就得到了下面的結果：

bye T3:4;

china T1:1;

hadoop T2:1;T3:1;

hello T1:2;T2:1;

world T1:1;T3:1;

最后，MapReduce函數都寫完后，就可以掛在Job中運行了。

public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
System.out.println("url:" + conf.get("fs.default.name"));
Job job = new Job(conf, "InvertedIndex");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(InvertedIndexMapper.class);
job.setMapOutputKeyClass(MyType.class);
job.setMapOutputValueClass(Text.class);
job.setCombinerClass(InvertedIndexCombiner.class);
job.setReducerClass(InvertedIndexReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
Path path = new Path("out");
FileSystem hdfs = FileSystem.get(conf);
if (hdfs.exists(path))
hdfs.delete(path, true);
FileInputFormat.addInputPath(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"));
job.waitForCompletion(true);
}

注：這里為了調試方便，我們把in和out都寫死，不用傳入執行參數了，並且，每次執行前，判斷out文件夾是否存在，如果存在則刪除。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hadoop實戰-MapReduce之倒排索引(八) Hadoop之倒排索引 hadoop倒排索引 Information Retrieval 倒排索引學習筆記 MapReduce實戰--倒排索引 elasticsearch學習筆記-倒排索引以及中文分詞什么是倒排索引？ Elaticsearch倒排索引倒排索引倒排索引基礎

Hadoop學習筆記(8) ——實戰 做個倒排索引

免責聲明！

Hadoop學習筆記(8) ——實戰做個倒排索引