基於Heritrix+Lucene的搜索引擎構建（3）——頁面信息內容抽取

本文轉載自查看原文 2013-01-06 23:06 1403 搜索引擎

搜索引擎無非是提供對Web內容的方便檢索，以至於能夠便捷的獲取瀏覽到相關的頁面。

因此，在通過Heritrix等網絡蜘蛛獲取Web資源以后，首要的任務就是抽取Web頁面的內容。

基於java的頁面抽取工具有很多，例如，抽取HTML頁面的有HtmlParser、Jsoup等，至於Word、Excel等文件的內容，也有相應的工具。

關於HtmlParser、Jsoup等頁面內容抽取可以參考相關文獻.如《HTML抽取工具Jsoup》。

關於Word等文件，建議學習使用一款叫POI的開源工具來實現：

Apache POI是一個開源的Java讀寫Excel、WORD等微軟OLE2組件文檔的項目。目前POI已經有了Ruby版本。

結構：

HSSF －提供讀寫Microsoft Excel XLS格式檔案的功能。
XSSF －提供讀寫Microsoft Excel OOXML XLSX格式檔案的功能。
HWPF －提供讀寫Microsoft Word DOC格式檔案的功能。
HSLF －提供讀寫Microsoft PowerPoint格式檔案的功能。
HDGF －提供讀Microsoft Visio格式檔案的功能。
HPBF －提供讀Microsoft Publisher格式檔案的功能。
HSMF －提供讀Microsoft Outlook格式檔案的功能。

POI項目網站：http://poi.apache.org/.

最常見的一種PDF文本抽取工具就是PDFBox，PDF文檔可以使用PDFBox來處理，http://pdfbox.apache.org/。

以下是一些文檔內容抽取的例子代碼。

Html文檔的內容抽取HtmlParser.java:

 1 import java.io.File;
 2 import java.io.IOException;
 3 
 4 import org.jsoup.Jsoup;
 5 import org.jsoup.nodes.Document;
 6 
 7 import GEsearcher.encode.FileEncode;
 8 import GEsearcher.index.FileDocument;
 9 
10 /**
11  * 解析html
12  * @author Shilong
13  *
14  */
15 public class HtmlParser {
16 
17     private String title;
18     private String content;
19     private String url;
20     private String path;
21     private FileDocument filedocument;
22     
23     public HtmlParser(String path)
24     {
25         //nothing
26         this.path=path;
27         filedocument=new FileDocument(path);
28         AnalysicDocument();//解析
29         //System.out.println("測試:"+filedocument.getUrl());
30     }
31         
32     //獲取待分析文件的File對象
33     public File getFile()
34     {
35         return filedocument.getFile();
36     }
37     
38     //獲取待分析文件編碼
39     public String getEncoding()
40     {
41         String val="GBK";
42         FileEncode fe=new FileEncode(path);
43         String encode= fe.getEncode();
44         if(encode.equals("GB-2312")||encode.equals("gb-2312"))
45         {
46             val="GB2312";
47         }else if(encode.equals("UNKNOWN"))
48         {
49             val="UTF-8";
50         }else
51         {
52             val=encode;
53         }
54         return val;
55     }
56     
57     //分析文件
58     public void AnalysicDocument()
59     {
60         File infile=getFile();
61         try {
62             Document doc = Jsoup.parse(infile, getEncoding());
63             title=doc.title();  //獲取標題
64             content=doc.text();  //獲取內容
65             url=filedocument.getUrl(); //獲取url
66         } 
67         catch (IOException e) {
68             e.printStackTrace();
69         }
70     }
71     
72     //數據的返回
73     //獲取待分析文件的url
74     public String getUrl()
75     {
76         return this.url;
77     }
78     public String getTitle()
79     {
80         return this.title;
81     }
82     public String getContent()
83     {
84         return this.content;
85     }
86 }

使用tm-extractors實現的Word抽取WordReader.java：

import java.io.File;
import java.io.FileInputStream;

import org.textmining.text.extraction.WordExtractor;

public class WordReader {
    
    private String FilePath;
    
    public WordReader(String FilePath){
        this.FilePath =  FilePath;
    }
    
    public String getText() {
        String text = "";
        FileInputStream in;
        try {
            in = new FileInputStream(new File(FilePath));
            WordExtractor extractor= new WordExtractor();    
            text = extractor.extractText(in);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return text;
    }
    
    public String getTitle(){
        File f = new File(FilePath);
        String name = f.getName();
        
        return name;
    }
    
    public String getUrl(){
        return FilePath;
    }

Txt文件內容的讀取TxtReader.java：

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;

public class TxtReader {

    private String FilePath;
    
    public TxtReader(String FilePath){
        this.FilePath = FilePath;
    }
    
    public String getText(){
        String str="";
        try{
            BufferedReader br=new BufferedReader(new FileReader(FilePath));        
            String r=br.readLine();
            while(r!=null){
                str+=r;
                r=br.readLine();
                }
        }catch(Exception e){
            e.printStackTrace();
        }    
        return str;
    }
    
    public String getTitle(){
        File f = new File(FilePath);
        String name = f.getName();
        return name;
    }
    
    public String getUrl(){
        return FilePath;
    }
}

總之，在建立搜索索引之前，先對Web頁面資源進行文本的抽取處理。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 lucene5.3.1+IKAnalyer 構建簡單搜索引擎 Lucene搜索引擎入門 Lucene搜索引擎例子demo 借助 Lucene.Net 構建站內搜索引擎（下）借助 Lucene.Net 構建站內搜索引擎（上） scrapy+Lucene搭建小型搜索引擎開源搜索引擎評估:lucene sphinx elasticsearch Lucene系列一：搜索引擎核心理論搜索引擎Solr與Lucene的比較分析【轉】 Web信息收集之搜索引擎