Tika支持多種功能:
文檔類型檢測 內容提取 元數據提取 語言檢測
重要特點:
-
統一解析器接口:Tika封裝在一個單一的解析器接口的第三方解析器庫。由於這個特征,用戶逸出從選擇合適的解析器庫的負擔,並使用它,根據所遇到的文件類型。
-
低內存占用:Tika因此消耗更少的內存資源也很容易嵌入Java應用程序。也可以用Tika平台像移動那樣PDA資源少,運行該應用程序。
-
快速處理:從應用連結內容檢測和提取可以預期的。
-
靈活元數據:Tika理解所有這些都用來描述文件的元數據模型。
-
解析器集成:Tika可以使用可在單一應用程序中每個文件類型的各種解析器庫。
-
MIME類型檢測: Tika可以檢測並從所有包括在MIME標准的媒體類型中提取內容。
-
語言檢測: Tika包括語言識別功能,因此可以在一個多語種網站基於語言類型的文檔中使用。
使用Parser接口內容提取
CompositeParser
給出的圖表顯示Tika通用解析器類CompositeParser 主AutoDetectParser。由於CompositeParser類遵循復合設計模式,可以用一組解析器實例作為一個單獨的解析器。CompositeParser類也可以訪問所有實現解析器接口的類。
AutoDetectParser
這是CompositeParser的子類,它提供了自動類型檢測。使用此功能,AutoDetectParser自動發送收到的文件到使用該復合方法適當分析器類。
parse()方法
除了parseToString(),還可以使用分析器接口的parse()方法。該方法的原型如下所示。
void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
我們簡要解釋方法參數:
stream,從需要被解析文檔創建的InputStream實例
handler,接收從輸入文檔解析XHTML SAX事件序列的ContentHandler對象,負責處理事件並以特定的形式導出結果。
metadata,元數據對象,它在解析器中傳遞元數據屬性
context,帶有上下文相關信息的ParseContext實例,用於自定義解析過程。
如果從輸入流讀取失敗,則parse方法拋出IOException異常,從流中獲取的文檔不能被解析拋TikaException異常,處理器不能處理事件則拋SAXException異常。
當解析文檔時,Tika盡量重用已經存在的解析庫,如Apache POI或PDFBox。因此,大多數解析器實現類僅適配這些外部類庫。下面,我們將了解如何使用處理程序和元數據參數來提取文檔的內容和元數據。為了方便,我們能使用Tika的門面類調用解析器Api。
0.Tika的maven地址:
<!--tika解析文本內容--> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.17</version> </dependency>
1.簡單的用法
1.1獲取文件類型
Tika支持MIME所提供的所有互聯網媒體文件類型。
/** * 檢測文件類型的用法 */ public static void test1(){ File file = new File("G:/tikatest/test.mp4"); Tika tika = new Tika(); String filetype = null; try { filetype = tika.detect(file); } catch (IOException e) { e.printStackTrace(); } System.out.println(filetype); }
結果:
video/mp4
我們將后綴去掉改為test也可以檢測出同樣的結果,其根據文件拓展名與文件內容檢測。
1.2提取Txt文本內容
解析文件,一般用於Tika外觀facade類的parseToString()方法。
/** * 讀取txt內容 */ public static void test2(){ File file = new File("G:/tikatest/test.txt"); Tika tika = new Tika(); String filecontent = null; try { filecontent = tika.parseToString(file); } catch (IOException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } System.out.println("Extracted Content: " + filecontent); }
結果:
Extracted Content: 111
222
333
444
555
666
補充:與之等價的FileUtils功能實現(commons-io包功能)
public static void test3(){ File file = new File("G:/tikatest/test.txt"); String s = null; try { s = FileUtils.readFileToString(file); } catch (IOException e) { e.printStackTrace(); } System.out.println(s); }
1.3提取元數據
元數據是什么,是文件所提供的附加信息。如果我們考慮一個音頻文件,藝術家名,專輯名,標題下自帶的元數據。
public static void test4(){ File file=new File("G:/tikatest/test.mp4"); Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = null; try { inputstream = new FileInputStream(file); } catch (FileNotFoundException e) { e.printStackTrace(); } ParseContext context = new ParseContext(); try { parser.parse(inputstream, handler, metadata, context); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } System.out.println(handler.toString()); //getting the list of all meta data elements String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
結果:
Software: OnePlus3-user 7.1.1 NMF26F 76 dev-keys
GPS Altitude Ref: Unknown (2)
Metering Mode: Center weighted average
Model: ONEPLUS A3010
meta:save-date: 2017-09-02T16:32:15
File Name: apache-tika-4154811460990247864.tmp
Exposure Mode: Auto exposure
Exif Version: 2.20
Sensing Method: One-chip color area sensor
tiff:ImageLength: 540
exif:Flash: false
Creation-Date: 2017-09-02T16:32:15
Interoperability Version: 1.00
ISO Speed Ratings: 640
X Resolution: 72 dots per inch
Shutter Speed Value: 1/20 sec
tiff:ImageWidth: 720
Thumbnail Width Pixels: 0
tiff:XResolution: 72.0
Image Width: 720 pixels
Last-Save-Date: 2017-09-02T16:32:15
exif:FNumber: 2.0
Number of Tables: 4 Huffman tables
F-Number: f/2.0
Color Space: sRGB
meta:creation-date: 2017-09-02T16:32:15
Resolution Units: inch
Data Precision: 8 bits
File Modified Date: 星期二 十月 16 22:15:54 +08:00 2018
tiff:BitsPerSample: 8
Last-Modified: 2017-09-02T16:32:15
tiff:YResolution: 72.0
YCbCr Positioning: Center of pixel array
Compression Type: Baseline
Components Configuration: YCbCr
exif:IsoSpeedRatings: 640
X-Parsed-By: org.apache.tika.parser.DefaultParser
Focal Length 35: 28 mm
modified: 2017-09-02T16:32:15
Brightness Value: 0
Thumbnail Offset: 874 bytes
Exif Image Height: 3480 pixels
Focal Length: 4.3 mm
Thumbnail Length: 14211 bytes
White Balance Mode: Auto white balance
Content-Type: image/jpeg
Make: OnePlus
tiff:Make: OnePlus
Date/Time Original: 2017:09:02 08:32:15
Scene Capture Type: Standard
Exif Image Width: 4640 pixels
Makernote: [26 values]
dcterms:created: 2017-09-02T16:32:15
exif:ExposureTime: 0.05
date: 2017-09-02T16:32:15
Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
tiff:ResolutionUnit: Inch
Interoperability Index: Recommended Exif Interoperability Rules (ExifR98)
Flash: Flash did not fire, auto
Date/Time Digitized: 2017:09:02 08:32:15
File Size: 50158 bytes
Thumbnail Height Pixels: 0
Resolution Unit: Inch
Sub-Sec Time Original: 994455
XMP Value Count: 4
tiff:Software: OnePlus3-user 7.1.1 NMF26F 76 dev-keys
Aperture Value: f/2.0
Number of Components: 3
dcterms:modified: 2017-09-02T16:32:15
tiff:Model: ONEPLUS A3010
Image Height: 540 pixels
Sub-Sec Time Digitized: 994455
Sub-Sec Time: 994455
Scene Type: Directly photographed image
Exposure Time: 0.05 sec
exif:DateTimeOriginal: 2017-09-02T16:32:15
exif:FocalLength: 4.26
Compression: JPEG (old-style)
FlashPix Version: 1.00
Date/Time: 2017:09:02 08:32:15
Exposure Program: Unknown (0)
Y Resolution: 72 dots per inch
1.4語言檢測
tika可以檢測的18種語言:
public static void test6(){ //Instantiating a file object File file = new File("G:/tikatest/test.txt"); //Parser method parameters Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream content = null; try { content = new FileInputStream(file); } catch (FileNotFoundException e) { e.printStackTrace(); } //Parsing the given document try { parser.parse(content, handler, metadata, new ParseContext()); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } LanguageIdentifier object = new LanguageIdentifier(handler.toString()); System.out.println("Language name :" + object.getLanguage()); }
結果:
Language name :lt
1.5提取PDF
強大到可以提取里面的連接以及小標點符號。可以獲取PDF的內容與元數據。
public static void test7() throws IOException, TikaException, SAXException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/4.pdf")); ParseContext pcontext = new ParseContext(); //parsing the document using PDF parser PDFParser pdfparser = new PDFParser(); pdfparser.parse(inputstream, handler, metadata,pcontext); //getting the content of the document System.out.println("Contents of the PDF :" + handler.toString()); //getting metadata of the document System.out.println("Metadata of the PDF:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name+ " : " + metadata.get(name)); } }
結果:
Contents of the PDF : 個人簡歷 ............................... Metadata of the PDF: access_permission:extract_for_accessibility : true pdf:docinfo:title : 個人簡歷 meta:save-date : 2018-06-12T07:41:54Z pdf:docinfo:modified : 2018-06-12T07:41:54Z dcterms:created : 2018-06-12T07:41:54Z Author : liqiang qiao date : 2018-06-12T07:41:54Z access_permission:can_modify : true access_permission:modify_annotations : true creator : liqiang qiao Creation-Date : 2018-06-12T07:41:54Z title : 個人簡歷 meta:author : liqiang qiao access_permission:fill_in_form : true created : Tue Jun 12 15:41:54 CST 2018 pdf:docinfo:producer : Microsoft® Word 2013 dc:format : application/pdf; version=1.5 access_permission:can_print : true pdf:docinfo:created : 2018-06-12T07:41:54Z xmp:CreatorTool : Microsoft® Word 2013 Last-Save-Date : 2018-06-12T07:41:54Z dc:title : 個人簡歷 access_permission:assemble_document : true dcterms:modified : 2018-06-12T07:41:54Z meta:creation-date : 2018-06-12T07:41:54Z pdf:docinfo:creator : liqiang qiao dc:creator : liqiang qiao pdf:PDFVersion : 1.5 Last-Modified : 2018-06-12T07:41:54Z modified : 2018-06-12T07:41:54Z xmpTPg:NPages : 2 access_permission:can_print_degraded : true pdf:encrypted : false access_permission:extract_content : true producer : Microsoft® Word 2013 pdf:docinfo:creator_tool : Microsoft® Word 2013 Content-Type : application/pdf
1.6提取MSOffice文檔(讀取word,excel)
從Microsoft Office文檔中提取內容和元數據。
public static void test8() throws TikaException, SAXException, IOException { //detecting the file type BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/test.docx")); ParseContext pcontext = new ParseContext(); //OOXml parser OOXMLParser msofficeparser = new OOXMLParser (); msofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
結果:
Contents of the document: -Xms5200M -Xmx5200M -XX:PermSize=512M -XX:MaxPermSize=512M
http_load使用教程: https://www.cnblogs.com/shijingjing07/p/6539179.html
1.默認配置;
內存
線程數量:
1.只修改JVM參數
內存
2.並發
2.修改JVM與並發
JVM
並發
Metadata of the document:
cp:revision: 19
meta:last-author: liqiang qiao
Last-Author: liqiang qiao
meta:save-date: 2017-12-14T10:25:00Z
Application-Name: Microsoft Office Word
Author: liqiang qiao
dcterms:created: 2017-12-14T09:28:00Z
Application-Version: 15.0000
Character-Count-With-Spaces: 195
date: 2017-12-14T10:25:00Z
Total-Time: 57
extended-properties:Template: Normal.dotm
meta:line-count: 1
creator: liqiang qiao
publisher:
Word-Count: 29
meta:paragraph-count: 1
Creation-Date: 2017-12-14T09:28:00Z
extended-properties:AppVersion: 15.0000
meta:author: liqiang qiao
Line-Count: 1
extended-properties:Application: Microsoft Office Word
Paragraph-Count: 1
Last-Save-Date: 2017-12-14T10:25:00Z
Revision-Number: 19
dcterms:modified: 2017-12-14T10:25:00Z
meta:creation-date: 2017-12-14T09:28:00Z
Template: Normal.dotm
Page-Count: 1
meta:character-count: 167
dc:creator: liqiang qiao
meta:word-count: 29
Last-Modified: 2017-12-14T10:25:00Z
extended-properties:Company:
modified: 2017-12-14T10:25:00Z
xmpTPg:NPages: 1
extended-properties:TotalTime: 57
dc:publisher:
Character Count: 167
meta:page-count: 1
meta:character-count-with-spaces: 195
Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
補充:tika讀取Excel里面的內容
例如一個Excel里面的內容:
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/user.xlsx")); ParseContext pcontext = new ParseContext(); OOXMLParser msofficeparser = new OOXMLParser (); msofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); }
結果:
Contents of the document:Sheet1
序號 用戶名字 用戶電話 用戶郵箱 用戶賬戶 用戶類型 密碼
1 rrrrrr 15888585954 954318308@qq.com root111 管理員 111222
2 001 15898569856 qiao_liqiang@163.com 001 普通用戶 111222
3 超級管理員 15898569856 5555@qq.com root8 管理員 111222
4 qqq 15898569856 qiao_liqiang@163.com 1231 普通用戶 111222
5 張三 18558458569 33335658@qq.com 333 普通用戶 111222
6 李四 15898569856 qiao_liqiang@163.com 4444 普通用戶 111222
7 超級管理員 15898569856 5555@qq.com root5 管理員 111222
8 張三 18434391711 qiao_liqiang@163.com root7 管理員 111222
9 張三 18434391711 qiao_liqiang@163.com root3 管理員 111222
10 超管 15898569856 qiao_liqiang@163.com root 管理員 111222
11 8888 15898569856 qiao_liqiang@163.com 8888 普通用戶 111222
12 超級管理員 15888585954 954318308@qq.com roo6 管理員 111222
13 張三 18434391711 qiao_liqiang@163.com root4 管理員 111222
1.7提取txt文檔內容
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/test.txt")); ParseContext pcontext = new ParseContext(); TXTParser msofficeparser = new TXTParser(); msofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
1.8獲取html
獲取的是解析后的html,如果需要獲取源碼可以用IOUtils
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/index.html")); ParseContext pcontext = new ParseContext(); HtmlParser msofficeparser = new HtmlParser(); msofficeparser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
html內容如下:
結果:
Contents of the document:
Welcome to nginx!
If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.
For online documentation and support please refer to
nginx.org.
Commercial support is available at
nginx.com.
Thank you for using nginx.
Metadata of the document:
title: Welcome to nginx!
Content-Encoding: ISO-8859-1
Content-Type: text/html; charset=ISO-8859-1
dc:title: Welcome to nginx!
補充:FileUtils讀取源碼
public static void test8() throws TikaException, SAXException, IOException { String s = FileUtils.readFileToString(new File("G:/tikatest/index.html")); System.out.println(s); }
結果:
<!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title> <style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; } </style> </head> <body> <h1>Welcome to nginx!</h1> <p>If you see this page, the nginx web server is successfully installed and working. Further configuration is required.</p> <p>For online documentation and support please refer to <a href="http://nginx.org/">nginx.org</a>.<br/> Commercial support is available at <a href="http://nginx.com/">nginx.com</a>.</p> <p><em>Thank you for using nginx.</em></p> </body> </html>
1.9獲取Class文件--可以實現反編譯的功能。
反編譯查看class文件內容:
tika提取class內容:(可以獲取類的方法摘要信息)
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/UUIDUtil.class")); ParseContext pcontext = new ParseContext(); ClassParser parser = new ClassParser(); parser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
結果:
Contents of the document:package cn.xm.jwxt.utils;
public synchronized class UUIDUtil {
public void UUIDUtil();
public static String getUUID();
public static String getUUID2();
}
Metadata of the document:
title: UUIDUtil
resourceName: UUIDUtil.class
dc:title: UUIDUtil
1.10獲取Jar文件
可以提取jar內部的class文件的概述信息以及元信息.這個可以用於列出一個文件下的所有的class信息或者寫一個工具類查找一個某個class是否在某個jar文件中。
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(10*1024*1024); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/t.jar")); ParseContext pcontext = new ParseContext(); PackageParser parser = new PackageParser (); parser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
結果:
..........................
org/apache/tika/utils/ServiceLoaderUtils.class
package org.apache.tika.utils;
public synchronized class ServiceLoaderUtils {
public void ServiceLoaderUtils();
public static void sortLoadedClasses(java.util.List);
public static Object newInstance(String);
public static Object newInstance(String, ClassLoader);
}
org/apache/tika/utils/XMLReaderUtils$1.class
package org.apache.tika.utils;
final synchronized class XMLReaderUtils$1 implements org.xml.sax.EntityResolver {
void XMLReaderUtils$1();
public org.xml.sax.InputSource resolveEntity(String, String) throws org.xml.sax.SAXException, java.io.IOException;
}
org/apache/tika/utils/XMLReaderUtils$2.class
package org.apache.tika.utils;
final synchronized class XMLReaderUtils$2 implements javax.xml.stream.XMLResolver {
void XMLReaderUtils$2();
public Object resolveEntity(String, String, String, String) throws javax.xml.stream.XMLStreamException;
}
org/apache/tika/utils/XMLReaderUtils.class
package org.apache.tika.utils;
public synchronized class XMLReaderUtils {
private static final java.util.logging.Logger LOG;
private static final org.xml.sax.EntityResolver IGNORING_SAX_ENTITY_RESOLVER;
private static final javax.xml.stream.XMLResolver IGNORING_STAX_ENTITY_RESOLVER;
public void XMLReaderUtils();
public static org.xml.sax.XMLReader getXMLReader() throws org.apache.tika.exception.TikaException;
public static javax.xml.parsers.SAXParser getSAXParser() throws org.apache.tika.exception.TikaException;
public static javax.xml.parsers.SAXParserFactory getSAXParserFactory();
public static javax.xml.parsers.DocumentBuilderFactory getDocumentBuilderFactory();
public static javax.xml.parsers.DocumentBuilder getDocumentBuilder() throws org.apache.tika.exception.TikaException;
public static javax.xml.stream.XMLInputFactory getXMLInputFactory();
private static void trySetSAXFeature(javax.xml.parsers.DocumentBuilderFactory, String, boolean);
private static void tryToSetStaxProperty(javax.xml.stream.XMLInputFactory, String, boolean);
public static javax.xml.transform.Transformer getTransformer() throws org.apache.tika.exception.TikaException;
static void <clinit>();
}
org/apache/tika/utils/package-info.class
package org.apache.tika.utils;
abstract interface package-info {
}
Metadata of the document:
Content-Type: application/zip
補充:在沒設置BodyContentHandler參數的時候讀取報錯如下:
Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
解決辦法就是設置讀取的參數:
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
1.11提取圖像信息:
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("g:/tikatest/5.jpeg")); ParseContext pcontext = new ParseContext(); JpegParser JpegParser = new JpegParser(); JpegParser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
結果:
Contents of the document:
Metadata of the document:
Number of Tables: 4 Huffman tables
Number of Components: 3
Image Height: 192 pixels
Resolution Units: inch
File Name: apache-tika-7234240523307196989.tmp
Data Precision: 8 bits
File Modified Date: 星期三 十月 17 21:43:39 +08:00 2018
tiff:BitsPerSample: 8
Compression Type: Baseline
Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
tiff:ImageLength: 192
Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
X Resolution: 96 dots
File Size: 9216 bytes
tiff:ImageWidth: 256
Thumbnail Height Pixels: 0
Thumbnail Width Pixels: 0
Image Width: 256 pixels
Y Resolution: 96 dots
1.12提取Mp4信息
public static void test8() throws TikaException, SAXException, IOException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("g:/tikatest/test.mp4")); ParseContext pcontext = new ParseContext(); MP4Parser MP4Parser = new MP4Parser(); MP4Parser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the document:" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } }
結果:
Contents of the document:
Metadata of the document:
dcterms:modified: 2017-07-20T10:25:23Z
xmpDM:duration: 39.5
meta:creation-date: 2017-07-20T10:25:23Z
meta:save-date: 2017-07-20T10:25:23Z
Last-Modified: 2017-07-20T10:25:23Z
dcterms:created: 2017-07-20T10:25:23Z
xmpDM:audioSampleRate: 10000
date: 2017-07-20T10:25:23Z
tiff:ImageLength: 578
modified: 2017-07-20T10:25:23Z
Creation-Date: 2017-07-20T10:25:23Z
tiff:ImageWidth: 442
Content-Type: video/mp4
Last-Save-Date: 2017-07-20T10:25:23Z
補充:有時候讀取的文件內容太大的時候需要設置參數,如下:(用於讀取大文件)
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
總結:
至此,apacheTika基本用法已經使用完畢,tika不能獲取word,pdf等文件中的圖片。但是可以解析文件中的文字,常見文件的內容都是可以提取的。在某些場景下也是有用途的。比如做文件服務器的時候可以將內容提取出來保存到數據庫或者保存到文件中,利用solr或者數據庫的查詢進行模糊搜索。
tika在提取html、office等文件之后是提取里面的文字,有時候提取源碼可以用FileUtils,最好兩者結合使用。
有時間可以用swing做一個基於apachetika查找文件內容和查找文件class的工具類,類似於everything,做的好一點比everything更好一點可以讀取里面的內容。這只是一個思路。。。。。有時間再慢慢實現。