htmlcleaner 使用說明

本文轉載自查看原文 2019-08-12 20:35 409

說明

在編程的時候或者寫網絡爬蟲的時候，經常需要對html進行解析，抽取其中有用的數據。一款好的工具是特別有用的，能提供很多的幫助，網上有很多這樣的工具，比如：htmlcleaner、htmlparser
經使用比較：感覺 htmlcleaner 比 htmlparser 好用，尤其是htmlcleaner 的 xpath特好用。
htmlcleaner 下載地址：htmlcleaner2_1.jar 源碼下載：htmlcleaner2_1-all.zip
下面針對htmlcleaner進行舉例說明，需求為：取出title，name=”my_href” 的鏈接，div的class=”d_1″下的所有li內容。

html-clean-demo.html

 
                <! 
                DOCTYPE 
                html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd " >    
               
 
                <  
                html   
                xmlns 
                =  
                "http://www.w3.org/1999/xhtml "   
                xml:lang 
                =  
                "zh-CN"   
                dir 
                =  
                "ltr" 
                >    
               
 
                <  
                head 
                >    
               
 
                     
                <  
                meta   
                http-equiv 
                =  
                "Content-Type"   
                content 
                =  
                "text/html; charset=GBK" 
                />    
               
 
                     
                <  
                meta   
                http-equiv 
                =  
                "Content-Language"   
                content 
                =  
                "zh-CN" 
                />    
               
 
                     
                <  
                title 
                > html clean demo </  
                title 
                >    
               
 
                </  
                head 
                >    
               
 
                <  
                body 
                >    
               
 
                <  
                div   
                class 
                =  
                "d_1" 
                >    
               
 
                     
                <  
                ul 
                >    
               
 
                         
                <  
                li 
                > bar </  
                li 
                >    
               
 
                         
                <  
                li 
                > foo </  
                li 
                >    
               
 
                         
                <  
                li 
                > gzz </  
                li 
                >    
               
 
                     
                </  
                ul 
                >    
               
 
                </  
                div 
                >    
               
 
                <  
                div 
                >    
               
 
                     
                <  
                ul 
                >    
               
 
                         
                <  
                li 
                > <  
                a   
                name 
                =  
                "my_href"   
                href 
                =  
                "1.html" 
                > text-1 </  
                a 
                > </  
                li 
                >    
               
 
                         
                <  
                li 
                > <  
                a   
                name 
                =  
                "my_href"   
                href 
                =  
                "2.html" 
                > text-2 </  
                a 
                > </  
                li 
                >    
               
 
                         
                <  
                li 
                > <  
                a   
                name 
                =  
                "my_href"   
                href 
                =  
                "3.html" 
                > text-3 </  
                a 
                > </  
                li 
                >    
               
 
                         
                <  
                li 
                > <  
                a   
                name 
                =  
                "my_href"   
                href 
                =  
                "4.html" 
                > text-4 </  
                a 
                > </  
                li 
                >    
               
 
                     
                </  
                ul 
                >    
               
 
                </  
                div 
                >    
               
 
                </  
                body 
                >    
               
 
                </  
                html 
                > 
               

HtmlCleanerDemo.java

 
                package 
                com.chenlb; 
               
                import 
                java.io.File; 
               
                import 
                org.htmlcleaner.HtmlCleaner; 
               
                import 
                org.htmlcleaner.TagNode; 
               
                /** 
               
                * htmlcleaner 使用示例. 
               
                * 
               
                */ 
               
                public 
                class 
                HtmlCleanerDemo { 
               
                public 
                static 
                void 
                main(String[] args)  
                throws 
                Exception { 
               
                HtmlCleaner cleaner =  
                new 
                HtmlCleaner(); 
               
                TagNode node = cleaner.clean( 
                new 
                File( 
                "html/html-clean-demo.html" 
                ),  
                "GBK" 
                ); 
               
                //按tag取. 
               
                Object[] ns = node.getElementsByName( 
                "title" 
                ,  
                true 
                );     
                //標題 
               
                if 
                (ns.length >  
                0 
                ) { 
               
                System.out.println( 
                "title=" 
                +((TagNode)ns[ 
                0 
                ]).getText()); 
               
                } 
               
                System.out.println( 
                "ul/li:" 
                ); 
               
                //按xpath取 
               
                ns = node.evaluateXPath( 
                "//div[@class='d_1']//li" 
                ); 
               
                for 
                (Object on : ns) { 
               
                TagNode n = (TagNode) on; 
               
                System.out.println( 
                "\ttext=" 
                +n.getText()); 
               
                } 
               
                System.out.println( 
                "a:" 
                ); 
               
                //按屬性值取 
               
                ns = node.getElementsByAttValue( 
                "name" 
                ,  
                "my_href" 
                ,  
                true 
                ,  
                true 
                ); 
               
                for 
                (Object on : ns) { 
               
                TagNode n = (TagNode) on; 
               
                System.out.println( 
                "\thref=" 
                +n.getAttributeByName( 
                "href" 
                )+ 
                ", text=" 
                +n.getText()); 
               
                } 
               
                } 
               
                }

cleaner.clean()中的參數，可以是文件，可以是url，可以是字符串內容。
比較常用的應該是evaluateXPath、 [EFDXKKXFXKF] 、getElementsByAttValue、getElementsByName方法了。另外說明下，htmlcleaner 對不規范的html兼容性比較好。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Shadows 使用說明 VRTK使用說明 jgit使用說明 Podfile使用說明 MIPSsim使用說明 quicker使用說明 xtrabackup 使用說明 Kibana 使用說明 lseek使用說明 influxdb使用說明