網站敏感詞過濾的實現（附敏感詞庫）

本文轉載自查看原文 2018-07-20 11:31 1016 java

現在基本上所有的網站都需要設置敏感詞過濾，似乎已經成了一個網站的標配，如果你的網站沒有，或者你沒有做相應的處理，那么小心相關部門請你喝茶哦。
最近在調研Java web網站的敏感詞過濾的實現，網上找了相關資料，經過我的驗證，把我的調研結果寫出來，供大家參考。

一、敏感詞過濾工具類

把敏感詞詞庫內容加載到ArrayList集合中，通過雙層循環，查找與敏感詞列表相匹配的字符串，如果找到以*號替換，最終得到替換后的字符串。

此種方式匹配度較高，匹配速度良好。

初始化敏感詞庫：

//初始化敏感詞庫
public void InitializationWork()  
{  
    replaceAll = new StringBuilder(replceSize); for(int x=0;x < replceSize;x++) { replaceAll.append(replceStr); }  //加載詞庫 arrayList = new ArrayList<String>(); InputStreamReader read = null; BufferedReader bufferedReader = null; try { read = new InputStreamReader(SensitiveWord.class.getClassLoader().getResourceAsStream(fileName),encoding); bufferedReader = new BufferedReader(read); for(String txt = null;(txt = bufferedReader.readLine()) != null;){ if(!arrayList.contains(txt)) arrayList.add(txt); } } catch (UnsupportedEncodingException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }finally{ try { if(null != bufferedReader) bufferedReader.close(); } catch (IOException e) { e.printStackTrace(); } try { if(null != read) read.close(); } catch (IOException e) { e.printStackTrace(); } } }

過濾敏感詞信息：

public String filterInfo(String str) { sensitiveWordSet = new HashSet<String>(); sensitiveWordList= new ArrayList<>(); StringBuilder buffer = new StringBuilder(str); HashMap<Integer, Integer> hash = new HashMap<Integer, Integer>(arrayList.size()); String temp; for(int x = 0; x < arrayList.size();x++) { temp = arrayList.get(x); int findIndexSize = 0; for(int start = -1;(start=buffer.indexOf(temp,findIndexSize)) > -1;) { //System.out.println("###replace="+temp); findIndexSize = start+temp.length();//從已找到的后面開始找 Integer mapStart = hash.get(start);//起始位置 if(mapStart == null || (mapStart != null && findIndexSize > mapStart))//滿足1個，即可更新map { hash.put(start, findIndexSize); //System.out.println("###敏感詞："+buffer.substring(start, findIndexSize)); } } } Collection<Integer> values = hash.keySet(); for(Integer startIndex : values) { Integer endIndex = hash.get(startIndex); //獲取敏感詞，並加入列表，用來統計數量 String sensitive = buffer.substring(startIndex, endIndex); //System.out.println("###敏感詞："+sensitive); if (!sensitive.contains("*")) {//添加敏感詞到集合 sensitiveWordSet.add(sensitive); sensitiveWordList.add(sensitive); } buffer.replace(startIndex, endIndex, replaceAll.substring(0,endIndex-startIndex)); } hash.clear(); return buffer.toString(); }

下載地址：SensitiveWord
鏈接: https://pan.baidu.com/s/12RcZ8-jNHMAR__VscRUDfQ 密碼: qmcw

二、Java關鍵詞過濾

這個方式采用的是正則表達式匹配，速度上比第一種稍慢，匹配度良好。

主要代碼：

// 從words.properties初始化正則表達式字符串 private static void initPattern() { StringBuffer patternBuffer = new StringBuffer(); try { //words.properties InputStream in = KeyWordFilter.class.getClassLoader().getResourceAsStream("keywords.properties"); Properties property = new Properties(); property.load(in); Enumeration<?> enu = property.propertyNames(); patternBuffer.append("("); while (enu.hasMoreElements()) { String scontent = (String) enu.nextElement(); patternBuffer.append(scontent + "|"); //System.out.println(scontent); keywordsCount ++; } patternBuffer.deleteCharAt(patternBuffer.length() - 1); patternBuffer.append(")"); //System.out.println(patternBuffer); // unix換成UTF-8 // pattern = Pattern.compile(new // String(patternBuf.toString().getBytes("ISO-8859-1"), "UTF-8")); // win下換成gb2312 // pattern = Pattern.compile(new String(patternBuf.toString() // .getBytes("ISO-8859-1"), "gb2312")); // 裝換編碼 pattern = Pattern.compile(patternBuffer.toString()); } catch (IOException ioEx) { ioEx.printStackTrace(); } } private static String doFilter(String str) { Matcher m = pattern.matcher(str); // while (m.find()) {// 查找符合pattern的字符串 // System.out.println("The result is here :" + m.group()); // } // 選擇替換方式，這里以* 號代替 str = m.replaceAll("*"); return str; }

下載地址：KeyWordFilter
鏈接: http://pan.baidu.com/s/1kVBl803 密碼: xi24

三、DFA算法進行過濾

這種方式聽起來高大上，采用DFA算法，這個算法個人不太懂，經測試發現，匹配度不行，速度良好。或許可以改良，還請大神進行改良。

主要有兩個文件：SensitivewordFilter.java 和 SensitiveWordInit.java

主要代碼：

public int CheckSensitiveWord(String txt,int beginIndex,int matchType){ boolean flag = false; //敏感詞結束標識位：用於敏感詞只有1位的情況 int matchFlag = 0; //匹配標識數默認為0 char word = 0; Map nowMap = sensitiveWordMap; for(int i = beginIndex; i < txt.length() ; i++){ word = txt.charAt(i); nowMap = (Map) nowMap.get(word); //獲取指定key if(nowMap != null){ //存在，則判斷是否為最后一個 matchFlag++; //找到相應key，匹配標識+1 if("1".equals(nowMap.get("isEnd"))){ //如果為最后一個匹配規則,結束循環，返回匹配標識數 flag = true; //結束標志位為true if(SensitivewordFilter.minMatchTYpe == matchType){ //最小規則，直接返回,最大規則還需繼續查找 break; } } } else{ //不存在，直接返回 break; } } if(matchFlag < 2 || !flag){ //長度必須大於等於1，為詞 matchFlag = 0; } return matchFlag; }

下載地址：SensitivewordFilter
鏈接: http://pan.baidu.com/s/1ccsa66 密碼: mc1x

四、多叉樹查找算法

這個方式采用了多叉樹查找算法，至於這個算法是怎么回事，大家可以去查看數據結構相關內容。提供了jar包，直接調用進行過濾。

經測試，這個方法匹配度良好，速度稍慢。

調用方式：

//敏感詞過濾
FilteredResult result = WordFilterUtil.filterText(str, '*'); //獲取過濾后的內容 System.out.println("替換后的字符串為:\n"+result.getFilteredContent()); //獲取原始字符串 System.out.println("原始字符串為:\n"+result.getOriginalContent()); //獲取替換的敏感詞 System.out.println("替換的敏感詞為:\n"+result.getBadWords());

下載地址：WordFilterUtil
鏈接: http://pan.baidu.com/s/1nvftzeD 密碼: 5t2h

以上就是我的調研結果，希望對大家有所幫助。

最后，附上大量敏感詞庫下載地址：
鏈接: https://pan.baidu.com/s/1n-GH-OO6nQ5oJk5h5qHVkA 密碼: qsv9

參考了以下文章：
- 《高效精准》敏感字&詞過濾
- Java關鍵字過濾
- Java實現敏感詞過濾
- 高效Java敏感詞、關鍵詞過濾工具包_過濾非法詞句

其他
- 個人博客：http://www.sendtion.cn
- CSDN：http://blog.csdn.net/shuyou612
- GitHub：https://github.com/sendtion

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 DFA敏感詞過濾實現 PHP實現敏感詞過濾 PHP實現的敏感詞過濾方法 java實現敏感詞過濾（DFA算法）基於DFA算法實現的敏感詞過濾 Js利用正則實現敏感詞過濾 JAVA敏感詞過濾 lua敏感詞過濾 js敏感詞過濾【面試被虐】說說游戲中的敏感詞過濾是如何實現的？