[Java Web]敏感詞過濾算法


1.DFA算法

DFA算法的原理可以參考 這里 ,簡單來說就是通過Map構造出一顆敏感詞樹,樹的每一條由根節點到葉子節點的路徑構成一個敏感詞,例如下圖:

代碼簡單實現如下:

public class TextFilterUtil {  //日志  private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class);  //敏感詞庫  private static HashMap sensitiveWordMap = null;  //默認編碼格式  private static final String ENCODING = "gbk";  //敏感詞庫的路徑  private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt");  /**  * 初始化敏感詞庫  */  private static void init() {   //讀取文件   Set<String> keyWords = readSensitiveWords();   //創建敏感詞庫   sensitiveWordMap = new HashMap<>(keyWords.size());   for (String keyWord : keyWords) {    createKeyWord(keyWord);   }  }  /**  * 構建敏感詞庫  *  * @param keyWord  */  private static void createKeyWord(String keyWord) {   if (sensitiveWordMap == null) {    LOG.error("sensitiveWordMap 未初始化!");    return;   }   Map nowMap = sensitiveWordMap;   for (Character c : keyWord.toCharArray()) {    Object obj = nowMap.get(c);    if (obj == null) {     Map<String, Object> childMap = new HashMap<>();     childMap.put("isEnd", "false");     nowMap.put(c, childMap);     nowMap = childMap;    } else {     nowMap = (Map) obj;    }   }   nowMap.put("isEnd", "true");  }  /**  * 讀取敏感詞文件  *  * @return  */  private static Set<String> readSensitiveWords() {   Set<String> keyWords = new HashSet<>();   BufferedReader reader = null;   try {    reader = new BufferedReader(new InputStreamReader(in, ENCODING));    String line;    while ((line = reader.readLine()) != null) {     keyWords.add(line.trim());    }   } catch (UnsupportedEncodingException e) {    LOG.error("敏感詞庫文件轉碼失敗!");   } catch (FileNotFoundException e) {    LOG.error("敏感詞庫文件不存在!");   } catch (IOException e) {    LOG.error("敏感詞庫文件讀取失敗!");   } finally {    if (reader != null) {     try {      reader.close();     } catch (IOException e) {      e.printStackTrace();     }     reader = null;    }   }   return keyWords;  }  /**  * 檢查敏感詞  *  * @return  */  private static List<String> checkSensitiveWord(String text) {   if (sensitiveWordMap == null) {    init();   }   List<String> sensitiveWords = new ArrayList<>();   Map nowMap = sensitiveWordMap;   for (int i = 0; i < text.length(); i++) {    Character word = text.charAt(i);    Object obj = nowMap.get(word);    if (obj == null) {     continue;    }    int j = i + 1;    Map childMap = (Map) obj;    while (j < text.length()) {     if ("true".equals(childMap.get("isEnd"))) {      sensitiveWords.add(text.substring(i, j));     }     obj = childMap.get(text.charAt(j));     if (obj != null) {      childMap = (Map) obj;     } else {      break;     }     j++;    }   }   return sensitiveWords;  } } 

2.TTMP算法

TTMP算法由網友原創,關於它的起源可以查看 這里 ,TTMP算法的原理是將敏感詞拆分成“臟字”的序列,只有待比對字符串完全由“臟字”組成時,才去判斷它是否為敏感詞,減少了比對次數。這個算法的簡單實現如下:

public class TextFilterUtil {  //日志  private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class);  //默認編碼格式  private static final String ENCODING = "gbk";  //敏感詞庫的路徑  private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt");  //臟字庫  private static Set<Character> sensitiveCharSet = null;  //敏感詞庫  private static Set<String> sensitiveWordSet = null;  /**  * 初始化敏感詞庫  */  private static void init() {   //初始化容器   sensitiveCharSet = new HashSet<>();   sensitiveWordSet = new HashSet<>();   //讀取文件 創建敏感詞庫   readSensitiveWords();  }  /**  * 讀取本地的敏感詞文件  *  * @return  */  private static void readSensitiveWords() {   BufferedReader reader = null;   try {    reader = new BufferedReader(new InputStreamReader(in, ENCODING));    String line;    while ((line = reader.readLine()) != null) {     String word = line.trim();     sensitiveWordSet.add(word);     for (Character c : word.toCharArray()) {      sensitiveCharSet.add(c);     }    }   } catch (UnsupportedEncodingException e) {    LOG.error("敏感詞庫文件轉碼失敗!");   } catch (FileNotFoundException e) {    LOG.error("敏感詞庫文件不存在!");   } catch (IOException e) {    LOG.error("敏感詞庫文件讀取失敗!");   } finally {    if (reader != null) {     try {      reader.close();     } catch (IOException e) {      e.printStackTrace();     }     reader = null;    }   }   return;  }  /**  * 檢查敏感詞  *  * @return  */  private static List<String> checkSensitiveWord(String text) {   if (sensitiveWordSet == null || sensitiveCharSet == null) {    init();   }   List<String> sensitiveWords = new ArrayList<>();   for (int i = 0; i < text.length(); i++) {    Character word = text.charAt(i);    if (!sensitiveCharSet.contains(word)) {     continue;    }    int j = i;    while (j < text.length()) {     if (!sensitiveCharSet.contains(word)) {      break;     }     String key = text.substring(i, j + 1);     if (sensitiveWordSet.contains(key)) {      sensitiveWords.add(key);     }     j++;    }   }   return sensitiveWords;  } } 

注:以上代碼實現僅用於展示思路,在實際使用中還有很多地方可以優化。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM