Java實現過濾中文亂碼


最近在日志數據清洗時遇到中文亂碼,如果只要有非中文字符就將該字符串過濾掉,這種方法雖簡單但並不可取,因為比如像Xperia™主題天天四川麻將Ⅱ這樣的字符串也會被過濾掉。

1. Unicode編碼

Unicode編碼是一種涵蓋了世界上所有語言、標點等字符的編碼方式,簡單一點說,就是一種通用的世界碼;其編碼范圍:U+0000 .. U+10FFFF。按Unicode硬編碼的區間進行划分,Unicode編碼被分成若干個block ( Unicode block);每一個Unicode編碼專屬於唯一的Unicode block,Unicode block之間互不重疊。從碼字的本身的屬性出發,Unicode編碼被分成了若干script ( Unicode script);比如,與中文相關的字符、標點的scriptHan包括block如下:

  • CJK Radicals Supplement
  • Kangxi Radicals
  • CJK Symbols and Punctuation中的15個字符
  • CJK Unified Ideographs Extension A
  • CJK Unified Ideographs
  • CJK Compatibility Ideographs
  • CJK Unified Ideographs Extension B
  • CJK Unified Ideographs Extension C
  • CJK Unified Ideographs Extension D
  • CJK Unified Ideographs Extension E
  • CJK Compatibility Ideographs Supplement

其中,常見的中文字符在CJK Unified Ideographs block;此外,考慮繁體字及不常見字等,CJK還有A、B、C、D、E五個extension。Basic Latin block完整地包含了ASCII碼的控制字符、標點字符與英文字母字符。

Unicode編碼與block、script之間的映射關系,具體可參看這里

2. Java的字符編碼

JDK完整實現Unicode的block與script:

Char c = '☎'
Character.UnicodeBlock ub = Character.UnicodeBlock.of(c)
Character.UnicodeScript uc = Character.UnicodeScript.of(c);

Java中的字符char內置的編碼方式是UTF-16,當char強轉成int類型時,其返回值是unicode編碼值,只有當getbyte時才返回的是utf-8編碼的byte:

String s = "\u00a0";
String.format("\\u%04x", (int) s.charAt(0)) // --> \u00a0
import org.apache.commons.codec.binary.Hex;
Hex.encodeHex(s.getBytes()) // --> c2a0

UTF-8是Unicode字符的變長前綴編碼的一種實現,二者之間的對應關系在這里.現在我們回到開篇過濾中文亂碼的問題,有一個基本解決思路:

  • 去掉各種標點字符、控制字符,
  • 計算剩下字符中非中文字符所占的比例,如果超過閾值,則認為該字符串為亂碼串

完整代碼如下:

public class ChineseUtill {
	 
    private static boolean isChinese(char c) {
    	Character.UnicodeScript sc = Character.UnicodeScript.of(c);
        if (sc == Character.UnicodeScript.HAN) {
            return true;
        }
        return false;
    }
    
    public static boolean isPunctuation(char c) {
        Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
        if (    // punctuation, spacing, and formatting characters
        		ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
        		// symbols and punctuation in the unified Chinese, Japanese and Korean script
                || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
                // fullwidth character or a halfwidth character
                || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
                // vertical glyph variants for east Asian compatibility
                || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
                // vertical punctuation for compatibility characters with the Chinese Standard GB 18030
                || ub == Character.UnicodeBlock.VERTICAL_FORMS
                // ascii
                || ub == Character.UnicodeBlock.BASIC_LATIN
                ) {
            return true;
        } else {
            return false;
        }
    }
    
    private static Boolean isUserDefined(char c) {
    	Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
    	if (ub == Character.UnicodeBlock.NUMBER_FORMS
    			|| ub == Character.UnicodeBlock.ENCLOSED_ALPHANUMERICS
    			|| ub == Character.UnicodeBlock.LETTERLIKE_SYMBOLS
    			|| c == '\ufeff'
    			|| c == '\u00a0'
    			)
    		return true;
    	return false;
    }
    
    public static Boolean isMessy(String str)  {
    	float chlength = 0;
    	float count = 0;
    	for(int i = 0; i < str.length(); i++) {
    		char c = str.charAt(i);
    		if(isPunctuation(c) || isUserDefined(c))
    			continue;
    		else {
    			if(!isChinese(c)) {
    				count = count + 1;
    			}
    			chlength ++;
    		}
    	}
    	float result = count / chlength;
    	if(result > 0.3)
    		return true;
    	return false;
    }
    
}

為了得到更為完整的可接受的字符表,定義isUserDefined方法(具體字符表與日志中的字符有關系);加上了Number FormsEnclosed AlphanumericsLetterlike Symbols這三個block,以及\u00a0(Non-breaking space)字符與\ufeff(ZERO WIDTH NO-BREAK SPACE)字符。

3. 參考資料

[1] Wikipedia, Unicode block.
[2] Tong Zeng, Java 中文字符判斷 中文標點符號判斷.


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM