Lua截取utf-8編碼的中英文混合字符串

本文轉載自查看原文 2016-11-24 18:36 3332 Lua

參考博客：UTF8字符串在lua的截取和字數統計【轉載】

需求

按字面個數來截取子字符串

函數(字符串, 開始位置, 截取長度)

utf8sub("你好1世界哈哈",2,5)    =    好1世界哈
utf8sub("1你好1世界哈哈",2,5)    =    你好1世界
utf8sub("你好世界1哈哈",1,5)    =    你好世界1
utf8sub("12345678",3,5)    =    34567
utf8sub("øpø你好pix",2,5)    =    pø你好p

錯誤方法

網上找了一些算法, 都不太正確; 要么就是亂碼, 要么就是只考慮了4 byte 中文的情況, 不夠全面

1. string.sub(s,1,截取長度*4)

　　網上很多直接使用"`""string.sub(s,1,截取長度*4)`"是肯定不對的, 因為如果中英文混合的字符串, 例如`你好1世界`的字符長度分別是`4,4,1,4,4`, 如果截取4個字, 4*4=4+4+1+4+3, 那`世界`的`界`字將會被取前3個byte, 就會出現亂碼

2. if byte>128 then index = index + 4

問題關鍵

1. utf8字符是變長字符

2. 字符長度有規律

如文字符編碼中所列，utf-8是對unicode字符集的編碼方案。因此其變長編碼方式為：

一字節：0*******

兩字節：110*****，10******

三字節：1110****，10******，10******

四字節：11110***，10******，10******，10******

五字節：111110**，10******，10******，10******，10******

六字節：1111110*，10******，10******，10******，10******，10******

因此，拿到字節串后，想判斷UTF8字符的byte長度，按照上文的規律，只需要獲取該字符的首個Byte，根據其值就可以判斷出該字符由幾個Byte表示。

其代碼如下：

local funciton charsize(ch)
    if not ch then return 0
    elseif ch >=252 then return 6
    elseif ch >= 248 and ch < 252 then return 5
    elseif ch >= 240 and ch < 248 then return 4
    elseif ch >= 224 and ch < 240 then return 3
    elseif ch >= 192 and ch < 224 then return 2
    elseif ch < 192 then return 1
    end
end

-- 計算utf8字符串字符數, 各種字符都按一個字符計算
-- 例如utf8len("1你好") => 3
function utf8len(str)
    local len = 0
    local aNum = 0 --字母個數
    local hNum = 0 --漢字個數
    local currentIndex = 1
    while currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        local cs = charsize(char)
        currentIndex = currentIndex + cs
        len = len +1
        if cs == 1 then 
            aNum = aNum + 1
        elseif cs >= 2 then 
            hNum = hNum + 1
        end
    end
    return len, aNum, hNum
end

-- 截取utf8 字符串
-- str:            要截取的字符串
-- startChar:    開始字符下標,從1開始
-- numChars:    要截取的字符長度
function utf8sub(str, startChar, numChars)
    local startIndex = 1
    while startChar > 1 do
        local char = string.byte(str, startIndex)
        startIndex = startIndex + chsize(char)
        startChar = startChar - 1
    end

    local currentIndex = startIndex

    while numChars > 0 and currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        currentIndex = currentIndex + chsize(char)
        numChars = numChars -1
    end
    return str:sub(startIndex, currentIndex - 1)
end

-- 自測
function test()
    -- test utf8len
    assert(utf8len("你好1世界哈哈") == 7)
    assert(utf8len("你好世界1哈哈 ") == 8)
    assert(utf8len(" 你好世 界1哈哈") == 9)
    assert(utf8len("12345678") == 8)
    assert(utf8len("øpø你好pix") == 8)

    -- test utf8sub
    assert(utf8sub("你好1世界哈哈",2,5) == "好1世界哈")
    assert(utf8sub("1你好1世界哈哈",2,5) == "你好1世界")
    assert(utf8sub(" 你好1世界 哈哈",2,6) == "你好1世界 ")
    assert(utf8sub("你好世界1哈哈",1,5) == "你好世界1")
    assert(utf8sub("12345678",3,5) == "34567")
    assert(utf8sub("øpø你好pix",2,5) == "pø你好p")

    print("all test succ")
end

test()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Java 截取中英文混合字符串中英文混合字符串截取java PHP針對中英文混合字符串長度判斷及截取方法 javascript截取字符串(支持中英文混合) PHP截取中英文混合字符 PHP 中英文混排截取字符串 Lua 截取字符串（截取utf-8格式字符串） Lua 截取字符串（截取utf-8格式字符串） jquery截取、判斷字符串的長度，中英文都可 smarty中用truncate來截取中英文字符串及避免中文亂碼問題