何為字符編碼?
字符編碼為計算機文字的存儲格式, 例如 英文 字母 以ASCII編碼存儲, 即單字節存儲, 其他字符編碼有 UTF-8(通用字符編碼格式), 其他區域性編碼格式, 例如 ISO-8859(西歐), windows-1251俄文,中文GB編碼。
為什么需要轉換?
正因各個地區有不同的編碼格式, 為了交換信息的目的, 就需要將相同字符的 從一種編碼格式 轉換為 另外一種編碼格式。
通用的編碼格式為 UTF-8, 其囊括了 世界上所有字符, 所以一般為了通用性, 文件都以UTF-8編碼(例如網頁支持多語言顯示的情況), 其他編碼的語言一般都向UTF-8轉換。
轉換庫LIBICONV
http://www.gnu.org/software/libiconv/#introduction
GNU世界提供了 一個開源 轉換庫, 支持若干編碼 和 unicode 編碼之間的轉換。 此庫可以再沒有提供編碼轉換的系統上使用。
項目地址 http://savannah.gnu.org/projects/libiconv/
最新的Linux C庫以已經提供 iconv 的轉換,可以不用安裝:
http://davidgao.github.io/LFSCN/chapter06/glibc.html
LFS 之外的某些程序包推薦安裝 GNU libiconv 用於轉換文本編碼。此工程的主頁 (http://www.gnu.org/software/libiconv/) 表示 “此庫提供一個
iconv()
實現,用於沒有提供此實現或無法操作 Unicode 的系統。” Glibc 提供一個iconv()
實現並且可以操作 Unicode,所以在 LFS 系統上不必安裝 libiconv。
LUAICONV
對於成熟的 lua, 對iconv功能進行了封裝, 形成了一個專門的庫,提供給LUA應用腳本使用。
官網介紹
http://ittner.github.io/lua-iconv/#download-and-installation
local iconv = require("iconv")cd = iconv.new(to, from) cd = iconv.open(to, from)nstr, err = cd:iconv(str) Converts the 'str' string to the desired charset. This method always returns two arguments: the converted string and an error code, which may have any of the following values: nil No error. Conversion was successful. iconv.ERROR_NO_MEMORY Failed to allocate enough memory in the conversion process. iconv.ERROR_INVALID An invalid character was found in the input sequence. iconv.ERROR_INCOMPLETE An incomplete character was found in the input sequence. iconv.ERROR_FINALIZED Trying to use an already-finalized converter. This usually means that the user was tweaking the garbage collector private methods. iconv.ERROR_UNKNOWN There was an unknown error.
對於LUA 5.1版本, 推薦下載 lua-iconv-5 版本, 最新的-7版本兼容 LUA5.2
https://github.com/ittner/lua-iconv/releases/tag/lua-iconv-5
安裝運行有報錯:
:~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ lua test_iconv.lua
lua: error loading module 'iconv' from file './iconv.so':
./iconv.so: undefined symbol: libiconv_open
stack traceback:
[C]: ?
[C]: in function 'require'
test_iconv.lua:1: in main chunk
[C]: ?
經過查證(受到此文啟發 http://tonybai.com/2013/04/25/a-libiconv-linkage-problem/),
分析為先安裝了 libiconv庫, 導致 此庫的iconv.h拷貝到 usr/local/include/iconv.h
然后編譯 luaiconv工程,編譯文件iconv.c文件時候, gcc先找到 usr/local/include/iconv.h 此文件, 以此文件內部的函數聲明為准,編譯出iconv.so
實際上次應該以系統提供的 iconv.h 為准, 此文件在 /usr/include/iconv.h
頭文件gcc搜索次序:
:~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ ld -verbose | grep SEARCH
SEARCH_DIR("=/usr/i686-linux-gnu/lib32"); SEARCH_DIR("=/usr/local/lib32"); SEARCH_DIR("=/lib32"); SEARCH_DIR("=/usr/lib32"); SEARCH_DIR("=/usr/i686-linux-gnu/lib"); SEARCH_DIR("=/usr/local/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/local/lib"); SEARCH_DIR("=/lib/i386-linux-gnu"); SEARCH_DIR("=/lib"); SEARCH_DIR("=/usr/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/lib");
libiconv-------usr/local/include/iconv.h
#ifndef LIBICONV_PLUG
#define iconv_open libiconv_open
#endif
extern LIBICONV_DLL_EXPORTED iconv_t iconv_open (const char* tocode, const char* fromcode);
libiconv -- iconv.c 中 libiconv_open 定義收到宏控制, 應該未開啟, 或者編譯 luaiconv未鏈接libiconv庫
#if defined __FreeBSD__ && !defined __gnu_freebsd__
/* GNU libiconv is the native FreeBSD iconv implementation since 2002.
It wants to define the symbols 'iconv_open', 'iconv', 'iconv_close'. */
#define strong_alias(name, aliasname) _strong_alias(name, aliasname)
#define _strong_alias(name, aliasname) \
extern __typeof (name) aliasname __attribute__ ((alias (#name)));
#undef iconv_open
#undef iconv
#undef iconv_close
strong_alias (libiconv_open, iconv_open)
strong_alias (libiconv, iconv)
strong_alias (libiconv_close, iconv_close)
#endif
解決方法: 修改實現文件中, 引用的 iconv.h 引用方式, 將標准方式, 修改為自定義,並且寫為全路徑 /usr/include/iconv.h
然后再次 make && make install, 運行ok
vim luaiconv.c
#include <lua.h>
#include <lauxlib.h>
#include <stdlib.h>
#include "/usr/include/iconv.h"
#include <errno.h>
安裝運行其它報錯參考:
https://github.com/ittner/lua-iconv/issues/3
生成轉換表實驗
在一些嵌入式系統上, 沒有安裝libiconv庫, 或者 libc庫中也沒有實現 iconv 功能, 但是同時還是需要字符換場景,
可以在編譯服務器上, 安裝luaiconv, 利用系統的iconv功能, 生成 一種編碼到另外一種編碼的映射表, 然后利用此映射表來, 是實現轉換。
例如, 將windows-1251轉換為UTF-8
windows-1251 字符編碼參考:
http://www.science.co.il/language/Character-code.asp?s=1251
生成表的LUA代碼:
function serializeTable(val, name, skipnewlines, depth) skipnewlines = skipnewlines or false depth = depth or 0 local tmp = string.rep(" ", depth) if name then tmp = tmp .. name .. " = " end if type(val) == "table" then tmp = tmp .. "{" .. (not skipnewlines and "\n" or "") for k, v in pairs(val) do tmp = tmp .. serializeTable(v, k, skipnewlines, depth + 1) .. "," .. (not skipnewlines and "\n" or "") end tmp = tmp .. string.rep(" ", depth) .. "}" elseif type(val) == "number" then tmp = tmp .. tostring(val) elseif type(val) == "string" then tmp = tmp .. string.format("%q", val) elseif type(val) == "boolean" then tmp = tmp .. (val and "true" or "false") else tmp = tmp .. "\"[inserializeable datatype:" .. type(val) .. "]\"" end return tmp end local iconv = require("iconv") -- Set your terminal encoding here -- local termcs = "iso-8859-1" local termcs = "utf-8" function check_one(to, from, text) print("\n-- Testing conversion from " .. from .. " to " .. to) local cd = iconv.new(to .. "//TRANSLIT", from) assert(cd, "Failed to create a converter object.") local ostr, err = cd:iconv(text) if err == iconv.ERROR_INCOMPLETE then print("ERROR: Incomplete input.") elseif err == iconv.ERROR_INVALID then print("ERROR: Invalid input.") elseif err == iconv.ERROR_NO_MEMORY then print("ERROR: Failed to allocate memory.") elseif err == iconv.ERROR_UNKNOWN then print("ERROR: There was an unknown error.") end print(ostr) return ostr end local result = {} local num = 255 for i = 0, num do print("----------------------------------- i="..i) local char = string.char(i) local ostr = check_one(termcs, "windows-1251", char) print(string.len(ostr)) local byteStr = "" for j = 1, string.len(ostr) do local byteVal = string.byte(ostr,j) print("byte j=" ..j .. " byteVal=".. byteVal) byteStr = byteStr .. "\\" .. byteVal end print("char i=" ..i .. " byteStr=".. byteStr) table.insert(result, byteStr) end print("-----------------------------------!!") s = serializeTable(result) print(s)
整理后的 windows-1251轉換為UTF-8 的表
lcoal transTbl_1251toutf8 = { 1 = "\0", 2 = "\1", 3 = "\2", 4 = "\3", 5 = "\4", 6 = "\5", 7 = "\6", 8 = "\7", 9 = "\8", 10 = "\9", 11 = "\10", 12 = "\11", 13 = "\12", 14 = "\13", 15 = "\14", 16 = "\15", 17 = "\16", 18 = "\17", 19 = "\18", 20 = "\19", 21 = "\20", 22 = "\21", 23 = "\22", 24 = "\23", 25 = "\24", 26 = "\25", 27 = "\26", 28 = "\27", 29 = "\28", 30 = "\29", 31 = "\30", 32 = "\31", 33 = "\32", 34 = "\33", 35 = "\34", 36 = "\35", 37 = "\36", 38 = "\37", 39 = "\38", 40 = "\39", 41 = "\40", 42 = "\41", 43 = "\42", 44 = "\43", 45 = "\44", 46 = "\45", 47 = "\46", 48 = "\47", 49 = "\48", 50 = "\49", 51 = "\50", 52 = "\51", 53 = "\52", 54 = "\53", 55 = "\54", 56 = "\55", 57 = "\56", 58 = "\57", 59 = "\58", 60 = "\59", 61 = "\60", 62 = "\61", 63 = "\62", 64 = "\63", 65 = "\64", 66 = "\65", 67 = "\66", 68 = "\67", 69 = "\68", 70 = "\69", 71 = "\70", 72 = "\71", 73 = "\72", 74 = "\73", 75 = "\74", 76 = "\75", 77 = "\76", 78 = "\77", 79 = "\78", 80 = "\79", 81 = "\80", 82 = "\81", 83 = "\82", 84 = "\83", 85 = "\84", 86 = "\85", 87 = "\86", 88 = "\87", 89 = "\88", 90 = "\89", 91 = "\90", 92 = "\91", 93 = "\92", 94 = "\93", 95 = "\94", 96 = "\95", 97 = "\96", 98 = "\97", 99 = "\98", 100 = "\99", 101 = "\100", 102 = "\101", 103 = "\102", 104 = "\103", 105 = "\104", 106 = "\105", 107 = "\106", 108 = "\107", 109 = "\108", 110 = "\109", 111 = "\110", 112 = "\111", 113 = "\112", 114 = "\113", 115 = "\114", 116 = "\115", 117 = "\116", 118 = "\117", 119 = "\118", 120 = "\119", 121 = "\120", 122 = "\121", 123 = "\122", 124 = "\123", 125 = "\124", 126 = "\125", 127 = "\126", 128 = "\127", 129 = "\208\130", 130 = "\208\131", 131 = "\226\128\154", 132 = "\209\147", 133 = "\226\128\158", 134 = "\226\128\166", 135 = "\226\128\160", 136 = "\226\128\161", 137 = "\226\130\172", 138 = "\226\128\176", 139 = "\208\137", 140 = "\226\128\185", 141 = "\208\138", 142 = "\208\140", 143 = "\208\139", 144 = "\208\143", 145 = "\209\146", 146 = "\226\128\152", 147 = "\226\128\153", 148 = "\226\128\156", 149 = "\226\128\157", 150 = "\226\128\162", 151 = "\226\128\147", 152 = "\226\128\148", 153 = "", 154 = "\226\132\162", 155 = "\209\153", 156 = "\226\128\186", 157 = "\209\154", 158 = "\209\156", 159 = "\209\155", 160 = "\209\159", 161 = "\194\160", 162 = "\208\142", 163 = "\209\158", 164 = "\208\136", 165 = "\194\164", 166 = "\210\144", 167 = "\194\166", 168 = "\194\167", 169 = "\208\129", 170 = "\194\169", 171 = "\208\132", 172 = "\194\171", 173 = "\194\172", 174 = "\194\173", 175 = "\194\174", 176 = "\208\135", 177 = "\194\176", 178 = "\194\177", 179 = "\208\134", 180 = "\209\150", 181 = "\210\145", 182 = "\194\181", 183 = "\194\182", 184 = "\194\183", 185 = "\209\145", 186 = "\226\132\150", 187 = "\209\148", 188 = "\194\187", 189 = "\209\152", 190 = "\208\133", 191 = "\209\149", 192 = "\209\151", 193 = "\208\144", 194 = "\208\145", 195 = "\208\146", 196 = "\208\147", 197 = "\208\148", 198 = "\208\149", 199 = "\208\150", 200 = "\208\151", 201 = "\208\152", 202 = "\208\153", 203 = "\208\154", 204 = "\208\155", 205 = "\208\156", 206 = "\208\157", 207 = "\208\158", 208 = "\208\159", 209 = "\208\160", 210 = "\208\161", 211 = "\208\162", 212 = "\208\163", 213 = "\208\164", 214 = "\208\165", 215 = "\208\166", 216 = "\208\167", 217 = "\208\168", 218 = "\208\169", 219 = "\208\170", 220 = "\208\171", 221 = "\208\172", 222 = "\208\173", 223 = "\208\174", 224 = "\208\175", 225 = "\208\176", 226 = "\208\177", 227 = "\208\178", 228 = "\208\179", 229 = "\208\180", 230 = "\208\181", 231 = "\208\182", 232 = "\208\183", 233 = "\208\184", 234 = "\208\185", 235 = "\208\186", 236 = "\208\187", 237 = "\208\188", 238 = "\208\189", 239 = "\208\190", 240 = "\208\191", 241 = "\209\128", 242 = "\209\129", 243 = "\209\130", 244 = "\209\131", 245 = "\209\132", 246 = "\209\133", 247 = "\209\134", 248 = "\209\135", 249 = "\209\136", 250 = "\209\137", 251 = "\209\138", 252 = "\209\139", 253 = "\209\140", 254 = "\209\141", 255 = "\209\142", 256 = "\209\143", }