js簡單實現utf8編碼和base64編碼

本文轉載自查看原文 2019-10-31 18:03 429

文章用JS簡單的實現UTF-8編碼和Base64編碼，閱讀本文可以了解Unicode 與 UTF-8 之間的轉換，了解Base64編碼為什么會使數據量變長。

概要：

Unicode簡單了解
UTF-8編碼
Base64編碼
總結

Unicode，ASCII，GB2312編碼集合等，類似於字典。字符的編碼，類似於字典中的字在哪一頁哪一行。當不同系統用同一本字典查同一個編碼得到的字符會一致。

如下圖：

1. Unicode簡單了解

wikipedia:

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

在創造Unicode之前各種語言有不同的編碼集合，ASCII,GB2312等也是發展過程中編碼集合，而且這些編碼集合相互沖突，給不同語言系統進行交流帶來了麻煩。因為兩種相同的字符在不同的編碼系統中可能有完全不同的意思。於是Unicode出現了，Unicode編碼集合給每個字符提供了一個唯一的數字，不論平台，程序，語言，Unicode 字符集因此被廣泛應用。

Javascript程序是用Unicode字符集編寫的，字符串（string）中每個字符通常來自於Unicode 字符集。

Unicode 字符集類似於字典，字符就類似於字。字符的Unicode碼值，就類似於字在字典的第頁第幾行。

2. utf8編碼

2.1為何有了Unicode字符集還需要一個編碼來傳輸了？

因為Unicode 編碼轉換成二進制，是一串0，和1,傳輸個另一方的時候，需要一個規則來分割這一串0、1。

於是就出現了UTF-n 編碼們。

8bit = 1byte

資料

UTF（Universal Transformation Format，通用傳輸格式），其實就是不改變字符集中各個字符的代碼，建立一套新的編碼方式，把字符的代碼通過這個編碼方式映射成傳輸時的編碼，最主要的任務就是在使用Unicode字符集保持通用性的同時節約流量和硬盤空間。

存儲
Unicode是一個符號集，規定了符號的二進制代碼，沒有規定這個二進制代碼應該如何存儲（即占用多少個字節）所以出現了不同的存儲實現方式。
UTF-32

字符用四個字節表示

UTF-16

字符用兩個字節或四個字節表示

UTF-8

一種變長的編碼方式，根據需要用1~4個字節來表示字符，（按需傳遞節約流量和硬盤空間，因此UTF-8用的比較廣）

2.2UTF-8編碼規則

對於單字節的符號，字節的第一位設為0，后面7位為這個符號的 Unicode 碼。
對於n字節的符號（n > 1），第一個字節的前n位都設為1，第n+ 1位設為0，后面字節的前兩位設為10。剩下，全部為這個符號的 Unicode 編碼。

Unicode符號范圍 |  Unicode符號范圍     |        UTF-8編碼方式                   
 (十進制)       |  (十六進制)          |        （二進制）
---------------+----------------------+---------------------------------------------
0 ~ 127        | 0000 0000-0000 007F | 0xxxxxxx
128 ~ 2047     | 0000 0080-0000 07FF | 110xxxxx 10xxxxxx
2048 ~ 65535   | 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
65536 ~1114111 | 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

eg: 字符羅的UTF-8編碼

用codePointAt得到了字符的Unicode 編碼，確認用幾個字節表示，然后按照規則填充。

編碼流程：

資料補充：

ES6 提供了codePointAt()方法，能夠正確處理字節儲存的字符，返回一個字符的碼點（Unicode 編碼）。

ES6 提供了String.fromCodePoint()方法能正確處理一個碼點（Unicode 編碼），返回碼點（Unicode 編碼）對應的字符

2.3UTF-8編碼解碼簡單實現

 1 function encodeUtf8(str) {
 2   var bytes = []
 3   for (ch of str) {
 4     // for...of循環，能正確識別 32 位的 UTF-16 字符， 可以查閱資料了解。
 5     let code = ch.codePointAt(0)
 6     if (code >= 65536 && code <= 1114111) {// 位運算， 補齊8位
 7       bytes.push((code >> 18) | 0xf0)
 8       bytes.push(((code >> 12) & 0x3f) | 0x80)
 9       bytes.push(((code >> 6) & 0x3f) | 0x80)
10       bytes.push((code & 0x3f) | 0x80)
11     } else if (code >= 2048 && code <= 65535) {
12       bytes.push((code >> 12) | 0xe0)
13       bytes.push(((code >> 6) & 0x3f) | 0x80)
14       bytes.push((code & 0x3f) | 0x80)
15     } else if (code >= 128 && code <= 2047) {
16       bytes.push((code >> 6) | 0xc0)
17       bytes.push((code & 0x3f) | 0x80)
18     } else {
19       bytes.push(code)
20     }
21   }
22   return bytes
23 }
24 function padStart(str, len, prefix) {
25   return ((new Array(len + 1).join(prefix)) + str).slice(-len) //  也可用 new Array(len+1).fill(0)
26 }
27 function decodeUtf8(str) {
28   let strValue = ''
29   let obStr = [...str].map((ch)=> {
30     return padStart(parseInt(ch,16).toString(2), 4, 0)
31   }).join('').match(/\d{8}/g).map((item)=> parseInt(item,2))
32   for (var i = 0; i < obStr.length; ) {
33     
34     let code = obStr[i]
35     let code1, code2, code3, code4, hex
36     if ((code & 240) == 240) {
37       code1 = (code & 0x03).toString(2)
38       code2 = padStart((obStr[i + 1] & 0x3f).toString(2),6, '0')
39       code3 = padStart((obStr[i + 2] & 0x3f).toString(2),6, '0')
40       code4 = padStart((obStr[i + 3] & 0x3f).toString(2),6, '0')
41       hex = parseInt((code1 + code2 + code3 + code4),2)
42       strValue = strValue + String.fromCodePoint(hex)
43       i = i + 4
44     } else if ((code & 224) == 224) {
45       code1 = (code & 0x07).toString(2)
46       code2 = padStart((obStr[i + 1] & 0x3f).toString(2),6, '0')
47       code3 = padStart((obStr[i + 2]& 0x3f).toString(2),6, '0')
48       hex = parseInt((code1 + code2 + code3),2)
49       strValue = strValue + String.fromCodePoint(hex)
50       i = i + 3
51     } else if ((code & 192) == 192) {
52       code1 = (code & 0x0f).toString(2)
53       code2 = padStart((obStr[i + 1] & 0x3f).toString(2),6, '0')
54       hex = parseInt((obStr + code2),2)
55       strValue = strValue + String.fromCodePoint(hex)
56       i = i + 2
57     } else {
58       hex = code
59       strValue = strValue + String.fromCodePoint(code)
60       i = i + 1
61     }
62   }
63   return strValue
64 }
65 function transferHex(bytes) {
66   let s = ''
67   bytes &&
68     bytes.forEach(ch => {
69       s = s + ch.toString(16)
70     })
71   return s
72 }
73 let text = "羅小步 啊哈哈 𠮷 ssdf 34534 ASD"
74 let strHax = transferHex(encodeUtf8(text))
75 console.log(strHax)
76 let str = decodeUtf8(strHax)
77 console.log(str)
78 
79 console.log("test ok?", text === str)

3. Base64 編碼

3.1Base64編碼規則

規則：Base64的編碼方法要求把每三個8bit的字節轉換成四個6bit的字節，然后把6Bit再添兩位高位0,組成四個8Bit的字節。

如果要編碼的二進制數據不是3的倍數，最后剩下一個或者兩個字節base64會在末尾補零，再在編碼的末尾加上一個或者兩個‘=’。

每個8bit 編碼成：CHARST[paresInt(8bit ,2)]

CHARTS = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

簡單流程的：

3.2 Base64編碼簡單實現

 1 const CHARTS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
 2 const prefix = '='
 3 const prefixTwo = 2
 4 const prefixfour = 4
 5 function padEnd(str, len, prefix) {
 6   return (str + (new Array(len + 1)).join(prefix)).slice(0, len)
 7 }
 8 function padStart(str, len, prefix) {
 9   return ((new Array(len + 1).join(prefix)) + str).slice(-len)
10 }
11 function encodeBase64(str){
12   let byteStr = ''
13   for(let ch of encodeUtf8(str)){ 
14     byteStr = byteStr + padStart(ch.toString(2),8,0)
15   }
16   let rest = byteStr.length % 6 // 余2 就是剩下了一個字節，余 4 就是剩下兩個字節
17   let restStr = rest === prefixTwo ? '==' :'='
18   let prefixzero = rest === prefixTwo ? prefixfour: prefixTwo
19   byteStr = padEnd(byteStr , byteStr.length + prefixzero,'0')
20   return byteStr.match(/(\d{6})/g).map(val=>parseInt(val,2)).map(val=>CHARTS[val]).join('') + restStr;
21 }
22 
23 function decodeBase64(str) {
24   let matchTime = str.match(/(ha)/g)
25   
26   let [...restStr] = str.replace(/=/g,'')
27   restStr = restStr.map((item)=> {
28     let value = CHARTS.indexOf(item)
29     return padStart(value.toString(2),6,0)
30   }).join('').match(/(\d{8})/g).map((item)=>parseInt(item,2).toString(16)).join()
31   console.log(restStr)
32   return decodeUtf8(restStr)
33 }
34 
35 let text = "羅小步 啊哈哈 𠮷 ssdf 34534 ASD"
36 let strHax = encodeBase64(text)
37 console.log(strHax)
38 let str = decodeBase64(strHax)
39 console.log(str)
40 
41 console.log("test ok?", text === str)

Base64的編碼方法要求把每三個8bit的字節轉換成四個6bit的字節，編碼會使數據量變長原來的1/3.

4. 總結

編碼方式只是一種對字符集表現的形式。文章用js 簡單的實現utf8編碼和base64編碼。代碼實現比較粗糙，理解不准確之處，還請教正。歡迎一起討論學習。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 js原生實現base64編碼解碼（utf8字符集） js簡單Base64編碼解碼前端js編碼轉換(base64/utf-8) Base64 和 UTF-8 編碼前端中常見字節編碼(base64、hex、utf8)及其轉換前端對base64編碼的理解，原生js實現字符base64編碼 js實現base64編碼與解碼(原生js) js實現base64編碼與解碼(原生js) php和js中，utf-8編碼轉成base64編碼原生js實現Base64編碼解碼