1. 原理
壓縮
LZ78算法的壓縮過程非常簡單。在壓縮時維護一個動態詞典Dictionary,其包括了歷史字符串的index與內容;壓縮情況分為三種:
- 若當前字符c未出現在詞典中,則編碼為
(0, c)
; - 若當前字符c出現在詞典中,則與詞典做最長匹配,然后編碼為
(prefixIndex,lastChar)
,其中,prefixIndex為最長匹配的前綴字符串,lastChar為最長匹配后的第一個字符; - 為對最后一個字符的特殊處理,編碼為
(prefixIndex,)
。
如果對於上述壓縮的過程稍感費解,下面給出三個例子。例子一,對於字符串“ABBCBCABABCAABCAAB”壓縮編碼過程如下:
1. A is not in the Dictionary; insert it 2. B is not in the Dictionary; insert it 3. B is in the Dictionary. BC is not in the Dictionary; insert it. 4. B is in the Dictionary. BC is in the Dictionary. BCA is not in the Dictionary; insert it. 5. B is in the Dictionary. BA is not in the Dictionary; insert it. 6. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is not in the Dictionary; insert it. 7. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is in the Dictionary. BCAAB is not in the Dictionary; insert it.
例子二,對於字符串“BABAABRRRA”壓縮編碼過程如下:
1. B is not in the Dictionary; insert it 2. A is not in the Dictionary; insert it 3. B is in the Dictionary. BA is not in the Dictionary; insert it. 4. A is in the Dictionary. AB is not in the Dictionary; insert it. 5. R is not in the Dictionary; insert it. 6. R is in the Dictionary. RR is not in the Dictionary; insert it. 7. A is in the Dictionary and it is the last input character; output a pair containing its index: (2, )
例子三,對於字符串“AAAAAAAAA”壓縮編碼過程如下:
1. A is not in the Dictionary; insert it 2. A is in the Dictionary AA is not in the Dictionary; insert it 3. A is in the Dictionary. AA is in the Dictionary. AAA is not in the Dictionary; insert it. 4. A is in the Dictionary. AA is in the Dictionary. AAA is in the Dictionary and it is the last pattern; output a pair containing its index: (3, )
解壓縮
解壓縮能更根據壓縮編碼恢復出(壓縮時的)動態詞典,然后根據index拼接成解碼后的字符串。為了便於理解,我們拿上述例子一中的壓縮編碼序列(0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)
來分解解壓縮步驟,如下圖所示:
前后拼接后,解壓縮出來的字符串為“ABBCBCABABCAABCAAB”。
LZ系列壓縮算法
LZ系列壓縮算法均為LZ77與LZ78的變種,在此基礎上做了優化。
- LZ77:LZSS、LZR、LZB、LZH;
- LZ78:LZW、LZC、LZT、LZMW、LZJ、LZFG。
其中,LZSS與LZW為這兩大陣容里名氣最響亮的算法。LZSS是由Storer與Szymanski [2]改進了LZ77:增加最小匹配長度的限制,當最長匹配的長度小於該限制時,則不壓縮輸出,但仍然滑動窗口右移一個字符。Google開源的Snappy壓縮算法庫大體遵循LZSS的編碼方案,在其基礎上做了一些工程上的優化。
2. 實現
Python 3.5實現LZ78算法:
# -*- coding: utf-8 -*- # A simplified implementation of LZ78 algorithm # @Time : 2017/1/13 # @Author : rain def compress(message): tree_dict, m_len, i = {}, len(message), 0 while i < m_len: # case I if message[i] not in tree_dict.keys(): yield (0, message[i]) tree_dict[message[i]] = len(tree_dict) + 1 i += 1 # case III elif i == m_len - 1: yield (tree_dict.get(message[i]), '') i += 1 else: for j in range(i + 1, m_len): # case II if message[i:j + 1] not in tree_dict.keys(): yield (tree_dict.get(message[i:j]), message[j]) tree_dict[message[i:j + 1]] = len(tree_dict) + 1 i = j + 1 break # case III elif j == m_len - 1: yield (tree_dict.get(message[i:j + 1]), '') i = j + 1 def uncompress(packed): unpacked, tree_dict = '', {} for index, ch in packed: if index == 0: unpacked += ch tree_dict[len(tree_dict) + 1] = ch else: term = tree_dict.get(index) + ch unpacked += term tree_dict[len(tree_dict) + 1] = term return unpacked if __name__ == '__main__': messages = ['ABBCBCABABCAABCAAB', 'BABAABRRRA', 'AAAAAAAAA'] for m in messages: pack = compress(m) unpack = uncompress(pack) print(unpack == m)