在提出基於滑動窗口的LZ77算法后,兩位大神Jacob Ziv與Abraham Lempel於1978年在發表的論文 [1]中提出了LZ78算法;與LZ77算法不同的是LZ78算法使用動態樹狀詞典維護歷史字符串。
1. 原理
壓縮
LZ78算法的壓縮過程非常簡單。在壓縮時維護一個動態詞典Dictionary,其包括了歷史字符串的index與內容;壓縮情況分為三種:
- 若當前字符c未出現在詞典中,則編碼為
(0, c)
; - 若當前字符c出現在詞典中,則與詞典做最長匹配,然后編碼為
(prefixIndex,lastChar)
,其中,prefixIndex為最長匹配的前綴字符串,lastChar為最長匹配后的第一個字符; - 為對最后一個字符的特殊處理,編碼為
(prefixIndex,)
。
如果對於上述壓縮的過程稍感費解,下面給出三個例子。例子一,對於字符串“ABBCBCABABCAABCAAB”壓縮編碼過程如下:
1. A is not in the Dictionary; insert it
2. B is not in the Dictionary; insert it
3. B is in the Dictionary.
BC is not in the Dictionary; insert it.
4. B is in the Dictionary.
BC is in the Dictionary.
BCA is not in the Dictionary; insert it.
5. B is in the Dictionary.
BA is not in the Dictionary; insert it.
6. B is in the Dictionary.
BC is in the Dictionary.
BCA is in the Dictionary.
BCAA is not in the Dictionary; insert it.
7. B is in the Dictionary.
BC is in the Dictionary.
BCA is in the Dictionary.
BCAA is in the Dictionary.
BCAAB is not in the Dictionary; insert it.
例子二,對於字符串“BABAABRRRA”壓縮編碼過程如下:
1. B is not in the Dictionary; insert it
2. A is not in the Dictionary; insert it
3. B is in the Dictionary.
BA is not in the Dictionary; insert it.
4. A is in the Dictionary.
AB is not in the Dictionary; insert it.
5. R is not in the Dictionary; insert it.
6. R is in the Dictionary.
RR is not in the Dictionary; insert it.
7. A is in the Dictionary and it is the last input character; output a pair
containing its index: (2, )
例子三,對於字符串“AAAAAAAAA”壓縮編碼過程如下:
1. A is not in the Dictionary; insert it
2. A is in the Dictionary
AA is not in the Dictionary; insert it
3. A is in the Dictionary.
AA is in the Dictionary.
AAA is not in the Dictionary; insert it.
4. A is in the Dictionary.
AA is in the Dictionary.
AAA is in the Dictionary and it is the last pattern; output a pair containing its index: (3, )
解壓縮
解壓縮能更根據壓縮編碼恢復出(壓縮時的)動態詞典,然后根據index拼接成解碼后的字符串。為了便於理解,我們拿上述例子一中的壓縮編碼序列(0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)
來分解解壓縮步驟,如下圖所示:
前后拼接后,解壓縮出來的字符串為“ABBCBCABABCAABCAAB”。
LZ系列壓縮算法
LZ系列壓縮算法均為LZ77與LZ78的變種,在此基礎上做了優化。
- LZ77:LZSS、LZR、LZB、LZH;
- LZ78:LZW、LZC、LZT、LZMW、LZJ、LZFG。
其中,LZSS與LZW為這兩大陣容里名氣最響亮的算法。LZSS是由Storer與Szymanski [2]改進了LZ77:增加最小匹配長度的限制,當最長匹配的長度小於該限制時,則不壓縮輸出,但仍然滑動窗口右移一個字符。Google開源的Snappy壓縮算法庫大體遵循LZSS的編碼方案,在其基礎上做了一些工程上的優化。
2. 實現
Python 3.5實現LZ78算法:
# -*- coding: utf-8 -*-
# A simplified implementation of LZ78 algorithm
# @Time : 2017/1/13
# @Author : rain
def compress(message):
tree_dict, m_len, i = {}, len(message), 0
while i < m_len:
# case I
if message[i] not in tree_dict.keys():
yield (0, message[i])
tree_dict[message[i]] = len(tree_dict) + 1
i += 1
# case III
elif i == m_len - 1:
yield (tree_dict.get(message[i]), '')
i += 1
else:
for j in range(i + 1, m_len):
# case II
if message[i:j + 1] not in tree_dict.keys():
yield (tree_dict.get(message[i:j]), message[j])
tree_dict[message[i:j + 1]] = len(tree_dict) + 1
i = j + 1
break
# case III
elif j == m_len - 1:
yield (tree_dict.get(message[i:j + 1]), '')
i = j + 1
def uncompress(packed):
unpacked, tree_dict = '', {}
for index, ch in packed:
if index == 0:
unpacked += ch
tree_dict[len(tree_dict) + 1] = ch
else:
term = tree_dict.get(index) + ch
unpacked += term
tree_dict[len(tree_dict) + 1] = term
return unpacked
if __name__ == '__main__':
messages = ['ABBCBCABABCAABCAAB', 'BABAABRRRA', 'AAAAAAAAA']
for m in messages:
pack = compress(m)
unpack = uncompress(pack)
print(unpack == m)
3. 參考資料
[1] Ziv, Jacob, and Abraham Lempel. "Compression of individual sequences via variable-rate coding." IEEE transactions on Information Theory 24.5 (1978): 530-536.
[2] Storer, James A., and Thomas G. Szymanski. "Data compression via textual substitution." Journal of the ACM (JACM) 29.4 (1982): 928-951.
[3] Welch, T. A. "A Technique for High-Performance Data Compression." Computer 17.17(1984):8-19.
[4] Jauhar Ali, Unit31_LZ78.ppt.
[5] guyb, 15-853:Algorithms in the Real World - Data Compression III.