【數據壓縮】LZ78算法原理及實現


1. 原理

壓縮

LZ78算法的壓縮過程非常簡單。在壓縮時維護一個動態詞典Dictionary,其包括了歷史字符串的index與內容;壓縮情況分為三種:

  1. 若當前字符c未出現在詞典中,則編碼為(0, c)
  2. 若當前字符c出現在詞典中,則與詞典做最長匹配,然后編碼為(prefixIndex,lastChar),其中,prefixIndex為最長匹配的前綴字符串,lastChar為最長匹配后的第一個字符;
  3. 為對最后一個字符的特殊處理,編碼為(prefixIndex,)

如果對於上述壓縮的過程稍感費解,下面給出三個例子。例子一,對於字符串“ABBCBCABABCAABCAAB”壓縮編碼過程如下:

1. A is not in the Dictionary; insert it 2. B is not in the Dictionary; insert it 3. B is in the Dictionary. BC is not in the Dictionary; insert it. 4. B is in the Dictionary. BC is in the Dictionary. BCA is not in the Dictionary; insert it. 5. B is in the Dictionary. BA is not in the Dictionary; insert it. 6. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is not in the Dictionary; insert it. 7. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is in the Dictionary. BCAAB is not in the Dictionary; insert it.

例子二,對於字符串“BABAABRRRA”壓縮編碼過程如下:

1.  B is not in the Dictionary; insert it 2. A is not in the Dictionary; insert it 3. B is in the Dictionary. BA is not in the Dictionary; insert it. 4. A is in the Dictionary. AB is not in the Dictionary; insert it. 5. R is not in the Dictionary; insert it. 6. R is in the Dictionary. RR is not in the Dictionary; insert it. 7. A is in the Dictionary and it is the last input character; output a pair containing its index: (2, )

例子三,對於字符串“AAAAAAAAA”壓縮編碼過程如下:

1.  A is not in the Dictionary; insert it 2. A is in the Dictionary AA is not in the Dictionary; insert it 3. A is in the Dictionary. AA is in the Dictionary. AAA is not in the Dictionary; insert it. 4. A is in the Dictionary. AA is in the Dictionary. AAA is in the Dictionary and it is the last pattern; output a pair containing its index: (3, )

解壓縮

解壓縮能更根據壓縮編碼恢復出(壓縮時的)動態詞典,然后根據index拼接成解碼后的字符串。為了便於理解,我們拿上述例子一中的壓縮編碼序列(0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)來分解解壓縮步驟,如下圖所示:

前后拼接后,解壓縮出來的字符串為“ABBCBCABABCAABCAAB”。

LZ系列壓縮算法

LZ系列壓縮算法均為LZ77與LZ78的變種,在此基礎上做了優化。

  • LZ77:LZSS、LZR、LZB、LZH;
  • LZ78:LZW、LZC、LZT、LZMW、LZJ、LZFG。

其中,LZSS與LZW為這兩大陣容里名氣最響亮的算法。LZSS是由Storer與Szymanski [2]改進了LZ77:增加最小匹配長度的限制,當最長匹配的長度小於該限制時,則不壓縮輸出,但仍然滑動窗口右移一個字符。Google開源的Snappy壓縮算法庫大體遵循LZSS的編碼方案,在其基礎上做了一些工程上的優化。

2. 實現

Python 3.5實現LZ78算法:

# -*- coding: utf-8 -*- # A simplified implementation of LZ78 algorithm # @Time : 2017/1/13 # @Author : rain def compress(message): tree_dict, m_len, i = {}, len(message), 0 while i < m_len: # case I if message[i] not in tree_dict.keys(): yield (0, message[i]) tree_dict[message[i]] = len(tree_dict) + 1 i += 1 # case III elif i == m_len - 1: yield (tree_dict.get(message[i]), '') i += 1 else: for j in range(i + 1, m_len): # case II if message[i:j + 1] not in tree_dict.keys(): yield (tree_dict.get(message[i:j]), message[j]) tree_dict[message[i:j + 1]] = len(tree_dict) + 1 i = j + 1 break # case III elif j == m_len - 1: yield (tree_dict.get(message[i:j + 1]), '') i = j + 1 def uncompress(packed): unpacked, tree_dict = '', {} for index, ch in packed: if index == 0: unpacked += ch tree_dict[len(tree_dict) + 1] = ch else: term = tree_dict.get(index) + ch unpacked += term tree_dict[len(tree_dict) + 1] = term return unpacked if __name__ == '__main__': messages = ['ABBCBCABABCAABCAAB', 'BABAABRRRA', 'AAAAAAAAA'] for m in messages: pack = compress(m) unpack = uncompress(pack) print(unpack == m)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM