【數據壓縮】LZ78算法原理及實現

本文轉載自查看原文 2017-01-13 16:37 8711 信息論與編碼

在提出基於滑動窗口的LZ77算法后，兩位大神Jacob Ziv與Abraham Lempel於1978年在發表的論文 [1]中提出了LZ78算法；與LZ77算法不同的是LZ78算法使用動態樹狀詞典維護歷史字符串。

【數據壓縮】LZ77算法原理及實現
 【數據壓縮】LZ78算法原理及實現

1. 原理

壓縮

LZ78算法的壓縮過程非常簡單。在壓縮時維護一個動態詞典Dictionary，其包括了歷史字符串的index與內容；壓縮情況分為三種：

若當前字符c未出現在詞典中，則編碼為(0, c)；
若當前字符c出現在詞典中，則與詞典做最長匹配，然后編碼為(prefixIndex,lastChar)，其中，prefixIndex為最長匹配的前綴字符串，lastChar為最長匹配后的第一個字符；
為對最后一個字符的特殊處理，編碼為(prefixIndex,)。

如果對於上述壓縮的過程稍感費解，下面給出三個例子。例子一，對於字符串“ABBCBCABABCAABCAAB”壓縮編碼過程如下：

1. A is not in the Dictionary; insert it
2. B is not in the Dictionary; insert it
3. B is in the Dictionary.
    BC is not in the Dictionary; insert it.  
4. B is in the Dictionary.
    BC is in the Dictionary.
    BCA is not in the Dictionary; insert it.
5. B is in the Dictionary.
    BA is not in the Dictionary; insert it.
6. B is in the Dictionary.
    BC is in the Dictionary.
    BCA is in the Dictionary.
    BCAA is not in the Dictionary; insert it.
7. B is in the Dictionary.
    BC is in the Dictionary.
    BCA is in the Dictionary.
    BCAA is in the Dictionary.
    BCAAB is not in the Dictionary; insert it.

例子二，對於字符串“BABAABRRRA”壓縮編碼過程如下：

1.  B is not in the Dictionary; insert it
2.  A is not in the Dictionary; insert it
3.  B is in the Dictionary.
     BA is not in the Dictionary; insert it.    
4.  A is in the Dictionary.
     AB is not in the Dictionary; insert it.
5.  R is not in the Dictionary; insert it.
6.  R is in the Dictionary.
     RR is not in the Dictionary; insert it.
7.  A is in the Dictionary and it is the last input character; output a pair 
      containing its index: (2, )

例子三，對於字符串“AAAAAAAAA”壓縮編碼過程如下：

1.  A is not in the Dictionary; insert it
2.  A is in the Dictionary
     AA is not in the Dictionary; insert it
3.  A is in the Dictionary.
     AA is in the Dictionary.
     AAA is not in the Dictionary; insert it.
4.  A is in the Dictionary.
     AA is in the Dictionary.
     AAA is in the Dictionary and it is the last pattern; output a pair containing its index: (3,  )

解壓縮

解壓縮能更根據壓縮編碼恢復出（壓縮時的）動態詞典，然后根據index拼接成解碼后的字符串。為了便於理解，我們拿上述例子一中的壓縮編碼序列(0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)來分解解壓縮步驟，如下圖所示：

前后拼接后，解壓縮出來的字符串為“ABBCBCABABCAABCAAB”。

LZ系列壓縮算法

LZ系列壓縮算法均為LZ77與LZ78的變種，在此基礎上做了優化。

LZ77：LZSS、LZR、LZB、LZH；
LZ78：LZW、LZC、LZT、LZMW、LZJ、LZFG。

其中，LZSS與LZW為這兩大陣容里名氣最響亮的算法。LZSS是由Storer與Szymanski [2]改進了LZ77：增加最小匹配長度的限制，當最長匹配的長度小於該限制時，則不壓縮輸出，但仍然滑動窗口右移一個字符。Google開源的Snappy壓縮算法庫大體遵循LZSS的編碼方案，在其基礎上做了一些工程上的優化。

2. 實現

Python 3.5實現LZ78算法：

# -*- coding: utf-8 -*-
# A simplified implementation of LZ78 algorithm
# @Time    : 2017/1/13
# @Author  : rain


def compress(message):
    tree_dict, m_len, i = {}, len(message), 0
    while i < m_len:
        # case I
        if message[i] not in tree_dict.keys():
            yield (0, message[i])
            tree_dict[message[i]] = len(tree_dict) + 1
            i += 1
        # case III
        elif i == m_len - 1:
            yield (tree_dict.get(message[i]), '')
            i += 1
        else:
            for j in range(i + 1, m_len):
                # case II
                if message[i:j + 1] not in tree_dict.keys():
                    yield (tree_dict.get(message[i:j]), message[j])
                    tree_dict[message[i:j + 1]] = len(tree_dict) + 1
                    i = j + 1
                    break
                # case III
                elif j == m_len - 1:
                    yield (tree_dict.get(message[i:j + 1]), '')
                    i = j + 1


def uncompress(packed):
    unpacked, tree_dict = '', {}
    for index, ch in packed:
        if index == 0:
            unpacked += ch
            tree_dict[len(tree_dict) + 1] = ch
        else:
            term = tree_dict.get(index) + ch
            unpacked += term
            tree_dict[len(tree_dict) + 1] = term
    return unpacked


if __name__ == '__main__':
    messages = ['ABBCBCABABCAABCAAB', 'BABAABRRRA', 'AAAAAAAAA']
    for m in messages:
        pack = compress(m)
        unpack = uncompress(pack)
        print(unpack == m)

3. 參考資料

[1] Ziv, Jacob, and Abraham Lempel. "Compression of individual sequences via variable-rate coding." IEEE transactions on Information Theory 24.5 (1978): 530-536.
[2] Storer, James A., and Thomas G. Szymanski. "Data compression via textual substitution." Journal of the ACM (JACM) 29.4 (1982): 928-951.
[3] Welch, T. A. "A Technique for High-Performance Data Compression." Computer 17.17(1984):8-19.
[4] Jauhar Ali, Unit31_LZ78.ppt.
[5] guyb, 15-853:Algorithms in the Real World - Data Compression III.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【數據壓縮】LZ78算法原理及實現壓縮算法實現之LZ78 【數據壓縮】LZ77算法原理及實現數據壓縮算法---LZ77算法的分析與實現 JS 使用 lz-string存儲數據壓縮一種整數數據壓縮存儲的算法實現數據壓縮算法---霍夫曼編碼的分析與實現字符串算法—數據壓縮數據壓縮算法綜述（摘錄）【探索】利用 canvas 實現數據壓縮