1，位圖法介紹

位圖的基本概念是用一個位（bit）來標記某個數據的存放狀態，由於采用了位為單位來存放數據，所以節省了大量的空間。舉個具體的例子，在Java中一般一個int數字要占用32位，如果能用一位就表示這個數，就可以縮減大量的存儲空間。一般把這種方法稱為位圖法，即Bitmap。

位圖法比較適合於判斷是否存在這樣的問題，元素的狀態比較少，元素的個數比較多的情況之下。那么具體咋么做呢，這樣，非常簡單明了就是，2.5億個整數里面，我維護一個長度等於最大整數值得字符串，每個整數是否存在我就在該整數對應的位置置為1，比如，有{2, 4, 5, 6, 67, 5}這么幾個整數，我維護一個 00…0000 67位的字符串。但是，如果你不知道整數的最大值，你至少需要一個長度2^32的字符串，因為整數的最大值就是2^32，(int占4個字節，因此是32位)，那這就最少是512M內存，從char的長度算內存會算吧，直接、最大整數/8*2^20 就是M的單位。那這么說來就可以理解位圖法了。

2，BitSet

正因為位圖運算在空間方面的優越性，很多語言都有直接對它的支持。如在C++的STL庫中就有一個bitset容器。而在Java中，在java.util包下也有一個BitSet類用來實現位圖運算。此類實現了一個按需增長的位向量。BitSet的每一位都由一個boolean值來表示。用非負的整數將BitSet的位編入索引，可以對每個編入索引的位進行測試、設置或者清除。通過邏輯與、邏輯或和邏輯異或操作，可以使用一個BitSet修改另一個BitSet的內容。

需要注意的是BitSet底層實現是通過一個long數組來保存數據的，也就是說它增長的最小單位是一個long所占的邏輯位，即64位。但如果不是對存儲區空間有極致的要求，而且對自己的基本功非常有信心，不建議自己去實現一個跟BitSet類似的類來實現相關的功能。因為jdk中的類都是極精簡並做過合理優化的，BitSet類比較長。

3，無重復排序

java JDK里面容器類的排序算法使用的主要是插入排序和歸並排序，可能不同版本的實現有所不同，關鍵代碼如下：

 1 /**
 2  * Performs a sort on the section of the array between the given indices  3  * using a mergesort with exponential search algorithm (in which the merge  4  * is performed by exponential search). n*log(n) performance is guaranteed  5  * and in the average case it will be faster then any mergesort in which the  6  * merge is performed by linear search.  7  *  8  * @param in -  9  * the array for sorting. 10  * @param out - 11  * the result, sorted array. 12  * @param start 13  * the start index 14  * @param end 15  * the end index + 1 16      */
17     @SuppressWarnings("unchecked") 18     private static void mergeSort(Object[] in, Object[] out, int start, 19             int end) { 20         int len = end - start; 21         // use insertion sort for small arrays
22         if (len <= SIMPLE_LENGTH) { 23             for (int i = start + 1; i < end; i++) { 24                 Comparable<Object> current = (Comparable<Object>) out[i]; 25                 Object prev = out[i - 1]; 26                 if (current.compareTo(prev) < 0) { 27                     int j = i; 28                     do { 29                         out[j--] = prev; 30                     } while (j > start 31                             && current.compareTo(prev = out[j - 1]) < 0); 32                     out[j] = current; 33  } 34  } 35             return; 36  } 37         int med = (end + start) >>> 1; 38  mergeSort(out, in, start, med); 39  mergeSort(out, in, med, end); 40 
41         // merging 42 
43         // if arrays are already sorted - no merge
44         if (((Comparable<Object>) in[med - 1]).compareTo(in[med]) <= 0) { 45  System.arraycopy(in, start, out, start, len); 46             return; 47  } 48         int r = med, i = start; 49 
50         // use merging with exponential search
51         do { 52             Comparable<Object> fromVal = (Comparable<Object>) in[start]; 53             Comparable<Object> rVal = (Comparable<Object>) in[r]; 54             if (fromVal.compareTo(rVal) <= 0) { 55                 int l_1 = find(in, rVal, -1, start + 1, med - 1); 56                 int toCopy = l_1 - start + 1; 57  System.arraycopy(in, start, out, i, toCopy); 58                 i += toCopy; 59                 out[i++] = rVal; 60                 r++; 61                 start = l_1 + 1; 62             } else { 63                 int r_1 = find(in, fromVal, 0, r + 1, end - 1); 64                 int toCopy = r_1 - r + 1; 65  System.arraycopy(in, r, out, i, toCopy); 66                 i += toCopy; 67                 out[i++] = fromVal; 68                 start++; 69                 r = r_1 + 1; 70  } 71         } while ((end - r) > 0 && (med - start) > 0); 72 
73         // copy rest of array
74         if ((end - r) <= 0) { 75             System.arraycopy(in, start, out, i, med - start); 76         } else { 77             System.arraycopy(in, r, out, i, end - r); 78  } 79     }

下面我們說下位圖法排序的思路：其實思路開篇已經交代，為了讓大家更容易理解，我將通過舉例的方式進一步闡明，假設我們有一個不重復的整型序列{n1， n2， ... ,nn},假設最大值為nx，則我們可以維護一個長度為nx的位串，第一遍遍歷整個序列，將出現的數字在位串中對應的位置置為1；第二遍遍歷位圖，依次輸出值為1的位對應的數字，這些1所在的位串中的位置的索引代表序列數據，1出現的先后位置則代表序列的大寫。

下面按上面的原理用Java實現：

 1 package acm;  2 
 3 import java.util.*;  4 
 5 public class javaUniqueSort {  6     public static int[] temp = new int[100001];  7     public static List<Integer> tempList = new ArrayList<Integer>();  8     public static int count ;  9     public static long start ; 10     public static long end ; 11     
12     public static List<Integer> uniqueSort(final List<Integer> uniqueList) { 13  javaUniqueSort.tempList.clear(); 14         for (int i = 0; i < javaUniqueSort.temp.length; i++) { 15             javaUniqueSort.temp[i] = 0; 16  } 17         for (int i = 0; i < uniqueList.size(); i++) { 18             javaUniqueSort.temp[uniqueList.get(i)] = 1; 19  } 20         for (int i = 0; i < javaUniqueSort.temp.length; i++) { 21             if (javaUniqueSort.temp[i] == 1) { 22  javaUniqueSort.tempList.add(i); 23  } 24  } 25 
26         return javaUniqueSort.tempList; 27  } 28 
29 
30     public static void getStartTime() { 31         javaUniqueSort.start = System.nanoTime(); 32  } 33 
34     public static void getEndTime(final String s) { 35         javaUniqueSort.end = System.nanoTime(); 36         System.out.println(s + ": " + (javaUniqueSort.end - javaUniqueSort.start) + "ns"); 37  } 38     
39     
40     
41     public static void main(final String[] args) { 42         
43         List<Integer> firstNum = new ArrayList<Integer>(); 44         List<Integer> secondNum = new ArrayList<Integer>(); 45 
46         for (int i = 1; i <= 100000; i++) { 47  firstNum.add(i); 48  secondNum.add(i); 49  } 50 
51  Collections.shuffle(firstNum); 52  Collections.shuffle(secondNum); 53         
54     
55  getStartTime(); 56  Collections.sort(firstNum); 57         getEndTime("java sort run time  "); 58 
59  getStartTime(); 60         secondNum = uniqueSort(secondNum); 61         getEndTime("uniqueSort run time "); 62         
63 
64  } 65 }

執行結果

4，有重復排序

有重復的整數序列排序，分為兩種情況，保留重復的整數排序，和去除重復整數排序。

4.1 保留重復的整數排序

思路：上面講述了無重復的整數序列排序，其實序列中的整數在位串中只用兩個狀態，要么在序列中出現（1），要么不出現（0），而對於有重復的整數序列，我們仍然可以用序列中整數出現的次數來表示數據狀態，只是現在這個狀態的數目是不確定的。實現方式也上面類似。

 1 package acm;  2 
 3 import java.util.*;  4 
 5 public class javaDuplicateSort {  6     public static List<Integer> tempList = new ArrayList<Integer>();  7     public static int count;  8     public static long start ;  9     public static long end ; 10     
11     public static void main(final String[] args) { 12         Random random = new Random(); 13         List<Integer> firstNum = new ArrayList<Integer>(); 14         List<Integer> secondNum = new ArrayList<Integer>(); 15 
16         for (int i = 1; i <= 100000; i++) { 17  firstNum.add(i); 18  secondNum.add(i); 19             firstNum.add(random.nextInt(i + 1)); 20             secondNum.add(random.nextInt(i + 1)); 21  } 22  Collections.shuffle(firstNum); 23  Collections.shuffle(secondNum); 24 
25  getStartTime(); 26  Collections.sort(firstNum); 27         getEndTime("java sort run time  "); 28 
29  getStartTime(); 30         secondNum = uniqueSort(secondNum); 31         getEndTime("uniqueSort run time "); 32 
33  } 34 
35     public static List<Integer> uniqueSort(final List<Integer> uniqueList) { 36  javaDuplicateSort.tempList.clear(); 37         int[] temp = new int[200002]; 38         for (int i = 0; i < temp.length; i++) { 39             temp[i] = 0; 40  } 41         for (int i = 0; i < uniqueList.size(); i++) { 42             temp[uniqueList.get(i)]++; 43  } 44         for (int i = 0; i < temp.length; i++) { 45             for (int j = temp[i]; j > 0; j--) { 46  javaDuplicateSort.tempList.add(i); 47  } 48  } 49 
50         return javaDuplicateSort.tempList; 51  } 52 
53     public static void getStartTime() { 54         javaDuplicateSort.start = System.nanoTime(); 55  } 56 
57     public static void getEndTime(final String s) { 58         javaDuplicateSort.end = System.nanoTime(); 59         System.out.println(s + ": " + (javaDuplicateSort.end - javaDuplicateSort.start) + "ns"); 60  } 61 }

執行結果：

4.2 去除重復整數排序

思路：去重的意思就是整數序列中多次出現的整數只保留一次，這也很好處理，可以對上面的方法再往前推一步，對位串中大於1的數全部置1，這樣就把重復的數據給去除了（或者在排序的時候增設一個條件狀態數大於1的，按1來處理，這樣也能得到想要的結果），方法很多，看個人的喜好，這里我就不去實現了。

5，數據壓縮

假設有這樣一份數據，記錄了全國1990-1999年出生的人的姓名和出生年月的鍵值對。假設正好有一千萬人，那就要存儲一千萬個姓名和年份。如何運用Bitmap的思想來壓縮數據呢。下面提供幾種思路。從人的角度來看，由於一共就只有10個年份，可以用4個bit將它們區分開。如0000表示1990年，1001表示1999年。那一個人的出生年份就可以用4個bit位來表示，進而一千萬個年份就可以壓縮為一千萬個4位的bit組；從另一個角度來看這個問題，我們有10個年份，每個人要么是要么不是在這個年份出生。每個人對於年份來說就可以抽象為一個bit位，所以我們可以把一千萬的年齡壓縮為10個一千萬位的bit組。這樣壓縮的力度不如按人的角度壓縮的大，但從年份出發的問題會有一定的優勢，如有哪些人是1990年出生的，只需遍歷1990年對應的那個bit組就可以了。可以看出來不管從哪個角度，bitmap的壓縮都是建立在數據中存在大量的冗余數據的基礎上的，如年份。而在上面的問題中，年份的分布是散亂的，那假如我們事先把數據進行了排序，把相同的出生年份的人排在一起，那數據就可以進一步壓縮。這樣一來就只要記錄每個年份的人數，就可以根據下標來判斷每個人的出生年份。

總結

位圖法可以用於海量數據排序，海量數據去重，海量數據壓縮，針對於稠密的數據集可以很好體現出位圖法的優勢（內存消耗少，速度較快），但對於稀疏數據集，應用位圖法反而會適得其反，比如我們有一個長度為10的序列，最大值為20億，則構造位串的內存消耗將相當大250M，而實際卻只需要40個字節，此外位圖法還存在可讀性差等缺點。

參考文獻：

https://jinfagang.gitlab.io/2017/09/01/%E4%B8%87%E5%8F%98%E4%B8%8D%E7%A6%BB%E5%85%B6%E5%AE%97%E4%B9%8B%E6%B5%B7%E9%87%8F%E6%95%B0%E6%8D%AE%E4%B8%8B%E7%9A%84%E7%AE%97%E6%B3%95%E9%97%AE%E9%A2%98%E5%A4%84%E7%90%86%E6%80%9D%E8%B7%AF/

http://blog.csdn.net/u013291394/article/details/50211181

http://blog.csdn.net/y999666/article/details/51220833

http://blog.csdn.net/korey_sparks/article/details/52512870

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 tar+pigz+ssh實現大數據壓縮傳輸瘋狂位圖之——位圖實現12GB無重復大整數集排序利用BitMap進行大數據排序去重大數據List去重 C#批量生成大數據量無重復隨機數據的另類高效實現 Java Gzip 數據壓縮與解壓縮位圖法排序 C#實現大數據量TXT文本數據快速高效去重【數據壓縮】LZ77算法原理及實現 HTTP傳輸數據壓縮