算法初級面試題05——哈希函數/表、生成多個哈希函數、哈希擴容、利用哈希分流找出大文件的重復內容、設計RandomPool結構、布隆過濾器、一致性哈希、並查集、島問題

本文轉載自查看原文 2019-01-28 11:36 606 數據結構與算法Java

今天主要討論：哈希函數、哈希表、布隆過濾器、一致性哈希、並查集的介紹和應用。

題目一

認識哈希函數和哈希表

1、輸入無限大

2、輸出有限的S集合

3、輸入什么就輸出什么

4、會發生哈希碰撞

5、會均勻分布，哈希函數的離散性，打亂輸入規律

public class Code_01_HashMap {

    public static void main(String[] args) {
        HashMap<String, String> map = new HashMap<>();
        map.put("zuo", "31");

        System.out.println(map.containsKey("zuo"));
        System.out.println(map.containsKey("chengyun"));
        System.out.println("=========================");

        System.out.println(map.get("zuo"));
        System.out.println(map.get("chengyun"));
        System.out.println("=========================");

        System.out.println(map.isEmpty());
        System.out.println(map.size());
        System.out.println("=========================");

        System.out.println(map.remove("zuo"));
        System.out.println(map.containsKey("zuo"));
        System.out.println(map.get("zuo"));
        System.out.println(map.isEmpty());
        System.out.println(map.size());
        System.out.println("=========================");

        map.put("zuo", "31");
        System.out.println(map.get("zuo"));
        map.put("zuo", "32");
        System.out.println(map.get("zuo"));
        System.out.println("=========================");

        map.put("zuo", "31");
        map.put("cheng", "32");
        map.put("yun", "33");

        for (String key : map.keySet()) {
            System.out.println(key);
        }
        System.out.println("=========================");

        for (String values : map.values()) {
            System.out.println(values);
        }
        System.out.println("=========================");

        map.clear();
        map.put("A", "1");
        map.put("B", "2");
        map.put("C", "3");
        map.put("D", "1");
        map.put("E", "2");
        map.put("F", "3");
        map.put("G", "1");
        map.put("H", "2");
        map.put("I", "3");
        for (Entry<String, String> entry : map.entrySet()) {
            String key = entry.getKey();
            String value = entry.getValue();
            System.out.println(key + "," + value);
        }
        System.out.println("=========================");

        // you can not remove item in map when you use the iterator of map
//         for(Entry<String,String> entry : map.entrySet()){
//             if(!entry.getValue().equals("1")){
//                 map.remove(entry.getKey());
//             }
//         }

        // if you want to remove items, collect them first, then remove them by
        // this way.
        List<String> removeKeys = new ArrayList<String>();
        for (Entry<String, String> entry : map.entrySet()) {
            if (!entry.getValue().equals("1")) {
                removeKeys.add(entry.getKey());
            }
        }
        for (String removeKey : removeKeys) {
            map.remove(removeKey);
        }
        for (Entry<String, String> entry : map.entrySet()) {
            String key = entry.getKey();
            String value = entry.getValue();
            System.out.println(key + "," + value);
        }
        System.out.println("=========================");
    }
}

推論：如果結果都%一個M，那么0~m-1這個區域也是均勻分布的。

怎么擁有1000個相對獨立的哈希函數。

把h計算出的16位數，分成高8位h1和低8位h2，然后h1 + 1*h2 =h3

生成新的哈希函數。（每個位置都是獨立的，都是通過hash函數不斷異或處理計算出來的）

哈希表經典結構：

哈希擴容：

擴容要把以前的元素拿出來，重新計算然后放入新的空間，為了不影響效率也可以使用離線時間進行擴容（push就同時兩個都push，get的話先從原來的地方拿）。

增刪改查，全為o(1)

JVM里面的實現：

利用了平衡搜索二叉樹

數組+紅黑色的哈希表，使用了TreeMap結構

哈希表多有用？引入一道題目：

有一個大文件（100T），每行是一個字符串，想把大文件里面重復的內容打印出來

問面試官：你給我多少台機器？1000台機器

給機器編號0~999

然后從100T里面開始讀文本，然后把文本按照hash函數算出hashcode再%上1000，如果是0就扔到0機器...，這樣就把大文件分到1000台機器上。

根據hash的性質，相同的文本會來到同一台機器上。然后再單台機器上統計哪些重復的。

如果還太大的話，可以再機器里面再分文件。（hash函數做分流）

題目二

設計RandomPool結構

【題目】設計一種結構，在該結構中有如下三個功能：insert(key)：將某個key加入到該結構，做到不重復加入。delete(key)：將原本在結構中的某個key移除。 getRandom()：等概率隨機返回結構中的任何一個key。

【要求】 Insert、delete和getRandom方法的時間復雜度都是 O(1)

做法：准備兩張hash表和整形變量size，每加入一個數就分別存在兩個hash表中，利用math.random隨機從第二個hash表中返回一個數。

怎么解決刪的問題？

拿最后一個值去填這個洞，然后刪了最后一個。size再減一。

public class Code_02_RandomPool {

    public static class Pool<K> {
        private HashMap<K, Integer> keyIndexMap;
        private HashMap<Integer, K> indexKeyMap;
        private int size;

        public Pool() {
            this.keyIndexMap = new HashMap<K, Integer>();
            this.indexKeyMap = new HashMap<Integer, K>();
            this.size = 0;
        }

        public void insert(K key) {
            if (!this.keyIndexMap.containsKey(key)) {
                this.keyIndexMap.put(key, this.size);
                this.indexKeyMap.put(this.size++, key);
            }
        }

        public void delete(K key) {
            if (this.keyIndexMap.containsKey(key)) {
                int deleteIndex = this.keyIndexMap.get(key);
                int lastIndex = --this.size;
                K lastKey = this.indexKeyMap.get(lastIndex);
                this.keyIndexMap.put(lastKey, deleteIndex);
                this.indexKeyMap.put(deleteIndex, lastKey);
                this.keyIndexMap.remove(key);
                this.indexKeyMap.remove(lastIndex);
            }
        }

        public K getRandom() {
            if (this.size == 0) {
                return null;
            }
            int randomIndex = (int) (Math.random() * this.size); // 0 ~ size -1
            return this.indexKeyMap.get(randomIndex);
        }

    }

    public static void main(String[] args) {
        Pool<String> pool = new Pool<String>();
        pool.insert("zuo");
        pool.insert("cheng");
        pool.insert("yun");
        System.out.println(pool.getRandom());
        System.out.println(pool.getRandom());
        System.out.println(pool.getRandom());
        System.out.println(pool.getRandom());
        System.out.println(pool.getRandom());
        System.out.println(pool.getRandom());
    }
}

題目三

認識布隆過濾器（面試搜索相關的公司幾乎都會問到）

就是一個某種類型的集合，不過會有失誤率。

實現0~m-1比特的數組（處理黑名單問題）

原本的數 | 1 << 16 就可以把32字節里面的第16位改為1

public class c05_03BloemFilter {

    //實現0~m-1比特的數組
    public static void main(String[] args) {
        //int 4個字節 32個比特
        int[] arr = new int[1000];//4*8*1000 = 32000;

        //數量不夠可以使用二維數組實現
        long[][] map = new long[1000][1000];

        int index = 30000;//想把第30000位置描黑

        int intIndex = index / 4 / 8;//查看這個bit來自哪個整數位置

        int bitIndex = index % 32;//在定位來自這個整數的哪個bit位

        arr[intIndex] = arr[intIndex] | (1 << bitIndex);
    }

}

一個URL經過K個hash函數，計算出K個位置都描黑。（這個URL就進入到布隆過濾器當中了）

接下來每個URL都這樣計算加入到bit類型的數組里面。（數組要夠大）

怎么查？

這個URL經過K個hash函數，算出來K個位置，如果K個位置都是黑的就說這個URL在黑名單中，如果有一個不是黑的就不在黑名單里

數組空間越大，失誤率會降低，空間多大和樣本量、預計失誤率有關系

數組的大小M（bit）有一個公式計算。22.3G

確定hash函數的個數K，最后P會在確定了M和K后計算出來

如果面試官感覺經典結構太費，就問面試官允不允許有失誤率，失誤率是多少，允許就講布隆過濾器的原理，URL經過K個hash然后描黑數組，檢查URL的時候通過K個hash來檢查。都黑就在，否則就不在。

數組開多大，由樣本量、失誤率，計算出bit后還要除以8才是字節數。

如果計算出16G，面試官給出20G空間就適當調整大到18G

接着就計算hash的個數K，向上取整。

最后再計算下失誤率。

題目四

認識一致性哈希（服務器設計）

服務器經典結構怎么做到負載均衡，前端通過同一份hash函數，計算出hashcode再%3，得到0/1/2然后存在不同的服務器中。由於hash函數的性質，這個服務器巨均衡。

當想加減機器的時候，這個結構就干了。和hash表擴容一樣。所有的數據歸屬全變了。（代價很大）

引入一致性哈希結構。

把hash函數的返回值想象成一個環。再把機器M1/M2/M3的IP經過hash計算放在環里面，接着要進入一個數據”zuo”就入環，順時針找到最近的機器存進去。

怎么實現？

把機器的hash值排序后做成數組，存在每個前端服務器中。

在數據訪問的時候，通過計算hash值，二分的方式查詢機器數組，查詢出最近的大於等於機器。

前端服務器二分的查找服務器過程，就是一個順時針找最近服務器的過程。

新增一個機器的情況：

M4通過IP計算出位置，數據遷移只需要一小部分。新增和刪除都只需要一小部分數據。

在機器數量小的時候，不能確保機器均勻分布。

什么技術可以解決這個問題？

虛擬節點技術。

給M1/M2/M3,1000個虛擬節點。

准備一張路由表，虛擬節點可以找到自己對應的節點。

把3000個節點。存入環中，那么機器們負責的數據就差不多一樣了

新增了M4之后，也加入1000個節點，把相應的數據進行調整。

幾乎所有需要集群化都進行了一致性哈希的改造。

題目五

島問題

一個矩陣中只有0和1兩種值，每個位置都可以和自己的上、下、左、右四個位置相連，如果有一片1連在一起，這個部分叫做一個島，求一個矩陣中有多少個島？

舉例：

0 0 1 0 1 0

1 1 1 0 1 0

1 0 0 1 0 0

0 0 0 0 0 0

這個矩陣中有三個島。

如果矩陣巨大無比，但是有幾個CPU，設計一個多任務並行的算法。

經典解法：

遍歷矩陣，碰到1就啟動感染函數（遞歸改變數值的函數），把1周圍的變為2，島嶼+1，直到遍歷結束。

public class Code_03_Islands {

    public static int countIslands(int[][] m) {
        if (m == null || m[0] == null) {
            return 0;
        }
        int N = m.length;
        int M = m[0].length;
        int res = 0;
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < M; j++) {
                if (m[i][j] == 1) {
                    res++;
                    infect(m, i, j, N, M);
                }
            }
        }
        return res;
    }

    public static void infect(int[][] m, int i, int j, int N, int M) {
        if (i < 0 || i >= N || j < 0 || j >= M || m[i][j] != 1) {
            return;
        }
        m[i][j] = 2;
        infect(m, i + 1, j, N, M);
        infect(m, i - 1, j, N, M);
        infect(m, i, j + 1, N, M);
        infect(m, i, j - 1, N, M);
    }

    public static void main(String[] args) {
        int[][] m1 = {  { 0, 0, 0, 0, 0, 0, 0, 0, 0 }, 
                        { 0, 1, 1, 1, 0, 1, 1, 1, 0 }, 
                        { 0, 1, 1, 1, 0, 0, 0, 1, 0 },
                        { 0, 1, 1, 0, 0, 0, 0, 0, 0 }, 
                        { 0, 0, 0, 0, 0, 1, 1, 0, 0 }, 
                        { 0, 0, 0, 0, 1, 1, 1, 0, 0 },
                        { 0, 0, 0, 0, 0, 0, 0, 0, 0 }, };
        System.out.println(countIslands(m1));

        int[][] m2 = {  { 0, 0, 0, 0, 0, 0, 0, 0, 0 }, 
                        { 0, 1, 1, 1, 1, 1, 1, 1, 0 }, 
                        { 0, 1, 1, 1, 0, 0, 0, 1, 0 },
                        { 0, 1, 1, 0, 0, 0, 1, 1, 0 }, 
                        { 0, 0, 0, 0, 0, 1, 1, 0, 0 }, 
                        { 0, 0, 0, 0, 1, 1, 1, 0, 0 },
                        { 0, 0, 0, 0, 0, 0, 0, 0, 0 }, };
        System.out.println(countIslands(m2));

    }

}

多任務解題思路：

要解決合並島的問題

把島的數量和邊界信息存儲起來。

邊界信息要如何合並：

標記感染中心。（並查集應用）

看邊界A和邊界C是否合並過，沒有就合並（指向同一個標記），島數量減一。

如何一路下去會碰到B和C，再次檢查，合並，島減一。

連成一片的這個概念，用並查集這個結構能非常好做，在結構上，怎么避免已經合完的部分，不重復減島這個問題，用並查集來解決。

多邊界的話就是收集的信息多一點而已，合並思路是一樣的。

可以把邊界信息都扔在一個並查集里面合並。（和面試官吹水的部分）

可以分成多個部分給多個CPU操作，得到結果后再合並最后的結果。（看具體情況，可使用二分法）

題目六

認識並查集結構（用之前給所有的數據）

1、非常快的檢查兩個元素是否在同一個集合。isSameSet

2、兩個元素各種所在的集合，合並在一起。Union(元素,元素)

使用list的話合並快，查詢是否在同一個集合慢。

使用set的話查詢快，合並慢。

自己指向自己的就是代表節點。

A/B向上找代表節點，相同就是在同一個集合。

怎么合並

少元素的掛在多元素的底下。

優化：（路徑壓縮）

在一次查詢后，把路徑上的節點統一打平。

public class Code_04_UnionFind {

    public static class Node {
        // whatever you like
    }

    public static class UnionFindSet {
        public HashMap<Node, Node> fatherMap;
        public HashMap<Node, Integer> sizeMap;

        //創建的時候就要一次性導入所有的節點
        public UnionFindSet(List<Node> nodes) {
            fatherMap = new HashMap<Node, Node>();
            sizeMap = new HashMap<Node, Integer>();
            makeSets(nodes);
        }

        private void makeSets(List<Node> nodes) {
            fatherMap.clear();
            sizeMap.clear();
            for (Node node : nodes) {
                fatherMap.put(node, node);//一開始自己是自己的父親
                sizeMap.put(node, 1);//大小為1
            }
        }


        private Node findHead(Node node) {

            //非遞歸版本
            Stack<Node> Nodes = new Stack<>();
            Node cur = node;
            Node parent = fatherMap.get(cur);

            while(cur != parent){
                Nodes.push(cur);
                cur = parent;
                parent = fatherMap.get(cur);
            }
            while(!Nodes.isEmpty()){
                fatherMap.put(Nodes.pop(),parent);
            }
            return parent;

            //遞歸版本
            /*
            //獲得節點的父節點
            Node father = fatherMap.get(node);
            if (father != node) {//這樣找是因為頭節點是自己指向自己的
                //一路向上找父節點
                father = findHead(father);
            }
            fatherMap.put(node, father);//路徑壓縮
            return father;*/
        }
        
        public boolean isSameSet(Node a, Node b) {
            return findHead(a) == findHead(b);
        }

        public void union(Node a, Node b) {
            if (a == null || b == null) {
                return;
            }
            Node aHead = findHead(a);
            Node bHead = findHead(b);
            if (aHead != bHead) {
                int aSetSize= sizeMap.get(aHead);
                int bSetSize = sizeMap.get(bHead);
                if (aSetSize <= bSetSize) {//a小於b
                    fatherMap.put(aHead, bHead);
                    sizeMap.put(bHead, aSetSize + bSetSize);
                } else {//a大於b
                    fatherMap.put(bHead, aHead);
                    sizeMap.put(aHead, aSetSize + bSetSize);
                }
            }
        }

    }

    public static void main(String[] args) {

    }

}

並查集是1964年別人腦補的一個算法，到證明結束是1989年，這個證明也是夠漫長的。

並查集的效率非常高，當有N個數據的時候，假設查詢次數到了N之后，其時間復雜度僅為o(1)！！

查詢次數+合並次數逼近o(n)以上，平均時間復雜度o(1)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 一文讀懂哈希和一致性哈希算法分布式哈希和一致性哈希算法了解一致性哈希算法一致性哈希算法原理一致性哈希一致性哈希算法整理一致性哈希(hash)算法一致性哈希算法原理 Redis的一致性哈希算法一致性哈希算法