Java性能優化——HashCode的使用

本文轉載自查看原文 2016-07-10 12:09 4035 Java原理

背景

告警子系統監控4萬個大網元所有端口的某些指標數據，根據閾值配置判斷是否產生告警。采集——數據處理子系統每5分鍾會主動采集24萬次數據，發送24萬條消息給告警子系統，這24萬條消息涉及100萬實體的數十個指標數據。告警子系統采用多節點部署方式分擔壓力，每個節點處理不同網元類型，不同實體，不同指標的數據。海量數據的過濾，必然會大量使用集合邏輯運算，使用不當，則會造成性能瓶頸。

例子

存在告警節點監控的實體動態變化，所以每個告警節點需要動態維護自己的監控列表，所以代碼中會用到Collection.removeAll求差集的計算，計算出新增的實體，然后進一步計算出這些新增實體的歷史平均值，方差等數據。

package com.coshaho.hash;

import java.util.ArrayList;
import java.util.List;

public class HashObject {
    
    public static void main(String[] args)
    {
        List<String> list1 = new ArrayList<String>();
        List<String> list2 = new ArrayList<String>();
        
        // 2000長度的List求差集
        for(int i = 0; i < 2000; i++)
        {
            list1.add("" + i);
            list2.add("" + (i + 1));
        }
        long startTime = System.currentTimeMillis();
        list1.removeAll(list2);
        long endTime = System.currentTimeMillis();
        System.out.println("2000 list remove all cost: " + (endTime - startTime) + "ms.");
        
        // 10000長度的List求差集
        list1.clear();
        list2.clear();
        for(int i = 0; i < 10000; i++)
        {
            list1.add("" + i);
            list2.add("" + (i + 1));
        }
        startTime = System.currentTimeMillis();
        list1.removeAll(list2);
        endTime = System.currentTimeMillis();
        System.out.println("10000 list remove all cost: " + (endTime - startTime) + "ms.");
        
        // 50000長度的List求差集
        list1.clear();
        list2.clear();
        for(int i = 0; i < 50000; i++)
        {
            list1.add("" + i);
            list2.add("" + (i + 1));
        }
        startTime = System.currentTimeMillis();
        list1.removeAll(list2);
        endTime = System.currentTimeMillis();
        System.out.println("50000 list remove all cost: " + (endTime - startTime) + "ms.");
    }
}

上述代碼我們分別對長度為2000,10000,50000的List進行了求差集的運算，耗時如下：

2000 list remove all cost: 46ms.
10000 list remove all cost: 1296ms.
50000 list remove all cost: 31028ms.

可以看到，數據量每增加5倍，ArrayList的求差集運算時間消耗增加30倍。當我們進行數十萬元素的求差集運算時，時間消耗是我們不可承受的。

Equals

實體過濾中，為了找到我們關心的實體數據，我們必然會采用Collection.contains過濾實體ID，這里面會使用到字符串equals方法判斷兩個ID是否相等。對於我們來說，兩個字符串相等的含義就是兩個字符串長度一致，對應位置的字符編碼相等。如果大量字符串兩兩比較都采用上述算法，那將會進行海量的運算，消耗大量性能。這個時候，HashCode的作用就顯得尤其重要。

HashCode

HashCode是int類型。兩個對象如果相等（equals為true），則HashCode必然相等；反之，HashCode不等的兩個對象，equals必然為false。最優秀的Hash算法，不相等的對象HashCode都不相同，所有equals比較都只調用HashCode的恆等比較，那么計算量就大大減小了。實際上，任何一個Hash算法都不能達到上述要求（HashCode為int類型，說明HashCode取值范圍有限，對象超過int取值范圍個數，就必然出現不相等對象對應同一個HashCode值）。不相等的對象對應相同的HashCode稱之為Hash沖突。

但是，好的Hash算法確出現Hash沖突的概率極低。比如0.01%的Hash沖突概率，這樣就意味着，我們平均進行10000次不相等對象的equals比較，只會出現一次Hash沖突，也就意味着只需要調用一次equals主邏輯。我們在設計equals方法時，先比較兩個對象HashCode是否相等，不相等則返回false，相等才進行equals主邏輯比較。

原始的HashCode方法是由虛擬機本地實現的，可能采用的對象地址進行運算。String復寫了HashCode方法，代碼如下：

    // Object
    public native int hashCode();

    // String
    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

HashMap
HashMap是一個利用Key的HashCode進行散列存儲的容器。它采用數組->鏈表->紅黑樹存儲數據。結構如下圖：

最簡單的設想，計算一個Key在數組中的位置時，采用HashCode%數組長度求余計算則可（實際上JDK采用了更好的散列算法）。可以想象，相同的散列算法下，數組長度越長，Hash沖突概率越小，但是使用的空間越大。

JDK默認采用0.75為元素容量與數組長度的比例。默認初始化數組長度為16（采用2的n次方是考慮HashMap的擴容性能），當元素個數增加到16*0.75=12個時，數組長度會自動增加一倍，元素位置會被重新計算。在數據量巨大的情況下，我們初始化HashMap時應該考慮初始化足夠的數組長度，特別是性能優先的情況下，我們還可以適當減小元素容量與數組長度的比例。HashMap部分源碼：

    /**
     * The default initial capacity - MUST be a power of two.
     */
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

    /**
     * The maximum capacity, used if a higher value is implicitly specified
     * by either of the constructors with arguments.
     * MUST be a power of two <= 1<<30.
     */
    static final int MAXIMUM_CAPACITY = 1 << 30;

    /**
     * The load factor used when none specified in constructor.
     */
    static final float DEFAULT_LOAD_FACTOR = 0.75f;

    /**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor
     * @throws IllegalArgumentException if the initial capacity is negative
     *         or the load factor is nonpositive
     */
    public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);

        this.loadFactor = loadFactor;
        threshold = initialCapacity;
        init();
    }

    /**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and the default load factor (0.75).
     *
     * @param  initialCapacity the initial capacity.
     * @throws IllegalArgumentException if the initial capacity is negative.
     */
    public HashMap(int initialCapacity) {
        this(initialCapacity, DEFAULT_LOAD_FACTOR);
    }

    /**
     * Constructs an empty <tt>HashMap</tt> with the default initial capacity
     * (16) and the default load factor (0.75).
     */
    public HashMap() {
        this(DEFAULT_INITIAL_CAPACITY, DEFAULT_LOAD_FACTOR);
    }

大數據集合運算性能考慮
通過上述分析，我們知道在性能優先的場景下，大數據集合運算一定要使用Hash集合（HashMap，HashSet，HashTable）存儲數據。文章開頭的集合求余運算，我們修改為使用HashSet.removeAll，代碼如下：

package com.coshaho.hash;

import java.util.Collection;
import java.util.HashSet;

public class HashObject {
    
    public static void main(String[] args)
    {
        Collection<String> list1 = new HashSet<String>();
        Collection<String> list2 = new HashSet<String>();
        
        // 2000長度的List求差集
        for(int i = 0; i < 2000; i++)
        {
            list1.add("" + i);
            list2.add("" + (i + 1));
        }
        long startTime = System.currentTimeMillis();
        list1.removeAll(list2);
        long endTime = System.currentTimeMillis();
        System.out.println("2000 list remove all cost: " + (endTime - startTime) + "ms.");
        
        // 10000長度的List求差集
        list1.clear();
        list2.clear();
        for(int i = 0; i < 10000; i++)
        {
            list1.add("" + i);
            list2.add("" + (i + 1));
        }
        startTime = System.currentTimeMillis();
        list1.removeAll(list2);
        endTime = System.currentTimeMillis();
        System.out.println("10000 list remove all cost: " + (endTime - startTime) + "ms.");
        
        // 50000長度的List求差集
        list1.clear();
        list2.clear();
        for(int i = 0; i < 50000; i++)
        {
            list1.add("" + i);
            list2.add("" + (i + 1));
        }
        startTime = System.currentTimeMillis();
        list1.removeAll(list2);
        endTime = System.currentTimeMillis();
        System.out.println("50000 list remove all cost: " + (endTime - startTime) + "ms.");
    }
}

運行效果如下：

2000 list remove all cost: 31ms.
10000 list remove all cost: 0ms.
50000 list remove all cost: 16ms.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 JAVA性能優化總結 Java 日志性能優化 java性能優化之for循環有哪些Java性能優化方法？談談java的hashcode使用場景 Java 代碼性能優化優化你的java代碼性能 Java性能優化方法 java性能優化之for循環性能優化 - Docker 容器中的 Java 內存使用分析