substring(start,end)在Java編程里面經常使用,沒想到如果使用不當,會出現內存泄露。
要了解substring(),最好的方法便是查看源碼(jdk6):
1 /** 2 * <blockquote><pre> 3 * "hamburger".substring(4, 8) returns "urge" 4 * "smiles".substring(1, 5) returns "mile" 5 * </pre></blockquote> 6 * 7 * @param beginIndex the beginning index, inclusive. 8 * @param endIndex the ending index, exclusive. 9 * @return the specified substring. 10 * @exception IndexOutOfBoundsException if the 11 * <code>beginIndex</code> is negative, or 12 * <code>endIndex</code> is larger than the length of 13 * this <code>String</code> object, or 14 * <code>beginIndex</code> is larger than 15 * <code>endIndex</code>. 16 */ 17 public String substring(int beginIndex, int endIndex) { 18 if (beginIndex < 0) { 19 throw new StringIndexOutOfBoundsException(beginIndex); 20 } 21 if (endIndex > count) { 22 throw new StringIndexOutOfBoundsException(endIndex); 23 } 24 if (beginIndex > endIndex) { 25 throw new StringIndexOutOfBoundsException(endIndex - beginIndex); 26 } 27 return ((beginIndex == 0) && (endIndex == count)) ? this : 28 new String(offset + beginIndex, endIndex - beginIndex, value); 29 }
插一句,這段substring()的源代碼,為如何編寫api提供了很好的一個例子,讓我想起了老趙的一篇文章,對參數的判斷,異常的處理,思路上有點接近。
值得注意的是,如果調用substring(i,i)的話(即beginIndex==endIndex)或者是substring(stringLength)(即是beginIndex==字符串長度),並不會拋出異常,而是會返回一個空的字符串,因為new String(offset + beginIndex , 0 , value)。
言歸正傳,真正創建字符串的,是一個String(int,in,char[])的構造函數,源代碼如下:
1 // Package private constructor which shares value array for speed. 2 String(int offset, int count, char value[]) { 3 this.value = value; 4 this.offset = offset; 5 this.count = count; 6 }
Java里的字符串,其實是由三個私有變量定義:
public final class String implements java.io.Serializable, Comparable<String>, CharSequence { /** The value is used for character storage. */ private final char value[]; /** The offset is the first index of the storage that is used. */ private final int offset; /** The count is the number of characters in the String. */ private final int count; }
當為字符串分配內存時,char數組存儲字符,offset=0,count=字符串長度。問題在於,由substring(start,end)調用構造函數String(int,in,char[])時,實際上是改變offset和count的位置達到取得子字符串的目的,而子字符串里的value[]數組,仍然指向原字符串。假設原字符串s有1GB,且我們需要的是s.substring(1,10)這樣一段小的字符串,但由於substring()里的value[]數組仍然指向1GB的原字符串,導致原字符串無法在GC中釋放,從而產生了內存泄露。
但為什么要這樣設計呢?由於String是不可變的(immutable),基於這種共享同一個字符數組的設計有以下好處:
調用substring()時無需復制數組,可重用value[]數組;且substring()的運行是常數時間,非線性,性能得到提高(這也是第二段代碼注釋的意思:share values for speed)。
而劣勢,便是可能會產生內存泄露(實際上,Oracle早有人提出這個bug:http://bugs.sun.com/view_bug.do?bug_id=4513622)。
如何避免這個問題呢?有一個變通的方案,通過一個構造函數,復制一段數組:
1 /** 2 * Initializes a newly created {@code String} object so that it represents 3 * the same sequence of characters as the argument; in other words, the 4 * newly created string is a copy of the argument string. Unless an 5 * explicit copy of {@code original} is needed, use of this constructor is 6 * unnecessary since Strings are immutable. 7 * 8 * @param original 9 * A {@code String} 10 */ 11 public String(String original) { 12 int size = original.count; 13 char[] originalValue = original.value; 14 char[] v; 15 if (originalValue.length > size) { 16 // The array representing the String is bigger than the new 17 // String itself. Perhaps this constructor is being called 18 // in order to trim the baggage, so make a copy of the array. 19 int off = original.offset; 20 v = Arrays.copyOfRange(originalValue, off, off+size); 21 } else { 22 // The array representing the String is the same 23 // size as the String, so no point in making a copy. 24 v = originalValue; 25 } 26 this.offset = 0; 27 this.count = size; 28 this.value = v; 29 } 30 31 //smalStr no longer holds the value[] of 1GB 32 String smallStr = new String(s.substring(1,10));
上面的構造方法,重新復制了一段數組給v,然后再將v給字符串的數組,從而避免內存泄露。
在Java7里,String的實現已經改變,substring()方法的實現,由原來的共享數組變成了傳統的拷貝,杜絕了內存泄露的同時也將運行時間由常數變成了線性:
1 public String substring(int beginIndex, int endIndex) { 2 if (beginIndex < 0) { 3 throw new StringIndexOutOfBoundsException(beginIndex); 4 } 5 if (endIndex > value.length) { 6 throw new StringIndexOutOfBoundsException(endIndex); 7 } 8 int subLen = endIndex - beginIndex; 9 if (subLen < 0) { 10 throw new StringIndexOutOfBoundsException(subLen); 11 } 12 return ((beginIndex == 0) && (endIndex == value.length)) ? this 13 : new String(value, beginIndex, subLen); 14 }
/** * Allocates a new {@code String} that contains characters from a subarray * of the character array argument. The {@code offset} argument is the * index of the first character of the subarray and the {@code count} * argument specifies the length of the subarray. The contents of the * subarray are copied; subsequent modification of the character array does * not affect the newly created string. * * @param value * Array that is the source of characters * * @param offset * The initial offset * * @param count * The length * * @throws IndexOutOfBoundsException * If the {@code offset} and {@code count} arguments index * characters outside the bounds of the {@code value} array */ public String(char value[], int offset, int count) { if (offset < 0) { throw new StringIndexOutOfBoundsException(offset); } if (count < 0) { throw new StringIndexOutOfBoundsException(count); } // Note: offset or count might be near -1>>>1. if (offset > value.length - count) { throw new StringIndexOutOfBoundsException(offset + count); } this.value = Arrays.copyOfRange(value, offset, offset+count); }
這個構造函數,每次都會復制數組,實現與Java6並不一樣。至於哪個好哪個壞,其實很難說清楚。
據說有一種Rope的數據結構,可以更加高效地處理字符串,得好好看看。
參考:
http://javarevisited.blogspot.hk/2011/10/how-substring-in-java-works.html
http://eyalsch.wordpress.com/2009/10/27/stringleaks/
http://blog.zhaojie.me/2013/03/string-and-rope-1-string-in-dotnet-and-java.html