1. 浮點數IEEE 754表示方法
要搞清楚float累加為什么會產生誤差,必須先大致理解float在機器里怎么存儲的,這里只介紹一下組成
由上圖可知(摘在[2]), 浮點數由: 符號位 + 指數位 + 尾數部分, 三部分組成。由於機器中都是由二進制存儲的,那么一個10進制的小數如何表示成二進制。例如: 8.25轉成二進制為1000.01, 這是因為 1000.01 = 1*2^3 + 0*2^2 + 0*2^1 + 0*2^0 + 0*2^-1 + 2*2^-2 = 1000.01.
(2)float的有效位數是6-7位,這是為什么呢?因為位數部分只有23位,所以最小的精度為1*2^-23 在10^-6和10^-7之間,接近10^-7, [3]中也有解釋
那么為什么float累加會產生誤差呢,主要原因在於兩個浮點數累加的過程。
2. 兩個浮點數相加的過程
兩浮點數X,Y進行加減運算時,必須按以下幾步執行:
(1)對階,使兩數的小數點位置對齊,小的階碼向大的階碼看齊。
(2)尾數求和,將對階后的兩尾數按定點加減運算規則求和(差)。
(3)規格化,為增加有效數字的位數,提高運算精度,必須將求和(差)后的尾數規格化。
(4)舍入,為提高精度,要考慮尾數右移時丟失的數值位。
(5)判斷結果,即判斷結果是否溢出。
關鍵就在與對階這一步驟,由於float的有效位數只有7位有效數字,如果一個大數和一個小數相加時,會產生很大的誤差,因為尾數得截掉好多位。例如:
123 + 0.00023456 = 1.23*10^2 + 0.000002 * 10^2 = 123.0002
那么此時就會產生0.00003456的誤差,如果累加多次,則誤差就會進一步加大。
那么怎么解決這種誤差呢?
Kahan summation algorithm
function KahanSum(input) var sum = 0.0 var c = 0.0 // A running compensation for lost low-order bits. for i = 1 to input.length do var y = input[i] - c // So far, so good: c is zero. var t = sum + y // Alas, sum is big, y small, so low-order digits of y are lost. c = (t - sum) - y // (t - sum) cancels the high-order part of y; subtracting y recovers negative (low part of y) sum = t // Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers! next i // Next time around, the lost low part will be added to y in a fresh attempt. return sum
例子:
1.
y = 3.14159 - 0 y = input[i] - c t = 10000.0 + 3.14159 = 10003.14159 But only six digits are retained. = 10003.1 Many digits have been lost! c = (10003.1 - 10000.0) - 3.14159 This must be evaluated as written! = 3.10000 - 3.14159 The assimilated part of y recovered, vs. the original full y. = -.0415900 Trailing zeros shown because this is six-digit arithmetic. sum = 10003.1 Thus, few digits from input(i) met those of sum.
2.
y = 2.71828 - -.0415900 The shortfall from the previous stage gets included. = 2.75987 It is of a size similar to y: most digits meet. t = 10003.1 + 2.75987 But few meet the digits of sum. = 10005.85987 And the result is rounded = 10005.9 To six digits. c = (10005.9 - 10003.1) - 2.75987 This extracts whatever went in. = 2.80000 - 2.75987 In this case, too much. = .040130 But no matter, the excess would be subtracted off next time. sum = 10005.9 Exact result is 10005.85987, this is correctly rounded to 6 digits.