Python的GC模塊主要運用了“引用計數”(reference counting)來跟蹤和回收垃圾。在引用計數的基礎上,還可以通過“標記-清除”(mark and sweep)解決容器對象可能產生的循環引用的問題。通過“分代回收”(generation collection)以空間換取時間來進一步提高垃圾回收的效率。
1 typedef struct_object { 2 int ob_refcnt; 3 struct_typeobject *ob_type; 4 }PyObject;
PyObject是每個對象必有的內容,其中ob_refcnt就是做為引用計數。當一個對象有新的引用時,它的ob_refcnt就會增加,當引用它的對象被刪除,它的ob_refcnt就會減少
1 #define Py_INCREF(op) ((op)->ob_refcnt++) //增加計數 2 #define Py_DECREF(op) \ //減少計數 3 if (--(op)->ob_refcnt != 0) \ 4 ; \ 5 else \ 6 __Py_Dealloc((PyObject *)(op))
1 list1 = [] 2 list2 = [] 3 list1.append(list2) 4 list2.append(list1)
上面說到python里回收機制是以引用計數為主,標記-清除和分代收集兩種機制為輔。
1、標記-清除機制
標記-清除機制,顧名思義,首先標記對象(垃圾檢測),然后清除垃圾(垃圾回收)。如圖:
首先初始所有對象標記為白色,並確定根節點對象(這些對象是不會被刪除),標記它們為黑色(表示對象有效)。將有效對象引用的對象標記為灰色(表示對象可達,但它們所引用的對象還沒檢查),檢查完灰色對象引用的對象后,將灰色標記為黑色。重復直到不存在灰色節點為止。最后白色結點都是需要清除的對象。
2、回收對象的組織
這里所采用的高級機制作為引用計數的輔助機制,用於解決產生的循環引用問題。而循環引用只會出現在“內部存在可以對其他對象引用的對象”,比如:list,class等。
為了要將這些回收對象組織起來,需要建立一個鏈表。自然,每個被收集的對象內就需要多提供一些信息,下面代碼是回收對象里必然出現的。
1 /* GC information is stored BEFORE the object structure. */ 2 typedef union _gc_head { 3 struct { 4 union _gc_head *gc_next; 5 union _gc_head *gc_prev; 6 Py_ssize_t gc_refs; 7 } gc; 8 long double dummy; /* force worst-case alignment */ 9 } PyGC_Head;一個對象的實際結構如圖所示:
通過PyGC_Head的指針將每個回收對象連接起來,形成了一個鏈表,也就是在1里提到的初始化的所有對象。
3、分代技術
分代技術是一種典型的以空間換時間的技術,這也正是java里的關鍵技術。這種思想簡單點說就是:對象存在時間越長,越可能不是垃圾,應該越少去收集。
這樣的思想,可以減少標記-清除機制所帶來的額外操作。分代就是將回收對象分成數個代,每個代就是一個鏈表(集合),代進行標記-清除的時間與代內對象
存活時間成正比例關系
1 /*** Global GC state ***/ 2 3 struct gc_generation { 4 PyGC_Head head; 5 int threshold; /* collection threshold */ 6 int count; /* count of allocations or collections of younger 7 generations */ 8 };//每個代的結構 9 10 #define NUM_GENERATIONS 3//代的個數 11 #define GEN_HEAD(n) (&generations[n].head) 12 13 /* linked lists of container objects */ 14 static struct gc_generation generations[NUM_GENERATIONS] = { 15 /* PyGC_Head, threshold, count */ 16 {{{GEN_HEAD(0), GEN_HEAD(0), 0}}, 700, 0}, 17 {{{GEN_HEAD(1), GEN_HEAD(1), 0}}, 10, 0}, 18 {{{GEN_HEAD(2), GEN_HEAD(2), 0}}, 10, 0}, 19 }; 20 21 PyGC_Head *_PyGC_generation0 = GEN_HEAD(0);從上面代碼可以看出python里一共有三代,每個代的threshold值表示該代最多容納對象的個數。默認情況下,當0代超過700,或1,2代超過10,垃圾回收機制將觸發。
0代觸發將清理所有三代,1代觸發會清理1,2代,2代觸發后只會清理自己。
這篇算是一個完整的收集流程:鏈表建立,確定根節點,垃圾標記,垃圾回收~
1、鏈表建立
首先,中里在分代技術說過:0代觸發將清理所有三代,1代觸發會清理1,2代,2代觸發后只會清理自己。在清理0代時,會將三個鏈表(代)鏈接起來,清理1代的時,會鏈接1,2兩代。在后面三步,都是針對的這個建立之后的鏈表。
2、確定根節點
圖1為一個例子。list1與list2循環引用,list3與list4循環引用。a是一個外部引用。
對於這樣一個鏈表,我們如何得出根節點呢。python里是在引用計數的基礎上又提出一個有效引用計數的概念。顧名思義,有效引用計數就是去除循環引用后的計數。
下面是計算有效引用計數的相關代碼:
1 /* Set all gc_refs = ob_refcnt. After this, gc_refs is > 0 for all objects 2 * in containers, and is GC_REACHABLE for all tracked gc objects not in 3 * containers. 4 */ 5 static void 6 update_refs(PyGC_Head *containers) 7 { 8 PyGC_Head *gc = containers->gc.gc_next; 9 for (; gc != containers; gc = gc->gc.gc_next) { 10 assert(gc->gc.gc_refs == GC_REACHABLE); 11 gc->gc.gc_refs = Py_REFCNT(FROM_GC(gc)); 12 assert(gc->gc.gc_refs != 0); 13 } 14 } 15 16 /* A traversal callback for subtract_refs. */ 17 static int 18 visit_decref(PyObject *op, void *data) 19 { 20 assert(op != NULL); 21 if (PyObject_IS_GC(op)) { 22 PyGC_Head *gc = AS_GC(op); 23 /* We're only interested in gc_refs for objects in the 24 * generation being collected, which can be recognized 25 * because only they have positive gc_refs. 26 */ 27 assert(gc->gc.gc_refs != 0); /* else refcount was too small */ 28 if (gc->gc.gc_refs > 0) 29 gc->gc.gc_refs--; 30 } 31 return 0; 32 } 33 34 /* Subtract internal references from gc_refs. After this, gc_refs is >= 0 35 * for all objects in containers, and is GC_REACHABLE for all tracked gc 36 * objects not in containers. The ones with gc_refs > 0 are directly 37 * reachable from outside containers, and so can't be collected. 38 */ 39 static void 40 subtract_refs(PyGC_Head *containers) 41 { 42 traverseproc traverse; 43 PyGC_Head *gc = containers->gc.gc_next; 44 for (; gc != containers; gc=gc->gc.gc_next) { 45 traverse = Py_TYPE(FROM_GC(gc))->tp_traverse; 46 (void) traverse(FROM_GC(gc), 47 (visitproc)visit_decref, 48 NULL); 49 } 50 }update_refs函數里建立了一個引用的副本。
visit_decref函數對引用的副本減1,subtract_refs函數里traverse的作用是遍歷對象里的每一個引用,執行visit_decref操作。
最后,鏈表內引用計數副本非0的對象,就是根節點了。
說明:
1、為什么要建立引用副本?
答:這個過程是尋找根節點的過程,在這個時候修改計數不合適。subtract_refs會對對象的引用對象執行visit_decref操作。如果鏈表內對象引用了鏈表外對象,那么鏈表外對象計數會減1,顯然,很有可能這個對象會被回收,而回收機制里根本不應該對非回收對象處理。
2、traverse的疑問(未解決)?
答:一開始,有個疑問。上面例子里,subtract_refs函數中處理完list1結果應該如下:
然后gc指向list2,此時list2的副本(為0)不會減少,但是list2對list1還是存在實際上的引用,那么list1副本會減1嗎?顯然,如果減1就出問題了。
所以list1為0時,traverse根本不會再去處理list1這些引用(或者說,list2對list1名義上不存在引用了)。
此時,又有一個問題,如果存在一個外部對象b,對list2引用,subtract_refs函數中處理完list1后,如下圖:
當subtract_refs函數中遍歷到list2時,list2的副本還會減1嗎?顯然traverse的作用還是沒有理解。
3、垃圾標記
接下來,python建立兩條鏈表,一條存放根節點,以及根節點的引用對象。另外一條存放unreachable對象。
標記的方法就是中里的標記思路,代碼如下:
1 /* A traversal callback for move_unreachable. */ 2 static int 3 visit_reachable(PyObject *op, PyGC_Head *reachable) 4 { 5 if (PyObject_IS_GC(op)) { 6 PyGC_Head *gc = AS_GC(op); 7 const Py_ssize_t gc_refs = gc->gc.gc_refs; 8 9 if (gc_refs == 0) { 10 /* This is in move_unreachable's 'young' list, but 11 * the traversal hasn't yet gotten to it. All 12 * we need to do is tell move_unreachable that it's 13 * reachable. 14 */ 15 gc->gc.gc_refs = 1; 16 } 17 else if (gc_refs == GC_TENTATIVELY_UNREACHABLE) { 18 /* This had gc_refs = 0 when move_unreachable got 19 * to it, but turns out it's reachable after all. 20 * Move it back to move_unreachable's 'young' list, 21 * and move_unreachable will eventually get to it 22 * again. 23 */ 24 gc_list_move(gc, reachable); 25 gc->gc.gc_refs = 1; 26 } 27 /* Else there's nothing to do. 28 * If gc_refs > 0, it must be in move_unreachable's 'young' 29 * list, and move_unreachable will eventually get to it. 30 * If gc_refs == GC_REACHABLE, it's either in some other 31 * generation so we don't care about it, or move_unreachable 32 * already dealt with it. 33 * If gc_refs == GC_UNTRACKED, it must be ignored. 34 */ 35 else { 36 assert(gc_refs > 0 37 || gc_refs == GC_REACHABLE 38 || gc_refs == GC_UNTRACKED); 39 } 40 } 41 return 0; 42 } 43 44 /* Move the unreachable objects from young to unreachable. After this, 45 * all objects in young have gc_refs = GC_REACHABLE, and all objects in 46 * unreachable have gc_refs = GC_TENTATIVELY_UNREACHABLE. All tracked 47 * gc objects not in young or unreachable still have gc_refs = GC_REACHABLE. 48 * All objects in young after this are directly or indirectly reachable 49 * from outside the original young; and all objects in unreachable are 50 * not. 51 */ 52 static void 53 move_unreachable(PyGC_Head *young, PyGC_Head *unreachable) 54 { 55 PyGC_Head *gc = young->gc.gc_next; 56 57 /* Invariants: all objects "to the left" of us in young have gc_refs 58 * = GC_REACHABLE, and are indeed reachable (directly or indirectly) 59 * from outside the young list as it was at entry. All other objects 60 * from the original young "to the left" of us are in unreachable now, 61 * and have gc_refs = GC_TENTATIVELY_UNREACHABLE. All objects to the 62 * left of us in 'young' now have been scanned, and no objects here 63 * or to the right have been scanned yet. 64 */ 65 66 while (gc != young) { 67 PyGC_Head *next; 68 69 if (gc->gc.gc_refs) { 70 /* gc is definitely reachable from outside the 71 * original 'young'. Mark it as such, and traverse 72 * its pointers to find any other objects that may 73 * be directly reachable from it. Note that the 74 * call to tp_traverse may append objects to young, 75 * so we have to wait until it returns to determine 76 * the next object to visit. 77 */ 78 PyObject *op = FROM_GC(gc); 79 traverseproc traverse = Py_TYPE(op)->tp_traverse; 80 assert(gc->gc.gc_refs > 0); 81 gc->gc.gc_refs = GC_REACHABLE; 82 (void) traverse(op, 83 (visitproc)visit_reachable, 84 (void *)young); 85 next = gc->gc.gc_next; 86 } 87 else { 88 /* This *may* be unreachable. To make progress, 89 * assume it is. gc isn't directly reachable from 90 * any object we've already traversed, but may be 91 * reachable from an object we haven't gotten to yet. 92 * visit_reachable will eventually move gc back into 93 * young if that's so, and we'll see it again. 94 */ 95 next = gc->gc.gc_next; 96 gc_list_move(gc, unreachable); 97 gc->gc.gc_refs = GC_TENTATIVELY_UNREACHABLE; 98 } 99 gc = next; 100 } 101 }
標記之后,鏈表如上圖。
4、垃圾回收
回收的過程,就是銷毀不可達鏈表內對象。下面代碼就是list的清除方法:
1 /* Methods */ 2 3 static void 4 list_dealloc(PyListObject *op) 5 { 6 Py_ssize_t i; 7 PyObject_GC_UnTrack(op); 8 Py_TRASHCAN_SAFE_BEGIN(op) 9 if (op->ob_item != NULL) { 10 /* Do it backwards, for Christian Tismer. 11 There's a simple test case where somehow this reduces 12 thrashing when a *very* large list is created and 13 immediately deleted. */ 14 i = Py_SIZE(op); 15 while (--i >= 0) { 16 Py_XDECREF(op->ob_item[i]); 17 } 18 PyMem_FREE(op->ob_item); 19 } 20 if (numfree < PyList_MAXFREELIST && PyList_CheckExact(op)) 21 free_list[numfree++] = op; 22 else 23 Py_TYPE(op)->tp_free((PyObject *)op); 24 Py_TRASHCAN_SAFE_END(op) 25 }轉自:https://my.oschina.net/hebianxizao/blog/59896