公司中的項目在逐漸的向Laravel框架進行遷移。在編寫定時任務腳本的時候，用到了chunk和chunkById的API，記錄一下踩到的坑。

一、前言

數據庫引擎為innodb。

表結構簡述，只列出了本文用到的字段。

字段	類型	注釋
id	int(11)	ID
type	int(11)	類型
mark_time	int(10)	標注時間（時間戳）

索引，也只列出需要的部分。

索引名	字段
PRIMARY	id
idx_sid_blogdel_marktime	type blog_del mark_time
Idx_marktime	mark_time

二、需求

每天凌晨一點取出昨天標注type為99的所有數據，進行邏輯判斷，然后進行其他操作。本文的重點只在於取數據的階段。

數據按月分表，每個月表中的數據為1000w上下。

三、chunk處理數據

代碼如下:

$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunk(1000, function ($rows){
 // 業務處理
});

從一個月中的數據，篩選出type為99，並且標注時間在某天的00:00:00-23:59:59的數據。可以使用到mark_time和type的索引。

type為99，一天的數據大概在15-25w上下的樣子。使用->get()->toArray()內存會直接炸掉。所以使用chunk方法，每次取出1000條數據。

使用chucnk，不會出現內存不夠的情況。但是性能較差。粗略估計，從一月數據中取出最后一天的數據，跑完20w數據大概需要一兩分鍾。

查看源碼，底層的chunk方法，是為sql語句添加了限制和偏移量。

 
              select * from `users` asc limit 500 offset 500;

在數據較多的時候，越往后的話效率會越慢，因為Mysql的limit方法底層是這樣的。

limit 10000，10

是掃描滿足條件的10010行，然后扔掉前面的10000行，返回最后最后20行。在數據較多的時候，性能會非常差。

查了下API，對於這種情況Laraverl提供了另一個API chunkById。

四、chunkById 原理

使用limit和偏移量在處理大量的數據會有性能的明顯下降。於是chunkById使用了id進行分頁處理。很好理解，代碼如下：

 
              select * from `users` where `id` > :last_id order by `id` asc limit 500;

API會自動保存最后一個ID，然后通過id > :last_id 再加上limit就可以通過主鍵索引進行分頁。只取出來需要的行數。性能會有明顯的提升。

五、chunkById的坑

API顯示chunk和chunkById的用法完全相同。於是把腳本的代碼換成了chunkById。

$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunkById(1000, function ($rows){
 // 業務處理
});

在執行腳本的時候，1月2號和1月1號的數據沒有任何問題。執行速度快了很多。但是在執行12月31號的數據的時候，發現腳本一直執行不完。

在定位后發現是腳本沒有進入業務處理的部分，也就是sql一直沒有執行完。當時很疑惑，因為剛才執行的沒問題，為什么執行12月31號的就出問題了呢。

於是查看sql服務器中的執行情況。

 
              show full processlist;

發現了問題。上節說了chunkById的底層是通過id進行order by，然后limie取出一部分一部分的數據，也就是我們預想的sql是這樣的。

 
              select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date limit 500;

explain出來的情況如下：

select_type	type	key	rows	Extra
SIMPLE	Range	idx_marktime	2370258	Using index condition; Using where

實際上的sql是這樣的：

 
              select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date order by id limit 500;

實際explain出來的情況是這樣的：

select_type	type	key	rows	Extra
SIMPLE	Index	PRIMARY	4379	Using where

chunkById會自動添加order by id。innodb一定會使用主鍵索引。那么就不會再使用mark_time的索引了。導致sql執行效率及其緩慢。

六、解決方法

再次查看chunkById的源碼。

 
           
            
              
              /**
 * Chunk the results of a query by comparing numeric IDs.
 *
 * @param int $count
 * @param callable $callback
 * @param string|null $column
 * @param string|null $alias
 * @return bool
 */
 public function chunkById($count, callable $callback, $column = null, $alias = null)
 {
 $column = is_null($column) ? $this->getModel()->getKeyName() : $column;

 $alias = is_null($alias) ? $column : $alias;

 $lastId = null;

 do {
 $clone = clone $this;

 // We'll execute the query for the given page and get the results. If there are
 // no results we can just break and return from here. When there are results
 // we will call the callback with the current chunk of these results here.
 $results = $clone->forPageAfterId($count, $lastId, $column)->get();

 $countResults = $results->count();

 if ($countResults == 0) {
 break;
 }

 // On each chunk result set, we will pass them to the callback and then let the
 // developer take care of everything within the callback, which allows us to
 // keep the memory low for spinning through large result sets for working.
 if ($callback($results) === false) {
 return false;
 }

 $lastId = $results->last()->{$alias};

 unset($results);
 } while ($countResults == $count);

 return true;
 }
  
            
 
           
         

能看到這個方法有四個參數count，callback，column，alias。

默認的column為null，第一行會進行默認賦值。

 
              $column = is_null($column) ? $this->getModel()->getKeyName() : $column;

往下跟:

 
           
            
              
              /**
 * Get the primary key for the model.
 *
 * @return string
 */
 public function getKeyName()
 {
 return $this->primaryKey;
 }
 
/**
 * The primary key for the model.
 *
 * @var string
 */
 protected $primaryKey = 'id';
  
            
 
           
         

能看到默認的column為id。

進入forPageAfterId方法。

 
           
            
              
              /**
 * Constrain the query to the next "page" of results after a given ID.
 *
 * @param int $perPage
 * @param int|null $lastId
 * @param string $column
 * @return \Illuminate\Database\Query\Builder|static
 */
 public function forPageAfterId($perPage = 15, $lastId = 0, $column = 'id')
 {
 $this->orders = $this->removeExistingOrdersFor($column);

 if (! is_null($lastId)) {
 $this->where($column, '>', $lastId);
 }

 return $this->orderBy($column, 'asc')
 ->take($perPage);
 }
  
            
 
           
         

能看到如果lastId不為0則自動添加where語句，還會自動添加order by column。

看到這里就明白了。上文的chunkById沒有添加column參數，所以底層自動添加了order by id。走了主鍵索引，沒有使用上mark_time的索引。導致查詢效率非常低。

chunkById的源碼顯示了我們可以傳遞一個column字段來讓底層使用這個字段來order by。

代碼修改如下：

 
              $this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunkById(1000, function ($rows){
 // 業務處理
}, 'mark_time');

這樣最后執行的sql如下：

 
              select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date order by mark_time limit 500;

再次執行腳本，大概執行一次也就十秒作用了，性能提升顯著。

七、總結

使用 chunkById 或者 chunk 方法的時候不要添加自定義的排序，chunk和chunkById的區別就是chunk是單純的通過偏移量來獲取數據，chunkById進行了優化，不使用偏移量，使用id過濾，性能提升巨大。在數據量大的時候，性能可以差到幾十倍的樣子。

而且使用chunk在更新的時候，也會遇到數據會被跳過的問題。詳見解決Laravel中chunk方法分塊處理數據的坑

同時chunkById在你沒有傳遞column參數時，會默認添加order by id。可能會遇到索引失效的問題。解決辦法就是傳遞column參數即可。

本人感覺chunkById不光是根據Id分塊，而是可以根據某一字段進行分塊，這個字段是可以指定的。叫chunkById有一些誤導性，chunkByColumn可能更容易理解。算是自己提的小小的建議。

本文非原創，轉載於https://www.lqwang.net/13.html

Laravel chunk和chunkById的坑