1.這里看到的是 主機cpu 90% 都給消耗掉了,主要是mysql 進程消耗資源
top - 14:46:26 up 266 days, 20:41, 4 users, load average: 17.14, 15.68, 10.69
Tasks: 264 total, 1 running, 263 sleeping, 0 stopped, 0 zombie
Cpu(s): 69.5%us, 21.2%sy, 0.0%ni, 9.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16333448k total, 9920660k used, 6412788k free, 488896k buffers
Swap: 2097148k total, 24448k used, 2072700k free, 3965104k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9386 mysql 20 0 14.6g 4.5g 9164 S 723.8 28.9 416:42.07 /db/mysql/app/mysql/bin/mysqld --defaults-file=/db/mysql/app/mysql/my.cnf --basedi
11846 root 20 0 15168 1380 940 R 0
2。使用mysql 巡檢腳本,檢查當前的processlist 列表 ,
主要等待事件是 Sending data 和 Waiting for table metadata lock 2個類型的等待事件。
Id User Host db Command Time State Info
473 resync 56.16.3.21:56636 NULL Binlog Dump GTID 52765 Master has sent all binlog to slave; waiting for binlog to be updated NULL
47414 db 10.200.210.61:36735 user Sleep 142 NULL
79608 db 56.16.3.26:38200 cbrc Query 2694 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
79748 db 10.200.210.66:33393 cbrc Query 2563 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
79928 db 10.200.210.66:33529 cbrc Query 2457 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
80492 db 56.16.3.26:38395 cbrc Query 2211 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
80562 db 56.16.3.26:38410 cbrc Query 2182 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
80645 db 56.16.3.26:38425 cbrc Query 2151 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
80658 db 56.16.3.26:38434 cbrc Query 2130 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
80716 db 56.16.3.26:38449 cbrc Query 2095 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
80996 db 56.16.3.26:38496 cbrc Query 2070 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
81043 db 56.16.3.26:38501 cbrc Query 1895 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
81530 db 56.16.3.26:38621 cbrc Query 1508 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
81736 db 56.16.3.26:38685 cbrc Query 1463 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
82182 db 56.16.3.26:38757 cbrc Query 1448 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
82783 db 56.16.3.26:38901 cbrc Query 1156 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
82788 db 56.16.3.26:38902 cbrc Query 1116 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
82789 db 56.16.3.26:38903 cbrc Query 1088 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
82790 db 56.16.3.26:38904 cbrc Query 943 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
83054 db 10.200.210.66:36161 cbrc Query 996 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
83058 db 10.200.210.66:36162 cbrc Query 981 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
83059 db 10.200.210.66:36163 cbrc Query 883 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
83060 db 10.200.210.66:36164 cbrc Query 830 Sending data select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
83404 db 56.16.3.26:39045 cbrc Query 391 Waiting for table metadata lock select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl
83552 db 10.200.210.66:36670 cbrc Query 271 Waiting for table metadata lock insert into data_collect_detl (detl_id, data_indc_id, data_collect_id, data_result, data_coll
3.對於2類等待事件。
3.1.1 Sending data -》 說明這個SQL 正在運行,消耗CPU ,有可能需要 優化SQL,現在可以考慮臨時kill 掉所有運行這個SQL的session.
如果沒有kill 這些select 的session,單獨對表truncate 也會造成等待,
因為mysql 的tuncate 會等待所有慢長查詢完成,才能進行, 另外 慢長的查詢,執行delete 語句,不會等待慢長查詢完成、
通過檢查慢查詢,發現如下SQL 比較消耗資源,后期考慮優化。
select c.*, i.*, (select max(dc.data_collect_mil) from data_collect_detl dc where dc.data_indc_id =i.data_indc_id and dc.rmk='success') as data_collect_mil, (select dc2.data_result from data_collect_detl dc2 where dc2.data_indc_id=i.data_indc_id and dc2.data_collect_mil=( select max(data_collect_mil) from data_collect_detl d where d.rmk='success' and d.data_indc_id=i.data_indc_id)) as data_result from data_collect_conf c left join data_indc_conf i on c.data_collect_id = i.data_collect_id order by i.data_collect_id,i.data_indc_order
--kill session 的方法,
select concat('KILL ',id,';') from information_schema.processlist where db='cbrc' into outfile '/tmp/a.txt'
3.2 Waiting for table metadata lock -》 說明有SQL 正在對表 做DDL 操作,因此引起了阻塞.
經過檢查對 用戶正在對該表 data_collect_detl 做add index 操作,kill 掉 add index 這個session ,這個等待就消失了。
| 84119 | db | 10.200.210.237:51793 | cbrc | Query | 2704 | Waiting for table metadata lock | alter table `cbrc`.`data_collect_detl` add index `index_data_indc_id` (`data_indc_id`)
處理方法
KILL 84119;
###感謝 DBA學習記錄 111
mysql 批量kill session
DBA學習記錄 2018-01-14 10:53:46 1368 收藏
展開
root@localhost > select concat('KILL ',id,';') from information_schema.processlist where user=’sam' into outfile '/tmp/a.txt
腳本內容如下:
+------------------------+
| concat('KILL ',id,';') |
+------------------------+
| KILL 31964612; |
| KILL 31964609; |
| KILL 31964611; |
…...
| KILL 31966619; |
| KILL 31966620; |
+------------------------+
991 rows in set (0.02 sec)
root@localhost >
4.執行上面生成的KILL腳本
root@localhost > source /tmp/a.txt
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
……
https://blog.csdn.net/hellosunqi/article/details/79055383
##################
###############sample 2
應用反饋 mysql 數據庫 很慢。
方案如下:
1.檢查cpu 數量 。
cat /proc/cpuinfo | grep name | cut -f2 -d: |uniq -c
2.檢查慢查詢位置
方法1:看源文件
cd <mysql_data_home>
#排除query_time !=0 秒的所有sql. 檢查summary 的慢sql
tail -100000 slow.log |grep Query_time |grep -v '# Query_time: 0.0'
##j檢查詳細的sql 日志 包括sql_text的語句
tail -100000 slow.log|more
方法2:使用工具
### 使用工具
/home/mysql/script/tools/bin/pt-query-digest --since '2020-10-30 15:30:00' --until '2020-10-30 16:30:00' /db/db/mysql/data/query_time_1030_1w >/home/mysql/script/query_time_1030_1w.log
3.詳細分析 慢sql 集中在 AND NOT EXISTS 語句上,2條SQL 語句都是一樣使用的語法:
1 運行190s 的sql 分析
# User@Host: nuseropr[nuseropr] @ [58.2.2.220] Id: 1591914
# Query_time: 190.104912 Lock_time: 0.000261 Rows_sent: 0 Rows_examined: 421796704 《- 處理行數 421796704, 處理時間 190.104912
use dvb_workflow;
SET timestamp=1604031333; 《-
SELECT dr.entity_type type, count(dr.id) count FROM wfl_document_ref dr WHERE EXISTS (SELECT 1 FROM wfl_task wd WHERE dr.tenant_id = 1292042641149067266 AND wd.status_code = 'CURRENT' AND dr.status_code = 1002 AND wd.person_id = 1292126974442536962 AND wd.entity_id = dr.entity_id AND wd.entity_type = dr.entity_type) AND dr.tenant_id = 1292042641149067266 GROUP BY dr.entity_type;
##檢查數據量
mysql> select count(1) from wfl_document_ref;
+----------+
| count(1) |
+----------+
| 8151 |
+----------+
1 row in set (0.01 sec)
mysql> select count(1) from wfl_task;
+----------+
| count(1) |
+----------+
| 52404 |
+----------+
1 row in set (0.01 sec)
檢查執行計划 wd 用全表
mysql> explain SELECT dr.entity_type type, count(dr.id) count FROM wfl_document_ref dr WHERE EXISTS (SELECT 1 FROM wfl_task wd WHERE dr.tenant_id = 1292042641149067266 AND wd.status_code = 'CURRENT' AND dr.status_code = 1002 AND wd.person_id = 1292126974442536962 AND wd.entity_id = dr.entity_id AND wd.entity_type = dr.entity_type) AND dr.tenant_id = 1292042641149067266 GROUP BY dr.entity_type;
+----+--------------------+-------+------------+------+------------------------------------------+----------------------+---------+-------+-------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+------+------------------------------------------+----------------------+---------+-------+-------+----------+---------------------------------------------------------------------+
| 1 | PRIMARY | dr | NULL | ref | uk_wfl_document_ref,idx_wfl_document_ref | idx_wfl_document_ref | 8 | const | 3463 | 100.00 | Using index condition; Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | wd | NULL | ALL | NULL | NULL | NULL | NULL | 52361 | 0.01 | Using where |
+----+--------------------+-------+------------+------+------------------------------------------+----------------------+---------+-------+-------+----------+---------------------------------------------------------------------+
2 rows in set, 5 warnings (0.00 sec)
##檢查該表的建表語句,是否有索引
mysql> show create table wfl_document_ref;
| wfl_document_ref | CREATE TABLE `wfl_document_ref` (
`id` bigint(20) NOT NULL,
`entity_id` varchar(100) NOT NULL COMMENT '對象id',
`paper_invoice` varchar(2) DEFAULT NULL COMMENT '紙質發票',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_wfl_document_ref` (`entity_id`,`entity_type`),
KEY `idx_wfl_document_ref` (`tenant_id`,`entity_id`,`entity_type`),
KEY `idx2_wfl_document_ref` (`id`,`tenant_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='工作流單據數據表' |
+------------------+-
| wfl_task | CREATE TABLE `wfl_task` (
`id` bigint(20) NOT NULL,
`instance_id` bigint(20) NOT NULL COMMENT '實例id',
`entity_id` varchar(100) NOT NULL COMMENT '對象id',
`order_time` timestamp NULL DEFAULT NULL COMMENT '取單時間',
PRIMARY KEY (`id`),
KEY `idx_wfl_task` (`tenant_id`,`instance_id`),
KEY `idx1_wfl_task` (`tenant_id`,`entity_id`,`entity_type`),
KEY `idx2_wfl_task` (`tenant_id`,`execution_id`),
KEY `idx3_wfl_task` (`id`,`tenant_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='工作流任務'
##選擇該表的數據字段分布 ,看看單個字段索引效率高 還是 組合索引效率高
mysql> select count(1),person_id from wfl_task group by person_id having count(1) > 1000 order by count(1) DESC;
+----------+---------------------+
| count(1) | person_id |
+----------+---------------------+
| 2897 | 1292126842351321089 |
| 2897 | 1292126852912578562 |
| 2896 | 1292126845232807938 |
| 2896 | 1292126837733392385 |
| 2896 | 1292126872231542785 |
| 2896 | 1292126863452864513 |
| 1958 | 1318818474438348802 |
| 1958 | 1315904516237074433 |
| 1957 | 1292126858109321217 |
+----------+---------------------+
9 rows in set (0.10 sec)
mysql> select count(1),person_id from wfl_task group by person_id,status_code,entity_id having count(1) > 1000 order by count(1) DESC;
Empty set (0.52 sec)
mysql> select count(1),person_id from wfl_task group by person_id,status_code,entity_id having count(1) > 100 order by count(1) DESC;
Empty set (0.53 sec)
mysql> select count(1),person_id from wfl_task group by person_id,status_code,entity_id having count(1) > 10 order by count(1) DESC;
+----------+---------------------+
| count(1) | person_id |
+----------+---------------------+
| 11 | 1292126837733392385 |
| 11 | 1292126842351321089 |
| 11 | 1292126845232807938 |
| 11 | 1292126852912578562 |
| 11 | 1292126863452864513 |
| 11 | 1292126894964670465 |
| 11 | 1292126897258954753 |
| 11 | 1292126901889466370 |
| 11 | 1292126902296313857 |
+----------+---------------------+
9 rows in set (0.53 sec)
##建議如下:
wfl_task 的字段person_id,status_code,entity_id 建立一個組合索引
########sql 2 分析 跟sql 1 類似優化 AND NOT EXISTS 的那部分語句。
# Time: 2020-10-30T13:47:31.983701+08:00
# User@Host: nuseropr[nuseropr] @ [58.2.2.221] Id: 1597123
# Query_time: 141.118841 Lock_time: 0.000734 Rows_sent: 2 Rows_examined: 146145070
use dvb_payment;
SET timestamp=1604036851;
select c.*,(c.amount - c.refundAmountCommit) as abledRefundAmount from (SELECT
a.*,
((select
COALESCE(sum( b.amount ), 0) AS refundAmountCommit
from csh_transaction_detail b
where a.id = b.ref_cash_detail_id
and b.operation_type = 'return'
and b.payment_status IN ( 'P', 'S' )
) +
(
SELECT
COALESCE(sum( c.write_off_amount ), 0) AS write_off_amount
FROM
csh_write_off c
WHERE
c.csh_transaction_detail_id = a.id
AND ( c.document_status = 'Y' OR ( c.document_status = 'P' AND c.operation_type = 'WRITE_OFF' ) )
)) as refundAmountCommit,
(select
COALESCE(sum( b.amount ), 0) AS refundAmount
from csh_transaction_detail b
where a.id = b.ref_cash_detail_id
and b.operation_type = 'return'
and b.payment_status = 'S'
) as refundAmount
FROM
csh_transaction_detail a
) c
WHERE (document_category = 'PREPAYMENT_REQUISITION' AND refund_status = 'N' AND operation_type = 'payment' AND payment_status = 'S' AND document_number IN ('JK7699992020100203517') AND NOT EXISTS ( 《——
select 1 from csh_transaction_detail d
WHERE d.ref_cash_detail_id = c.id
AND d.operation_type = 'reserved'
AND d.payment_status in ('P','S')
) AND (amount - refundAmountCommit> 0) AND not exists (select 1 from csh_transaction_detail t where t.document_id = null and t.ref_cash_detail_id = c.id) AND employee_id = 1292126879034703873);
###檢查該表csh_transaction_detail 的數據分布,
mysql> select count(1),ref_cash_detail_id,operation_type,payment_status from csh_transaction_detail group by ref_cash_detail_id,operation_type,payment_status having count(1) > 0;
+----------+---------------------+----------------+----------------+
| count(1) | ref_cash_detail_id | operation_type | payment_status |
+----------+---------------------+----------------+----------------+
| 21 | NULL | payment | F |
| 1 | NULL | payment | P |
| 9 | NULL | payment | R |
| 8518 | NULL | payment | S |
| 1 | 1316316695794552833 | return | P |
| 1 | 1316316739306262530 | return | P |
| 1 | 1317011126101217281 | return | P |
| 1 | 1318047328500518913 | refund | S |
+----------+---------------------+----------------+----------------+
8 rows in set (0.00 sec)
mysql> select * from csh_transaction_detail where operation_type = 'reserved';
Empty set (0.03 sec)
mysql>
####建議2
csh_transaction_detail 的字段ref_cash_detail_id,operation_type,payment_status 建立一個組合索引
3
# Time: 2020-10-30T14:01:34.819561+08:00
# User@Host: nuseropr[nuseropr] @ [58.2.2.220] Id: 1598883
# Query_time: 50.834178 Lock_time: 0.000293 Rows_sent: 1 Rows_examined: 82541324
use dvb_workflow;
SET timestamp=1604037694;
SELECT COUNT(1) FROM WFL_INSTANCE wi, WFL_DOCUMENT_REF dr WHERE wi.ENTITY_ID = dr.ENTITY_ID AND wi.ENTITY_TYPE = dr.ENTITY_TYPE AND dr.tenant_id = 1292042641149067266 AND dr.entity_type = '801001' AND wi.approval_status = 'IN_APPROVAL' AND EXISTS (SELECT 1 FROM wfl_task wt WHERE wt.person_id = 1292126941320118273 AND wt.status_code = 'CURRENT' AND wt.instance_id = wi.id) AND wi.tenant_id = 1292042641149067266;
###剩余問題
zabbix 監控環境的監控項目 mysql processlist on 3306 顯示不准確
附錄:
發現一個報警提示 :
throttle: 92 'index not used' warning(s) suppressed.;
mysql
# Time: 2020-10-30T14:00:31.740296+08:00
# User@Host: [] @ [] Id: 1598122
# Query_time: 51.209512 Lock_time: 0.008423 Rows_sent: 0 Rows_examined: 0
SET timestamp=1604037631;
throttle: 59 'index not used' warning(s) suppressed.;
# Time: 2020-10-30T14:01:31.743503+08:00
# User@Host: [] @ [] Id: 1598249
# Query_time: 145.619897 Lock_time: 0.017162 Rows_sent: 0 Rows_examined: 0
SET timestamp=1604037691;
throttle: 92 'index not used' warning(s) suppressed.;
調整參數log_queries_not_using_indexes從on 調整off
1).修改my.cnf
2).SET GLOBAL log_queries_not_using_indexes=off
######## 222
The Slow Query Log
直接查看mysql-slow.log
# Query_time: 2.302229 Lock_time: 0.000346 Rows_sent: 1 Rows_examined: 1916238
SET timestamp=1588843823;
select count(1) as total from (select * from (select a.*,CONCAT(@rowno:=@rowno+1,'') as rownum_ from (
如果啟用了慢查詢日志並將FILE其選擇為輸出目標,則寫入日志的每個語句都以一行開頭,該行以#字符開頭 並具有以下字段(所有字段都位於一行):
Query_time: duration
語句執行時間,以秒為單位。
Lock_time: duration
獲取鎖的時間(以秒為單位)。
Rows_sent: N
發送給客戶端的行數。
Rows_examined:
服務器層檢查的行數(不計算存儲引擎內部的任何處理)。
寫入慢查詢日志文件的每個語句之前都有一個SET 包含時間戳的語句。從MySQL 8.0.14開始,時間戳指示慢速語句何時開始執行。在8.0.14之前,時間戳指示慢速語句的記錄時間(在該語句完成執行之后發生)。
##感謝https://www.percona.com/doc/percona-toolkit/LATEST/pt-query-digest.html
##感謝 https://dev.mysql.com/doc/refman/8.0/en/slow-query-log.html
pt-query-digest inspects the columns in the table. The table must have at least the following columns:
CREATE TABLE query_review_history (
checksum CHAR(32) NOT NULL,
sample TEXT NOT NULL
);
Any columns not mentioned above are inspected to see if they follow a certain naming convention. The column is special if the name ends with an underscore followed by any of these values:
pct|avg|cnt|sum|min|max|pct_95|stddev|median|rank
If the column ends with one of those values, then the prefix is interpreted as the event attribute to store in that column, and the suffix is interpreted as the metric to be stored. For example, a column named Query_time_min will be used to store the minimum Query_time for the class of events.
Next is the table of metrics about this class of queries.
# pct total min max avg 95% stddev median
# Count 0 2
# Exec time 13 1105s 552s 554s 553s 554s 2s 553s
# Lock time 0 216us 99us 117us 108us 117us 12us 108us
# Rows sent 20 6.26M 3.13M 3.13M 3.13M 3.13M 12.73 3.13M
# Rows exam 0 6.26M 3.13M 3.13M 3.13M 3.13M 12.73 3.13M
The first line is column headers for the table. The percentage is the percent of the total for the whole analysis run, and the total is the actual value of the specified metric. For example, in this case we can see that the query executed 2 times, which is 13% of the total number of queries in the file. The min, max and avg columns are self-explanatory. The 95% column shows the 95th percentile; 95% of the values are less than or equal to this value. The standard deviation shows you how tightly grouped the values are. The standard deviation and median are both calculated from the 95th percentile, discarding the extremely large values.
The stddev, median and 95th percentile statistics are approximate. Exact statistics require keeping every value seen, sorting, and doing some calculations on them. This uses a lot of memory. To avoid this, we keep 1000 buckets, each of them 5% bigger than the one before, ranging from .000001 up to a very big number. When we see a value we increment the bucket into which it falls. Thus we have fixed memory per class of queries. The drawback is the imprecision, which typically falls in the 5 percent range.
最后,如果要在日志文件中查找查詢示例,則可以在其中查找字節偏移量。(由於慢日志格式中的某些異常,這並不總是准確的,但通常是正確的。)該位置是指最差的樣本,我們將在下文中詳細介紹。
接下來是有關此類查詢的指標表。
#pct 總計最小最大平均95%stddev中位數#計數0 2
#執行時間13 1105s 552s 554s 553s 554s 2s 553s
#鎖定時間0 216us 99us 117us 108us 117us 12us 108us
#發送的行20 6.26M
3.13M 3.13M 3.13M 3.13M 3.13M 12.73 3.13M#行考試0 6.26M 3.13M 3.13M 3.13M 3.13M 12.73 3.13M
第一行是表的列標題。百分比是整個分析運行中總計的百分比,總計是指定指標的實際值。例如,在這種情況下,我們可以看到查詢執行了2次,占文件查詢總數的13%。最小值,最大值和平均值列是不言自明的。95%列顯示第95個百分點;95%的值小於或等於此值。標准偏差顯示值的緊密程度。標准偏差和中位數都是從第95個百分位數計算得出的,丟棄了非常大的值。
stddev,中位數和第95個百分位數的統計數據均為近似值。確切的統計信息要求保留每個可見的值,對其進行排序並進行一些計算。這會占用大量內存。為避免這種情況,我們保留了1000個存儲桶,每個存儲桶都比以前的存儲桶大5%,范圍從.000001到很大的數字。當我們看到一個值時,便增加它所屬的存儲桶。因此,每類查詢都有固定的內存。缺點是不精確度,通常不超過5%。
##sample
6AE2E64878779CE70A1744847528E482 select distinct t.result_id,t.*,? a inspection db 2020-05-07 14:01:56 52 208.032 2.02074 7.04318 0.006151 0.000091 0.000176
query_time_sum=208.032 seconds
query_count=52
query_avg=4
query_time_min=2.02074 seconds
query_time_max =7.04318 seconds
##########
###
感謝percona y Juan Arruti
https://www.percona.com/blog/2020/04/23/a-simple-approach-to-troubleshooting-high-cpu-in-mysql/
One of our customers recently asked whether it is possible to identify, from the MySQL side, the query that is causing high CPU usage on his system. The usage of simple OS tools to find the culprit has been a widely used technique for a long time by PostgreSQL and Oracle DBAs, but it didn’t work for MySQL as historically we’ve lacked the instrumentation to match an OS thread with an internal processlist thread – until recently.
Percona added support to map processlist ids to OS thread ids through column TID of the information_schema.processlist table starting on Percona Server for MySQL 5.6.27. With the release of 5.7, MySQL followed with its own implementation by extending the PERFORMANCE_SCHEMA.THREADS table and adding a new column named THREAD_OS_ID, which Percona Server for MySQL adopted in place of its own, as it usually does remain as close to upstream as possible.
The following approach is useful for cases where there is a query overloading one particular CPU while other cores are performing normally. For cases where it is a general CPU usage issue, different methods can be used, such as the one in this other blog post Reducing High CPU on MySQL: A Case Study.
How can we use this new column to find out which session is using the most CPU resources in my database?
Let’s use an example:
To troubleshoot CPU issues, we can use several tools, such as top or pidstat (requires the sysstat package). In the following example, we will use pidstat. The tool has an option (-t) that changes its view from process (the default) to threads, where it shows the associated threads within a given process. We can use it to find out which thread is consuming the most CPU in our server. Adding the -p parameter along with the mysql process id so the tool only shows MySQL threads, making it easier for us to troubleshoot. The last parameter (1) is to display one sample per second:
The command is pidstat -t -p <mysqld_pid> 1:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
shell> pidstat -t -p 31258 1
03:31:06 PM UID TGID TID %usr %system %guest %CPU CPU Command
[...]
03:31:07 PM 10014 - 32039 5.00 1.00 0.00 6.00 22 |__mysqld
03:31:07 PM 10014 - 32040 5.00 1.00 0.00 6.00 23 |__mysqld
03:31:07 PM 10014 - 32042 6.00 1.00 0.00 7.00 8 |__mysqld
03:31:07 PM 10014 - 32047 5.00 1.00 0.00 6.00 6 |__mysqld
03:31:07 PM 10014 - 32048 5.00 1.00 0.00 6.00 15 |__mysqld
03:31:07 PM 10014 - 32049 5.00 1.00 0.00 6.00 14 |__mysqld
03:31:07 PM 10014 - 32052 5.00 1.00 0.00 6.00 14 |__mysqld
03:31:07 PM 10014 - 32053 94.00 0.00 0.00 94.00 9 |__mysqld
03:31:07 PM 10014 - 32055 4.00 1.00 0.00 5.00 10 |__mysqld
03:31:07 PM 10014 - 4275 5.00 1.00 0.00 6.00 10 |__mysqld
03:31:07 PM 10014 - 4276 5.00 1.00 0.00 6.00 7 |__mysqld
03:31:07 PM 10014 - 4277 6.00 1.00 0.00 7.00 15 |__mysqld
03:31:07 PM 10014 - 4278 5.00 1.00 0.00 6.00 18 |__mysqld
03:31:07 PM 10014 - 4279 5.00 1.00 0.00 6.00 10 |__mysqld
03:31:07 PM 10014 - 4280 5.00 1.00 0.00 6.00 12 |__mysqld
03:31:07 PM 10014 - 4281 5.00 1.00 0.00 6.00 11 |__mysqld
03:31:07 PM 10014 - 4282 4.00 1.00 0.00 5.00 2 |__mysqld
03:31:07 PM 10014 - 35261 0.00 0.00 0.00 0.00 4 |__mysqld
03:31:07 PM 10014 - 36153 0.00 0.00 0.00 0.00 5 |__mysqld
|
We can see that the thread 32053 is consuming the most CPU by quite a large margin, and we made sure to verify that the consumption is constant across multiple samples of pidstat. Using that information, we can log into the database and use the following query to find out which MySQL Thread is the culprit:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
mysql > select * from performance_schema.threads where THREAD_OS_ID = 32053 \G
*************************** 1. row ***************************
THREAD_ID: 686
NAME: thread/sql/one_connection
TYPE: FOREGROUND
PROCESSLIST_ID: 590
PROCESSLIST_USER: msandbox
PROCESSLIST_HOST: localhost
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Query
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: Sending data
PROCESSLIST_INFO: select * from test.joinit where b = 'a a eveniet ut.'
PARENT_THREAD_ID: NULL
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: SSL/TLS
THREAD_OS_ID: 32053
1 row in set (0.00 sec)
|
There we go! Now we know that the high CPU consumption comes from a query in the joinit table, being executed by the user msandbox from localhost in the database test. Using this information we can troubleshoot the query and check the execution plan with the EXPLAIN command to see if there is any room for improvement.
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
mysql > explain select * from test.joinit where b = 'a a eveniet ut.' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: joinit
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 7170836
filtered: 10.00
Extra: Using where
1 row in set, 1 warning (0.00 sec)
|
In this case, it was a simple index that was missing!
|
1
2
3
|
mysql > alter table test.joinit add index (b) ;
Query OK, 0 rows affected (15.18 sec)
Records: 0 Duplicates: 0 Warnings: 0
|
After we create the index, we are no longer seeing CPU spikes:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
shell> pidstat -t -p 31258 1
03:37:53 PM UID TGID TID %usr %system %guest %CPU CPU Command
[...]
03:37:54 PM 10014 - 32039 25.00 6.00 0.00 31.00 0 |__mysqld
03:37:54 PM 10014 - 32040 25.00 5.00 0.00 30.00 21 |__mysqld
03:37:54 PM 10014 - 32042 25.00 6.00 0.00 31.00 20 |__mysqld
03:37:54 PM 10014 - 32047 25.00 4.00 0.00 29.00 23 |__mysqld
03:37:54 PM 10014 - 32048 25.00 7.00 0.00 32.00 22 |__mysqld
03:37:54 PM 10014 - 32049 23.00 6.00 0.00 29.00 4 |__mysqld
03:37:54 PM 10014 - 32052 23.00 7.00 0.00 30.00 14 |__mysqld
03:37:54 PM 10014 - 32053 10.00 2.00 0.00 12.00 11 |__mysqld
03:37:54 PM 10014 - 32055 24.00 6.00 0.00 30.00 1 |__mysqld
03:37:54 PM 10014 - 4275 25.00 6.00 0.00 31.00 7 |__mysqld
03:37:54 PM 10014 - 4276 25.00 6.00 0.00 31.00 1 |__mysqld
03:37:54 PM 10014 - 4277 24.00 5.00 0.00 29.00 14 |__mysqld
03:37:54 PM 10014 - 4278 24.00 6.00 0.00 30.00 9 |__mysqld
03:37:54 PM 10014 - 4279 25.00 5.00 0.00 30.00 6 |__mysqld
03:37:54 PM 10014 - 4280 26.00 5.00 0.00 31.00 14 |__mysqld
03:37:54 PM 10014 - 4281 24.00 6.00 0.00 30.00 10 |__mysqld
03:37:54 PM 10014 - 4282 25.00 6.00 0.00 31.00 10 |__mysqld
03:37:54 PM 10014 - 35261 0.00 0.00 0.00 0.00 4 |__mysqld
03:37:54 PM 10014 - 36153 0.00 0.00 0.00 0.00 5 |__mysqld
|
Why not use this approach to troubleshoot IO and Memory issues?
The issue with measuring thread IO from the OS side is that most of the MySQL IO operations are done by background threads, such as read, write, and page cleaner threads. To measure thread IO you can use tools like pidstat with the -d (IO instead of CPU) option or iostat with -H (per thread). If you have a very IO-consuming thread you might be able to see it, but be warned that the results can be misleading because of the background thread operations.
Memory consumption is an even trickier resource to measure from the OS side, as all memory is allocated under the MySQL process, and since it is MySQL that manages its memory access, it is transparent for the OS which thread is consuming the most memory. For this, we can use the perfomance_schema memory instrumentation available starting in 5.7.
Conclusion
There are many approaches to troubleshooting high CPU usage, but here we present a simple, widely-used approach on Oracle and PostgreSQL databases that can be adapted to MySQL, starting from version 5.7. By tracking down from the OS thread consumption to the database side, we can quickly detect CPU intensive queries that are affecting the system performance.
Correctly understanding the true cause of database performance probluser allows for a quick and efficient resolution – yet enterprises often lack this crucial information. Without it, your solution could require more time and resources than necessary, or inefficiently address the issue. And contrary to popular belief, the problem is not always the database itself!
####
感謝percona Marcelo Altmann Reducing High CPU on MySQL: a Case Study
https://www.percona.com/blog/2019/03/07/reducing-high-cpu-on-mysql-a-case-study/
n this blog post, I want to share a case we worked on a few days ago. I’ll show you how we approached the resolution of a MySQL performance issue and used Percona Monitoring and Management PMM to support troubleshooting. The customer had noticed a linear high CPU usage in one of their MySQL instances and was not able to figure out why as there was no much traffic hitting the app. We needed to reduce the high CPU usage on MySQL. The server is a small instance:
|
1
2
|
Models | 6xIntel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz
10GB RAM
|
This symptom can be caused by various different reasons. Let’s see how PMM can be used to troubleshoot the issue.
CPU

It’s important to understand where the CPU time is being consumed: user space, system space, iowait, and so on. Here we can see that CPU usage was hitting almost 100% and the majority of the time was being spent on user space. In other words, the time the CPU was executing user code, such as MySQL. Once we determined that the time was being spent on user space, we could discard other possible issues. For example, we could eliminate the possibility that a high amount of threads were competing for CPU resources, since that would cause an increase in context switches, which in turn would be taken care of by the kernel – system space.
With that we decided to look into MySQL metrics.
MySQL


As expected, there weren’t a lot of threads running—10 on average—and MySQL wasn’t being hammered with questions/transactions. It was running from 500 to 800 QPS (queries per second). Next step was to check the type of workload that was running on the instance:

In red we can see that almost all commands are SELECTS. With that in mind, we checked the handlers using SHOW STATUS LIKE 'Handler%' to verify if those selects were doing an index scan, a full table scan or what.

Blue in this graph represents Handler_read_rnd_next , which is the counter MySQL increments every time it reads a row when it’s doing a full table scan. Bingo!!! Around 350 selects were reading 2.5 million rows. But wait—why was this causing CPU issues rather than IO issues? If you refer to the first graph (CPU graph) we cannot see iowait.
That is because the data was stored in the InnoDB Buffer Pool, so instead of having to read those 2.5M rows per second from disk, it was fetching them from memory. The stress had moved from disk to CPU. Now that we identified that the issue had been caused by some queries or query, we went to QAN to verify the queries and check their status:

First query, a SELECT on table store.clients was responsible for 98% of the load and was executing in 20+ seconds.

EXPLAIN confirmed our suspicions. The query was accessing the table using type ALL, which is the last type we want as it means “Full Table Scan”. Taking a look into the fingerprint of the query, we identified that it was a simple query:


The query was filtering clients based on the status field SELECT * FROM store.clients WHERE status = ? As shown in the indexes, that column was not indexed. Talking with the customer, this turned out to be a query that was introduced as part of a new software release.
From that point, we were confident that we had identified the problem. There could be more, but this particular query was definitely hurting the performance of the server. We decided to add an index and also sent an annotation to PMM, so we could refer back to the graphs to check when the index has been added, check if CPU usage had dropped, and also check Handler_read_rnd_next.
To run the alter we decided to use pt-online-schema-change as it was a busy table, and the tool has safeguards to prevent the situation from becoming even worse. For example, we wanted to pause or even abort the alter in the case of the number of Threads_Running exceeding a certain threshold. The threshold is controlled by --max-load (25 by default) and --critical-load (50 by default):
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
pmm-admin annotate "Started ALTER store.clients ADD KEY (status)" && \
pt-online-schema-change --alter "ADD KEY (status)" --execute u=root,D=store,t=clients && \
pmm-admin annotate "Finished ALTER store.clients ADD KEY (status)"
Your annotation was successfully posted.
No slaves found. See --recursion-method if host localhost.localdomain has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
analyze_table, 10, 1
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Altering `store`.`clients`...
Creating new table...
Created new table store._clients_new OK.
Altering new table...
Altered `store`.`_clients_new` OK.
2019-02-22T18:26:25 Creating triggers...
2019-02-22T18:27:14 Created triggers OK.
2019-02-22T18:27:14 Copying approximately 4924071 rows...
Copying `store`.`clients`: 7% 05:46 remain
Copying `store`.`clients`: 14% 05:47 remain
Copying `store`.`clients`: 22% 05:07 remain
Copying `store`.`clients`: 30% 04:29 remain
Copying `store`.`clients`: 38% 03:59 remain
Copying `store`.`clients`: 45% 03:33 remain
Copying `store`.`clients`: 52% 03:06 remain
Copying `store`.`clients`: 59% 02:44 remain
Copying `store`.`clients`: 66% 02:17 remain
Copying `store`.`clients`: 73% 01:50 remain
Copying `store`.`clients`: 79% 01:23 remain
Copying `store`.`clients`: 87% 00:53 remain
Copying `store`.`clients`: 94% 00:24 remain
2019-02-22T18:34:15 Copied rows OK.
2019-02-22T18:34:15 Analyzing new table...
2019-02-22T18:34:15 Swapping tables...
2019-02-22T18:34:27 Swapped original and new tables OK.
2019-02-22T18:34:27 Dropping old table...
2019-02-22T18:34:32 Dropped old table `store`.`_clients_old` OK.
2019-02-22T18:34:32 Dropping triggers...
2019-02-22T18:34:32 Dropped triggers OK.
Successfully altered `store`.`clients`.
Your annotation was successfully posted.
|
Results


As we can see, above, CPU usage dropped to less than 25%, which is 1/4 of the previous usage level. Handler_read_rnd_next dropped and we can’t even see it once pt-osc has finished. We had a small increase on Handler_read_next as expected because now MySQL is using the index to resolve the WHERE clause. One interesting outcome is that the instance was able to increase it’s QPS by 2x after the index was added as CPU/Full Table Scan was no longer limiting performance. On average, query time has dropped from 20s to only 661ms.
Summary:
- Applying the correct troubleshooting steps to your probluser is crucial:
a) Understand what resources have been saturated.
b) Understand what if anything is causing an error.
c) From there you can divert into the areas that are related to that resource and start to narrow down the issue.
d) Tackle the probluser bit by bit. - Having the right tools for the job key for success. PMM is a great example of a tool that can help you quickly identify, drill in, and fix bottlenecks.
- Have realistic load tests. In this case, they had tested the new release on a concurrency level that was not like their production
- By identifying the culprit query we were able to:
a.) Drop average query time from 20s to 661ms
b.) Increase QPS by 2x
c.) Reduce the usage of CPU to 1/4 of its level prior to our intervention
Disclosure: For security reasons, sensitive information, such as database, table, column names have been modified and graphs recreated to simulate a similar problem.
