基础信息
MySQL版本:5.7.27-30-log Percona Server (GPL), wsrep_31.39
MySQL参数:
innodb_purge_threads = 4
innodb_read_io_threads = 8
innodb_write_io_threads = 8
innodb_thread_concurrency = 24
innodb_thread_sleep_delay = 148358
innodb_concurrency_tickets = 5000
gtid_mode = ON
thread_handling = one-thread-per-connection
涉及SQL:
SELECT
`iast_agent_method_pool`.`id`, `iast_agent_method_pool`.`agent_id`,
`iast_agent_method_pool`.`url`, `iast_agent_method_pool`.`uri`,
`iast_agent_method_pool`.`http_method`, `iast_agent_method_pool`.`http_scheme`,
`iast_agent_method_pool`.`http_protocol`, `iast_agent_method_pool`.`req_header`,
`iast_agent_method_pool`.`req_params`, `iast_agent_method_pool`.`req_data`,
`iast_agent_method_pool`.`res_header`, `iast_agent_method_pool`.`res_body`,
`iast_agent_method_pool`.`req_header_for_search`, `iast_agent_method_pool`.`context_path`,
`iast_agent_method_pool`.`method_pool`, `iast_agent_method_pool`.`pool_sign`,
`iast_agent_method_pool`.`clent_ip`, `iast_agent_method_pool`.`create_time`,
`iast_agent_method_pool`.`update_time`, `iast_agent_method_pool`.`uri_sha1`
FROM `iast_agent_method_pool` FORCE INDEX(PRIMARY)
WHERE (
`iast_agent_method_pool`.`agent_id` IN (
SELECT U0.`id`
FROM `iast_agent` U0
WHERE ((U0.`project_name` IS NULL
OR U0.`bind_project_id` = 806)
AND U0.`online` = 1
AND U0.`project_version_id` = 806
AND U0.`user_id` = 1)
)
AND `iast_agent_method_pool`.`pool_sign` = 'd94e7aaee21470b45ceef899dc35c2f46e4dbd24'
)
ORDER BY `iast_agent_method_pool`.`id` ASC LIMIT 1;
涉及表结构:
CREATE TABLE `iast_agent_method_pool` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`agent_id` int(11) DEFAULT NULL COMMENT 'Agent',
`url` varchar(2000) DEFAULT NULL COMMENT 'URL',
`uri` varchar(2000) DEFAULT NULL COMMENT 'URI',
`http_method` varchar(10) DEFAULT NULL COMMENT 'HTTP请求方法',
`http_scheme` varchar(20) DEFAULT NULL COMMENT '协议',
`http_protocol` varchar(255) DEFAULT NULL COMMENT 'HTTP协议',
`req_header` varchar(2000) DEFAULT NULL COMMENT '请求头',
`req_params` varchar(2000) DEFAULT NULL COMMENT '请求参数',
`req_data` varchar(4000) DEFAULT NULL COMMENT '请求体',
`res_header` varchar(1000) DEFAULT NULL COMMENT '响应头',
`res_body` varchar(1000) DEFAULT NULL COMMENT '响应体',
`context_path` varchar(255) DEFAULT NULL COMMENT '应用上下文',
`method_pool` json DEFAULT NULL COMMENT '方法池',
`clent_ip` varchar(255) DEFAULT NULL COMMENT '客户端IP',
`create_time` int(11) DEFAULT NULL COMMENT '创建时间',
`update_time` int(11) DEFAULT NULL COMMENT '修改时间',
`pool_sign` varchar(40) DEFAULT NULL COMMENT '方法池签名',
`req_header_for_search` text COMMENT '用于搜索功能的请求头,请求方法、uri、参数、协议、请求头的拼接',
`uri_sha1` varchar(40) NOT NULL DEFAULT '',
PRIMARY KEY (`id`) USING BTREE,
UNIQUE KEY `pool_sign` (`pool_sign`,`agent_id`) USING BTREE,
KEY `agent_id` (`agent_id`) USING BTREE,
KEY `update_time` (`update_time`) USING BTREE,
KEY `iast_agent__uri_sha_aac6d7_idx` (`uri_sha1`,`http_method`,`agent_id`),
KEY `iast_agent_method_pool_uri_sha1_a64012e4` (`uri_sha1`),
KEY `iast_agent_method_pool_agent_id_30df78eb` (`agent_id`)
) ENGINE=InnoDB AUTO_INCREMENT=106703 DEFAULT CHARSET=utf8mb4;
CREATE TABLE `iast_agent` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`token` varchar(255) DEFAULT NULL COMMENT 'agent唯一标识',
`version` varchar(255) DEFAULT NULL COMMENT '版本',
`latest_time` int(11) DEFAULT NULL COMMENT '更新时间',
`user_id` int(11) NOT NULL COMMENT '用户ID',
`server_id` int(11) DEFAULT NULL COMMENT '服务器ID',
`is_running` int(1) DEFAULT NULL COMMENT 'agent运行状态',
`control` int(1) DEFAULT NULL COMMENT 'agent控制位,1-安装、2-卸载、0-无控制',
`is_control` int(1) DEFAULT NULL COMMENT '是否正处于控制中,0-否,1-是',
`bind_project_id` int(11) DEFAULT '0' COMMENT '捆绑项目ID,存在则为已捆绑',
`project_name` varchar(255) DEFAULT NULL COMMENT '项目名称,用于先启动agent后创建项目',
`project_version_id` int(11) DEFAULT '0' COMMENT '项目版本ID',
`online` tinyint(4) DEFAULT '0' COMMENT '1在线运行,0未运行,同token,仅一个online',
`is_core_running` int(11) DEFAULT NULL COMMENT '核心引擎是否启动',
`language` varchar(10) DEFAULT 'JAVA' COMMENT '语言,JAVA、PYTHON',
`startup_time` int(11) NOT NULL DEFAULT '0' COMMENT '启动时间',
`alias` varchar(255) NOT NULL DEFAULT '' COMMENT 'agent别名',
`register_time` int(11) NOT NULL DEFAULT '0' COMMENT '注册时间',
`is_audit` int(11) NOT NULL DEFAULT '0' COMMENT '0:未审核;1:已审核',
PRIMARY KEY (`id`) USING BTREE,
KEY `user_id` (`user_id`) USING BTREE,
KEY `server_id` (`server_id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=29519 DEFAULT CHARSET=utf8mb4;
存储引擎信息
通过SHOW ENGINE INNODB STATUS输出可发现,大量进程处于"not started sleeping before entering InnoDB"
---TRANSACTION 421116931020976, not started sleeping before entering InnoDB
mysql tables in use 2, locked 0
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421116931082032, not started sleeping before entering InnoDB
mysql tables in use 2, locked 0
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421116931076272, not started sleeping before entering InnoDB
mysql tables in use 2, locked 0
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421116931034800, not started sleeping before entering InnoDB
mysql tables in use 2, locked 0
只有少量进程(innodb_thread_concurrency)进入的INNODB存储引擎层进行执行:
--------------
ROW OPERATIONS
--------------
24 queries inside InnoDB, 1182 queries in queue
25 read views open inside InnoDB
0 RW transactions active inside InnoDB
实例堆栈信息
从堆栈信息可以看到,上千个线程在使用srv_conc_enter_innodb方法:
Thread 539 (Thread 0x7f9036907700 (LWP 378610)):
#0 0x00007f9506517e3d in nanosleep () from /lib64/libpthread.so.0
#1 0x00000000010af8d0 in os_thread_sleep(unsigned long) ()
#2 0x000000000114a0b8 in srv_conc_enter_innodb(row_prebuilt_t*) ()
#3 0x000000000102306c in ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) ()
#4 0x000000000084e918 in handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function) ()
#5 0x0000000000851e5a in handler::ha_index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function) ()
#6 0x0000000000d0e993 in read_const(TABLE*, st_table_ref*) ()
#7 0x0000000000d0ead0 in join_read_const_table(JOIN_TAB*, st_position*) ()
#8 0x0000000000d23f38 in JOIN::extract_func_dependent_tables() ()
#9 0x0000000000d2f00e in JOIN::make_join_plan() ()
#10 0x0000000000d332a6 in JOIN::optimize() ()
#11 0x0000000000d7e322 in st_select_lex::optimize(THD*) ()
#12 0x0000000000d7e5cf in handle_query(THD*, LEX*, Query_result*, unsigned long long, unsigned long long) ()
#13 0x0000000000d39e33 in execute_sqlcom_select(THD*, TABLE_LIST*) ()
#14 0x0000000000d40f38 in mysql_execute_command(THD*, bool) ()
#15 0x0000000000d419c5 in mysql_parse(THD*, Parser_state*, bool) ()
#16 0x0000000000d41b10 in wsrep_mysql_parse(THD*, char const*, unsigned int, Parser_state*, bool) [clone .clone.0] ()
#17 0x0000000000d43d2f in dispatch_command(THD*, COM_DATA const*, enum_server_command) ()
#18 0x0000000000d44efb in do_command(THD*) ()
#19 0x0000000000e24ce4 in handle_connection ()
#20 0x0000000000f9ebe4 in pfs_spawn_thread ()
#21 0x00007f9506510dd5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f9504db7ead in clone () from /lib64/libc.so.6
有24个线程在使用btr_copy_blob_prefix方法:
Thread 1081 (Thread 0x7f8fd756d700 (LWP 41431)):
#0 0x00007f9506514965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000010ae7fb in os_event::wait_low(long) ()
#2 0x000000000115d6e9 in sync_array_wait_event(sync_array_t*, sync_cell_t*&) ()
#3 0x000000000115fb98 in rw_lock_s_lock_spin(rw_lock_t*, unsigned long, char const*, unsigned long) ()
#4 0x00000000011ea9d3 in buf_page_get_gen(page_id_t const&, page_size_t const&, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*, bool, dberr_t*) ()
#5 0x00000000011c63fe in btr_copy_blob_prefix(unsigned char*, unsigned long, unsigned long, unsigned long, unsigned long) ()
#6 0x00000000011c6a9a in btr_copy_externally_stored_field(unsigned long*, unsigned char const*, page_size_t const&, unsigned long, mem_block_info_t*) ()
#7 0x0000000001124900 in row_sel_store_mysql_field_func(unsigned char*, row_prebuilt_t*, unsigned char const*, unsigned long const*, unsigned long, mysql_row_templ_t const*, unsigned long) ()
#8 0x0000000001124d9b in row_sel_store_mysql_rec(unsigned char*, row_prebuilt_t*, unsigned char const*, dtuple_t const*, unsigned long, dict_index_t const*, unsigned long const*, bool) ()
#9 0x000000000112b18f in row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long) ()
#10 0x0000000001022889 in ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int) ()
#11 0x0000000000851bbb in handler::ha_index_next(unsigned char*) ()
#12 0x0000000000d0a6fc in join_read_next(READ_RECORD*) ()
#13 0x0000000000d0f0c6 in sub_select(JOIN*, QEP_TAB*, bool) ()
#14 0x0000000000d0e3aa in JOIN::exec() ()
#15 0x0000000000d7e600 in handle_query(THD*, LEX*, Query_result*, unsigned long long, unsigned long long) ()
#16 0x0000000000d39e33 in execute_sqlcom_select(THD*, TABLE_LIST*) ()
#17 0x0000000000d40f38 in mysql_execute_command(THD*, bool) ()
#18 0x0000000000d419c5 in mysql_parse(THD*, Parser_state*, bool) ()
#19 0x0000000000d41b10 in wsrep_mysql_parse(THD*, char const*, unsigned int, Parser_state*, bool) [clone .clone.0] ()
#20 0x0000000000d43d2f in dispatch_command(THD*, COM_DATA const*, enum_server_command) ()
#21 0x0000000000d44efb in do_command(THD*) ()
#22 0x0000000000e24ce4 in handle_connection ()
#23 0x0000000000f9ebe4 in pfs_spawn_thread ()
#24 0x00007f9506510dd5 in start_thread () from /lib64/libpthread.so.0
#25 0x00007f9504db7ead in clone () from /lib64/libc.so.6
实例监控信息
在异常阶段,数据库实例监控信息如下:
1、Thread Running爆涨至1200,同时QPS下滑接近为0:
2、大量数根读入Buffer Pool中:
3、Buffer Poll使用内存快速上涨:
4、故障时产生大量的s_os_wait和s_spin_waits等待
5、数据库服务器CPU使用率超过30%
6、数据库服务器磁盘使用率达到100%
7、数据库服务器磁盘读操作吞吐超过500MB/s,磁盘读操作IO超过35k/s,平均每次IO读16KB
异常原因分析
由于表iast_agent_method_pool上存在唯一索引pool_sign(pool_sign
,agent_id
),子查询返回的agent_id
数量有限,虽然表上存在诸多大字段和JSON,甚至部分记录总长度超过28MB,但异常期间出现的SQL正常情况下均能在50ms内返回。在异常期间,大量查询甚至按主键查询的SQL执行时间超过10分钟,结合各种监控数据分析,怀疑该SQL执行计划出现异常,由于查询中包含ORDER BY iast_agent_method_pool
.id
ASC LIMIT 1 语句,查询优化器可能选择使用主键索引扫描的方式来执行,导致查询需要扫描大量数据才能找到满足WHERE条件的第一条记录,由于所需数据未缓存于buffer pool中,需要从磁盘读取并加载到buffer pool,因此产生大量的磁盘读操作。
由于参数innodb_thread_concurrency的限制,每次只有24个请求能进入到INNODB存储引擎层进行处理,这些执行计划异常导致查询耗时严重的SQL长期在INNODB存储引擎层占用,其他SQL无法获得进入INNODB存储引擎层执行的机会。
问题优化建议
1、尽量避免使用TEXT或BLOB还是JSON等大字段列。
2、尽量避免使用ORDER BY语句,避免在不需要排序的场景中使用ORDER BY语句。
3、尽量避免使用子查询,查询优化器很难准确评估子查询的返回记录,很容易引发执行计划异常。
4、合理设置参数innodb_thread_concurrency的值,推荐使用默认值0(即不限制并发),当使用>0的值时,需考虑超慢查询对集群的影响。