面试官:MySQL权限表损坏导致无法启动怎么办?

一、背景

近期,公司RDS云产品的MySQL Server版本进行升级,由目前使用的5.7.26版本升级到最新版本5.7.31;升级后测试同学发现:在MySQL创建用户后,5.7.31版本重新启动集群会出现启动失败的现象;而5.7.26版本在相同测试场景下是正常启动的。这到底是为什么呢?

二、问题复现

2.1 实验环境

CentOS 7.5
MySQL 5.7.31
2.2 操作步骤

按照测试同学的测试步骤,首先创建一个用户:

mysql> create user test@’%’ identified by ‘1234’;
Query OK, 0 rows affected (0.01 sec)
然后关闭mysqld;这里需要介绍一下,我们集群的关闭方式是如下方式:

shell> kubectl delete pod yuelei-57-f3b6200-0 -n qfusion-admin
这种方式的内部实现类似于kill -9模式。所以我在线下环境使用kill -9的方式来复现,操作如下:

shell> kill -9 $(pidof mysqld)
然后重启mysqld,操作如下:

shell> mysqld_safe –defaults-file=/etc/mysql/my.cnf &
此时问题复现了,mysqld启动失败,我们查看了下error日志,信息如下:


2020-08-18T10:13:55.506415+08:00 0 [ERROR] /usr/local/mysql/bin/mysqld: Table ‘./mysql/user’ is marked as crashed and should be repaired
2020-08-18T10:13:55.506553+08:00 0 [ERROR] Fatal error: Can’t open and lock privilege tables: Table ‘./mysql/user’ is marked as crashed and should be repaired
2020-08-18T10:13:55.506658+08:00 0 [ERROR] Fatal error: Failed to initialize ACL/grant/time zones structures or failed to remove temporary table files.
2020-08-18T10:13:55.506789+08:00 0 [ERROR] Aborting

根据报错信息可以看出:MySQL的权限系统表发生了损坏,导致了mysqld启动失败;由于在MySQL 5.7及其之前版本该表是MyISAM引擎,且该引擎不支持事务,所以在mysqld异常崩溃会导致该类型引擎表的损坏;但在mysqld启动时是有参数控制MyISAM引擎的恢复模式,且该参数在我们产品中也配置到了my.cnf中,如下所示:

[mysqld]
myisam_recover_options=BACKUP,FORCE
2.3 参数解析

对于该参数的官方文档的解释如下:

设置MyISAM存储引擎恢复模式。选项值是OFF、DEFAULT、BACKUP、FORCE或QUICK的值的任意组合。如果指定多个值,请用逗号分隔。指定不带参数的选项与指定DEFAULT相同,指定显式值” “将禁用恢复(与OFF值相同)。如果启用了恢复,则mysqld每次打开MyISAM表时,都会检查该表是否标记为已崩溃或未正确关闭。(只有在禁用外部锁定的情况下运行,最后一个选项才起作用。)在这种情况下,mysqld在表上运行检查。如果表已损坏,mysqld将尝试对其进行修复。

OFF:No recovery.
DEFAULT:Recovery without backup, forcing, or quick checking.
BACKUP:If the data file was changed during recovery, save a backup of the tbl_name.MYD file as tbl_name-datetime.BAK.
FORCE:Run recovery even if we would lose more than one row from the .MYD file.
QUICK:Do not check the rows in the table if there are not any delete blocks.
服务器自动修复表之前,它将有关修复的注释写到错误日志中。如果您希望能够在无需用户干预的情况下从大多数问题中恢复,则应使用选项BACKUP,FORCE。即使某些行将被删除,这也会强制修复表,但是它将旧的数据文件保留为备份,以便您以后可以检查发生了什么。

全局变量,只读变量,默认为OFF。

三、问题修复

这类MySQL用户表损耗的问题解决方式也是有多种,我这里列举其中一种:

(1)my.cnf中的[mysqld]标签下添加skip_grant_tables,启动时跳过加载系统字典。

[mysqld]
skip_grant_tables
(2)重启mysqld,然后修复mysql schema下的所有表。

shell> mysqlcheck -uroot -pxxx mysql –auto-repair

mysql.user
warning : 1 client is using or hasn’t closed the table properly
status : OK
(3)在[mysqld]标签下注释或删除掉skip_grant_tables,然后重启mysqld。

shell> mysqld_safe –defaults-file=/etc/mysql/my.cnf &
此时mysqld是可以正常启动的,无异常。

四、深入排查

在产品化中,以上修复方式很不优雅,只是作为临时的解决方案;并且也存在一些令人疑惑的点:

通常情况下在my.cnf中设置myisam_recover_options=BACKUP,FORCE时,启动时会自动修复MyISAM损坏的表;在上述场景中为什么没有自动修复呢?
相同环境下,5.7.26版本mysqld启动是正常的,但是5.7.31版本mysqld启动出现如上述现象;难道是MySQL做了改动吗?
带着这些疑问,我们继续排查出现该现象的原因;此时Google也没有找到一些有效的信息,那么只能通过MySQL源代码来寻找一些答案。

首先需要下载mysql 5.7.31版本的源代码,并搭建mysql debug环境;具体步骤可以自动Google搜索一下,本文就不再赘述了。

在源代码中搜索一下关键词,用于打断点的位置,然后进行调试:

确认MySQL版本

shell> cat VERSION
MYSQL_VERSION_MAJOR=5
MYSQL_VERSION_MINOR=7
MYSQL_VERSION_PATCH=31
MYSQL_VERSION_EXTRA=

搜索一些关键信息

shell> grep -r “initialize ACL/grant/time” *
mysql-test/r/grant_debug.result:Pattern “Fatal error: Failed to initialize ACL/grant/time zones structures or failed to remove temporary table files.” found
mysql-test/t/grant_debug.test:let SEARCH_PATTERN=Fatal error: Failed to initialize ACL/grant/time zones structures or failed to remove temporary table files.;
sql/mysqld.cc: sql_print_error(“Fatal error: Failed to initialize ACL/grant/time zones “
搜索得出的结果很多,我们需要对此进行过滤。mysql-test目录是mysql的测试用例代码,我们可以直 接忽略;需要关注的是sql/mysqld.cc的文件,因为mysqld启动时调用main->mysqld_main->…,而mysqld_main函数是在sql/mysqld.cc目录下,此时我们查看具体的代码(sql/mysqld.cc:4958):

if (mysql_rm_tmp_tables() || acl_init(opt_noacl) ||
my_tz_init((THD *)0, default_tz_name, opt_bootstrap) ||
grant_init(opt_noacl))

{
abort_loop= true;
sql_print_error(“Fatal error: Failed to initialize ACL/grant/time zones “
“structures or failed to remove temporary table files.”);

delete_pid_file(MYF(MY_WME));

unireg_abort(MYSQLD_ABORT_EXIT);
}
定位到相关代码,大概是sql/mysqld.cc的4958行,且存在if条件判断,此时我们开始调试:

shell> gdb /usr/local/mysql/bin/mysqld

(gdb) b sql/mysqld.cc:4958
Breakpoint 1 at 0xe6f8d0: file /opt/mysql-server/sql/mysqld.cc, line 4958.

(gdb) r –defaults-file=/etc/mysql/my.cnf

(gdb) p mysql_rm_tmp_tables()
$1 = 0 ‘\000’

(gdb) p acl_init(opt_noacl)
$2 = 1 ‘\001’

(gdb) p my_tz_init((THD *)0, default_tz_name, opt_bootstrap)
$3 = 0 ‘\000’

(gdb) p grant_init(opt_noacl)
$4 = false
通过以上调试信息,可以判断出acl_init函数返回的值为真;此时我们查看该函数的代码 (
sql/auth/sql_auth_cache.cc:1365):

/*
Initialize structures responsible for user/db-level privilege checking and
load privilege information for them from tables in the ‘mysql’ database.

SYNOPSIS
acl_init()
dont_read_acl_tables TRUE if we want to skip loading data from
privilege tables and disable privilege checking.

NOTES
This function is mostly responsible for preparatory steps, main work
on initialization and grants loading is done in acl_reload().

RETURN VALUES
0 ok
1 Could not initialize grant’s
*/

my_bool acl_init(bool dont_read_acl_tables)
{
THD *thd;
my_bool return_val;
DBUG_ENTER(“acl_init”);

根据该函数的注释发现:该函数是初始化负责用户/数据库级特权检查的结构,并从mysql schema中的表中为其加载特权信息;且return值为1代表的是初始化权限失败。

此后开始逐步调试,观察return相关信息,当调试到lock_table_names函数时,我们发现在Phase 3时return值为true,且根据代码注释发现true代表是Failure;具体代码如下(sql/sql_base.cc:5549):

/**
Acquire “strong” (SRO, SNW, SNRW) metadata locks on tables used by
LOCK TABLES or by a DDL statement.

Acquire lock “S” on table being created in CREATE TABLE statement.

@note Under LOCK TABLES, we can’t take new locks, so use
open_tables_check_upgradable_mdl() instead.

@param thd Thread context.
@param tables_start Start of list of tables on which locks
should be acquired.
@param tables_end End of list of tables.
@param lock_wait_timeout Seconds to wait before timeout.
@param flags Bitmap of flags to modify how the tables will be
open, see open_table() description for details.

@retval false Success.
@retval true Failure (e.g. connection was killed)
*/

bool
lock_table_names(THD *thd,
TABLE_LIST *tables_start, TABLE_LIST *tables_end,
ulong lock_wait_timeout, uint flags)
{
MDL_request_list mdl_requests;
TABLE_LIST *table;
MDL_request global_request;
Hash_set schema_set(PSI_INSTRUMENT_ME);
bool need_global_read_lock_protection= false;

// Phase 3: Acquire the locks which have been requested so far.
if (thd->mdl_context.acquire_locks(&mdl_requests, lock_wait_timeout))
return true;

/*
Now when we have protection against concurrent change of read_only
option we can safely re-check its value.
Skip the check for FLUSH TABLES … WITH READ LOCK and
FLUSH TABLES … FOR EXPORT as they are not supposed to be affected
by read_only modes.
*/
if (need_global_read_lock_protection &&
!(flags & MYSQL_OPEN_SKIP_SCOPED_MDL_LOCK) &&
!(flags & MYSQL_LOCK_IGNORE_GLOBAL_READ_ONLY) &&
check_readonly(thd, true))
return true;

调试信息如下:

(gdb) p flags
$7 = 0
(gdb) p need_global_read_lock_protection
$8 = true
(gdb) p check_readonly(thd, true)
$9 = true
可以看到flags的值为0,而
MYSQL_OPEN_SKIP_SCOPED_MDL_LOCK为宏定义值0x1000,与flags的值 做按位与操作,结果自然也是0,当然MYSQL_LOCK_IGNORE_GLOBAL_READ_ONLY也是如此;need_global_read_lock_protection是bool类型值,代表是否需要全局读锁的保护,这个值是在table- >mdl_request.type不为MDL_SHARED_READ_ONLY发生改变;check_readonly函数相关信息⻅下面概述。

此时也查看了下MySQL 5.7.26版本代码作为对比,发现lock_table_names函数下的Phase 3后的部分代 码是在5.7.29版本后新增的。如果是git clone的MySQL代码可以用git blame命令查询文件变化的信息:

shell> git blame -L 5637,+10 sql/sql_base.cc
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5637) Now when we have protection against concurrent change of read_only
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5638) option we can safely re-check its value.
0405ebee (Nisha Gopalakrishnan 2019-09-12 19:24:45 +0530 5639) Skip the check for FLUSH TABLES … WITH READ LOCK and
0405ebee (Nisha Gopalakrishnan 2019-09-12 19:24:45 +0530 5640) FLUSH TABLES … FOR EXPORT as they are not supposed to be affected
0405ebee (Nisha Gopalakrishnan 2019-09-12 19:24:45 +0530 5641) by read_only modes.
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5642) */
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5643) if (need_global_read_lock_protection &&
0405ebee (Nisha Gopalakrishnan 2019-09-12 19:24:45 +0530 5644) !(flags & MYSQL_OPEN_SKIP_SCOPED_MDL_LOCK) &&
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5645) !(flags & MYSQL_LOCK_IGNORE_GLOBAL_READ_ONLY) &&
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5646) check_readonly(thd, true))
上述展示的信息中,最左侧的列值为commit id为05824063和0405ebee,有兴趣的同学可以详细看下。

此功能解决的问题是 BUG#28438114: SET READ_ONLY=1 SOMETIMES DOESN’T BLOCK CONCURRENTDDL.;当然这个代码的变更功能也在5.7 Release Notes中有所体现,如下所示(https://dev.mysql.co
m/doc/relnotes/mysql/5.7/en/news-5-7-29.html):

Under certain conditions, enabling the read_only or super_read_only system variable did not block concurrent DDL statements executed by users without the SUPER privilege. (Bug #28438114, Bug #91852)
最后我们再查看下check_readonly函数,该函数是基于read_only和super_read_only状态执行标准化检查,是禁止(TRUE)还是允许(FALSE)操作。代码如下(
sql/auth/sql_authorization.cc:489):

/**
@brief Performs standardized check whether to prohibit (TRUE)
or allow (FALSE) operations based on read_only and super_read_only
state.
@param thd Thread handler
@param err_if_readonly Boolean indicating whether or not
to add the error to the thread context if read-only is
violated.

@returns Status code
@retval TRUE The operation should be prohibited.
@ retval FALSE The operation should be allowed.
*/
bool check_readonly(THD *thd, bool err_if_readonly)
{
DBUG_ENTER(“check_readonly”);

/* read_only=OFF, do not prohibit operation: */
if (!opt_readonly)
DBUG_RETURN(FALSE);

此时第一反应就是去检查my.cnf中是否包含read_only相关参数,检查之后发现确实是使用了该参数, 如下:

[mysqld]
read_only=1
此时注释掉该参数,然后再次启动mysqld,发现MyISAM表可以自动修复,且正常启动;error log信息如下:


2020-08-18T11:52:26.775553+08:00 0 [ERROR] /usr/local/mysql/bin/mysqld: Table ‘./mysql/user’ is marked as crashed and should be repaired
2020-08-18T11:52:26.776217+08:00 0 [Warning] Checking table: ‘./mysql/user’
2020-08-18T11:52:26.776273+08:00 0 [ERROR] 1 client is using or hasn’t closed the table properly
2020-08-18T11:52:26.882537+08:00 0 [Note] Failed to start slave threads for channel ”
2020-08-18T11:52:26.906018+08:00 0 [Note] Event Scheduler: Loaded 0 events
2020-08-18T11:52:26.906480+08:00 0 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: ‘5.7.31-debug-log’ socket: ‘/var/lib/mysql/sock/mysql.sock’ port: 3306 Source distribution
由于docker一些限制,我们在mysqld启动会涉及两次;所以解决该问题的方式为:第一次mysqld的启动时先关闭read_only参数,第二次启动时开启read_only参数。之所以选择默认开启read_only参数, 是为了避免在mysqld启动后,选主逻辑未完成时的保护措施;当然选主完成后,会自动对master执行 set global read_only=0 操作。

五、总结

MySQL小版本的升级也会有变化,一定要做好升级前的**度测试工作。
MySQL源代码量很多,想要全部了解也很困难;此时我们通过对比不同版本之间的现象差异分别 进行调试,找出差异点后再深入探索到某些函数,效果事半功倍。
六、附录

调试的栈帧信息如下,有兴趣的小伙伴可以研究下:

(gdb) bt

0 check_readonly (thd=0xcde33f0, err_if_readonly=true) at /opt/mysql- server/sql/auth/sql_authorization.cc:491

1 0x0000000001486262 in lock_table_names (thd=0xcde33f0, tables_start=0xcdc9100, tables_end=0x0, lock_wait_timeout=31536000, flags=0) at /opt/mysql-server/sql/sql_base.cc:5646

2 0x0000000001484be5 in Open_table_context::recover_from_failed_open (this=0x7fffffffcad0) at /opt/mysql-server/sql/sql_base.cc:4789

3 0x0000000001486758 in open_tables (thd=0xcde33f0, start=0x7fffffffcb90, counter=0x7fffffffcbd4, flags=2048, prelocking_strategy=0x7fffffffcc10) at /opt/mysql-server/sql/sql_base.cc:5891

4 0x0000000001487851 in open_and_lock_tables (thd=0xcde33f0, tables=0x7fffffffccd0, flags=2048, prelocking_strategy=0x7fffffffcc10) at /opt/mysql-server/sql/sql_base.cc:6583

5 0x0000000000ea2611 in open_and_lock_tables (thd=0xcde33f0, tables=0x7fffffffccd0, flags=2048) at /opt/mysql-server/sql/sql_base.h:486

6 0x0000000000e9c2cf in acl_reload (thd=0xcde33f0) at /opt/mysql- server/sql/auth/sql_auth_cache.cc:2091

7 0x0000000000e9a2fd in acl_init (dont_read_acl_tables=false) at /opt/mysql- server/sql/auth/sql_auth_cache.cc:1429

8 0x0000000000e6f8f2 in mysqld_main (argc=136, argv=0x2c136a8) at /opt/mysql-server/sql/mysqld.cc:4958

9 0x0000000000e66cdd in main (argc=2, argv=0x7fffffffe458) at /opt/mysql- server/sql/main.cc:32

熟悉MySQL体系结构和innodb存储引擎工作原理;以及MySQL备份恢复、复制、数据迁移等技术;专注于MySQL、MariaDB开源数据库,喜好开源技术。

面试官:MySQL权限表损坏导致无法启动怎么办?
原文链接:
https://www.heapdump.cn/articles

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至22018681@qq.com 举报,一经查实,本站将立刻删除。

Like (0)
Donate 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
森林服务号的头像森林服务号
Previous 2022年4月9日
Next 2022年4月9日

相关推荐

发表回复

Please Login to Comment