1

mysql slave io线程报错一例 -m6米乐安卓版下载

原创 冯刚 2022-10-21
717

前言

此案例介绍半同步复制环境中,因为主库binary log index file里面出现两个相同binlog文件名,导致该实例的slave库i/o线程停止。

1 环境信息

自建mysql:5.7.22

os 版本:centos linux release 7.6


2 错误描述

早晨接到报警,提示报表库(见上面拓扑图)i/o线程异常。

2.1 登录报表库,查看复制状态

[ root@报表库:~ ]# mysql
welcome to the mysql monitor.  commands end with ; or \g.
your mysql connection id is 132984
server version: 5.7.22-log mysql community server (gpl)
米乐app官网下载 copyright (c) 2000, 2018, oracle and/or its affiliates. all rights reserved.
oracle is a registered trademark of oracle corporation and/or its
affiliates. other names may be trademarks of their respective
owners.
type 'help;' or '\h' for help. type '\c' to clear the current input statement.
mysql> show slave status\g;
*************************** 1. row ***************************
               slave_io_state: 
                  master_host: 备份库地址
                  master_user: repl_user
                  master_port: 3306
                connect_retry: 1
              master_log_file: mysql_bin.003427
          read_master_log_pos: 4
               relay_log_file: relay.001627
                relay_log_pos: 241
        relay_master_log_file: mysql_bin.003427
             slave_io_running: no
            slave_sql_running: yes
              replicate_do_db: 
          replicate_ignore_db: 
           replicate_do_table: 
       replicate_ignore_table: 
      replicate_wild_do_table: 
  replicate_wild_ignore_table: 
                   last_errno: 0
                   last_error: 
                 skip_counter: 0
          exec_master_log_pos: 4
              relay_log_space: 525
              until_condition: none
               until_log_file: 
                until_log_pos: 0
           master_ssl_allowed: no
           master_ssl_ca_file: 
           master_ssl_ca_path: 
              master_ssl_cert: 
            master_ssl_cipher: 
               master_ssl_key: 
        seconds_behind_master: null
master_ssl_verify_server_cert: no
                last_io_errno: 1236
                last_io_error: got fatal error 1236 from master when reading data from binary log: 'could not open log file'
               last_sql_errno: 0
               last_sql_error: 
  replicate_ignore_server_ids: 
             master_server_id: 10002
                  master_uuid: 85cb9384-4db4-11ec-b654-fa1b79c33900
             master_info_file: mysql.slave_master_info
                    sql_delay: 0
          sql_remaining_delay: null
      slave_sql_running_state: slave has read all relay log; waiting for more updates
           master_retry_count: 86400
                  master_bind: 
      last_io_error_timestamp: 221021 04:29:36
     last_sql_error_timestamp: 
               master_ssl_crl: 
           master_ssl_crlpath: 
           retrieved_gtid_set: 8077181f-4db4-11ec-8225-fa417cfd7e00:169899218-205014272
            executed_gtid_set: 3c07e8e0-399a-11ed-9be9-5254006919a1:1-156029,
8077181f-4db4-11ec-8225-fa417cfd7e00:1-205014272,
88958222-4db4-11ec-b637-faa9e751b600:1-243629236
                auto_position: 1
         replicate_rewrite_db: 
                 channel_name: 
           master_tls_version: 
1 row in set (0.00 sec)

根据报错,提示打不开binlog文件,报错时间为:221021 04:29:36。

2.2 查看对应时间报表库错误日志:

2022-10-21t04:29:36.176599 08:00 1 [error] error reading packet from server for channel '': could not open log file (server_errno=1236)
2022-10-21t04:29:36.176638 08:00 1 [error] slave i/o for channel '': got fatal error 1236 from master when reading data from binary log: 'could not open log file', error_code: 1
236
2022-10-21t04:29:36.176642 08:00 1 [note] slave i/o thread exiting for channel '', read up to log 'mysql_bin.003427', position 4

可以看到读到mysql_bin.003427文件position 4的时候报错。

3 排查

3.1 登录备份库,查看binlog

[ root@备份库:/xxxxxx/binlog ]# ll
total 247390320
.................................省略..............................................................省略.............................
-rw-r----- 1 mysql mysql 1080514038 oct 21 04:23 mysql_bin.003425
-rw-r----- 1 mysql mysql 1080022172 oct 21 04:29 mysql_bin.003426
-rw-r----- 1 mysql mysql 1073745885 oct 21 10:45 mysql_bin.003427
-rw-r----- 1 mysql mysql 1073742926 oct 21 14:56 mysql_bin.003428
-rw-r----- 1 mysql mysql 1073749274 oct 21 16:56 mysql_bin.003429
-rw-r----- 1 mysql mysql  147230918 oct 21 17:32 mysql_bin.003430
-rw-r----- 1 mysql mysql      12642 oct 21 16:56 mysql_bin.index
[ root@备份库:/xxxxxxx/binlog ]# cat mysql_bin.index
.................................省略.............................
.................................省略.............................
/xxxxxx/mysql_bin.003425
/xxxxxx/mysql_bin.003426
/xxxxxx/mysql_bin.003427
/xxxxxx/mysql_bin.003427

看到备份库正在写mysql_bin.003427文件,且只有一个。但是mysql_bin.index里面记录了两个mysql_bin.003427文件。

3.2 查看错误日志

2022-10-21t04:29:36.171970 08:00 815671 [error] failed to open log (file '/xxxxxx/binlog/mysql_bin.003427', errno 2)
2022-10-21t04:29:36.171982 08:00 669961 [error] failed to open log (file '/xxxxxx/binlog/mysql_bin.003427', errno 2)
2022-10-21t04:29:36.172200 08:00 815671 [note] stop asynchronous binlog_dump to slave (server_id: 1)
2022-10-21t04:29:36.172200 08:00 669961 [note] stop asynchronous binlog_dump to slave (server_id: 330647)

可以看到报错和从库一致。

3.3 解析备份库binlog

根据2.1步骤,报表库/从库最后一个事务gtid为: '8077181f-4db4-11ec-8225-fa417cfd7e00:205014272'

先不考虑为什么有两个 mysql_bin.003427 文件,先解析备份库binlog,

mysql_bin.003427日志的第一个事物gtid为:'8077181f-4db4-11ec-8225-fa417cfd7e00:205221231'

[ root@备份库:/xxxxxx/binlog ]# mysqlbinlog --base64-output=decode-rows -v -v  mysql_bin.003427|more
/*!50530 set @@session.pseudo_slave_mode=1*/;
/*!50003 set @old_completion_type=@@completion_type,completion_type=0*/;
delimiter /*!*/;
# at 4
#221021  8:15:54 server id 10002  end_log_pos 123 crc32 0xd51f2be9 	start: binlog v 4, server v 5.7.22-log created 221021  8:15:54
# at 123
#221021  8:15:54 server id 10002  end_log_pos 234 crc32 0x7d2337cf 	previous-gtids
# 8077181f-4db4-11ec-8225-fa417cfd7e00:1-205221230,
# 88958222-4db4-11ec-b637-faa9e751b600:1-243629236
# at 234
#221021  8:15:54 server id 10003  end_log_pos 299 crc32 0xcaae95a6 	gtid	last_committed=0	sequence_number=1	rbr_only=yes
/*!50718 set transaction isolation level read committed*//*!*/;
set @@session.gtid_next= '8077181f-4db4-11ec-8225-fa417cfd7e00:205221231'/*!*/;
# at 299

上一个日志mysql_bin.003426的最后一个事物gtid为:'8077181f-4db4-11ec-8225-fa417cfd7e00:205014272'

可以看到,两个binlog日志文件差了 205221231-205014272 将近21万个gtid事务。

4 原因

之前有人复现过,主库手动删除在用的binlog日志文件 延迟 会复现上面的报错,参考文章:

文章链接,转载注明出处:https://cdn.modb.pro/db/401771

简单说就是手动删除主库在用的binlog文件,会一定概率导致从库i/o线程停止。

但是此次问题不是这种情形。从拓扑图可以看到,出现两个同名binlog的虽然是主库,但是是线上的备份库,只读的,不会有人手动清理binlog,并且日志清理默认是15天。

mysql> show variables like '%expire%';
 -------------------------------- ------- 
| variable_name                  | value |
 -------------------------------- ------- 
| disconnect_on_expired_password | on    |
| expire_logs_days               | 15    |
 -------------------------------- ------- 
2 rows in set (0.00 sec)

备份库每天凌晨1点物理备份全库,应该不是这个原因。

还有一个就是,发现该实例一直在新启动一个新的mysql进程(间隔大约是100秒一次),但是一直失败。不知道是不是这个原因(后面通过停止pid为73985的进程解决这个问题,没有再出现启动新进程)

[ root@备份库:/etc/systemd/system ]# ps -ef|grep mysqld
mysql      6708      1  5 jun24 ?        6-00:05:03 /usr/local/mysql57/sbin/mysqld --defaults-file=/etc/mysql57/my_3306.cnf --basedir=/usr/local/mysql57 --plugin-dir=/usr/local/mysql57/lib64/mysql/plugin
mysql     73985      1  3 14:23 ?        00:00:01 /usr/local/mysql57/sbin/mysqld --defaults-file=/etc/mysql57/my_3306.cnf --basedir=/usr/local/mysql57 --plugin-dir=/usr/local/mysql57/lib64/mysql/plugin
root      74130  69736  0 14:24 pts/2    00:00:00 grep --colour mysqld

错误日志:

2022-10-21t03:50:07.523347 08:00 0 [note] /usr/local/mysql57/sbin/mysqld (mysqld 5.7.22-log) starting as process 109889 ...
2022-10-21t03:50:07.527315 08:00 0 [warning] option 'innodb-sort-buffer-size': unsigned value 134217728 adjusted to 67108864
2022-10-21t03:50:07.527829 08:00 0 [note] innodb: punch hole support available
2022-10-21t03:50:07.527858 08:00 0 [note] innodb: mutexes and rw_locks use gcc atomic builtins
2022-10-21t03:50:07.527865 08:00 0 [note] innodb: uses event mutexes
2022-10-21t03:50:07.527870 08:00 0 [note] innodb: gcc builtin __atomic_thread_fence() is used for memory barrier
2022-10-21t03:50:07.527875 08:00 0 [note] innodb: compressed tables use zlib 1.2.3
2022-10-21t03:50:07.527881 08:00 0 [note] innodb: using linux native aio
2022-10-21t03:50:07.531363 08:00 0 [note] innodb: number of pools: 1
2022-10-21t03:50:07.531516 08:00 0 [note] innodb: using cpu crc32 instructions
2022-10-21t03:50:07.535524 08:00 0 [note] innodb: initializing buffer pool, total size = 8g, instances = 16, chunk size = 128m
2022-10-21t03:50:07.535555 08:00 0 [note] innodb: setting numa memory policy to mpol_interleave
2022-10-21t03:50:08.222880 08:00 0 [note] innodb: setting numa memory policy to mpol_default
2022-10-21t03:50:08.222925 08:00 0 [note] innodb: completed initialization of buffer pool
2022-10-21t03:50:08.434086 08:00 0 [note] innodb: if the mysqld execution user is authorized, page cleaner thread priority can be changed. see the man page of setpriority().
2022-10-21t03:50:08.434440 08:00 0 [error] innodb: unable to lock ./ibdata1 error: 11
2022-10-21t03:50:08.434463 08:00 0 [note] innodb: check that you do not already have another mysqld process using the same innodb data or log files.
2022-10-21t03:50:08.434470 08:00 0 [note] innodb: retrying to lock the first data file
2022-10-21t03:50:09.434584 08:00 0 [error] innodb: unable to lock ./ibdata1 error: 11

主机日志:

oct 21 14:20:04 systemd[1]: mysql57@3306.service: main process exited, code=exited, status=1/failure
oct 21 14:20:04 kill[73396]: usage:
oct 21 14:20:04 kill[73396]: kill [options]  [...]
oct 21 14:20:04 kill[73396]: options:
oct 21 14:20:04 kill[73396]: -a, --all              do not restrict the name-to-pid conversion to processes
oct 21 14:20:04 kill[73396]: with the same uid as the present process
oct 21 14:20:04 kill[73396]: -s, --signal      send specified signal
oct 21 14:20:04 kill[73396]: -q, --queue       use sigqueue(2) rather than kill(2)
oct 21 14:20:04 kill[73396]: -p, --pid              print pids without signaling them
oct 21 14:20:04 kill[73396]: -l, --list [=] list signal names, or convert one to a name
oct 21 14:20:04 kill[73396]: -l, --table            list signal names and numbers
oct 21 14:20:04 kill[73396]: -h, --help     display this help and exit
oct 21 14:20:04 kill[73396]: -v, --version  output version information and exit
oct 21 14:20:04 kill[73396]: for more details see kill(1).
oct 21 14:20:04 systemd[1]: mysql57@3306.service: control process exited, code=exited status=1
oct 21 14:20:04 systemd[1]: unit mysql57@3306.service entered failed state.
oct 21 14:20:04 systemd[1]: mysql57@3306.service failed.
oct 21 14:20:04 systemd[1]: mysql57@3306.service holdoff time over, scheduling restart.
oct 21 14:20:04 systemd[1]: stopped mysql community server.
oct 21 14:20:04 systemd[1]: starting mysql community server...
oct 21 14:20:05 systemd[1]: started mysql community server.

5 m6米乐安卓版下载的解决方案

既然备份库的binlog中缺失了gtid,并且缺失的gtid区间为:205014273-205221230 。那就解析线上主库对应时间段的binlog日志文件,确定对应gtid的postion,然后通过mysqlbinlog --start-positon和--stop-position生成sql文件,到报表库执行即可。

执行期间,通过show master status;可以看到executed_gtid_set的事务id一直在增长。

最后start slave即可。

最后修改时间:2022-10-24 08:57:03
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【米乐app官网下载的版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

网站地图