前言
昨天同事找到我,万恶的备份又双叒叕出问题了,将nas打爆了。
现象
pgbackrest在之前的文章提过,是一款优秀的备份工具,美中不足的是假如需要在备库备份以减少对主库的压力,需要配置ssh免密,相信这对于绝大多数的生产环境,都是无法支持的,细节在此表过不提,感兴趣的可以回过头去看看,。
但没法,我们的备份工具是基于pgbackrest研发的,目前也正在开发基于tcp/ip协议的备份,以绕过ssh这个致命缺点。
回到这个问题,首先为什么会将nas打爆?最先想到的就是回收策略retention,后面经证实也正是回收策略出了问题。在官方文档中,有简单的介绍:
pgbackrest does full backup rotation based on the retention type which can be a count or a time period. when a count is specified, then expiration is not concerned with when the backups were created but with how many must be retained. differential and incremental backups are count-based but will always be expired when the backup they depend on is expired. see sections and for details and examples. archived wal is retained by default for backups that have not expired, however, although not recommended, this schedule can be modified per repository with the retention-archive options. see section for details and examples.
pgbackrest根据保留类型进行完整的备份轮换,保留类型可以是一个数字或一个时间。当指定了一个数字,那么过期就不关心备份是何时创建的,而是关心必须保留多少个。差异备份和增量备份是基于计数的,但当它们所依赖的备份过期时,也会过期。
可以指定保留多少份,也可以指定保留多长时间,由repo1-retention-full-type参数控制,分别对应count和time。举个例子:
repo1-retention-full-type=count,repo1-retention-full=2,那么至多只会保留2份全备
repo1-retention-full-type=time,repo1-retention-full=2,那么必须至少有一个2天前的全备
差分备份和增量备份所依赖的备份过期后,也会随之过期。另外wal归档日志也可以采用类似的过期保留策略,但是官方不推荐,因此wal会随着相应的备份过期自动删除。
在我们的备份脚本中,规划的是保留2份全备,保留6天的差异备份
repo1-retention-full=2
repo1-retention-diff=6
同时在脚本里进行判断,假如当天是周六,则进行全备,其余时间进行差异备份。按照之前的分析,差异备份会随着全备过期之后过期,删除相关的备份和wal,理应不该打爆nas。
问题在哪
那么问题在哪?通过观察,发现很多实例只有一份备份,全都是增量备份
[postgres@xiongcc ~]$ pgbackrest info --config-path=/home/postgres | grep full | wc -l
1
[postgres@xiongcc ~]$ pgbackrest info --config-path=/home/postgres | grep incr | wc -l
561
脚本里面的逻辑类似如下
[postgres@xiongcc ~]$ cat test.sh
#!/bin/bash
week_day=`date %w`
if [ $week_day = 6 ];then
pgbackrest --config-path=/home/postgres backup --type=full --stanza=mypg
else
pgbackrest --config-path=/home/postgres backup --stanza=mypg
fi
判断日期是否是周六,是周六的话则做全备,否则做增量备份。
可以看到,第二条语句没有显式指明备份类型type,那么pgbackrest会默认做增量备份,看个例子,当前有两份全备
[postgres@xiongcc ~]$ cat pgbackrest.conf
[mypg]
pg1-path=/home/postgres/pgdata
[global]
repo1-path=/home/postgres/backrest_backup_dir
repo1-retention-full=2
repo1-retention-diff=4
log-level-file=debug
[postgres@xiongcc ~]$ pgbackrest info --config-path=/home/postgres
stanza: mypg
status: ok
cipher: none
db (current)
wal archive min/max (14): 000000010000000900000051/000000010000000900000057
full backup: 20220506-094821f
timestamp start/stop: 2022-05-06 09:48:21 / 2022-05-06 09:48:28
wal start/stop: 000000010000000900000051 / 000000010000000900000051
database size: 34.3mb, database backup size: 34.3mb
repo1: backup set size: 4.3mb, backup size: 4.3mb
full backup: 20220506-144430f
timestamp start/stop: 2022-05-06 14:44:30 / 2022-05-06 14:44:37
wal start/stop: 000000010000000900000057 / 000000010000000900000057
database size: 42.2mb, database backup size: 42.2mb
repo1: backup set size: 9mb, backup size: 9mb
手动做一个备份,不指定type
[postgres@xiongcc ~]$ pgbackrest --stanza=mypg --config-path=/home/postgres backup --log-level-console=info
再次查看,可以看到pgbackrest默认做的增量
[postgres@xiongcc ~]$ pgbackrest info --config-path=/home/postgres
stanza: mypg
status: ok
cipher: none
db (current)
wal archive min/max (14): 000000010000000900000051/000000010000000900000059
full backup: 20220506-094821f
timestamp start/stop: 2022-05-06 09:48:21 / 2022-05-06 09:48:28
wal start/stop: 000000010000000900000051 / 000000010000000900000051
database size: 34.3mb, database backup size: 34.3mb
repo1: backup set size: 4.3mb, backup size: 4.3mb
full backup: 20220506-144430f
timestamp start/stop: 2022-05-06 14:44:30 / 2022-05-06 14:44:37
wal start/stop: 000000010000000900000057 / 000000010000000900000057
database size: 42.2mb, database backup size: 42.2mb
repo1: backup set size: 9mb, backup size: 9mb
incr backup: 20220506-144430f_20220506-144727i
timestamp start/stop: 2022-05-06 14:47:27 / 2022-05-06 14:47:29
wal start/stop: 000000010000000900000059 / 000000010000000900000059
database size: 42.2mb, database backup size: 4.8mb
repo1: backup set size: 9mb, backup size: 2.5mb
backup reference list: 20220506-144430f
并且更加坑的是,增量没有所谓的"过期"机制,再多执行几次刚刚的命令
[postgres@xiongcc ~]$ pgbackrest info --config-path=/home/postgres | grep incr | wc -l
8
现在累计了8份增量备份,那让我们配置一个类似的repo1-retention-incr试试?
[postgres@xiongcc ~]$ cat pgbackrest.conf
[mypg]
pg1-path=/home/postgres/pgdata
[global]
repo1-path=/home/postgres/backrest_backup_dir
repo1-retention-full=2
repo1-retention-diff=4
repo1-retention-incr=4
log-level-file=debug
但是尴尬的是,pgbackrest不支持这个参数,告警了出来
[postgres@xiongcc ~]$ pgbackrest info --config-path=/home/postgres | grep incr | wc -l
warn: configuration file contains invalid option 'repo1-retention-incr'
9
为什么会这样呢?得回到备份类型的差异上来
full backup: pgbackrest copies the entire contents of the database cluster to the backup. the first backup of the database cluster is always a full backup. pgbackrest is always able to restore a full backup directly. the full backup does not depend on any files outside of the full backup for consistency.
differential backup: pgbackrest copies only those database cluster files that have changed since the last full backup. pgbackrest restores a differential backup by copying all of the files in the chosen differential backup and the appropriate unchanged files from the previous full backup. the advantage of a differential backup is that it requires less disk space than a full backup, however, the differential backup and the full backup must both be valid to restore the differential backup.
incremental backup: pgbackrest copies only those database cluster files that have changed since the last backup (which can be another incremental backup, a differential backup, or a full backup). as an incremental backup only includes those files changed since the prior backup, they are generally much smaller than full or differential backups. as with the differential backup, the incremental backup depends on other backups to be valid to restore the incremental backup. since the incremental backup includes only those files since the last backup, all prior incremental backups back to the prior differential, the prior differential backup, and the prior full backup must all be valid to perform a restore of the incremental backup. if no differential backup exists then all prior incremental backups back to the prior full backup, which must exist, and the full backup itself must be valid to restore the incremental backup.
通过上面的概念分析可以知道,差异备份与增量备份的区别在于它们备份的参考点不同:
差异备份的参考点是上一次的全量备份。
增量备份的参考点是上一次的全量备份、差异备份或增量备份。
不难理解,假如 t1 至 t2 这段时间内,只增加了100mb的数据的话,多次差异备份的大小基本差不多,而多次增量备份的大小会越来越小。因此,增量备份依赖于之前的备份,也就是下方info里面的backup reference list,假如也有所谓的"过期"机制,将增量备份所依赖的备份删了(不管是增量、差异还是全量),那么后续的增量也就没用了,失效了。
incremental backups cannot be expired independently — they are always expired with their related full or differential backup.
所以一旦增量备份所依赖的备份过期了,该增量备份也会随之过期,pgbackrest会自动处理这一块。
incr backup: 20220506-152039f_20220506-152127i
timestamp start/stop: 2022-05-06 15:21:27 / 2022-05-06 15:21:29
wal start/stop: 00000001000000090000006e / 00000001000000090000006e
database size: 42.2mb, database backup size: 5mb
repo1: backup set size: 9mb, backup size: 2.6mb
backup reference list: 20220506-152039f
incr backup: 20220506-152039f_20220506-152141i
timestamp start/stop: 2022-05-06 15:21:41 / 2022-05-06 15:21:44
wal start/stop: 000000010000000900000070 / 000000010000000900000070
database size: 42.2mb, database backup size: 4.9mb
repo1: backup set size: 9mb, backup size: 2.5mb
backup reference list: 20220506-152039f, 20220506-152039f_20220506-152127i ---依赖上一份增量
而差异备份就不同了,差异备份的参照是全备,只依赖全备,所以差异备份可以指定repo1-retention-diff。当然了,假如做了全备,增量、差异备份以及相关的wal都会随之过期。
那么剩下的问题就在于为什么我们的环境只进行了一次全备?通过排查,原来是脚本里面的逻辑有欠缺,我们偷懒延用了之前pg_probackup的逻辑,首先判断是否存在backup.lock文件,然后判断是否存在pgbackrest的进程,若有的话就直接退出脚本,因为我们简单地认为这个时候已经存在一个pgbackrest进程正在备份了。
if [ -f backup.lock];then
backup_count=`ps -ef | grep pgbackrest | grep $port | grep -v $$ | wc -l`
if [ "$backup_count" = "0"];then
rm backup.lock
else
echo "another backup is running!"
exit 3
fi
else
touch backup.lock
fi
pg_probackup的话没有问题,但是pgbackrest就有问题了,因为pgbackrest的归档命令也需要单独配置,如下
[postgres@xiongcc ~]$ psql -c "show archive_command"
archive_command
-----------------------------------------------------------------------
pgbackrest --stanza=mypg --config-path=/home/postgres archive-push %p
(1 row)
所以,假如在备份的时候又正好在归档,那么脚本就出错了,判断的数量大于0,直接退出,而pg_probackup没有这个规定。
[postgres@xiongcc ~]$ ps -ef | grep pgbackrest
postgres 9932 7442 0 15:49 pts/0 00:00:00 pgbackrest --stanza=mypg --config-path=/home/postgres/pg5432 archive-push pg_wal/000000010000000900000075
小结
真是一段奇妙的踩坑记,赶紧改脚本去了。