oracle adg troubleshooting – fal 配置
现有adg环境架构
主库: 一套双节点rac:
节点一:real_ip:10.3.xx.1
节点二:real_ip: 10.3.xx.2
备库: 单节点主机:10.0.xx.11
背景:
近期钉钉时常报警(报警代码详见<
问题排查:
报警内容分析:
通过报警内容可知,主要的报警原因为: mrp0 进程状态异常(wait_for_gap), 传输日志有延迟,应用日志有延迟。 通过wait_for_gap 的状态可知部分归档日志在传输的过程中失败了。 进一步排查备库trace 日志,确定丢失的日志属于主库的那个节点,进一步分析。
备库trace日志排查:
发现是thread 2 sequence 558445 - 558445 这个日志传输失败了,同时备库日志显示fal[client] 进程通过adg定义的fal servers 参数重新去主库拉去归档,但仍未拉取到thread 2 sequence 558445 - 558445 日志片段。 若配置正确将不会有fal[client]: all defined fal servers have been attempted 的报错,遂此处考虑备库配置是否不完整。
通过以上判断,进一步去查看备库fal server 的配置,发现fal_server 处只配置了一个tnsname 为orcladg的配置,具体配置如下:
cat $oracle_home/network/admin/tnsnames.ora orcladg = (description = (address_list = (address = (protocol = tcp)(host = 10.3.xx.1)(port = 1521)) ) (connect_data = (server = dedicated) (service_name = orcl) ) )
发现此处fal_server只配置了节点一的的real_ip,遂出现节点二日志传输失败后fal 进程无法重新拉取的报错, 遂更改orcladg tnsname 的配置,增加节点二的real_ip:
orcladg = (description = (address_list = (address = (protocol = tcp)(host = 10.3.xx.1)(port = 1521)) (address = (protocol = tcp)(host = 10.3.xx.2)(port = 1521)) ) (connect_data = (server = dedicated) (service_name = orcl) ) ) $tnsping orcladg used tnsnames adapter to resolve the alias attempting to contact (description = (address_list = (address = (protocol = tcp)(host = 10.3.xx.1)(port = 1521)) (address = (protocol = tcp)(host = 10.3.xx.2)(port = 1521))) (connect_data = (server = dedicated) (service_name = orcl))) ok (0 msec)
到此备库配置的问题修改完毕, 配置正确的时候,日志传输失败后fal重新拉去的日志如下:
thu sep 22 22:19:19 2022 media recovery waiting for thread 2 sequence 558445 fetching gap sequence in thread 2, gap sequence 558445 - 558445
tips:
通过查阅资料,类似的主库是rac节点,一般在备库的tnsname 里面配置的ip都是rac集群的scan_ip, 而不是rac节点的真实ip。
传输日志失败的节点问题排查:
备库的fal_server配置问题解决后只是保证了日志传输失败的时候备库能正确的去重新拉去主库的日志,仍需排查为什么主库的归档日志为什么没有正确的传输到备库上。
检查日志传输节点失败的trace日志,如下:
arc0: archive log rejected (thread 2 sequence 558852) at host 'orclold'
fal[server, arc0]: fal archive failed, see trace file.
arch: fal archive failed. archiver continuing
oracle instance hostname - archival error. archiver continuing.
通过trace 文件日志可知是 arc0 进程在传输日志时失败,然后去trace 目录下查看arc0 进程的trace 文件,最后发现的报错是:
# 查看trace 文件的命令:
ls -lt *arc0*.trc
# 具体的报错
kcrrwkx: unknown error:16401
error 16401 creating standby archive log file at host 'orclold'
本次案例是在传输日志的时候被备库给拒绝了。当时备库的压力较大。
最后总结:
oralce adg 在日志传输失败后会通过fal 进程重新去拉取主库的日志,只要fal进程的配置正确且数据不延迟,此报错可以忽略。