故障现象:
oracle linux 7.9 oracle 19.18 rac(集群和数据库补丁打了最新的20230117psu)
突然发现连续几天节点1在凌晨2:05左右节点1oracle alert日志报错如下:报错8次。
2023-03-18t02:05:11.597975 08:00
errors in file /oracle/app/oracle/diag/rdbms/jyc/jyc1/trace/jyc1_ora_19622.trc:
ora-17503: ksfdopn: 2 未能打开文件 data/jyc/password/pwdjyc.263.1131705075
ora-27300: 操作系统系统相关操作: open 失败, 状态为: 13
ora-27301: 操作系统故障消息: permission denied
ora-27302: 错误发生在: sskgmsmr_7
2023-03-18t02:05:11.602151 08:00
errors in file /oracle/app/oracle/diag/rdbms/jyc/jyc1/trace/jyc1_ora_19622.trc:
ora-17503: ksfdopn: 2 未能打开文件 data/jyc/password/pwdjyc.263.1131705075
ora-27300: 操作系统系统相关操作: open 失败, 状态为: 13
ora-27301: 操作系统故障消息: permission denied
ora-27302: 错误发生在: sskgmsmr_7
ora-01017: 用户名/口令无效; 登录被拒绝
2023-03-18t02:05:12.644246 08:00
errors in file /oracle/app/oracle/diag/rdbms/jyc/jyc1/trace/jyc1_ora_19727.trc:
ora-17503: ksfdopn: 2 未能打开文件 data/jyc/password/pwdjyc.263.1131705075
ora-27300: 操作系统系统相关操作: open 失败, 状态为: 13
ora-27301: 操作系统故障消息: permission denied
ora-27302: 错误发生在: sskgmsmr_7
2023-03-18t02:05:12.650235 08:00
errors in file /oracle/app/oracle/diag/rdbms/jyc/jyc1/trace/jyc1_ora_19727.trc:
ora-17503: ksfdopn: 2 未能打开文件 data/jyc/password/pwdjyc.263.1131705075
ora-27300: 操作系统系统相关操作: open 失败, 状态为: 13
ora-27301: 操作系统故障消息: permission denied
ora-27302: 错误发生在: sskgmsmr_7
ora-01017: 用户名/口令无效; 登录被拒绝
2023-03-18t02:05:13.246784 08:00
errors in file /oracle/app/oracle/diag/rdbms/jyc/jyc1/trace/jyc1_ora_19778.trc:
ora-17503: ksfdopn: 2 未能打开文件 data/jyc/password/pwdjyc.263.1131705075
ora-27300: 操作系统系统相关操作: open 失败, 状态为: 13
ora-27301: 操作系统故障消息: permission denied
ora-27302: 错误发生在: sskgmsmr_7
2023-03-18t02:05:13.251266 08:00
errors in file /oracle/app/oracle/diag/rdbms/jyc/jyc1/trace/jyc1_ora_19778.trc:
ora-17503: ksfdopn: 2 未能打开文件 data/jyc/password/pwdjyc.263.1131705075
ora-27300: 操作系统系统相关操作: open 失败, 状态为: 13
ora-27301: 操作系统故障消息: permission denied
ora-27302: 错误发生在: sskgmsmr_7
ora-01017: 用户名/口令无效; 登录被拒绝
trace file /oracle/app/oracle/diag/rdbms/jyc/jyc1/trace/jyc1_ora_19622.trc
oracle database 19c enterprise edition release 19.0.0.0.0 - production
version 19.18.0.0.0
build label: rdbms_19.18.0.0.0dbru_linux.x64_230111
oracle_home: /oracle/app/oracle/product/19c/dbhome_1
system name: linux
node name: rac1
release: 5.4.17-2102.201.3.el7uek.x86_64
version: #2 smp fri apr 23 09:05:55 pdt 2021
machine: x86_64
instance name: jyc1
redo thread mounted by this instance: 1
oracle process number: 170
unix process pid: 19622, image: oracle@rac1 (tns v1-v3)
*** 2023-03-18t02:05:11.597802 08:00 (cdb$root(1))
*** session id:(2826.63186) 2023-03-18t02:05:11.597839 08:00
*** client id:() 2023-03-18t02:05:11.597845 08:00
*** service name:(sys$users) 2023-03-18t02:05:11.597849 08:00
*** module name:(sqlplus@rac1 (tns v1-v3)) 2023-03-18t02:05:11.597854 08:00
*** action name:() 2023-03-18t02:05:11.597859 08:00
*** client driver:(server) 2023-03-18t02:05:11.597863 08:00
*** container id:(1) 2023-03-18t02:05:11.597868 08:00
ora-17503: ksfdopn: 2 未能打开文件 data/jyc/password/pwdjyc.263.1131705075
ora-27300: 操作系统系统相关操作: open 失败, 状态为: 13
ora-27301: 操作系统故障消息: permission denied
ora-27302: 错误发生在: sskgmsmr_7
ora-17503: ksfdopn: 2 未能打开文件 data/jyc/password/pwdjyc.263.1131705075
ora-27300: 操作系统系统相关操作: open 失败, 状态为: 13
ora-27301: 操作系统故障消息: permission denied
ora-27302: 错误发生在: sskgmsmr_7
at 0x7ffd140a6538 placed kzia.c@2649
ora-01017: 用户名/口令无效; 登录被拒绝
集群和数据库服务并没发现异常:
[oracle@rac1 trace]$ crsctl stat res -t
--------------------------------------------------------------------------------
name target state server state details
--------------------------------------------------------------------------------
local resources
--------------------------------------------------------------------------------
ora.listener.lsnr
online online rac1 stable
online online rac2 stable
ora.chad
online online rac1 stable
online online rac2 stable
ora.net1.network
online online rac1 stable
online online rac2 stable
ora.ons
online online rac1 stable
online online rac2 stable
--------------------------------------------------------------------------------
cluster resources
--------------------------------------------------------------------------------
ora.asmnet1lsnr_asm.lsnr(ora.asmgroup)
1 online online rac1 stable
2 online online rac2 stable
3 online offline stable
ora.data.dg(ora.asmgroup)
1 online online rac1 stable
2 online online rac2 stable
3 offline offline stable
ora.listener_scan1.lsnr
1 online online rac1 stable
ora.reco.dg(ora.asmgroup)
1 online online rac1 stable
2 online online rac2 stable
3 offline offline stable
ora.asm(ora.asmgroup)
1 online online rac1 started,stable
2 online online rac2 started,stable
3 offline offline stable
ora.asmnet1.asmnetwork(ora.asmgroup)
1 online online rac1 stable
2 online online rac2 stable
3 offline offline stable
ora.jyc.db
1 online online rac1 open,home=/oracle/ap
p/oracle/product/19c
/dbhome_1,stable
2 online online rac2 open,home=/oracle/ap
p/oracle/product/19c
/dbhome_1,stable
ora.cvu
1 online online rac1 stable
ora.qosmserver
1 online online rac1 stable
ora.rac1.vip
1 online online rac1 stable
ora.rac2.vip
1 online online rac2 stable
ora.scan1.vip
1 online online rac1 stable
--------------------------------------------------------------------------------
[oracle@rac1 trace]$ crsctl stat res -t -init
--------------------------------------------------------------------------------
name target state server state details
--------------------------------------------------------------------------------
cluster resources
--------------------------------------------------------------------------------
ora.asm
1 online online rac1 stable
ora.cluster_interconnect.haip
1 online online rac1 stable
ora.crf
1 online online rac1 stable
ora.crsd
1 online online rac1 stable
ora.cssd
1 online online rac1 stable
ora.cssdmonitor
1 online online rac1 stable
ora.ctssd
1 online online rac1 active:0,stable
ora.diskmon
1 offline offline stable
ora.drivers.acfs
1 online online rac1 stable
ora.evmd
1 online online rac1 stable
ora.gipcd
1 online online rac1 stable
ora.gpnpd
1 online online rac1 stable
ora.mdnsd
1 online online rac1 stable
ora.storage
1 online online rac1 stable
--------------------------------------------------------------------------------
故障分析:
第一感觉一定是数据库的定时任务引起,否则不可能这么规律。
第二想法是数据库集群服务都正常,为何会报这样的错误?权限肯定都是正常的,实在没道理。
查遍数据库里的所有任务,并没有异常。都是success的。
生成2:00-2:30的awr报告:符合trc报错特征的信息如下:
set line 160
set wrap off
col owner for a10
col job_name for a30
col errors for a30
col run_duration for a10
col status for a10
select * from (select instance_id,status,owner,job_name,run_duration,actual_start_date
from dba_scheduler_job_run_details order by actual_start_date desc) a where rownum < 50;
select job_name,enabled,state,run_count,last_start_date,next_run_date,last_run_duration from dba_scheduler_jobs;
参考:
new 11g default jobs (doc id 755838.1)
1. ora$autotask_clean
the job is created by the 11g script catmwin.sql - catalog script for maintenance window
catmwin.sql defines maintenance window and stats collection job. this job is an autotask repository data ageing job and deletes obsolete autotask repository data. the delete of this old data will be performed by the procedure ora$age_autotask_data which will be executed by the job.
2. hm_create_offline_dictionary
the job is created by the 11g script catmwin.sql which mentions that this is a job for creation of offline dictionary for database repair advisor.
from unpublished note 602459.1 - 'what is offline_dictionary': 'the system job sys.hm_create_offline_dictionary executes the dbms_hm.create_offline_dictionary package which creates a logminer offline dictionary in the adr for dra name translation service. the job for generating the logminer dictionary is scheduled during the maintenance window. this job can be disabled. '
3. dra_reevaluate_open_failures
the job is created by the 11g script catmwin.sql which mentions that this is a job for reevaluate open failures for database recovery advisor. in other words it is used to get assistance in case of loss or corruption of datafiles, redo log files or controlfiles.
the job executes the procedure dbms_ir.reevaluateopenfailures.
note.466682.1 - 'data recovery advisor -reference guide'.
4. mgmt_config_job - comes with the ocm(oracle configuration manager) installation - this is a configuration collection job.
the job is created by the script ocmjb10.sql by running procedure 'oracle_ocm.mgmt_config.collect_config'.
5. mgmt_stats_config_job
this is an ocm statistics collection job created in ocmjb10.sql by running 'oracle_ocm.mgmt_config.collect_stats'.
6. bsln_maintain_stats_job
this job is a compute statistics job. this job runs the bsln_maintain_stats_prog program on the bsln_maintain_stats_sched schedule. the program bsln_maintain_stats_prog will keep the default baseline's statistics up-to-date.
7. xmldb_nfs_cleanup_job
the job is created in xdbu102.sql and runs the procedure dbms_xdbutil_int.cleanup_expired_nfsclients.
8. rse$clean_recoverable_script - this job is streams related.
it is created in execstr.sql and is new in 11.2. the script execstr.sql executes anonymous blocks for stream. the job is responsible with the cleaning of recoverable scripts and calls 'sys.dbms_streams_auto_int.clean_recoverable_script;'.
9. sm$clean_auto_split_merge - this job is streams related. it is created in file execstr.sql and is new in 11.2 too. the job is responsible with the cleaning of auto split merge views.
尝试处理:在cdb下禁用在2点中启动的任务sm$clean_auto_split_merge。
唯一有2点中开始的任务也显示是success的,尝试禁用,无效还原。
sql> execute dbms_scheduler.disable('sm$clean_auto_split_merge');
pl/sql procedure successfully completed.
检查oracle metalink文章也没有找到符合的情况:
ora-17503/ora-27300/ora-27301/ora-27302 occurred if connected database via an os user who is not in “oinstall” group (doc id 2310640.1)
sp execution fails with error - ora-29780: unable to connect to gpnp daemon [clsgpnp_err] in rac database. (doc id 2627124.1)
ora-27300: os system dependent operation:open failed with status: 13 (doc id 2347696.1)
甚至还尝试了如下操作,修改两节点/oracle/app/19c/grid/gpnp/节点名称/wallets目录权限,无效。(观察几天后还报错所以还原)
反复查阅资料发现:
查看ls -alt /tmp/下有个文件.orachk.loc的生成时间是在2:04左右的orachk相关名称的文件,时间上匹配。所以查阅了orachk相关的定时任务:orachk -autostatus果然时间上匹配到了,2:03开始执行。
再去查阅资料,发现有人也遇到过类似问题。
根本原因是run orachk with ora-17503/ora-27300/ora-27301/ora-27302
手动执行orachk,观察alert输出确定引起错误信息。
由于升级到当前最新版orachk23.2也未解决,并且orachk这个工具并不需要经常执行,一般初次安装的时候用来核查一下健康状况而已,所以暂时停用了该健康检查的任务。计划等后续再出最新版本后尝试升级看是否可修复该bug。
禁用任务命令:orachk -autostop
关于升级orachk请参看另外写的一篇文章:https://www.modb.pro/db/620052