为啥说惊险是因为在有限的停机时间里竟然不断的出现报错失败,不免紧张,如果不能解决,失败,就得考虑回退了,不说大家白熬夜是多么痛苦的事情,如果到点集群服务起不来则是重大事故责任。
这套集群曾经因为异常断电重启后,节点1不能自启集群(节点2正常),最后节点1是通过手动启动集群服务的,所以有些预感可能会出些风险,终究还是出了问题…
之前节点1的手动启动处理参考gi fails to start as process “init.ohasd run” is not running (doc id 1680406.1)
cd /etc/init.d
nohup ./init.ohasd run &
升级前版本19.11:
补丁过程主要参看readme
打补丁之前检查发现有补丁冲突,所以需考虑先卸载冲突补丁。没想到第一步就来个下马威。
1.卸载冲突补丁,最后起crs报错失败:
参考:crs-41053: checking oracle grid infrastructure for file permission issues. not able to start has after patching failure. (doc id 2894422.1)
生产环境动作有点大,不敢轻举妄动。
oracle database - enterprise edition - version 19.16.0.0.0 and later
information in this document applies to any platform.
symptoms
the has was failing to startup with the following message:
[ node1 bin ] # ./crsctl start has
crs-41053: checking oracle grid infrastructure for file permission issues
crs-4124: oracle high availability services startup failed.
crs-4000: command start failed, or completed with errors.
configurations we verified:
1) there were multiple ohasd processes running. there should only be 1 ohasd process running, therefore, we killed all of the processes but this action plan didn't help. after validating the has processes through "ps -ef | grep has" command and killing multiple processes followed by crsctl stop has and crsctl start has, this didn't help.
2) the clusterware alert log only reported the following message:
2022-09-02 17:46:51.671 [ohasd(31214)]crs-0715: oracle high availability service has timed out waiting for init.ohasd to be started.
3) no messages were reported in the ohasd.trc file
4) no errors were reported in the os logs, located under var/log called messages. (for linux)
5) rebooting the node didn't start up the has even though the auto-start feature was enabled.
6) the directory permission on the inventory.xml file location and it's parent directory were incorrect but even after fixing the permissions, it didn't help with bringing the has online.
7) the crs failed to come online even after executing roothas.sh -lock or roothas.sh -patch
changes
after manual intervention during a failed patching process, the crs was failing to startup
cause
the customer mentioned executing "rootcrs.sh -lock" command in a standalone env. to bring the crs online. in a standalone env, roothas.sh should be used instead.
solution
you will need to relink the binaries to resolve the issue.
we unlocked the grid home and relinked the binaries. after relinking, the crs started successfully:
how to relink the oracle grid infrastructure standalone (restart) installation or oracle grid infrastructure rac/cluster installation (11.2 to 21c). (doc id 1536057.1)
最终处理办法:
/oracle/app/19c/grid/bin/crsctl stop crs
systemctl stop oracle-ohasd
ps -ef|grep d.bin
kill -9 xxx
cd /var/tmp/.oracle/
rm -rf /var/tmp/.oracle/*
systemctl start oracle-ohasd
/oracle/app/19c/grid/bin/crsctl start crs
ps -ef|grep d.bin|wc -l 一般看到20多个进程就ok了
节点1集群服务起来之后,再在节点2卸载补丁。
2.节点1打补丁,最后起crs报错失败
最终处理办法:同问题1
/oracle/app/19c/grid/bin/crsctl stop crs
systemctl stop oracle-ohasd
ps -ef|grep d.bin
kill -9 xxx
cd /var/tmp/.oracle/
rm -rf /var/tmp/.oracle/*
systemctl start oracle-ohasd
systemctl status oracle-ohasd
/oracle/app/19c/grid/bin/crsctl start crs
ps -ef|grep d.bin|wc -l 一般看到20多个进程就ok了
节点1集群服务起来之后,再在节点2打补丁。
3.节点2打完补丁,但状态不对
时间有点长,和存在备份任务有关(patch之前让专人禁用任务竟然没做,大意了)
处理办法:
参考:https://www.modb.pro/db/496096
节点1操作即可:
[root@rac1 .oracle]# /oracle/app/19c/grid/bin/clscfg -patch
clscfg: -patch mode specified
clscfg: existing configuration version 19 detected.
successfully accumulated necessary ocr keys.
creating ocr keys for user 'root', privgrp 'root'..
operation successful.
[root@rac1 .oracle]# /oracle/app/19c/grid/bin/crsctl query crs activeversion -f
oracle clusterware active version on the cluster is [19.0.0.0.0]. the cluster upgrade state is [rolling patch]. the cluster active patch level is [3331580692].
[root@rac1 .oracle]# /oracle/app/19c/grid/bin/crsctl stop rollingpatch
crs-1161: the cluster was successfully patched to patch level [3976270074].
[root@rac1 .oracle]# /oracle/app/19c/grid/bin/crsctl query crs activeversion -f
oracle clusterware active version on the cluster is [19.0.0.0.0]. the cluster upgrade state is [normal]. the cluster active patch level is [3976270074].
[root@rac1 .oracle]#
在节点2检查也正常了。
[root@rac2 .oracle]# /oracle/app/19c/grid/bin/crsctl query crs activeversion -f
oracle clusterware active version on the cluster is [19.0.0.0.0]. the cluster upgrade state is [normal]. the cluster active patch level is [3976270074].
4.打ojvm补丁报错
处理办法:
参考:https://www.modb.pro/db/1688153507759742976
设置path和perl5lib环境变量,然后重新opatch apply.
export path=$oracle_home/perl/bin:$path
export perl5lib=$oracle_home/perl/lib
小结:
1.对生产的补丁操作尽量申请预留足够的时间应对风险(有备份就不用担心回退)
备份grid/oracle软件目录
root操作:
节点1和2:
cd /oracle/app
tar -cvf /oracle/bak/orainventory.tar ./orainventory
tar -cvf /oracle/bak/grid.tar ./19c ./grid --exclude=./grid/admin --exclude=./grid/diag --exclude=./19c/grid/network/log --exclude=./19c/grid/log --exclude=./19c/grid/rdbms/audit
tar -cvf /oracle/bak/oracle.tar ./oracle --exclude=./oracle/admin --exclude=./oracle/diag
2.提前预判,尽量做足技术准备。
参考原经验:
对比检查节点1和2的目录文件,这次的oui-patch.xml两节点权限无问题,但/oracle/app/19c/grid/inventory/oneoffs目录下的文件,节点2有缺失,所以需要从节点1提前传输过来,再在节点2打补丁。
[grid@rac1 ~]$ll /oracle/app/orainventory/contentsxml/oui-patch.xml
-rw-rw---- 1 grid oinstall 174 may 30 11:43 /oracle/app/orainventory/contentsxml/oui-patch.xml
[grid@rac1 ~]$ll /oracle/app/19c/grid/inventory/oneoffs
[grid@rac1 ~]$scp -rp /oracle/app/19c/grid/inventory/oneoffs/* rac2:/oracle/app/19c/grid/inventory/oneoffs/
补丁升级成功截图:
随机附上crs,has,cluster的区别:
参考
crsctl start/stop crs
crsctl start/stop has
crsctl start/stop cluster
这三个是在oracle集群里常用的命令,他们的使用也有一些区别
has是在11g里新出的概念,10g里只有crs.
以下测试都是在12c的环境里进行的:
1.单机环境里只能使用crsctl start/stop has,不能使用crsctl start/stop crs.
2.在停集群的时候,crsctl stop has/crs能有效的完成工作(只能停单节点),而crsctl stop cluster却不能停干净(可以同时停所有节点),还剩下has进程.
3.在起集群的过程里, crsctl start has/crs 也可以完成使命。crsctl start cluster却压根不作用(原因是无法连接has).
原因其实很容易理解,看下面的命令:
crsctl check crs
crs-4638: oracle high availability services is online
crs-4537: cluster ready services is online
crs-4529: cluster synchronization services is online
crs-4533: event manager is online
crsctl check has
crs-4638: oracle high availability services is online
crsctl check cluster
crs-4537: cluster ready services is online
crs-4529: cluster synchronization services is online
crs-4533: event manager is online
可以发现crs=has cluster,其中has是主进程,cluster都是集群交互的进程。
最后修改时间:2023-08-13 13:44:08
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【米乐app官网下载的版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。