asm磁盘头损坏无法mount[存档故障] -m6米乐安卓版下载

原创 it泥瓦工 2023-05-06

810

某个客户因部分科室客户端无法连接数据库，一线同事查询一个节点因网卡频繁掉线原因发生节点驱逐，问题节点2重启后asm无法mount。

1.故障环境

操作系统:centos 6.9
数据库:oracle rac 11.2.0.3

2.网卡故障信息

apr  4 16:24:28 db2 kernel: addrconf(netdev_up): eth1: link is not ready
apr  4 16:24:31 db2 kernel: igb 0000:04:00.1: eth1: igb: eth1 nic link is up 1000 mbps full duplex, flow control: rx/tx
apr  4 16:24:31 db2 kernel: addrconf(netdev_change): eth1: link becomes ready
apr  4 16:30:39 db2 init: oracle-ohasd main process (3029) killed by term signal
apr  4 16:30:39 db2 init: tty (/dev/tty2) main process (5243) killed by term signal
apr  4 16:30:39 db2 init: tty (/dev/tty3) main process (5245) killed by term signal
apr  4 16:30:39 db2 init: tty (/dev/tty4) main process (5247) killed by term signal
apr  4 16:30:39 db2 init: tty (/dev/tty5) main process (5249) killed by term signal
apr  4 16:30:39 db2 init: tty (/dev/tty6) main process (5251) killed by term signal
apr  4 16:30:49 db2 kernel: igb 0000:04:00.1: eth1: igb: eth1 nic link is down

3.gi alert日志

ora-27508: ipc error sending a message
wed apr 04 16:29:23 2018
ipc send timeout detected. receiver ospid 6780 [oracle@db1 (lmd0)]
wed apr 04 16:29:23 2018
errors in file /u01/app/grid/diag/asm/ asm/ asm1/trace/ asm1_lmd0_6780.trc:
wed apr 04 16:31:08 2018
detected an inconsistent instance membership by instance 2
evicting instance 2 from cluster
waiting for instances to leave: 2

因网卡频繁掉线，crs驱逐节点2。同事查到这里客户更换了交换机网口和线，网卡掉线问题已修复。

4.问题节点2重启后，asm无法mount磁盘，

节点2的gi日志

[/u01/app/11.2.0/grid/bin/oraagent.bin(9051)]crs-5019:all ocr locations are on asm disk groups [votedisk], and none of these disk groups are mounted. details are at "(:clsn00100:)" in "/u01/app/11.2.0/grid/log/db2/agent/ohasd/oraagent_grid/oraagent_grid.log".

节点2的asm日志

note: no asm libraries found in the system
note: assigning number (3,2) to disk (/dev/asm-vote3)
note: assigning number (3,0) to disk (/dev/asm-vote1)
warning: gmon has insufficient disks to maintain consensus. minimum required is 3
gmon querying group 3 at 7 for pid 23, osid 10398
note: group votedisk: updated pst location: disk 0000 (pst copy 0)
note: group votedisk: updated pst location: disk 0002 (pst copy 1)
note: cache dismounting (clean) group 3/0xdb585101 (votedisk) 
note: messaging ckpt to quiesce pins unix process pid: 10398, image: oracle@db2 (tns v1-v3)
note: dbwr not being msg'd to dismount
note: lgwr not being msg'd to dismount
note: cache dismounted group 3/0xdb585101 (votedisk) 
note: cache ending mount (fail) of group votedisk number=3 incarn=0xdb585101
note: cache deleting context for group votedisk 3/0xdb585101
gmon dismounting group 3 at 8 for pid 23, osid 10398
note: disk  in mode 0x8 marked for de-assignment
note: disk  in mode 0x8 marked for de-assignment
note: disk  in mode 0x8 marked for de-assignment
error: diskgroup votedisk was not mounted

磁盘无法挂载，vote丢失了一块盘（3，1），查看节点1的asm日志同时出现了故障，磁盘组dismount

wed apr 04 16:31:09 2018
note: waiting for instance recovery of group 2
wed apr 04 16:31:10 2018
note: smon starting instance recovery for group arch domain 1 (mounted)
note: f1x0 found on disk 0 au 2 fcn 0.0
note: starting recovery of thread=2 ckpt=41.7360 group=1 (arch)
note: smon waiting for thread 2 recovery enqueue
note: smon about to begin recovery lock claims for diskgroup 1 (arch)
note: smon successfully validated lock domain 1
note: advancing ckpt for group 1 (arch) thread=2 ckpt=41.7360
note: smon did instance recovery for group arch domain 1
note: smon starting instance recovery for group data domain 2 (mounted)
note: smon skipping disk 0 - no header
note: cache initiating offline of disk 0 group data
note: process _smon_ asm1 (6796) initiating offline of disk 0.3915955645 (data_0000) with mask 0x7e in group 2
warning: disk 0 (data_0000) in group 2 in mode 0x7f is now being taken offline on asm inst 1
note: initiating pst update: grp = 2, dsk = 0/0xe968bdbd, mask = 0x6a, op = clear--warning: failed to online diskgroup resource ora.data.dg (unable to communicate with crsd/ohasd)
warning: failed to online diskgroup resource ora.votedisk.dg (unable to communicate with crsd/ohasd)

三个磁盘组都出现了问题，查看v$asm_disk信息

sql> select name,path,header_status,mount_status from v$asm_disk;
name                            path                header_statu mount_s
-----------------------  ------------------------   ----------- -------
data_0000                /dev/asm-data              provisioned  cached
votedisk_0001            /dev/asm-vote2             provisioned  cached
votedisk_0002            /dev/asm-vote3             provisioned  cached
arch_0000                /dev/asm-arch              provisioned  cached
votedisk_0000            /dev/asm-vote1             member       cached

此时五块磁盘，四块盘的header是不对的，处于provisioned状态，不属于磁盘组了，有其他的原因改变了磁盘头

provisioned - disk is not part of a disk group and may be added to a disk group with the alter diskgroup statement. the provisioned header status is different from the candidate header status in that provisioned implies that an additional platform-specific action has been taken by an administrator to make the disk available for oracle asm

查到这里，就需要使用kfed查看asm磁盘的头部信息了，看看是不是被破坏了。如下：

kfed  read /dev/asm-datakfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: kfbtyp_diskhead
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:              2147483648 ; 0x008: disk=0
kfbh.check:                  1267702279 ; 0x00c: 0x4b8f9a07
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr:         orcldisk ; 0x000: length=8
kfdhdb.driver.reserved[0]:            0 ; 0x008: 0x00000000
kfdhdb.driver.reserved[1]:            0 ; 0x00c: 0x00000000
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                186646528 ; 0x020: 0x0b200000
kfdhdb.dsknum:                        0 ; 0x024: 0x0000
kfdhdb.grptyp:                        1 ; 0x026: kfdgtp_external
kfdhdb.hdrsts:                        3 ; 0x027: kfdhdr_member
kfdhdb.dskname:            data_0000 ; 0x028: length=12
kfdhdb.grpname:                 data ; 0x048: length=7
kfdhdb.fgname:             data_0000 ; 0x068: length=12
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.crestmp.hi:             33032846 ; 0x0a8: hour=0xe days=0x14 mnth=0x2 year=0x7e0
kfdhdb.crestmp.lo:           3868405760 ; 0x0ac: usec=0x0 msec=0xcc secs=0x29 mins=0x39
kfdhdb.mntstmp.hi:             33067153 ; 0x0b0: hour=0x11 days=0x4 mnth=0x4 year=0x7e2
kfdhdb.mntstmp.lo:           2590023680 ; 0x0b4: usec=0x0 msec=0x28 secs=0x26 mins=0x26
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80
kfdhdb.dsksize:                 2097152 ; 0x0c4: 0x00200000
kfdhdb.pmcnt:                        20 ; 0x0c8: 0x00000014
kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001
kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002
kfdhdb.f1b1locn:                      2 ; 0x0d4: 0x00000002
kfdhdb.redomirrors[0]:                0 ; 0x0d8: 0x0000
kfdhdb.redomirrors[1]:                0 ; 0x0da: 0x0000
kfdhdb.redomirrors[2]:                0 ; 0x0dc: 0x0000
kfdhdb.redomirrors[3]:                0 ; 0x0de: 0x0000
kfdhdb.dbcompat:              168820736 ; 0x0e0: 0x0a100000
kfdhdb.grpstmp.hi:             33032846 ; 0x0e4: hour=0xe days=0x14 mnth=0x2 year=0x7e0
kfdhdb.grpstmp.lo:           3867648000 ; 0x0e8: usec=0x0 msec=0x1e8 secs=0x28 mins=0x39
kfdhdb.vfstart:                       0 ; 0x0ec: 0x00000000
kfdhdb.vfend:                         0 ; 0x0f0: 0x00000000
kfdhdb.spfile:                        0 ; 0x0f4: 0x00000000
kfdhdb.spfflg:                        0 ; 0x0f8: 0x00000000
kfdhdb.ub4spare[0]:                   0 ; 0x0fc: 0x00000000
kfdhdb.ub4spare[1]:                   0 ; 0x100: 0x00000000
kfdhdb.ub4spare[2]:                   0 ; 0x104: 0x00000000
kfdhdb.ub4spare[3]:                   0 ; 0x108: 0x00000000
kfdhdb.ub4spare[4]:                   0 ; 0x10c: 0x00000000
kfdhdb.ub4spare[5]:                   0 ; 0x110: 0x00000000
kfdhdb.ub4spare[34]:                  0 ; 0x184: 0x00000000
kfdhdb.ub4spare[35]:                  0 ; 0x188: 0x00000000
kfdhdb.ub4spare[36]:                  0 ; 0x18c: 0x00000000
kfdhdb.ub4spare[37]:                  0 ; 0x190: 0x00000000
kfdhdb.ub4spare[38]:                  0 ; 0x194: 0x00000000
kfdhdb.ub4spare[39]:            4930648 ; 0x198: 0x004b3c58
kfdhdb.ub4spare[40]:                  0 ; 0x19c: 0x00000000
kfdhdb.ub4spare[41]:                  0 ; 0x1a0: 0x00000000
kfdhdb.acdb.aba.seq:                  0 ; 0x1d4: 0x00000000
kfdhdb.acdb.aba.blk:                  0 ; 0x1d8: 0x00000000
kfdhdb.acdb.ents:                     0 ; 0x1dc: 0x0000
kfdhdb.acdb.ub2spare:                 0 ; 0x1de: 0x0000

这里看到kfed看到的磁盘头类型是正常的，不过kfdhdb.ub4spare[39]: 却有了数据，这里应该是 0,出现这个情况有两种情况：

case #1] 0xaa55 on little-endian server like linux  or 0x55aa on big-endain server like sun sparc indicates boot signature ( or magic number ) on mbr (master boot record )  partition. 
case #2] local backup software ( like symantec image backup ) touches asm disk header on  column kfdhdb.ub4spare[39] from kfed output.
this issue could happen outside asm when some tools on os ( or human ) put partition information on the affected device

此时问题就比较明朗了，因为一线告知客户刚部署了一台快速拉起的容灾机器，在这套容灾系统内有台虚拟机，一比一磁盘比例镜像asm磁盘，猜想初始化的时候对asm头部改变了磁盘头部信息。

5.修复磁盘头部

需要这几块磁盘的头部信息，还好查询的时候asm自动备份的磁盘头信息比较完整(特性只在10.2.0.5及以上版本才有)

kfed read /dev/asm-data aun=1 blkn=254

此时执行kfed repair命令修复受影响的磁盘即可。

6.集群修复完成

随后重启crsctl stop/start crs 集群，asm正常挂载磁盘，同时做好磁盘头备份工作。

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

1人已赞赏

asm磁盘头损坏无法mount[存档故障] -m6米乐安卓版下载

评论