客户在执行一条查询语句的时候,数据库报出ora-01578错误
报错截图如下:
通过错误可以看出,很明显是因为sql访问到的数据块(file#=7,block#=113906)损坏导致报错。
客户反馈出问题的是一个开发库,因该库多次处理过该库的问题,我对这个库的大概信息有所了解,该库没有备份,数据也不是很重要。
修复损坏的数据块,最好的方式是通过备份集,执行blockrecover来修复,数据不会丢失。或者dump该块的结构,通过bbed进行修复(技能要求较高)。
既然数据不是很重要,如果损坏的就这一个块,那么最快修复的方式是跳过损坏的数据块,然后重建表,一个数据块的数据损失对一个开发数据库来讲,几乎可以忽略不计。
与客户简单沟通后,同意该方案后开干。
(1)检查数据文件
通过backup validate检查数据文件,看是否还有其它block损坏,也可以通过dbv来检查。
rman> backup validate check logical datafile 7;
starting backup at 2022-04-14 16:33:38
using target database control file instead of recovery catalog
allocated channel: ora_disk_1
channel ora_disk_1: sid=1119 device type=disk
channel ora_disk_1: starting full datafile backup set
channel ora_disk_1: specifying datafile(s) in backup set
input datafile file number=00007 name=/u01/app/oracle/oradata/yktdb/gfmis_sys_data_2.dbf
channel ora_disk_1: backup set complete, elapsed time: 00:07:25
list of datafiles
=================
file status marked corrupt empty blocks blocks examined high scn
---- ------ -------------- ------------ --------------- ----------
7 ok 1 52893 4194302 18110743018595
file name: /u01/app/oracle/oradata/yktdb/gfmis_sys_data_2.dbf
block type blocks failing blocks processed
---------- -------------- ----------------
data 0 1399377
index 0 2569304
other 0 172728
finished backup at 2022-04-14 16:41:04
sql> select * from v$database_block_corruption;
file# block# blocks corruption_change# corruption_type
---------- ---------- ---------- ------------------ ---------------------------
7 113906 1 1.8111e 13 corrupt
可以看到该文件只有一个数据块损坏。
(2)确认对象
select owner,
segment_name,
partition_name,
segment_type,
tablespace_name,
file_id,
block_id,
blocks,
relative_fno
from dba_extents
where file_id = 7
and 113906 between block_id and block_id blocks - 1;
owner segment_name partition_name segment_type tablespace_name file_id block_id blocks relative_fno
--------------- -------------------- -------------------- -------------------- ------------------ ---------- ---------- ---------- ------------
projectlib glf_bal table gfmis_sys_data 7 113904 8 7
对象名与报错sql语句中要查的对象一致。
(3)跳过坏块
这一步,我们可以参考mos中提供的几种方法,任选一种来处理:
ora-1578 methods to skip block corruption (doc id 2199133.1)
我这里采用dbms_repair.skip_corrupt_blocks程序来进行跳过。
sql> col table_name for a15
sql> select owner,table_name,skip_corrupt from dba_tables where table_name='glf_bal';
owner table_name skip_corrupt
--------------- --------------- ------------------------
projectlib glf_bal disabled
sql> begin
dbms_repair.skip_corrupt_blocks
(
schema_name => 'projectlib',
object_name => 'glf_bal',
object_type => dbms_repair.table_object,
flags => dbms_repair.skip_flag
);
end;
/ 2 3 4 5 6 7 8 9 10
pl/sql procedure successfully completed.
执行后可以通过dba_tables里的skip_corrupt字段来验证。
sql> select owner,table_name,skip_corrupt from dba_tables where table_name='glf_bal';
owner table_name skip_corrupt
--------------- --------------- ------------------------
projectlib glf_bal enabled
官方文档中关于skip_corrupt字段的解释如下,一看就明白。
indicates whether oracle database ignores blocks marked corrupt during table and index scans (enabled) or raises an error (disabled). to enable this feature, run the dbms_repair.skip_corrupt_blocks procedure.
标记坏块后,查询之前报错sql句正常了。
但是观察alert log,发现还是在报错。
thu apr 14 17:17:08 2022
errors in file /u01/app/oracle/diag/rdbms/yktdb/yktdb/trace/yktdb_smon_22147.trc (incident=819024):
ora-01578: oracle data block corrupted (file # 7, block # 113906)
ora-01110: data file 7: '/u01/app/oracle/oradata/yktdb/gfmis_sys_data_2.dbf'
incident details in: /u01/app/oracle/diag/rdbms/yktdb/yktdb/incident/incdir_819024/yktdb_smon_22147_i819024.trc
oracle instance yktdb (pid = 15) - error 1578 encountered while recovering transaction (58, 11) on object 172902.
errors in file /u01/app/oracle/diag/rdbms/yktdb/yktdb/trace/yktdb_smon_22147.trc:
ora-01578: oracle data block corrupted (file # 7, block # 113906)
ora-01110: data file 7: '/u01/app/oracle/oradata/yktdb/gfmis_sys_data_2.dbf'
thu apr 14 17:17:09 2022
dumping diagnostic data in directory=[cdmp_20220414171709], requested by (instance=1, osid=22147 (smon)), summary=[incident=819024].
从alert log里的报错来看,错误是smon进程抛出来的,while recovering transaction表明正在进行事务恢复。查询可以跳过,但是事务恢复不能,符合常理。
不得不说mos实在太强大了,这个问题如下文档有完整的记录:
ora-01578 encountered while recovering transaction on object (doc id 2470377.1)
(1)为了避免smon导致数据库崩溃,设置_smu_debug_mode
sql> alter system set "_smu_debug_mode"=1024;
设置"_smu_debug_mode"=1024来停止事务恢复,该参数可以动态修改。
(2)导出表
[oracle@yth132 ~]$ exp \'\/ as sysdba\' tables=projectlib.glf_bal log=glf_bal.log
export: release 11.2.0.4.0 - production on fri apr 15 18:03:59 2022
米乐app官网下载 copyright (c) 1982, 2011, oracle and/or its affiliates. all rights reserved.
connected to: oracle database 11g enterprise edition release 11.2.0.4.0 - 64bit production
with the partitioning, olap, data mining and real application testing options
export done in al32utf8 character set and al16utf16 nchar character set
server uses zhs16gbk character set (possible charset conversion)
about to export specified tables via conventional path ...
current user changed to projectlib
. . exporting table glf_bal
export terminated successfully without warnings.
(3)删除表
sql> drop table projectlib.glf_bal;
table dropped.
(4)导入表
[oracle@yth132 ~]$ imp \'\/ as sysdba\' log=imp_glf_bal.log file=glf_bal.dmp ignore=y fromuser=projectlib touser=projectlib
import: release 11.2.0.4.0 - production on fri apr 15 18:24:37 2022
米乐app官网下载 copyright (c) 1982, 2011, oracle and/or its affiliates. all rights reserved.
connected to: oracle database 11g enterprise edition release 11.2.0.4.0 - 64bit production
with the partitioning, olap, data mining and real application testing options
export file created by export:v11.02.00 via conventional path
import done in al32utf8 character set and al16utf16 nchar character set
import server uses zhs16gbk character set (possible charset conversion)
. importing projectlib's objects into projectlib
. . importing table "glf_bal" 273155 rows imported
import terminated successfully without warnings.
(5)取消表上的skip corrupt标记
sql> begin
dbms_repair.skip_corrupt_blocks
(
schema_name => 'projectlib',
object_name => 'glf_bal',
flags => dbms_repair.noskip_flag
);
end; 2 3 4 5 6 7 8
9 /
pl/sql procedure successfully completed.
(6)重置_smu_debug_mode参数
sql> alter system reset "_smu_debug_mode";
system altered.
完毕后再观察alert log,没有再报错。