直方图竟然是导致业务执行缓慢的元凶！ -m6米乐安卓版下载

原创王小那个鑫 all china database union 2023-07-12

726

一背景

7月10日，大概在下午5点左右，应用bpc告警针对行里的某系统出现了一个短暂交易成功率低的告警，本能的对数据库进行一个先行检查，发现确实存在一条sql执行缓慢的情况。同时对数据库的dbtime进行查看，发现异常时间点的dbtime比正常时候高了上百倍。

二排查过程

对数据库的dbtime进行查询，可以看到该数据库在平常时候的dbtime大概是在1.5min/h左右，但是在异常时候，dbtime达到了最高580min/h。说明在异常的一个小时中，确实有任务导致数据库的dbtime急剧增加。


      inst inst_name        begin interval time  end interval time          db time (min)
---------- ---------------- -------------------- -------------------- -------------------
         1 mbfedb1          07/11/2023 18:00:37  07/11/2023 19:00:03                 1.23
         1 mbfedb1          07/11/2023 17:00:34  07/11/2023 18:00:37                 1.25
         1 mbfedb1          07/11/2023 16:00:32  07/11/2023 17:00:34                 1.31
         1 mbfedb1          07/11/2023 15:00:30  07/11/2023 16:00:32                 1.29
         1 mbfedb1          07/11/2023 14:00:28  07/11/2023 15:00:30                 1.78
         1 mbfedb1          07/11/2023 13:00:26  07/11/2023 14:00:28                 1.28
         1 mbfedb1          07/11/2023 12:00:23  07/11/2023 13:00:26                 1.26
         1 mbfedb1          07/11/2023 11:00:20  07/11/2023 12:00:23                 1.30
         1 mbfedb1          07/11/2023 10:00:17  07/11/2023 11:00:20                 1.44
         1 mbfedb1          07/11/2023 09:00:13  07/11/2023 10:00:17                 1.41
         1 mbfedb1          07/11/2023 08:00:10  07/11/2023 09:00:13                 1.40
         1 mbfedb1          07/11/2023 07:00:07  07/11/2023 08:00:10                 1.27
         1 mbfedb1          07/11/2023 06:00:03  07/11/2023 07:00:07                 1.27
         1 mbfedb1          07/11/2023 05:00:01  07/11/2023 06:00:03                 1.28
         1 mbfedb1          07/11/2023 04:00:06  07/11/2023 05:00:01                 1.22
         1 mbfedb1          07/11/2023 03:00:03  07/11/2023 04:00:06                 1.29
         1 mbfedb1          07/11/2023 02:00:01  07/11/2023 03:00:03                 1.83
         1 mbfedb1          07/11/2023 01:00:43  07/11/2023 02:00:01                 1.28
         1 mbfedb1          07/11/2023 00:00:40  07/11/2023 01:00:43                 1.24
         1 mbfedb1          07/10/2023 23:00:37  07/11/2023 00:00:40                 1.24
         1 mbfedb1          07/10/2023 22:00:28  07/10/2023 23:00:37                 1.27
         1 mbfedb1          07/10/2023 21:00:25  07/10/2023 22:00:28                 1.25
         1 mbfedb1          07/10/2023 20:00:22  07/10/2023 21:00:25                 1.26
         1 mbfedb1          07/10/2023 19:00:13  07/10/2023 20:00:22                 1.26
         1 mbfedb1          07/10/2023 18:00:10  07/10/2023 19:00:13                 1.35
         1 mbfedb1          07/10/2023 17:00:04  07/10/2023 18:00:10               355.41
         1 mbfedb1          07/10/2023 16:00:01  07/10/2023 17:00:04               580.73
         1 mbfedb1          07/10/2023 15:00:12  07/10/2023 16:00:01               207.50
         1 mbfedb1          07/10/2023 14:00:09  07/10/2023 15:00:12                22.04
         1 mbfedb1          07/10/2023 13:00:05  07/10/2023 14:00:09                 5.05

通过对awr查询，我们发现了sqlid为bhyt8t15xy9c1的sql在过去的一小时内其执行的采样次数也很高。这可能说明两个问题：一是该sql执行频率极高，每次采样都能采样到该sql执行；二是该sql执行效率低，在采样时间内该大量的sql都没有执行完。而根据该系统的使用情况、历史awr情况以及dbtime的正常值情况，我们基本可以排除第一种该sql执行次数极高的情况。

sql_id          count(*)
------------- ----------
bhyt8t15xy9c1      32942
                     364
fdpk3nacn4xm0         32
4dh15ry33grtn         20
cqsmuuvmg9y0c         13
51xpvzuzzw8gj         12
gq4u3dppjkmyr          9
4h6a78z4rj8bf          9
aqunvmrz8dwy3          9

在初步确认了是该sql执行效率较低的方向后，我们则可以着重看一下该sql为什么会出现执行效率低的可能，通过对该sql当前的执行计划进行查看：

plan_table_output
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sql_id  bhyt8t15xy9c1, child number 1
-------------------------------------
select messagein0_.proccnt as col_0_0_ from mbfe.messagein messagein0_
where messagein0_.msgid=:1
plan hash value: 2678161237
-------------------------------------------------------------------------------
| id  | operation         | name      | rows  | bytes | cost (%cpu)| time     |
-------------------------------------------------------------------------------
|   0 | select statement  |           |       |       | 57522 (100)|          |
|*  1 |  table access full| messagein |  1371k|    71m| 57522   (1)| 00:11:31 |
-------------------------------------------------------------------------------
predicate information (identified by operation id):
---------------------------------------------------
   1 - filter("messagein0_"."msgid"=:1)

发现该sql构成非常简单，就是通过msgid条件对messagein表进行但表查询，而该sql慢的原因也非常简单，就是在messagein表上，msgid谓词条件只起到了过滤作用，而没起到数据快速定位的作用。
我们通过上述执行计划判断初步可以怀疑两个点：
1）messagein表可能是一张大表；
2）谓词条件msgid没有索引（或者条件中的可选择度极低）；

基于这两个怀疑的点，我们对该表的统计信息及索引情况进行查询：

表统计信息：


                                                                                     avg       avg                                        degree
owner                               tablespace           trans                       row sampl table    table      last        locked     pct_free
table_name                          name            part ini_max    temp   num_rows  len size  size     size kb    analyzed    olded      pct_used
----------------------------------- --------------- ---- ---------- ---- ---------- ---- ----- -------- ---------- ----------- ---------- ----------
mbfe:messagein                      ts_mbfe_tab     no   1:255      n       2033367  742 100%  1438mb   1721mb     23-07-08 16 .no        1:10:

表索引情况：

owner
table_name                                    tablespace                                                          dinsinct                          par
index_name                                    name                 status     index_type uniquenes  pct log  b        keys leaf_blocks     num_rows ti  post columnname
--------------------------------------------- -------------------- ---------- ---------- --------- ---- --- -- ----------- ----------- ------------ --- ---- --------------------
indexmsgin_1                                  ts_mbfe_tab          valid      normal     nonunique   10 yes  3     2029696       25297      2096619 no     1 msgid
messagein_pk                                  ts_mbfe_tab          valid      normal     unique      10 yes  2     2081296        7672      2081296 no     1 id

对该表的统计信息及索引情况进行查询后，我们可以发现几个重要信息：
1）该表的统计信息还是比较新，数据量也是比较准确的，数据量在200w左右；
2）在msgid列上是存在索引的；
3）msgid列的值其distinct值非常高，几乎等同于主键；

因此，这样分析下来，这条sql正确的走法应该是走索引indexmsgin_1的范围扫描，从大量的数据中获取少量的数据才对，但是为什么实际上的执行计划却是全表扫描呢？全表扫描的代价理论上一定会非常大的。

于是针对该索引进行测试，即走全表和加hint强制让该sql走索引，观察效率及执行计划：
1）走全表

select * from mbfe.messagein where msgid='id:414d5120514d454d424645202020202065603b6102799920';
execution plan
----------------------------------------------------------
plan hash value: 2678161237
-------------------------------------------------------------------------------
| id  | operation         | name      | rows  | bytes | cost (%cpu)| time     |
-------------------------------------------------------------------------------
|   0 | select statement  |           |   617k|   437m| 59740   (1)| 00:11:57 |
|*  1 |  table access full| messagein |   617k|   437m| 59740   (1)| 00:11:57 |
-------------------------------------------------------------------------------
predicate information (identified by operation id):
---------------------------------------------------
   1 - filter("msgid"='id:414d5120514d454d424645202020202065603b61027999
              20')
statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
     219879  consistent gets
     117255  physical reads
          0  redo size
       1672  bytes sent via sql*net to client
        811  bytes received via sql*net from client
          4  sql*net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

2）强制索引

select /* index(a,indexmsgin_1)*/ * from mbfe.messagein a where a.msgid='id:414d5120514d454d424645202020202065603b6102799920'
execution plan
----------------------------------------------------------
plan hash value: 4204216909
--------------------------------------------------------------------------------------------
| id  | operation                   | name         | rows  | bytes | cost (%cpu)| time     |
--------------------------------------------------------------------------------------------
|   0 | select statement            |              |   617k|   437m|   592k  (1)| 01:58:26 |
|   1 |  table access by index rowid| messagein    |   617k|   437m|   592k  (1)| 01:58:26 |
|*  2 |   index range scan          | indexmsgin_1 |   637k|       |  7696   (1)| 00:01:33 |
--------------------------------------------------------------------------------------------
predicate information (identified by operation id):
---------------------------------------------------
   2 - access("a"."msgid"='id:414d5120514d454d424645202020202065603b6102799920')
statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          8  consistent gets
          2  physical reads
          0  redo size
       1672  bytes sent via sql*net to client
        811  bytes received via sql*net from client
          4  sql*net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

进行分析对比后，发现确实强制走索引的实际执行时间更短更高，执行过程中的各类信息更优(比如逻辑读和物理度，强制索引分别为8和2，而全表扫描则为21w和11w)。

但是从cbo所需判断的cost值上来看，走索引的cost值为592000，全表扫描的cost值为59740；也就是说走索引时的cost值要远大于全表扫描的cost值，因此cbo一定会选择以全表扫描的方式来执行该sql。

这种不合理的现象到底是为什么呢？为什么在这么明显的条件下，cbo会算出走索引的cost值要远大于走全表扫描的cost值呢？为了找到具体的原因，我们则需要借助oracle的10053的事件跟踪器来进行分析了。

三分析

通过10053事件跟踪，我们对该sql的执行计划cost计算值进行查看分析：

......
pm: considering predicate move-around in query block sel$1 (#0)
**************************
predicate move-around (pm)
**************************
optimizer information
******************************************
----- current sql statement for this session (sql_id=4dh15ry33grtn) -----
select messagein0_.proccnt as col_0_0_ from mbfe.messagein messagein0_ where messagein0_.msgid='id:414d5120514d454d424645202020202065603b6102799920'
*******************************************
legend
the following abbreviations are used by optimizer trace.
cbqt - cost-based query transformation
jppd - join predicate push-down
ojppd - old-style (non-cost-based) jppd
......
***************************************
base statistical information
***********************
table stats::
  table: messagein  alias: messagein0_
    #rows: 1943151  #blks:  212183  avgrowlen:  741.00  chaincnt:  0.00
index stats::
  index: indexmsgin_1  col#: 2
    lvls: 3  #lb: 23750  #dk: 1931136  lb/k: 1.00  db/k: 1.00  cluf: 1838851.00
  index: messagein_pk  col#: 1
    lvls: 2  #lb: 6848  #dk: 1939843  lb/k: 1.00  db/k: 1.00  cluf: 1210515.00
  index: sys_il0000145004c00006$$  col#:    (not analyzed)
    lvls: 1  #lb: 25  #dk: 100  lb/k: 1.00  db/k: 1.00  cluf: 800.00
access path analysis for messagein
***************************************
single table access path 
  single table cardinality estimation for messagein[messagein0_] 
  column (#2): 
    newdensity:0.146996, olddensity:0.000000 bktcnt:5609, popbktcnt:5609, popvalcnt:2, ndv:1931136
  column (#2): msgid(
    avglen: 52 ndv: 1931136 nulls: 0 density: 0.146996
    histogram: freq  #bkts: 2  uncompbkts: 5609  endptvals: 2
  table: messagein  alias: messagein0_
    card: original: 1943151.000000  rounded: 571097  computed: 571097.24  non adjusted: 571097.24
  access path: tablescan
    cost:  57521.17  resp: 57521.17  degree: 0
      cost_io: 57468.00  cost_cpu: 1961385604
      resp_io: 57468.00  resp_cpu: 1961385604
  access path: index (alleqrange)
    index: indexmsgin_1
    resc_io: 547428.00  resc_cpu: 4140263726
    ix_sel: 0.293903  ix_sel_with_filters: 0.293903 
    cost: 547540.24  resp: 547540.24  degree: 1
  best:: accesspath: tablescan
         cost: 57521.17  degree: 1  resp: 57521.17  card: 571097.24  bytes: 0
***************************************
......

从10053的trace中可以看到cbo对该表的访问方式同样也是两种，即tablescan和index (alleqrange)，但是，整体的cost计算下来，tablescan’s cost=57521.17 << index’s cost=547540.24，通过index方式的消耗，cbo主要预估在了resc_io上，也就是说通过索引的方式，cbo认为在io上会消耗大量的资源(通过hint的执行计划也可以发现，在回表上cbo预算的值非常大)。

此时我们可以注意到两个值：ix_sel和ix_sel_with_filters均为0.293903。

ix_sel的值为查询中参考到的所有被索引的字段的distinct值累乘，比如说如在查询中涉及到3个索引字段cola，colb，colc,则ix_sel：1/ndv(colaxcolbxcolc),其中ndv为列的唯一之数量。

因此，我们可以按照该公式计算下ix_sel列的值应该为：

ix_sel=1/1931136=0.00000051782992=5.1783e-07

而由于在该查询中没有其他的索引条件过滤，所以最终的ix_sel_with_filters应该等于ix_sel，即

ix_sel_with_filters=ix_sel=5.1783e-07

根据该值对成本进行计算

cost ~= (ix_sel: * #lb: )   (ix_sel_with_filters: * cluf: ) lvl cpu
     ~= (5.1783e-07 * 23750)   (5.1783e-07 * 1838851) 3 8
     ~= 0.0122984606 0.95221206622192 3 8
     ~=11

而按照10053给出的值对成本进行计算

cost ~= (ix_sel: * #lb: )   (ix_sel_with_filters: * cluf: ) lvl cpu
     ~= (0.293903 * 23750)   (0.293903 * 1838851) 3 8
     ~= 6980 540444 3 8
     ~=547435

很明显，根本原因找到了，即优化器对于索引过滤的成本出现了错误，原本的索引列选择度成本为5.1783e-07，却被评估为0.293903，最终导致成本被放大，执行计划选择错误。

那么更进一步分析，什么东西会影响索引列成本的评估呢？我们知道索引列的成本预估和列的可选择率、cardinality有非常紧密的关系，那么在oracle中排除bug的影响，也就只有直方图可能会对列的可选择率、cardinality的判断有影响了。

我们知道直方图统计信息主要是为了准确评估一些分布不均匀的列的可选择率、cardinality而被oracle引入的。当目标列上有了直方图统计信息后，cbo则不会认为该列的数据是均匀分布的，因此会使用该列上的直方图统计信息来计算该列的cardinality值。

通过对表中的索引列统计信息进行查询：

table_name                     column_name                    num_distinct    density num_buckets histogram
------------------------------ ------------------------------ ------------ ---------- ----------- ---------------
messagein                      id                                  2042111 4.8969e-07           1 none
messagein                      msgid                               2038144 2.3974e-07           2 frequency
messagein                      status                                    1          1           1 none
messagein                      proccnt                                   3 .333333333           1 none
messagein                      accepttime                          1645952 6.0755e-07           1 none
messagein                      msgbody                                   0          0           0 none
messagein                      queue                                     4        .25           1 none

可以看到此时统计信息列信息中确实该列存在直方图统计信息，且当histogram值为frequency时(存在直方图时)，density值为：2.3974e-07

其中，density值实际上可以近似看作是目标列在进行等值查询条件后的可选择率，那么在没有直方图的情况下，density ～= 1/num_distinct。当存在直方图的时候，density ～= 1/（2 * num_rows * null_adjust），其中null_adjust=(num_rows-null_rows)/num_rows。

按照该方式运算，当我们执行sql的时候，其cardinality的值应该为：2033367*2.3974e-07=0.4～=1。但实际情况却为(从全表扫描的执行计划中得到)：600k。因此可以发现，此时cbo没有使用选择使用density作为该列成本预估基数。

也就是说当该列中存在直方图统计信息的时候，成本预估并不是按照我们原本的方式进行，最终导致执行计划错误的原因找到后，我们则可以进行相应的处理。

四处理

整个问题的处理可以分为两个步骤：
1）清理该列的直方图统计信息：

begin
   dbms_stats.delete_column_stats(ownname=>'mbfe', tabname=>'messagein', colname=>'msgid', col_stat_type=>'histogram');
end;
/

2）从内存中删除原执行计划：

sql> select inst_id,sql_id, address, hash_value from gv$sqlarea v where v.sql_id='&sql_id';
enter value for sql_id: bhyt8t15xy9c1
 i sql_id             address          hash_value
-- ------------------ ---------------- ----------
 1 bhyt8t15xy9c1      000000025d2956c8 1272915329
exec dbms_shared_pool.purge('000000025d2956c8,1272915329','c');

进行处理后，查看最新统计信息：

table_name                     column_name                    num_distinct    density num_buckets histogram
------------------------------ ------------------------------ ------------ ---------- ----------- ---------------
messagein                      id                                  2033367 4.9180e-07           1 none
messagein                      msgid                               2029696 4.9268e-07           1 none
messagein                      status                                    1          1           1 none
messagein                      proccnt                                   3 .333333333           1 none
messagein                      accepttime                          1639552 6.0992e-07           1 none
messagein                      msgbody                                   0          0           0 none
messagein                      queue                                     4        .25           1 none

查看最新的执行计划：

****************************************************************************************
plan stat from ash
****************************************************************************************
sql_id  bhyt8t15xy9c1, child number 0
-------------------------------------
select messagein0_.proccnt as col_0_0_ from mbfe.messagein messagein0_
where messagein0_.msgid=:1
plan hash value: 4204216909
--------------------------------------------------------------------------------------------
| id  | operation                   | name         | rows  | bytes | cost (%cpu)| time     |
--------------------------------------------------------------------------------------------
|   0 | select statement            |              |       |       |     5 (100)|          |
|   1 |  table access by index rowid| messagein    |     1 |    55 |     5   (0)| 00:00:01 |
|*  2 |   index range scan          | indexmsgin_1 |     1 |       |     4   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
predicate information (identified by operation id):
---------------------------------------------------
2 - access("messagein0_"."msgid"=:1)
pl/sql procedure successfully completed.

生成最新的10053进行查看：

.......
***************************************
base statistical information
***********************
table stats::
  table: messagein  alias: messagein0_
    #rows: 2033367  #blks:  220367  avgrowlen:  742.00  chaincnt:  0.00
index stats::
  index: indexmsgin_1  col#: 2
    lvls: 3  #lb: 25297  #dk: 2029696  lb/k: 1.00  db/k: 1.00  cluf: 1922889.00
  index: messagein_pk  col#: 1
    lvls: 2  #lb: 7672  #dk: 2081296  lb/k: 1.00  db/k: 1.00  cluf: 1314367.00
  index: sys_il0000145004c00006$$  col#:    (not analyzed)
    lvls: 1  #lb: 25  #dk: 100  lb/k: 1.00  db/k: 1.00  cluf: 800.00
access path analysis for messagein
***************************************
single table access path 
  single table cardinality estimation for messagein[messagein0_] 
column (#2): msgid(
    avglen: 52 ndv: 2029696 nulls: 0 density: 0.000000
table: messagein  alias: messagein0_
    card: original: 2033367.000000  rounded: 1  computed: 1.00  non adjusted: 1.00
access path: tablescan
  cost:  59738.67  resp: 59738.67  degree: 0
    cost_io: 59684.00  cost_cpu: 2016671148
    resp_io: 59684.00  resp_cpu: 2016671148
access path: index (alleqrange)
  index: indexmsgin_1
  resc_io: 5.00  resc_cpu: 36427
  ix_sel: 0.000000  ix_sel_with_filters: 0.000000 
  cost: 5.00  resp: 5.00  degree: 1
best:: accesspath: indexrange
index: indexmsgin_1
         cost: 5.00  degree: 1  resp: 5.00  card: 1.00  bytes: 0
***************************************

五结论

直方图往往可以在列分布不均匀的情况下，为我们提供更好的执行计划选择。但是有几种情况，我们要避免直方图的产生，否则会由于cbo的算法评估给我们带来更差的执行计划：
1）条件列为主键的时候；
2）条件列为唯一键的时候；
3）条件列虽然不唯一，但是量大且分布比较均匀的时候(本次案例的场景)；

这几种情况都有一个共性，就是列的量大，且数据分布均匀，可以被用于主键、唯一键或者近似为唯一键。

附文中出现的相关查询sql

1）查询数据库dbtime:

select 
    i.instance_number inst,
    i.instance_name inst_name,
    to_char(s.begin_interval_time, 'mm/dd/yyyy hh24:mi:ss') begin_interval_time,
    to_char(s.end_interval_time, 'mm/dd/yyyy hh24:mi:ss') end_interval_time,
    round((e.value - b.value) / 1000000 / 60, 2) db_time
from dba_hist_snapshot s,
    v$instance i,
    dba_hist_sys_time_model e,
    dba_hist_sys_time_model b
where i.instance_number = s.instance_number
    and e.snap_id = s.snap_id
    and b.snap_id = s.snap_id - 1
    and e.stat_id = b.stat_id
    and e.instance_number = b.instance_number
    and e.instance_number = s.instance_number
    and e.stat_name = 'db time'
order by i.instance_name,
    s.snap_id desc ;

2）查询sql采样次数to10:

select * from (select sql_id,count(*)
  from v$active_session_history ash
 where to_char(ash.sample_time, 'yyyy-mm-dd hh24:mi"ss') between
       '2023-07-10 14:00:00' and '2023-07-10 17:00:00' group by sql_id order by 2 desc) where rownum < 10;

3）查询执行计划

select * from table(dbms_xplan.display_cursor('&sql_id'));

4）查看索引列统计信息

select table_name,column_name,num_distinct,density,num_buckets,histogram from dba_tab_col_statistics where table_name='table_name';

墨力计划

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

直方图竟然是导致业务执行缓慢的元凶！ -m6米乐安卓版下载

一 背景

二 排查过程

三 分析

四 处理

五 结论

附 文中出现的相关查询sql

评论

一背景

二排查过程

三分析

四处理

五结论

附文中出现的相关查询sql