深入浅出 VACUUM 内核原理(中)

VACUUM 原理

Word count: 3.8kReading time: 19 min

 2023/10/14 

前言

之前写了一篇深入浅出VACUUM内核原理(上)，赶巧最近有位读者和我探讨了一个问题，恰巧也和vacuum的有趣特性有关，借此再增加一篇深入浅出VACUUM内核原理(中)。那么是什么样的特性呢？没错，又是我们的老朋友——页剪枝。

现象

简单重温下页剪枝，有两种情况会进行剪枝，以删除在任何快照中不再可见的元组

之前的 UPDATE 操作没有找到足够的空间将新元组放入同一页面。
堆页中包含的数据多于 fillfactor 存储参数所允许的数据，比如 fillfactor 是 80，那么预留 20% 的空间用于更新操作，假如剩余的空间不足 20%，就会执行页剪枝，另外最低不能低于单个数据块大小的 10%

之前的文章已经反复提及，此处不再赘述细节。让我们先看个场景，样例来自《PostgreSQL 14 internal》。

postgres=# CREATE TABLE hot(id integer, s char(2000)) WITH (fillfactor = 75);
CREATE TABLE
postgres=# CREATE INDEX hot_id ON hot(id);
CREATE INDEX
postgres=# CREATE INDEX hot_s ON hot(s);
CREATE INDEX
postgres=# INSERT INTO hot VALUES (1, 'A');
INSERT 0 1
postgres=# UPDATE hot SET s = 'B';
UPDATE 1
postgres=# UPDATE hot SET s = 'C';
UPDATE 1
postgres=# UPDATE hot SET s = 'D';
UPDATE 1
postgres=# SELECT * FROM heap_page('hot',0);
 ctid  | state  |  xmin  |  xmax  
-------+--------+--------+--------
 (0,1) | normal | 1158 c | 1159 c
 (0,2) | normal | 1159 c | 1160 c
 (0,3) | normal | 1160 c | 1161
 (0,4) | normal | 1161   | 0 a
(4 rows)

由于 fillfactor 设置的为 75%，并且 s 字段是定长的 2000 字节，所以每个页面只能容纳 4 条元组，插入 3 条，还有 1 条留作更新。然后下次页面访问将触发页剪枝，删除所有不可见的元组

postgres=# UPDATE hot SET s = 'E';
UPDATE 1
postgres=# SELECT * FROM heap_page('hot',0);
 ctid  | state  |  xmin  | xmax 
-------+--------+--------+------
 (0,1) | dead   |        | 
 (0,2) | dead   |        | 
 (0,3) | dead   |        | 
 (0,4) | normal | 1161 c | 1164
 (0,5) | normal | 1164   | 0 a
(5 rows)

可以看到，前面 3 条元组已经被移除。通过这个例子，想必各位已经大致知晓了页剪枝的原理。

现在让我们变换一下

postgres=# truncate table hot;
TRUNCATE TABLE
postgres=# insert into hot values(1,'A');
INSERT 0 1
postgres=# insert into hot values(1,'B');
INSERT 0 1
postgres=# insert into hot values(1,'C');
INSERT 0 1
postgres=# update hot set s = 'D';
UPDATE 3
postgres=# SELECT * FROM heap_page('hot',0);
 ctid  | state  |  xmin  | xmax 
-------+--------+--------+------
 (0,1) | normal | 1187 c | 1190
 (0,2) | normal | 1188 c | 1190
 (0,3) | normal | 1189 c | 1190
 (0,4) | normal | 1190   | 0 a
(4 rows)

postgres=# SELECT * FROM heap_page('hot',1);
 ctid  | state  | xmin | xmax 
-------+--------+------+------
 (1,1) | normal | 1190 | 0 a
 (1,2) | normal | 1190 | 0 a
(2 rows)

postgres=# select * from pg_visibility_map('hot');
 blkno | all_visible | all_frozen 
-------+-------------+------------
     0 | f           | f
     1 | f           | f
(2 rows)

postgres=# explain analyze select * from hot;			---触发了页剪枝
                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Seq Scan on hot  (cost=0.00..10.10 rows=10 width=8008) (actual time=0.026..0.031 rows=3 loops=1)
 Planning Time: 0.079 ms
 Execution Time: 0.048 ms
(3 rows)

postgres=# SELECT * FROM heap_page('hot',0);			---触发了页剪枝
 ctid  | state  |  xmin  | xmax 
-------+--------+--------+------
 (0,1) | dead   |        | 
 (0,2) | dead   |        | 
 (0,3) | dead   |        | 
 (0,4) | normal | 1175 c | 0 a
(4 rows)

当执行了查询之后，触发了页剪枝，前面 3 条死元组都被清理了。但是！你会发现此时 vm 文件中依旧不是 all_visible 的，因为页剪枝期间不会去更新 vm 和 fsm，指针还在，索引对其可能还有引用。

postgres=# select * from pg_visibility_map('hot');
 blkno | all_visible | all_frozen 
-------+-------------+------------
     0 | f           | f
     1 | f           | f
(2 rows)

所以这个时候，假如执行计划选择了 index only scan ，注意并不是真正的 index only scan，还是需要去回表判断可见性的，即我们看到的 Heap Fetchs，参照下例。至于为什么是 6，各位读者可以思考一下🤔，并且为什么再次执行之后变成了 3，各位也可以仔细思考一下。

postgres=# set enable_seqscan to off;
SET
postgres=# explain (analyze,buffers) select id from hot where id = 1;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Only Scan using hot_id on hot  (cost=0.14..8.15 rows=1 width=4) (actual time=0.010..0.014 rows=3 loops=1)
   Index Cond: (id = 1)
   Heap Fetches: 6
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=5
 Planning Time: 0.115 ms
 Execution Time: 0.032 ms
(8 rows)

postgres=# explain (analyze,buffers) select id from hot where id = 1;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Only Scan using hot_id on hot  (cost=0.14..8.15 rows=1 width=4) (actual time=0.019..0.022 rows=3 loops=1)
   Index Cond: (id = 1)
   Heap Fetches: 3
   Buffers: shared hit=3
 Planning Time: 0.077 ms
 Execution Time: 0.040 ms
(6 rows)

各位思考 5 分钟……

叮，答案在这 👉🏻 扫描前的索引状态

postgres=# SELECT * FROM index_page('hot_id',1);
 itemoffset | htid  | dead 
------------+-------+------
          1 | (0,1) | f
          2 | (0,2) | f
          3 | (0,3) | f
          4 | (0,4) | f
          5 | (1,1) | f
          6 | (1,2) | f
(6 rows)

扫描后的索引状态，点到为止，全部说明就失去了意思。

postgres=# SELECT * FROM index_page('hot_id',1);
 itemoffset | htid  | dead 
------------+-------+------
          1 | (0,1) | t
          2 | (0,2) | t
          3 | (0,3) | t
          4 | (0,4) | f
          5 | (1,1) | f
          6 | (1,2) | f
(6 rows)

当然，随着你执行了 vacuum，all_visible 就变成了 true，此时 Heap Fetchs 变成了 0，也就是真正意义上的”仅索引扫描”了，通过 vm 文件我已经知晓了数据块中所有元组均是可见的了，不需要再回表判断可见性了。这里其实也可以引申出一个 PostgreSQL 中常见的面试题——索引中是否包含可见性信息？

postgres=# vacuum hot;
VACUUM
postgres=# select * from pg_visibility_map('hot');
 blkno | all_visible | all_frozen 
-------+-------------+------------
     0 | t           | f
     1 | t           | f
(2 rows)

postgres=# explain (analyze,buffers) select id from hot where id = 1;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Only Scan using hot_id on hot  (cost=0.13..4.15 rows=1 width=4) (actual time=0.012..0.013 rows=3 loops=1)
   Index Cond: (id = 1)
   Heap Fetches: 0
   Buffers: shared hit=2
 Planning:
   Buffers: shared hit=18
 Planning Time: 0.253 ms
 Execution Time: 0.037 ms
(8 rows)

index bypass

那么到此就完了吗？非也，让我们看个更加有趣的例子。

postgres=# drop table test;
DROP TABLE
postgres=# create table test(id int,info text);
CREATE TABLE
postgres=# alter table test set (autovacuum_enabled = off);
ALTER TABLE
postgres=# create index on test(id);
CREATE INDEX
postgres=# insert into test select n,'test' from generate_series(1,100000) as n;
INSERT 0 100000
postgres=# analyze test;
ANALYZE
postgres=# update test set info = 'hello' where id = 99;
UPDATE 1
postgres=# SELECT lp,lp_flags,t_xmin,t_xmax,t_ctid,t_data FROM heap_page_items(get_raw_page('test', 0)) limit 2 offset 98;
 lp  | lp_flags | t_xmin | t_xmax |  t_ctid   |        t_data        
-----+----------+--------+--------+-----------+----------------------
  99 |        1 |   1199 |   1201 | (540,101) | \x630000000b74657374
 100 |        1 |   1199 |      0 | (0,100)   | \x640000000b74657374
(2 rows)

postgres=# explain (analyze,buffers) select id from test where id = 99;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using test_id_idx on test  (cost=0.29..8.31 rows=1 width=4) (actual time=0.051..0.053 rows=1 loops=1)
   Index Cond: (id = 99)
   Heap Fetches: 2
   Buffers: shared hit=4
 Planning Time: 0.075 ms
 Execution Time: 0.070 ms
(6 rows)

postgres=# SELECT lp,lp_flags,t_xmin,t_xmax,t_ctid,t_data FROM heap_page_items(get_raw_page('test', 0)) limit 2 offset 98;
 lp  | lp_flags | t_xmin | t_xmax | t_ctid  |        t_data        
-----+----------+--------+--------+---------+----------------------
  99 |        3 |        |        |         | 
 100 |        1 |   1199 |      0 | (0,100) | \x640000000b74657374
(2 rows)

同样，经过了查询之后，也触发了页剪枝，注意观察 Heap Fetchs 值的变化。

postgres=# explain (analyze,buffers) select id from test where id = 99;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using test_id_idx on test  (cost=0.29..8.31 rows=1 width=4) (actual time=0.048..0.050 rows=1 loops=1)
   Index Cond: (id = 99)
   Heap Fetches: 2
   Buffers: shared hit=4
 Planning Time: 0.072 ms
 Execution Time: 0.066 ms
(6 rows)

postgres=# explain (analyze,buffers) select id from test where id = 99;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using test_id_idx on test  (cost=0.29..8.31 rows=1 width=4) (actual time=0.021..0.023 rows=1 loops=1)
   Index Cond: (id = 99)
   Heap Fetches: 1
   Buffers: shared hit=3
 Planning Time: 0.075 ms
 Execution Time: 0.039 ms
(6 rows)

那么让我们依葫芦画瓢，也执行一下 vacuum

postgres=# select * from pg_visibility_map('test',0);
 all_visible | all_frozen 
-------------+------------
 f           | f
(1 row)

postgres=# vacuum test;
VACUUM
postgres=# select * from pg_visibility_map('test',0);
 all_visible | all_frozen 
-------------+------------
 f           | f
(1 row)

纳尼！怎么 all_visible 还是 false？让我们打印 verbose 出来看一下

postgres=# vacuum verbose test;
INFO:  vacuuming "postgres.public.test"
INFO:  finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 541 remain, 2 scanned (0.37% of total)
tuples: 0 removed, 100000 remain, 0 are dead but not yet removable
removable cutoff: 1214, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan bypassed: 1 pages from table (0.18% of total) have 1 dead item identifiers
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 11 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  vacuuming "postgres.pg_toast.pg_toast_16707"
INFO:  finished vacuuming "postgres.pg_toast.pg_toast_16707": index scans: 0
pages: 0 removed, 0 remain, 0 scanned (100.00% of total)
tuples: 0 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 1214, which was 0 XIDs old when operation ended
frozen: 0 pages from table (100.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (100.00% of total) had 0 dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 3 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM

各位以往可能没有注意这个显眼包

index scan bypassed: 1 pages from table (0.18% of total) have 1 dead item identifiers

index scan bypassed，顾名思义，跳过了索引扫描，有一个页面中含有一条死元组，但是被跳过了！为什么会这样呢？其实官网已经有了简单的介绍，关于 index_cleanup

Normally, VACUUM will skip index vacuuming when there are very few dead tuples in the table. The cost of processing all of the table’s indexes is expected to greatly exceed the benefit of removing dead index tuples when this happens. This option can be used to force VACUUM to process indexes when there are more than zero dead tuples. The default is AUTO, which allows VACUUM to skip index vacuuming when appropriate. If INDEX_CLEANUP is set to ON, VACUUM will conservatively remove all dead tuples from indexes. This may be useful for backwards compatibility with earlier releases of PostgreSQL where this was the standard behavior.

INDEX_CLEANUP can also be set to OFF to force VACUUM to always skip index vacuuming, even when there are many dead tuples in the table. This may be useful when it is necessary to make VACUUM run as quickly as possible to avoid imminent transaction ID wraparound (see Section 25.1.5). However, the wraparound failsafe mechanism controlled by vacuum_failsafe_age will generally trigger automatically to avoid transaction ID wraparound failure, and should be preferred. If index cleanup is not performed regularly, performance may suffer, because as the table is modified indexes will accumulate dead tuples and the table itself will accumulate dead line pointers that cannot be removed until index cleanup is completed.

This option has no effect for tables that have no index and is ignored if the FULL option is used. It also has no effect on the transaction ID wraparound failsafe mechanism. When triggered it will skip index vacuuming, even when INDEX_CLEANUP is set to ON.

注意这一段，Normally, VACUUM will skip index vacuuming when there are very few dead tuples in the table. ，通常情况下，当只有很少的死元组的时候，vacuum 会跳过索引的清理，发生这种情况时，处理所有表索引的成本预计将大大超过删除死索引元组的好处，当遇到即将事务回卷时，将该参数设为 off，将跳过索引的清理，加速事务的回收。默认是 auto，意味着在合适的时候，自动跳过。那么问题来了，什么时候是”合适”的时候？让我们分析下源码，代码也好找

		/*
		 * This crossover point at which we'll start to do index vacuuming is
		 * expressed as a percentage of the total number of heap pages in the
		 * table that are known to have at least one LP_DEAD item.  This is
		 * much more important than the total number of LP_DEAD items, since
		 * it's a proxy for the number of heap pages whose visibility map bits
		 * cannot be set on account of bypassing index and heap vacuuming.
		 *
		 * We apply one further precautionary test: the space currently used
		 * to store the TIDs (TIDs that now all point to LP_DEAD items) must
		 * not exceed 32MB.  This limits the risk that we will bypass index
		 * vacuuming again and again until eventually there is a VACUUM whose
		 * dead_items space is not CPU cache resident.
		 *
		 * We don't take any special steps to remember the LP_DEAD items (such
		 * as counting them in our final update to the stats system) when the
		 * optimization is applied.  Though the accounting used in analyze.c's
		 * acquire_sample_rows() will recognize the same LP_DEAD items as dead
		 * rows in its own stats report, that's okay. The discrepancy should
		 * be negligible.  If this optimization is ever expanded to cover more
		 * cases then this may need to be reconsidered.
		 */
		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
		bypass = (vacrel->lpdead_item_pages < threshold &&
				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
	}

	if (bypass)
	{
		/*
		 * There are almost zero TIDs.  Behave as if there were precisely
		 * zero: bypass index vacuuming, but do index cleanup.
		 *
		 * We expect that the ongoing VACUUM operation will finish very
		 * quickly, so there is no point in considering speeding up as a
		 * failsafe against wraparound failure. (Index cleanup is expected to
		 * finish very quickly in cases where there were no ambulkdelete()
		 * calls.)
		 */
		vacrel->do_index_vacuuming = false;
	}

/*
 * Threshold that controls whether we bypass index vacuuming and heap
 * vacuuming as an optimization
 */
#define BYPASS_THRESHOLD_PAGES	0.02	/* i.e. 2% of rel_pages */

可以看到，默认的阈值是 2% 的总页面数量，当包含有 LP_DEAD 标记的页面数量小于 2% 的页面数量时，就会设置 bypass，设置之后，do_index_vacuuming 状态位就会设置为 false（一个 LP 才占 4 个字节，不清理影响也不大），最终进入到 failsafe 的逻辑中，failsafe 是 14 引入的特性，紧急情况下切换到全速 VACUUM 模式，同时忽略 vacuum_cost_delay，防止 xid wraparound。

if (vacrel->do_index_vacuuming)
{
	if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
		appendStringInfoString(&buf, _("index scan not needed: "));
	else
		appendStringInfoString(&buf, _("index scan needed: "));

	msgfmt = _("%u pages from table (%.2f%% of total) had %lld dead item identifiers removed\n");
}
else
{
	if (!vacrel->failsafe_active)
		appendStringInfoString(&buf, _("index scan bypassed: "));
	else
		appendStringInfoString(&buf, _("index scan bypassed by failsafe: "));

	msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
}

最终，就是我们日志中看到的

index scan bypassed: 1 pages from table (0.18% of total) have 1 dead item identifiers

所以，假如我们强制让其清理索引就不一样了，index only scan 变成了真正的 index only scan。

index scan needed: 1 pages from table (0.18% of total) had 1 dead item identifiers removed

postgres=# vacuum (verbose,index_cleanup true) test;
INFO:  vacuuming "postgres.public.test"
INFO:  finished vacuuming "postgres.public.test": index scans: 1
pages: 0 removed, 541 remain, 2 scanned (0.37% of total)
tuples: 0 removed, 100000 remain, 0 are dead but not yet removable
removable cutoff: 1214, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 1 pages from table (0.18% of total) had 1 dead item identifiers removed
index "test_id_idx": pages: 276 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 0.000 MB/s, avg write rate: 31.376 MB/s
buffer usage: 288 hits, 0 misses, 5 dirtied
WAL usage: 4 records, 4 full page images, 31197 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  vacuuming "postgres.pg_toast.pg_toast_16707"
INFO:  finished vacuuming "postgres.pg_toast.pg_toast_16707": index scans: 0
pages: 0 removed, 0 remain, 0 scanned (100.00% of total)
tuples: 0 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 1214, which was 0 XIDs old when operation ended
frozen: 0 pages from table (100.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (100.00% of total) had 0 dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 1 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM

postgres=# select * from pg_visibility_map('test',0);
 all_visible | all_frozen 
-------------+------------
 t           | f
(1 row)

postgres=# explain (analyze,buffers) select id from test where id = 99;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using test_id_idx on test  (cost=0.29..4.31 rows=1 width=4) (actual time=0.010..0.011 rows=1 loops=1)
   Index Cond: (id = 99)
   Heap Fetches: 0
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=4
 Planning Time: 0.101 ms
 Execution Time: 0.026 ms
(8 rows)