自省 自行 自醒

从一个小案例,看 PostgreSQL 日臻完善的可观测性能力

Word count: 1.5kReading time: 6 min
2024/08/05
loading

前言

前阵子,一位读者朋友咨询了我这样一个问题

我用的 pg_stat_statments 视图,里面比如那个 shared_blks_read 字段,意思应该是一条 SQL 的内存读到的数据块数,然后我数据库一个块 8kb,算了算他这个值怎么比我的数据库的 shared_buffers 参数还大,我的 shared buffers 是 262144 * 8kb = 2GB

简而言之,就是 pg_stat_statments 视图中的 shared_blks_read 远大于 shared_buffers。分析一下这个问题 ~

pg_stat_statments

关于 pg_stat_statments 的案例,之前也曾写过一篇 👉🏻 pg_stat_statements 的有趣案例,不过那次案例是由于应用传入了大量绑定变量,导致查询的时候,巨慢无比。pg_stat_statments 视图中有许多有价值的指标,并且还在不断完善中

以 17 为例,除了常规的 buffer、WAL 等指标,还加上了 JIT 相关的指标。其中关于 local、temp 和 shared 相关字段,需要说明一下:

1
2
3
4
5
6
postgres=# \o | egrep "blk_write_time|blk_read_time"
postgres=# \d pg_stat_statements
blk_read_time | double precision | | |
blk_write_time | double precision | | |
temp_blk_read_time | double precision | | |
temp_blk_write_time | double precision | | |

这三个指标分别对应了不同的对象,其实 local 对应的才是临时表,不要搞混了。

  1. Shared blocks contain data from regular tables and indexes,shared 对应常规的表和索引
  2. local blocks contain data from temporary tables and indexes,local 对应的是临时表及其索引
  3. while temp blocks contain short-term working data used in sorts, hashes, Materialize plan nodes, and similar cases,temp 对应的是排序、哈希等会用到的临时数据块

回到此例,shared_blks_read 远大于 shared_buffers,第一反应,可能是这个值累计的原因,让我们验证一下:

1
2
3
4
5
6
7
8
9
10
11
12
13
postgres=# create table t1(id int);
CREATE TABLE
postgres=# insert into t1 values(generate_series(1,10000));
INSERT 0 10000
postgres=# explain analyze select pg_buffercache_evict(bufferid) from pg_buffercache where relfilenode = 't1'::regclass;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_buffercache_pages p (cost=0.00..12.51 rows=5 width=1) (actual time=81.836..120.876 rows=49 loops=1)
Filter: (relfilenode = '16775'::oid)
Rows Removed by Filter: 166351
Planning Time: 0.090 ms
Execution Time: 123.014 ms
(5 rows)

为了观察 read,插完数据之后做一下缓冲逐出。然后我们就可以观察到效果了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
postgres=# explain (analyze,buffers) select * from t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..145.00 rows=10000 width=4) (actual time=0.046..1.532 rows=10000 loops=1)
Buffers: shared read=45
Planning Time: 0.039 ms
Execution Time: 2.259 ms
(4 rows)

postgres=# select query,calls,shared_blks_hit,shared_blks_read from pg_stat_statements where query like '%t1';
-[ RECORD 1 ]----+-------------------------------------------
query | explain (analyze,buffers) select * from t1
calls | 1
shared_blks_hit | 0
shared_blks_read | 45

postgres=# explain (analyze,buffers) select * from t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..145.00 rows=10000 width=4) (actual time=0.021..1.546 rows=10000 loops=1)
Buffers: shared hit=45
Planning Time: 0.044 ms
Execution Time: 2.376 ms
(4 rows)

postgres=# select query,calls,shared_blks_hit,shared_blks_read from pg_stat_statements where query like '%t1';
-[ RECORD 1 ]----+-------------------------------------------
query | explain (analyze,buffers) select * from t1
calls | 2
shared_blks_hit | 45
shared_blks_read | 45

postgres=# explain (analyze,buffers) select * from t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..145.00 rows=10000 width=4) (actual time=0.018..1.302 rows=10000 loops=1)
Buffers: shared hit=45
Planning Time: 0.037 ms
Execution Time: 1.971 ms
(4 rows)

postgres=# select query,calls,shared_blks_hit,shared_blks_read from pg_stat_statements where query like '%t1';
-[ RECORD 1 ]----+-------------------------------------------
query | explain (analyze,buffers) select * from t1
calls | 3
shared_blks_hit | 90
shared_blks_read | 45

结果很清晰,这个值是累计值,和 explain + buffers 观察到的结果一致。那么我就很自然地回复了他:”因为这个值是累计值,运行久了自然就超过了 shared_buffers。”,但是这位读者反馈

我拿到的 topsql 单个 SQL 执行了 3 次,我看的shared blks hit 64069336,shared blks read 6185765

也就是说,除以 3 次,shared_blks_read 的值还是远大于 shared_buffers。因此,说明还有其他原因。到这里,想必不少读者已经猜到了原因,举个栗子:如果一个表 10 GB,shared_buffers 就 1GB,全表查询会如何?没错,自然是缓冲逐出了。并且为了预防缓冲区被污染,PostgreSQL 还有一个 ring buffer 的机制:

  1. 当大表的大小超过了缓冲区缓存的 1/4 时便会使用。环形缓冲区占用 256 kB (32 个标准页面)。此策略不允许将脏页写入磁盘以释放缓冲区;相反,缓冲区从环中排除,并被另一个取代。因此,读取不必等待写入完成,因此执行得更快。

  2. COPY FROM、CREATE TABLE AS SELECT 和 CREATE MATERIALIZED VIEW 命令,以及那些导致表重写的 ALTER TABLE 变体。默认大小是 16MB (2048 个标准页面),但不会超过 shared buffers 总大小的 1/8。

  3. VACUUM 在执行全表扫描而不考虑可见性映射时使用。默认分配了 256 kB 的内存 (32 个标准页面),在 17 中进一步提升到了 2MB。

让我们验证一下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
postgres=# insert into t1 values(generate_series(1,10000000));
INSERT 0 10000000
postgres=# \dt+ t1
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------+-------+----------+-------------+---------------+--------+-------------
public | t1 | table | postgres | permanent | heap | 346 MB |
(1 row)

postgres=# show shared_buffers ;
shared_buffers
----------------
128MB
(1 row)

现在表大小已经超过了 shared_buffers,然后执行一个全表扫描:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
postgres=# explain (analyze,buffers) select * from t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..144248.48 rows=10000048 width=4) (actual time=0.048..1442.240 rows=10000000 loops=1)
Buffers: shared hit=16232 read=28016
Planning Time: 0.047 ms
Execution Time: 2166.871 ms
(4 rows)

postgres=# select query,calls,shared_blks_hit,shared_blks_read from pg_stat_statements where query like '%t1';
-[ RECORD 1 ]----+-------------------------------------------
query | explain (analyze,buffers) select * from t1
calls | 1
shared_blks_hit | 16232
shared_blks_read | 28016

postgres=# select name,unit,setting from pg_settings where name = 'shared_buffers';
-[ RECORD 1 ]-----------
name | shared_buffers
unit | 8kB
setting | 16384

很明显,shared_blks_read 远大于 shared_buffers,说明发生了置换,怎么验证呢?那就要再次提一下 pg_stat_io 了,大杀器,使得 PostgreSQL 的可观测性能力更上一层楼。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
postgres=# select * from pg_stat_io where object = 'relation' and context = 'bulkread' and backend_type = 'client backend'; 
-[ RECORD 1 ]--+------------------------------
backend_type | client backend
object | relation
context | bulkread
reads | 28016
read_time | 0
writes | 0
write_time | 0
writebacks | 0
writeback_time | 0
extends |
extend_time |
op_bytes | 8192
hits | 16232
evictions | 32
reuses | 27984
fsyncs |
fsync_time |
stats_reset | 2024-08-05 23:07:13.689509+08

evictions: 32,一目了然。

小结

这么一个小案例中,蕴藏了大量的知识点:

  1. ring buffer,另外还有一个鲜为人知的点,TOAST 是会绕过 ring buffer 的。
  2. pg_stat_io 学会利用起来,可以帮助我们理解底层原理。

为 PostgreSQL 日益完善的可观测性能力点赞 👍🏻

参考

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=dc9f8a798307244d791b71f666f50de319799e7c

https://pgpedia.info/p/pg_stat_statements.html

CATALOG
  1. 1. 前言
  2. 2. pg_stat_statments
  3. 3. 小结
  4. 4. 参考