前言 前阵子,一位读者朋友咨询了我这样一个问题
我用的 pg_stat_statments 视图,里面比如那个 shared_blks_read 字段,意思应该是一条 SQL 的内存读到的数据块数,然后我数据库一个块 8kb,算了算他这个值怎么比我的数据库的 shared_buffers 参数还大,我的 shared buffers 是 262144 * 8kb = 2GB
简而言之,就是 pg_stat_statments 视图中的 shared_blks_read 远大于 shared_buffers。分析一下这个问题 ~
pg_stat_statments 关于 pg_stat_statments 的案例,之前也曾写过一篇 👉🏻 pg_stat_statements 的有趣案例 ,不过那次案例是由于应用传入了大量绑定变量,导致查询的时候,巨慢无比。pg_stat_statments 视图中有许多有价值的指标,并且还在不断完善中
以 17 为例,除了常规的 buffer、WAL 等指标,还加上了 JIT 相关的指标。其中关于 local、temp 和 shared 相关字段,需要说明一下:
1 2 3 4 5 6 postgres= # \o | egrep "blk_write_time|blk_read_time" postgres= # \d pg_stat_statements blk_read_time | double precision | | | blk_write_time | double precision | | | temp_blk_read_time | double precision | | | temp_blk_write_time | double precision | | |
这三个指标分别对应了不同的对象,其实 local 对应的才是临时表,不要搞混了。
Shared blocks contain data from regular tables and indexes,shared 对应常规的表和索引
local blocks contain data from temporary tables and indexes,local 对应的是临时表及其索引
while temp blocks contain short-term working data used in sorts, hashes, Materialize plan nodes, and similar cases,temp 对应的是排序、哈希等会用到的临时数据块
回到此例,shared_blks_read 远大于 shared_buffers,第一反应,可能是这个值累计的原因,让我们验证一下:
1 2 3 4 5 6 7 8 9 10 11 12 13 postgres= # create table t1(id int ); CREATE TABLE postgres= # insert into t1 values (generate_series(1 ,10000 )); INSERT 0 10000 postgres= # explain analyze select pg_buffercache_evict(bufferid) from pg_buffercache where relfilenode = 't1' ::regclass; QUERY PLAN Function Scan on pg_buffercache_pages p (cost= 0.00 ..12 .51 rows = 5 width= 1 ) (actual time = 81.836 ..120 .876 rows = 49 loops= 1 ) Filter : (relfilenode = '16775' ::oid) Rows Removed by Filter : 166351 Planning Time : 0.090 ms Execution Time : 123.014 ms (5 rows )
为了观察 read,插完数据之后做一下缓冲逐出。然后我们就可以观察到效果了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 postgres= # explain (analyze,buffers) select * from t1; QUERY PLAN Seq Scan on t1 (cost= 0.00 ..145 .00 rows = 10000 width= 4 ) (actual time = 0.046 ..1 .532 rows = 10000 loops= 1 ) Buffers: shared read= 45 Planning Time : 0.039 ms Execution Time : 2.259 ms (4 rows ) postgres= # select query,calls,shared_blks_hit,shared_blks_read from pg_stat_statements where query like '%t1' ; - [ RECORD 1 ]query | explain (analyze,buffers) select * from t1 calls | 1 shared_blks_hit | 0 shared_blks_read | 45 postgres= # explain (analyze,buffers) select * from t1; QUERY PLAN Seq Scan on t1 (cost= 0.00 ..145 .00 rows = 10000 width= 4 ) (actual time = 0.021 ..1 .546 rows = 10000 loops= 1 ) Buffers: shared hit= 45 Planning Time : 0.044 ms Execution Time : 2.376 ms (4 rows ) postgres= # select query,calls,shared_blks_hit,shared_blks_read from pg_stat_statements where query like '%t1' ; - [ RECORD 1 ]query | explain (analyze,buffers) select * from t1 calls | 2 shared_blks_hit | 45 shared_blks_read | 45 postgres= # explain (analyze,buffers) select * from t1; QUERY PLAN Seq Scan on t1 (cost= 0.00 ..145 .00 rows = 10000 width= 4 ) (actual time = 0.018 ..1 .302 rows = 10000 loops= 1 ) Buffers: shared hit= 45 Planning Time : 0.037 ms Execution Time : 1.971 ms (4 rows ) postgres= # select query,calls,shared_blks_hit,shared_blks_read from pg_stat_statements where query like '%t1' ; - [ RECORD 1 ]query | explain (analyze,buffers) select * from t1 calls | 3 shared_blks_hit | 90 shared_blks_read | 45
结果很清晰,这个值是累计值,和 explain + buffers 观察到的结果一致。那么我就很自然地回复了他:”因为这个值是累计值,运行久了自然就超过了 shared_buffers。”,但是这位读者反馈
我拿到的 topsql 单个 SQL 执行了 3 次,我看的shared blks hit 64069336,shared blks read 6185765
也就是说,除以 3 次,shared_blks_read 的值还是远大于 shared_buffers。因此,说明还有其他原因。到这里,想必不少读者已经猜到了原因,举个栗子:如果一个表 10 GB,shared_buffers 就 1GB,全表查询会如何?没错,自然是缓冲逐出了。并且为了预防缓冲区被污染,PostgreSQL 还有一个 ring buffer 的机制:
当大表的大小超过了缓冲区缓存的 1/4 时便会使用。环形缓冲区占用 256 kB (32 个标准页面)。此策略不允许将脏页写入磁盘以释放缓冲区;相反,缓冲区从环中排除,并被另一个取代。因此,读取不必等待写入完成,因此执行得更快。
COPY FROM、CREATE TABLE AS SELECT 和 CREATE MATERIALIZED VIEW 命令,以及那些导致表重写的 ALTER TABLE 变体。默认大小是 16MB (2048 个标准页面),但不会超过 shared buffers 总大小的 1/8。
VACUUM 在执行全表扫描而不考虑可见性映射时使用。默认分配了 256 kB 的内存 (32 个标准页面),在 17 中进一步提升到了 2MB。
让我们验证一下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 postgres= # insert into t1 values (generate_series(1 ,10000000 )); INSERT 0 10000000 postgres= # \dt+ t1 List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description public | t1 | table | postgres | permanent | heap | 346 MB | (1 row ) postgres= # show shared_buffers ; shared_buffers 128 MB (1 row )
现在表大小已经超过了 shared_buffers,然后执行一个全表扫描:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 postgres= # explain (analyze,buffers) select * from t1; QUERY PLAN Seq Scan on t1 (cost= 0.00 ..144248 .48 rows = 10000048 width= 4 ) (actual time = 0.048 ..1442 .240 rows = 10000000 loops= 1 ) Buffers: shared hit= 16232 read= 28016 Planning Time : 0.047 ms Execution Time : 2166.871 ms (4 rows ) postgres= # select query,calls,shared_blks_hit,shared_blks_read from pg_stat_statements where query like '%t1' ; - [ RECORD 1 ]query | explain (analyze,buffers) select * from t1 calls | 1 shared_blks_hit | 16232 shared_blks_read | 28016 postgres= # select name,unit,setting from pg_settings where name = 'shared_buffers' ; - [ RECORD 1 ]name | shared_buffers unit | 8 kB setting | 16384
很明显,shared_blks_read 远大于 shared_buffers,说明发生了置换,怎么验证呢?那就要再次提一下 pg_stat_io 了,大杀器,使得 PostgreSQL 的可观测性能力更上一层楼。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 postgres= # select * from pg_stat_io where object = 'relation' and context = 'bulkread' and backend_type = 'client backend' ; - [ RECORD 1 ]backend_type | client backend object | relation context | bulkread reads | 28016 read_time | 0 writes | 0 write_time | 0 writebacks | 0 writeback_time | 0 extends | extend_time | op_bytes | 8192 hits | 16232 evictions | 32 reuses | 27984 fsyncs | fsync_time | stats_reset | 2024 -08 -05 23 :07 :13.689509 + 08
evictions: 32,一目了然。
小结 这么一个小案例中,蕴藏了大量的知识点:
ring buffer,另外还有一个鲜为人知的点,TOAST 是会绕过 ring buffer 的。
pg_stat_io 学会利用起来,可以帮助我们理解底层原理。
为 PostgreSQL 日益完善的可观测性能力点赞 👍🏻
参考 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=dc9f8a798307244d791b71f666f50de319799e7c
https://pgpedia.info/p/pg_stat_statements.html