PostgreSQL学徒

庖丁解牛,深入浅出无日志表

Word count: 1.6kReading time: 7 min
2024/12/07
loading

前言

在 PostgreSQL 中,支持多种类型的表 — 临时表、普通表,以及无日志表,顾名思义,”无日志”其优势在于不用记录 WAL,那么写入速度自然也杠杠的,同理,无日志表在备库上没有数据 (只有一个壳),也无法进行访问,会提示 ERROR: cannot access temporary or unlogged relations during recovery,pg_basebackup 的时候也会跳过无日志表 (除了 init 分支)。那么无日志表又有哪些鲜为人知的细节呢?

分析

首先,让我们思考一下,无日志表的数据会落盘吗?或者说,按照常识,执行正常检查点的时候会落盘吗?

However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.

根据官网解释,在崩溃或者非正常关闭的时候,表文件会被截断,也就是说,在正常关闭的情况下,数据是会正常落盘的,让我们验证一下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
postgres=# create unlogged table t1(id int,info text);
CREATE TABLE
postgres=# insert into t1 select n,md5(random()::text) from generate_series(1,10000) as n;
INSERT 0 10000
postgres=# select pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/5/51158
(1 row)

postgres=# checkpoint ;
CHECKPOINT
[postgres@mypg 5]$ ls -lrth 51158*
-rw------- 1 postgres postgres 0 Dec 7 15:04 51158_init
-rw------- 1 postgres postgres 24K Dec 7 15:04 51158_fsm
-rw------- 1 postgres postgres 672K Dec 7 15:04 51158
-rw------- 1 postgres postgres 8.0K Dec 7 15:04 51158_vm

[postgres@mypg 5]$ hexdump -C 51158
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000a8000

hexdump 会将相同的都是 0 的行标记为 *,用 vim 打开也确实全是 0。

1
2
3
4
0000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
....

让我们正常关闭,再观察一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
postgres=# \q
[postgres@mypg ~]$ pg_ctl -D 17data stop
waiting for server to shut down.... done
server stopped

[postgres@mypg 5]$ hexdump -C 51158 | head -n 10
00000000 00 00 00 00 00 00 00 00 00 00 04 00 f8 01 00 02 |................|
00000010 00 20 04 20 00 00 00 00 c0 9f 7a 00 80 9f 7a 00 |. . ......z...z.|
00000020 40 9f 7a 00 00 9f 7a 00 c0 9e 7a 00 80 9e 7a 00 |@.z...z...z...z.|
00000030 40 9e 7a 00 00 9e 7a 00 c0 9d 7a 00 80 9d 7a 00 |@.z...z...z...z.|
00000040 40 9d 7a 00 00 9d 7a 00 c0 9c 7a 00 80 9c 7a 00 |@.z...z...z...z.|
00000050 40 9c 7a 00 00 9c 7a 00 c0 9b 7a 00 80 9b 7a 00 |@.z...z...z...z.|
00000060 40 9b 7a 00 00 9b 7a 00 c0 9a 7a 00 80 9a 7a 00 |@.z...z...z...z.|
00000070 40 9a 7a 00 00 9a 7a 00 c0 99 7a 00 80 99 7a 00 |@.z...z...z...z.|
00000080 40 99 7a 00 00 99 7a 00 c0 98 7a 00 80 98 7a 00 |@.z...z...z...z.|
00000090 40 98 7a 00 00 98 7a 00 c0 97 7a 00 80 97 7a 00 |@.z...z...z...z.|

这次就可以很清楚地看到,数据都被刷盘了。那么这二者有什么区别?其实代码的注释就很清楚

1
2
3
4
5
6
7
8
/*
* Unless this is a shutdown checkpoint or we have been explicitly told,
* we write only permanent, dirty buffers. But at shutdown or end of
* recovery, we write all dirty buffers.
*/
if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_FLUSH_ALL))))
mask |= BM_PERMANENT;

也就是说,如果 checkpoint 的时候,其 flags 不是 CHECKPOINT_IS_SHUTDOWNCHECKPOINT_END_OF_RECOVERY 或者 CHECKPOINT_FLUSH_ALL,那么只会刷持久化的表。让我们验证一下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(gdb) b CreateCheckPoint
Breakpoint 1 at 0x569430: file xlog.c, line 6827.
(gdb) signal SIGINT
Continuing with signal SIGINT.

Program received signal SIGINT, Interrupt.
0x00007f0ccd0ae0c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) c
Continuing.

Breakpoint 1, CreateCheckPoint (flags=108) at xlog.c:6827
6827 xlog.c: No such file or directory.
(gdb) p/x flags
$1 = 0x6c
(gdb) p flags
$2 = 108

108 等于:

  • CHECKPOINT_REQUESTED (0x0040 = 64)
  • CHECKPOINT_WAIT (0x0020 = 32)
  • CHECKPOINT_FORCE (0x0008 = 8)
  • CHECKPOINT_IMMEDIATE (0x0004 = 4)

说明不带有 CHECKPOINT_IS_SHUTDOWN 或者 CHECKPOINT_FLUSH_ALL,对应到我们前面的现象,没有刷新 unlogged tables。再看看正常关机的情况 (注意此处需要用 SIGUSR2 的信号):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(gdb) b CreateCheckPoint
Breakpoint 1 at 0x569430: file xlog.c, line 6827.
(gdb) signal SIGUSR2
Continuing with signal SIGUSR2.

Program received signal SIGUSR2, User defined signal 2.
0x00007f0ccd0ae0c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) c
Continuing.

Breakpoint 1, CreateCheckPoint (flags=5) at xlog.c:6827
6827 xlog.c: No such file or directory.
(gdb) p/x flags
$1 = 0x5

flags = 5 表示 CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE,所以对应到我们前面的现象,正常关机会进行刷盘。

另外上面细心的读者可能发现了,执行正常的 checkpoint,其大小也变为了同样的大小,我没有看具体代码,猜测是类似于打开 wal_init_zero 参数,调用 pg_pwrite_zeros,直接调用 mdextend -> FileWrite

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/*
* mdextend() -- Add a block to the specified relation.
*
* The semantics are nearly the same as mdwrite(): write at the
* specified position. However, this is to be used for the case of
* extending a relation (i.e., blocknum is at or beyond the current
* EOF). Note that we assume writing a block beyond current EOF
* causes intervening file space to become filled with zeroes.
*/
void
mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
const void *buffer, bool skipFsync)
{
off_t seekpos;
int nbytes;
MdfdVec *v;

见微知著

简单浏览了一下代码,CHECKPOINT_FLUSH_ALL 一共在两处进行了调用

  • CreateDatabaseUsingFileCopy:Create a new database using the FILE_COPY strategy. FILE_COPY 可以参照官网

  • movedb:对应到 ALTER DATABASE SET TABLESPACE

所以,对应到这些场景的时候,包括 unlogged tables,也需要进行刷盘,注意潜在的性能影响。

1
2
3
4
5
6
7
8
9
10
11
12
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*
* flags is a bitwise OR of the following:
* CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
* CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
* CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
* ignoring checkpoint_completion_target parameter.
* CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
* since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
* CHECKPOINT_END_OF_RECOVERY).
* CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.

表的转换

还有需要注意的点在于,普通表和无日志表之间的转换

1
2
3
ALTER TABLE mytable SET UNLOGGED; -- cheap! 

ALTER TABLE mytable SET LOGGED; -- expensive!

也就是将无日志表转为普通表的时候,会需要写大量的 REDO,不难理解,不然崩溃了就无法恢复数据了。反之则很快,只需要修改一下元数据即可。因此,这种行为也可以进行发散,直接通过修改元数据的方式,避免将无日志表转为普通表的时候写入大量 WAL,但是有什么幺蛾子就不好说了,比如 PITR,都没有这个表的 WAL,那自然恢复出来的数据也不一致了。

块结构

还有一点值得注意的细节是,块上面的 LSN👇🏻

PostgreSQL 中的已记录和未记录表头

无日志表的 LSN 为 0,这意味着无日志表由其自己的数据文件中的数据表示,但不由日志文件中的数据表示。

小结

最后就用一张引用的图做总结吧!

未记录表

参考

https://www.crunchydata.com/blog/postgresl-unlogged-tables

https://levelup.gitconnected.com/logged-unlogged-and-temporary-tables-in-postgresql-d390d9a4ef15

https://blog.japinli.top/2022/11/postgresql-debug-checkpointer/

https://github.com/digoal/blog/blob/master/202405/20240510_03.md

https://blog.csdn.net/baidu_41642080/article/details/132835382

https://www.modb.pro/db/11892

CATALOG
  1. 1. 前言
  2. 2. 分析
  3. 3. 见微知著
  4. 4. 表的转换
  5. 5. 块结构
  6. 6. 小结
  7. 7. 参考