庖丁解牛，深入浅出无日志表

数据库原理

Word count: 1.6kReading time: 7 min

 2024/12/07 

前言

在 PostgreSQL 中，支持多种类型的表 — 临时表、普通表，以及无日志表，顾名思义，”无日志”其优势在于不用记录 WAL，那么写入速度自然也杠杠的，同理，无日志表在备库上没有数据 (只有一个壳)，也无法进行访问，会提示 ERROR: cannot access temporary or unlogged relations during recovery，pg_basebackup 的时候也会跳过无日志表 (除了 init 分支)。那么无日志表又有哪些鲜为人知的细节呢？

分析

首先，让我们思考一下，无日志表的数据会落盘吗？或者说，按照常识，执行正常检查点的时候会落盘吗？

However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.

根据官网解释，在崩溃或者非正常关闭的时候，表文件会被截断，也就是说，在正常关闭的情况下，数据是会正常落盘的，让我们验证一下：

postgres=# create unlogged table t1(id int,info text);
CREATE TABLE
postgres=# insert into t1 select n,md5(random()::text) from generate_series(1,10000) as n;
INSERT 0 10000
postgres=# select pg_relation_filepath('t1');
 pg_relation_filepath 
----------------------
 base/5/51158
(1 row)

postgres=# checkpoint ;
CHECKPOINT
[postgres@mypg 5]$ ls -lrth 51158*
-rw------- 1 postgres postgres    0 Dec  7 15:04 51158_init
-rw------- 1 postgres postgres  24K Dec  7 15:04 51158_fsm
-rw------- 1 postgres postgres 672K Dec  7 15:04 51158
-rw------- 1 postgres postgres 8.0K Dec  7 15:04 51158_vm

[postgres@mypg 5]$ hexdump -C 51158
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000a8000

hexdump 会将相同的都是 0 的行标记为 *，用 vim 打开也确实全是 0。

0000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
....

让我们正常关闭，再观察一下

postgres=# \q
[postgres@mypg ~]$ pg_ctl -D 17data stop
waiting for server to shut down.... done
server stopped

[postgres@mypg 5]$ hexdump -C 51158 | head -n 10
00000000  00 00 00 00 00 00 00 00  00 00 04 00 f8 01 00 02  |................|
00000010  00 20 04 20 00 00 00 00  c0 9f 7a 00 80 9f 7a 00  |. . ......z...z.|
00000020  40 9f 7a 00 00 9f 7a 00  c0 9e 7a 00 80 9e 7a 00  |@.z...z...z...z.|
00000030  40 9e 7a 00 00 9e 7a 00  c0 9d 7a 00 80 9d 7a 00  |@.z...z...z...z.|
00000040  40 9d 7a 00 00 9d 7a 00  c0 9c 7a 00 80 9c 7a 00  |@.z...z...z...z.|
00000050  40 9c 7a 00 00 9c 7a 00  c0 9b 7a 00 80 9b 7a 00  |@.z...z...z...z.|
00000060  40 9b 7a 00 00 9b 7a 00  c0 9a 7a 00 80 9a 7a 00  |@.z...z...z...z.|
00000070  40 9a 7a 00 00 9a 7a 00  c0 99 7a 00 80 99 7a 00  |@.z...z...z...z.|
00000080  40 99 7a 00 00 99 7a 00  c0 98 7a 00 80 98 7a 00  |@.z...z...z...z.|
00000090  40 98 7a 00 00 98 7a 00  c0 97 7a 00 80 97 7a 00  |@.z...z...z...z.|

这次就可以很清楚地看到，数据都被刷盘了。那么这二者有什么区别？其实代码的注释就很清楚

/*
 * Unless this is a shutdown checkpoint or we have been explicitly told,
 * we write only permanent, dirty buffers.  But at shutdown or end of
 * recovery, we write all dirty buffers.
 */
if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
				CHECKPOINT_FLUSH_ALL))))
	mask |= BM_PERMANENT;

也就是说，如果 checkpoint 的时候，其 flags 不是 CHECKPOINT_IS_SHUTDOWN、CHECKPOINT_END_OF_RECOVERY 或者 CHECKPOINT_FLUSH_ALL，那么只会刷持久化的表。让我们验证一下：

(gdb) b CreateCheckPoint
Breakpoint 1 at 0x569430: file xlog.c, line 6827.
(gdb) signal SIGINT
Continuing with signal SIGINT.

Program received signal SIGINT, Interrupt.
0x00007f0ccd0ae0c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) c
Continuing.

Breakpoint 1, CreateCheckPoint (flags=108) at xlog.c:6827
6827    xlog.c: No such file or directory.
(gdb) p/x flags
$1 = 0x6c
(gdb) p flags
$2 = 108

108 等于：

CHECKPOINT_REQUESTED (0x0040 = 64)
CHECKPOINT_WAIT (0x0020 = 32)
CHECKPOINT_FORCE (0x0008 = 8)
CHECKPOINT_IMMEDIATE (0x0004 = 4)

说明不带有 CHECKPOINT_IS_SHUTDOWN 或者 CHECKPOINT_FLUSH_ALL，对应到我们前面的现象，没有刷新 unlogged tables。再看看正常关机的情况 (注意此处需要用 SIGUSR2 的信号)：

(gdb) b CreateCheckPoint
Breakpoint 1 at 0x569430: file xlog.c, line 6827.
(gdb) signal SIGUSR2
Continuing with signal SIGUSR2.

Program received signal SIGUSR2, User defined signal 2.
0x00007f0ccd0ae0c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) c
Continuing.

Breakpoint 1, CreateCheckPoint (flags=5) at xlog.c:6827
6827    xlog.c: No such file or directory.
(gdb) p/x flags
$1 = 0x5

flags = 5 表示 CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE，所以对应到我们前面的现象，正常关机会进行刷盘。

另外上面细心的读者可能发现了，执行正常的 checkpoint，其大小也变为了同样的大小，我没有看具体代码，猜测是类似于打开 wal_init_zero 参数，调用 pg_pwrite_zeros，直接调用 mdextend -> FileWrite

/*
 * mdextend() -- Add a block to the specified relation.
 *
 * The semantics are nearly the same as mdwrite(): write at the
 * specified position.  However, this is to be used for the case of
 * extending a relation (i.e., blocknum is at or beyond the current
 * EOF).  Note that we assume writing a block beyond current EOF
 * causes intervening file space to become filled with zeroes.
 */
void
mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
		 const void *buffer, bool skipFsync)
{
	off_t		seekpos;
	int			nbytes;
	MdfdVec    *v;

见微知著

简单浏览了一下代码，CHECKPOINT_FLUSH_ALL 一共在两处进行了调用

CreateDatabaseUsingFileCopy：Create a new database using the FILE_COPY strategy. FILE_COPY 可以参照官网
movedb：对应到 ALTER DATABASE SET TABLESPACE

所以，对应到这些场景的时候，包括 unlogged tables，也需要进行刷盘，注意潜在的性能影响。

/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 *
 * flags is a bitwise OR of the following:
 *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
 *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
 *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
 *		ignoring checkpoint_completion_target parameter.
 *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
 *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
 *		CHECKPOINT_END_OF_RECOVERY).
 *	CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.

表的转换

还有需要注意的点在于，普通表和无日志表之间的转换

1
2
3

ALTER TABLE mytable SET UNLOGGED; -- cheap! 

ALTER TABLE mytable SET LOGGED; -- expensive!

也就是将无日志表转为普通表的时候，会需要写大量的 REDO，不难理解，不然崩溃了就无法恢复数据了。反之则很快，只需要修改一下元数据即可。因此，这种行为也可以进行发散，直接通过修改元数据的方式，避免将无日志表转为普通表的时候写入大量 WAL，但是有什么幺蛾子就不好说了，比如 PITR，都没有这个表的 WAL，那自然恢复出来的数据也不一致了。