从一个罕见案例聊聊我对社区的看法

数据库原理

Word count: 3.3kReading time: 15 min

 2023/12/29 

前言

今天白鳝老师在公众号发了一篇文章，在PG数据库中 shared_buffers会影响DROP TABLE的性能吗，群里也就这个问题讨论了许久。简而言之，就是 shared_buffers 参数的大小会直接影响到 DROP/TRUNCATE 的性能，这个以前还没有注意过，自己下来也验证了一下，发现的确如此，这立马引起了我的兴趣。

复现

具体复现步骤可以参照 DROP TABLE: KILLING SHARED_BUFFERS 中所述，此处稍作修改，验证一下 TRUNCATE

[postgres@xiongcc ~]$ cat run.sql 
SET synchronous_commit TO off;
BEGIN;
CREATE TABLE if not exists x(id int);
INSERT INTO x VALUES (1);
truncate TABLE x;
COMMIT;
[postgres@xiongcc ~]$ cat test.sh 
#/bin/sh
 
DB=postgres
 
for x in '8 MB' '32 MB' '128 MB' '1 GB' '8 GB'
do
      pg_ctl -D /home/postgres/16data -l /dev/null -o "--shared_buffers='$x'" start
      sleep 1
      echo tps for $x
      psql -c "SHOW shared_buffers" $DB
      pgbench --file=run.sql -j 1 -c 1 -T 10 -P 2 $DB 2> /dev/null
      pg_ctl -D /home/postgres/16data stop
      sleep 1
done

以下是测试结果

[postgres@xiongcc ~]$ ./test.sh | grep tps
tps for 8 MB
tps = 950.839041 (without initial connection time)
tps for 32 MB
tps = 945.427848 (without initial connection time)
tps for 128 MB
tps = 928.624541 (without initial connection time)
tps for 1 GB
tps = 541.861555 (without initial connection time)

可以看到，当 shared_buffers 来到 1GB 的时候，TPS 直接衰减了一半，这个结果令我很诧异。为什么 shared_buffers 越大，性能越慢？这个和我们常规认知完全背道而驰，shared_buffers 越大，能够缓存的对象就越多，避免额外 IO 进而提升性能。

那问题出在了哪里？其实不难理解，当我们要删除对象时，要确保 shared_buffers 里面和这个对象相关的 buffer 都被清空，那么慢的原因，基本上就是遍历 shared_buffers 导致的了，越大，遍历当然就越慢了。那让我们深入代码细节，是否是这样。

深入分析

以 16 的代码为例，TRUNCATE 的逻辑很好找，DropRelationsAllBuffers，整体的代码逻辑较长，让我们一步一步分析。注释很明了，从 Buffer pool 中移除指定表的所有分支文件。

/* ---------------------------------------------------------------------
 *		DropRelationsAllBuffers
 *
 *		This function removes from the buffer pool all the pages of all
 *		forks of the specified relations.  It's equivalent to calling
 *		DropRelationBuffers once per fork per relation with firstDelBlock = 0.
 *		--------------------------------------------------------------------
 */

首先代码中会判断是否是本地对象，即是否是临时表

/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nlocators; i++)
{
	if (RelFileLocatorBackendIsTemp(smgr_reln[i]->smgr_rlocator))
	{
		if (smgr_reln[i]->smgr_rlocator.backend == MyBackendId)
			DropRelationAllLocalBuffers(smgr_reln[i]->smgr_rlocator.locator);
	}
	else
		rels[n++] = smgr_reln[i];
}

/*
 * If there are no non-local relations, then we're done. Release the
 * memory and return.
 */
if (n == 0)
{
	pfree(rels);
	return;
}

然后分配一个二维数组，记录要 DROP/TRUNCATE 对象的所有数据块

/*
 * This is used to remember the number of blocks for all the relations
 * forks.
 */
block = (BlockNumber (*)[MAX_FORKNUM + 1])
	palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1));

接下来是代码核心流程，先看注释

We can avoid scanning the entire buffer pool if we know the exact size of each of the given relation forks. See DropRelationBuffers.

如果我们知道每个给定关系分支的确切大小，我们可以避免扫描整个缓冲池。请参阅 DropRelationBuffers。

/*
 * We can avoid scanning the entire buffer pool if we know the exact size
 * of each of the given relation forks. See DropRelationBuffers.
 */
for (i = 0; i < n && cached; i++)
{
	for (int j = 0; j <= MAX_FORKNUM; j++)
	{
		/* Get the number of blocks for a relation's fork. */
		block[i][j] = smgrnblocks_cached(rels[i], j);				---返回InvalidBlockNumber

		/* We need to only consider the relation forks that exists. */
		if (block[i][j] == InvalidBlockNumber)
		{
			if (!smgrexists(rels[i], j))		---判断文件是否存在，进而返回cached = false
				continue;
			cached = false;
			break;
		}

		/* calculate the total number of blocks to be invalidated */
		nBlocksToInvalidate += block[i][j];
	}
}

DropRelationBuffers 中有这么一段注释，写明了原因，目前仅适用于恢复期间！

/*
 * To remove all the pages of the specified relation forks from the buffer
 * pool, we need to scan the entire buffer pool but we can optimize it by
 * finding the buffers from BufMapping table provided we know the exact
 * size of each fork of the relation. The exact size is required to ensure
 * that we don't leave any buffer for the relation being dropped as
 * otherwise the background writer or checkpointer can lead to a PANIC
 * error while flushing buffers corresponding to files that don't exist.
 *
 为了从缓冲池中移除指定关系分支的所有页面，我们需要扫描整个缓冲池，但如果我们知道关系的每个分
 支的确切大小，我们可以通过从BufMapping表中查找缓冲区来优化它。需要确切大小是为了确保我们没
 有留下任何要删除的关系的缓冲区，否则后台写入器或检查点器在刷新不存在的文件对应的缓冲区时可能
 会导致PANIC错误。


 * To know the exact size, we rely on the size cached for each fork by us
 * during recovery which limits the optimization to recovery and on
 * standbys but we can easily extend it once we have shared cache for
 * relation size.
 
 为了知道确切的大小，我们依赖于在恢复期间为每个分支缓存的大小，这限制了优化适用于恢复和备用节
 点，但一旦我们有了关系大小的共享缓存，我们可以轻松扩展它。
 
 *
 * In recovery, we cache the value returned by the first lseek(SEEK_END)
 * and the future writes keeps the cached value up-to-date. See
 * smgrextend. It is possible that the value of the first lseek is smaller
 * than the actual number of existing blocks in the file due to buggy
 * Linux kernels that might not have accounted for the recent write. But
 * that should be fine because there must not be any buffers after that
 * file size.
 */

于是，smgrnblocks_cached 返回的就是 InvalidBlockNumber，进而走到了将 cached 设置为了 false。


/*
 * smgrnblocks_cached() -- Get the cached number of blocks in the supplied
 *						   relation.
 *
 * Returns an InvalidBlockNumber when not in recovery and when the relation
 * fork size is not cached.
 */
BlockNumber
smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
{
	/*
	 * For now, we only use cached values in recovery due to lack of a shared
	 * invalidation mechanism for changes in file size.
	 */
	if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
		return reln->smgr_cached_nblocks[forknum];

	return InvalidBlockNumber;
}

当 cached = false 时，后面的逻辑也就不会再继续走下去了，所以此处的优化仅仅针对的是备库在重放 WAL 的时候，才可以避免遍历 shared_buffers


/*
 * We apply the optimization iff the total number of blocks to invalidate
 * is below the BUF_DROP_FULL_SCAN_THRESHOLD.
 
 如果要无效的页面数小于BUF_DROP_FULL_SCAN_THRESHOLD(NBuffers/32)的话，从hash中查找，否则遍历所有页面
 */
if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
{
	for (i = 0; i < n; i++)
	{
		for (int j = 0; j <= MAX_FORKNUM; j++)
		{
			/* ignore relation forks that doesn't exist */
			if (!BlockNumberIsValid(block[i][j]))
				continue;

			/* drop all the buffers for a particular relation fork */
			FindAndDropRelationBuffers(rels[i]->smgr_rlocator.locator,
									   j, block[i][j], 0);
		}
	}

	pfree(block);
	pfree(rels);
	return;
}

那么后面的逻辑就十分清晰了，判断要删除的表是否大于 RELS_BSEARCH_THRESHOLD，这个值是 20，当然也是社区开发者拍脑袋想出来的

The threshold to use is rather a guess than an exactly determined value

比如 drop test1,test2,test… ，如果要删除的表大于了 20，就采用二分查找，否则就避免进行二分查找，减少开销

/*
 * For low number of relations to drop just use a simple walk through, to
 * save the bsearch overhead. The threshold to use is rather a guess than
 * an exactly determined value, as it depends on many factors (CPU and RAM
 * speeds, amount of shared buffers etc.).
 */
use_bsearch = n > RELS_BSEARCH_THRESHOLD;

/* sort the list of rlocators if necessary */
if (use_bsearch)
	pg_qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator);

然后就是最为关键的地方了，NBuffers就是 shared buffers 的大小，可以看到，代码中采用了遍历！复杂度是 O(N)，N = shared_buffers/8KB


for (i = 0; i < NBuffers; i++)				---最为关键的地方
{
	RelFileLocator *rlocator = NULL;
	BufferDesc *bufHdr = GetBufferDescriptor(i);
	uint32		buf_state;

	/*
	 * As in DropRelationBuffers, an unlocked precheck should be safe and
	 * saves some cycles.
	 */

	if (!use_bsearch)
	{
		int			j;

		for (j = 0; j < n; j++)
		{
			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &locators[j]))
			{
				rlocator = &locators[j];
				break;
			}
		}
	}
	else
	{
		RelFileLocator locator;

		locator = BufTagGetRelFileLocator(&bufHdr->tag);
		rlocator = bsearch((const void *) &(locator),
						   locators, n, sizeof(RelFileLocator),
						   rlocator_comparator);
	}

	/* buffer doesn't belong to any of the given relfilelocators; skip it */
	if (rlocator == NULL)
		continue;

	buf_state = LockBufHdr(bufHdr);
	if (BufTagMatchesRelFileLocator(&bufHdr->tag, rlocator))
		InvalidateBuffer(bufHdr);	/* releases spinlock */
	else
		UnlockBufHdr(bufHdr, buf_state);
}

pfree(locators);
pfree(rels);

所以，为什么较大的 shared_buffers 删除表会导致性能问题的原因就十分清晰了：代码中会遍历 shared_buffers，复杂度是 O(N)，N = shared_buffers/8KB，所以 shared_buffers 越大，性能就越慢！

那么对于临时表是如何？让我们也简单看一下

/*
 * DropRelationAllLocalBuffers
 *		This function removes from the buffer pool all pages of all forks
 *		of the specified relation.
 *
 *		See DropRelationsAllBuffers in bufmgr.c for more notes.
 */
void
DropRelationAllLocalBuffers(RelFileLocator rlocator)
{
	int			i;

	for (i = 0; i < NLocBuffer; i++)
	{
		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
		LocalBufferLookupEnt *hresult;
		uint32		buf_state;

		buf_state = pg_atomic_read_u32(&bufHdr->state);

		if ((buf_state & BM_TAG_VALID) &&
			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
		{
			if (LocalRefCount[i] != 0)
				elog(ERROR, "block %u of %s is still referenced (local %u)",
					 bufHdr->tag.blockNum,
					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
									MyBackendId,
									BufTagGetForkNum(&bufHdr->tag)),
					 LocalRefCount[i]);
			/* Remove entry from hashtable */
			hresult = (LocalBufferLookupEnt *)
				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
			if (!hresult)		/* shouldn't happen */
				elog(ERROR, "local buffer hash table corrupted");
			/* Mark buffer invalid */
			ClearBufferTag(&bufHdr->tag);
			buf_state &= ~BUF_FLAG_MASK;
			buf_state &= ~BUF_USAGECOUNT_MASK;
			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
		}
	}
}

可以看到，它的逻辑也是类似，遍历 NLocBuffer，但是 NLocBuffer 取决于 temp_buffers，temp_buffers 按照日常使用并不会特别大。

再次复现

那让我们再次验证一下我们的想法

(gdb) call smgrnblocks_cached(rels[i], j)
$1 = 4294967295
(gdb) p smgrnblocks_cached(rels[i], j)
$2 = 4294967295
(gdb) p/x smgrnblocks_cached(rels[i], j)
$3 = 0xffffffff
(gdb) p cached
$4 = false

这个 0xffffffff 便是 InvalidBlockNumber，最终返回 cached = false。

1	#define InvalidBlockNumber ((BlockNumber) 0xFFFFFFFF)

另外，最开始判断需要删除的对象数量，包括表所有的对象，比如 TOAST，比如 TOAST 的索引，所以各位可以验证一下，比如 create table test1(id int)，那么这个 n 就是 1 ，如果 create table test2(info text)，这个 n 就是 3，然后所有的对象的都要经过遍历，可想而知，表越复杂，索引越多，那么删除就越慢！

/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nlocators; i++)
{
	if (RelFileLocatorBackendIsTemp(smgr_reln[i]->smgr_rlocator))
	{
		if (smgr_reln[i]->smgr_rlocator.backend == MyBackendId)
			DropRelationAllLocalBuffers(smgr_reln[i]->smgr_rlocator.locator);
	}
	else
		rels[n++] = smgr_reln[i];
}

此处我用 until 跳出循环，最终可以看到，遍历了 16384 次！也就是整个 shared buffers 的大小。

1
2
3

(gdb) until
(gdb) p i
$16 = 16384

14中的优化

前文也提了，仅适用于备库做回放的时候，14中引入了一项优化：https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d6ad34f3410f101f9909c0918a49d6ce86fa216c，如果要删除的页面数小于 NBuffers/32 的话，从BufMapping哈希表中查找，否则遍历所有页面

DropRelFileNodeBuffers()的恢复路径经过了优化，当一个关系中要截断的块数低于某个阈值时，可以避免扫描整个缓冲池。对于这样的情况，我们通过在BufMapping表中查找来找到缓冲区。这在很多情况下（测试了1000个关系的多个小表）提高了性能，超过了100倍，特别是在服务器配置了大量共享缓冲区的情况下（大于等于100GB）。这种优化有助于以下情况：1、当vacuum或autovacuum截断了关系末尾的任何空页，或者2、当关系在创建它的同一个事务中被截断。此提交引入了一个新的API smgrnblocks_cached，它返回关系分支中块数的缓存值。这有助于我们确定应用此优化所需的关系的确切大小。需要确切的大小是为了确保我们没有留下任何要删除的关系的缓冲区，否则后台写入器或检查点器在刷新不存在的文件对应的缓冲区时可能会导致PANIC错误。

不过前文也说了，仅仅是针对主从复制的场景。

小结

所以，为什么普通表会出问题，因为 shared buffers 往往会很大，假如你将 temp_buffers 一样设置很大，照样会有性能衰减，让我们稍作修改，针对临时表测一下

#/bin/sh
 
DB=postgres
 
for x in '8 MB' '32 MB' '128 MB' '1 GB' '8 GB'
do
      pg_ctl -D /home/postgres/16data -l /dev/null -o "--temp_buffers='$x'" start
      sleep 1
      echo tps for $x
      # psql -c "SHOW shared_buffers" $DB
      psql -c "SHOW temp_buffers" $DB
      pgbench --file=run.sql -j 1 -c 1 -T 10 -P 2 $DB 2> /dev/null
      pg_ctl -D /home/postgres/16data stop
      sleep 1
done

测试结果如出一辙。

[postgres@xiongcc ~]$ ./test.sh | grep tps
tps for 8 MB
tps = 1643.377319 (without initial connection time)
tps for 32 MB
tps = 1641.946117 (without initial connection time)
tps for 128 MB
tps = 1375.333656 (without initial connection time)
tps for 1 GB
tps = 607.852753 (without initial connection time)
tps for 8 GB
tps = 96.615903 (without initial connection time)

看样子 shared_buffers 的调整真的是一门很大的学问。

我见

借着这个案例，也聊聊我对社区的浅薄看法吧，经常有人会纳闷为什么社区这不做，那不做，明明都是问题！就此例中的问题，其实早在 2015 年就有人提过这个问题了，https://www.postgresql.org/message-id/CAA4eK1JPLGjpMeJ5YLNE7bpNBhP2EQe_rDR%2BAw3atNfj9WkAGg%40mail.gmail.com，但是被社区大佬以这个场景比较罕见为由，并没有做相关的优化。但是也确实可以看到，社区也在做不断的优化，并非完全油盐不进，比如 14 引入的 smgrnblocks_cached，PostgreSQL 始终是个开源数据库，完全开源免费，不像商业数据库，由客户需求直接驱动。相反，我认为这些痛点和痒点，都是源自真实需求，这不正是基于 PostgreSQL 二次开发而来的国产数据库需要做的事情吗？觉得不好，不如去建设它。

源自社区，反哺社区，共创社区。

参考

https://github.com/digoal/blog/blob/master/202101/20210113_01.md

Next Post

从一次提交聊聊饱受诟病的双缓存架构
Previous Post

PostgreSQL 术语简明指南

CATALOG

1. 前言
2. 复现
3. 深入分析
4. 再次复现
5. 14中的优化
6. 小结
7. 我见
8. 参考