Warm strategy #270

knizhnik · 2023-03-15T17:19:08Z

Neon architecture is very sensitive to fast prewarming of shared buffers,
because it can not rely on file system cache. Prewarming can be done using pg_prewarm
extension. But it requires manual invocation of pg_perwarm function for each relation
(including indexes and TOAST) which is very inconvenient. Also Neon can restart compute
at any moment of time (for example because of inactivity), so it is not clear who and when
should do prewarming.

To avoid flushing of shared buffers by seqscan of huge table Postgres is using ring buffer strategies,
which limit number of used buffers to some small fraction of shared buffers.
So it is not possible to prewarm cache using queries performing seqscan.
Also some extensions and Postgres background tasks are using BULKREAD strategy
which also prevents loading all data in shared buffers even if there is enough free space.

This PR avoid use of ring buffer if there is free space in shared buffers.

Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs. Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads f_smgr implementation using smgr_hook. Also add smgr_init_hook and smgr_shutdown_hook. And a lot of mechanical changes in smgr.c functions. This patch is proposed to community: https://commitfest.postgresql.org/33/3216/ Author: anastasia <[email protected]>

Add contrib/zenith that handles interaction with remote pagestore. To use it add 'shared_preload_library = zenith' to postgresql.conf. It adds a protocol for network communications - see libpagestore.c; and implements smgr API. Also it adds several custom GUC variables: - zenith.page_server_connstring - zenith.callmemaybe_connstring - zenith.zenith_timeline - zenith.wal_redo Authors: Stas Kelvich <[email protected]> Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request. To start postgres in wal-redo mode, run postgres with --wal-redo option It requires zenith shared library and zenith.wal_redo Author: Heikki Linnakangas <[email protected]>

Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

In the test_createdb test, we created a new database, and created a new branch after that. I was seeing the test fail with: PANIC: could not open critical system index 2662 The WAL contained records like this: rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Database len (rec/tot): 42/ 42, tx: 540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Transaction len (rec/tot): 66/ 66, tx: 540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown The compute node had correctly replayed all the WAL up to the last record, and opened up. But when you tried to connect to the new database, the very first requests for the critical relations, like pg_class, were made with request LSN 0/01642990. That's the last record that's applicable to a particular block. Because the database CREATE record didn't bump up the "last written LSN", the getpage requests were made with too old LSN. I fixed this by adding a SetLastWrittenLSN() call to the redo of database CREATE record. It probably wouldn't hurt to also throw in a call at the end of WAL replay, but let's see if we bump into more cases like this first. This doesn't seem to be happening with page server as of 'main'; I was testing with a version where I had temporarily reverted all the recent changes to reconstruct control file, checkpoints, relmapper files etc. from the WAL records in the page server, so that the compute node was redoing all the WAL. I'm pretty sure we need this fix even with 'main', even though this test case wasn't failing there right now.

Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits) or delay wal-logging till the end of operation (i.e. index build). So if such page is evicted, we will lose the update. To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged. If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it. Authors: Konstantin Knizhnik <[email protected]> anastasia <[email protected]>

Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <[email protected]>

Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <[email protected]>

Request relation size via smgr function, not just stat(filepath).

Author: Konstantin Knizhnik <[email protected]>

…mmon error. TODO: add a comment, why this is fine for zenith.

…d of WAL page header, then return it back to the page origin

…of WAL at compute node + Check for presence of replication slot

…t inside. WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely. Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if all connections are dead as well. Fix that. Also move break on latch set to the end of the loop to attempt ReconnectWalKeepers even if latch is constantly set. Per test_race_conditions (Python version now).

…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres

@knizhnik

This patch aims to make our bespoke WAL redo machinery more robust in the presence of untrusted (in other words, possibly malicious) inputs. Pageserver delegates complex WAL decoding duties to postgres, which means that the latter might fall victim to carefully designed malicious WAL records and start doing harmful things to the system. To prevent this, it has been decided to limit possible interactions with the outside world using the Secure Computing BPF mode. We use this mode to disable all syscalls not in the allowlist. Please refer to src/backend/postmaster/seccomp.c to learn more about the pros & cons of the current approach. + Fix some bugs in seccomp bpf wrapper * Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals. * Add a missing variant of select() syscall (thx to @knizhnik). * Write error messages to an fd stderr's currently pointing to.

…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith

this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor

…recovery. Rust's postgres_backend currently is too dummy to handle it properly: reading happens in separate thread which just ignores CopyDone. Instead, writer thread must get aware of termination and send CommandComplete. Also reading socket must be transferred back to postgres_backend (or connection terminated completely after COPY). Let's do that after more basic safkeeper refactoring and right now cover this up to make tests pass. ref #388

…ion position in wal_proppser to segment boundary

…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.

* Fix shared memory initialization for last written LSN cache Replace (from,till) with (from,n_blocks) for SetLastWrittenLSNForBlockRange function * Fast exit from SetLastWrittenLSNForBlockRange for n_blocks == 0

…ForBlockRange (#230)

- Refactor the way the WalProposerMain function is called when started with --sync-safekeepers. The postgres binary now explicitly loads the 'neon.so' library and calls the WalProposerMain in it. This is simpler than the global function callback "hook" we previously used. - Move the WAL redo process code to a new library, neon_walredo.so, and use the same mechanism as for --sync-safekeepers to call the WalRedoMain function, when launched with --walredo argument. - Also move the seccomp code to neon_walredo.so library. I kept the configure check in the postgres side for now, though.

Fix indentation, remove unused definitions, resolve some FIXMEs.

Previously, we called PrefetchBuffer [NBlkScanned * seqscan_prefetch_buffers] times in each of those situations, but now only NBlkScanned. In addition, the prefetch mechanism for the vacuum scans is now based on blocks instead of tuples - improving the efficiency.

Parallel seqscans didn't take their parallelism into account when determining which block to prefetch, and vacuum's cleanup scan didn't correctly determine which blocks would need to be prefetched, and could get into an infinite loop.

* Use prefetch in pg_prewarm extension * Change prefetch order as suggested in review

* Update prefetch mechanisms: - **Enable enable_seqscan_prefetch by default** - Store prefetch distance in the relevant scan structs - Slow start sequential scan, to accommodate LIMIT clauses. - Replace seqscan_prefetch_buffer with the relations' tablespaces' *_io_concurrency; and drop seqscan_prefetch_buffer as a result. - Clarify enable_seqscan_prefetch GUC description - Fix prefetch in pg_prewarm - Add prefetching to autoprewarm worker - Fix an issue where we'd incorrectly not prefetch data when hitting a table wraparound. The same issue also resulted in assertion failures in debug builds. - Fix parallel scan prefetching - we didn't take into account that parallel scans have scan synchronization, too.

#244) * Maintain last written LSN for each page to enable prefetch on vacuum, delete and other massive update operations * Move PageSetLSN in heap_xlog_visible before MarkBufferDirty

- Prefetch the pages in index vacuum's sequential scans Implemented in NBTREE, GIST and SP-GIST. BRIN does not have a 2nd phase of vacuum, and both GIN and HASH clean up their indexes in a non-seqscan fashion: GIN scans the btree from left to right, and HASH only scans the initial buckets sequentially.

The compiler warning was correct and would have the potential to disable prefetching.

…251) refer #2807

* Show prefetch statistic in EXPLAIN refer #2994 * Collect per-node prefetch statistics * Show number of prefetch duplicates in explain

* Implement efficient prefetch for parallel bitmap heap scan * Change MAX_IO_CONCURRENCY to be power of 2

* Avoid errors when accessing indexes of unlogge tables after compute restart * Address review complaints: add comment to mdopenfork * Initialize unlogged index undex eclusive lock

They will be handled in pageserver, ref neondatabase/neon#3706 Reverts a9f5034 Reverts 7d7a547

written LSN cache optional.

Now similar kind of hack (using malloc() instead of shmem) is done in the wal-redo extension.

MMeent

I approve of this; strategies don't make much sense if there are empty buffers available.

Some comments on the spelling and contents of the comment.

Could you also send the patch to -hackers@? It's a trivial patch, and I think there are several people interested in this as well: I believe that at least one hacker was complaining about the same issue when we met at the PG Developers day in February.

src/backend/storage/buffer/freelist.c

Co-authored-by: MMeent <[email protected]>

lubennikovaav and others added 30 commits February 10, 2023 16:25

lastWrittenPageLSN.patch

6afc81e

Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

[walproposer] wal_proposer.patch

eacda29

Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <[email protected]>

persist_unlogged_tables.patch

250e230

Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <[email protected]>

fix_pg_table_size.patch

521aa5a

Request relation size via smgr function, not just stat(filepath).

[walredo] fix_gin_redo.patch

d3b3173

Author: Konstantin Knizhnik <[email protected]>

[walredo] fix_brin_redo.patch

c1857b7

Author: Konstantin Knizhnik <[email protected]>

speculative_records_workaround.patch

8ea72ed

wallog_t_ctid.patch

d2a0a67

vacuumlazy_debug_stub.patch

0843119

[test] zenith_test_evict.patch

da83fc2

fix_sequence_wallogging.patch

16c32da

Bring back change that got lost in refactoring. silence ReadBuffer_co…

029fce5

…mmon error. TODO: add a comment, why this is fine for zenith.

[contrib/zenith] [refer #225] if insert WAL position points at the en…

668bedc

…d of WAL page header, then return it back to the page origin

[walproposer] Create replication slot for walproposer to avoid loose …

07a20a8

…of WAL at compute node + Check for presence of replication slot

[walproposer] Skip absent WAL segment removed by pg_resetwal

e74ef1a

[walproposer] Make it possible to start postgres without reading chec…

082ad62

…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres

[walproposer] Simplify WL_LATCH_SET testing in the walproposer

996ba4e

[smgr_api] [contrib/zenith] 1. Do not call mdinit from smgrinit() bec…

38747b1

…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith

[walproposer] [contrib/zenith] support zenith_tenant

b499250

this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor

[walproposer] [contrib/zenith] [refer #395] Do no align sart replicat…

6a65961

…ion position in wal_proppser to segment boundary

[test] Add contrib/zenith_test_utils with helpers for testing and deb…

6b9f636

…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.

[walproposer] Change condition for triggering recovery

d7375b2

knizhnik and others added 21 commits February 10, 2023 16:57

Fix shared memory initialization for last written LSN cache (#224)

d502dcb

* Fix shared memory initialization for last written LSN cache Replace (from,till) with (from,n_blocks) for SetLastWrittenLSNForBlockRange function * Fast exit from SetLastWrittenLSNForBlockRange for n_blocks == 0

Fix upper boundary caculation in the chunks loop in SetLastWrittenLSN…

6686404

…ForBlockRange (#230)

Misc cleanup, mostly to reduce unnecessary differences with upstream.

a43197a

Fix indentation, remove unused definitions, resolve some FIXMEs.

Use prefetch in pg_prewarm extension (#236)

d95f88a

* Use prefetch in pg_prewarm extension * Change prefetch order as suggested in review

Do not produce open file error for unlogged relations (#239)

7d7a547

Maintain last written LSN for each page to enable prefetch on vacuum,… (

0d149ea

#244) * Maintain last written LSN for each page to enable prefetch on vacuum, delete and other massive update operations * Move PageSetLSN in heap_xlog_visible before MarkBufferDirty

Fix uninitialized variable in spgvacuum.c (#250)

4159f96

The compiler warning was correct and would have the potential to disable prefetching.

Update heap pge LSN in case of VM changes even if wal_redo_hints=off (#…

e5a660f

…251) refer #2807

Show prefetch statistic in EXPLAIN (#248)

5476da2

* Show prefetch statistic in EXPLAIN refer #2994 * Collect per-node prefetch statistics * Show number of prefetch duplicates in explain

Implement efficient prefetch for parallel bitmap heap scan (#258)

1b8c35b

* Implement efficient prefetch for parallel bitmap heap scan * Change MAX_IO_CONCURRENCY to be power of 2

Unlogged index fix v14 (#259)

a9f5034

* Avoid errors when accessing indexes of unlogge tables after compute restart * Address review complaints: add comment to mdopenfork * Initialize unlogged index undex eclusive lock

Fix bitmap scan prefetch (#260)

468d3c0

Revert handling of UNLOGGED tables on compute side.

5fb2e0b

They will be handled in pageserver, ref neondatabase/neon#3706 Reverts a9f5034 Reverts 7d7a547

Allow external main functions to skip config load and make last

29fd6af

written LSN cache optional.

Remove walredo-related hacks from InternalIpcMemoryCreate()

9fd9794

Now similar kind of hack (using malloc() instead of shmem) is done in the wal-redo extension.

Do not use ring buffer strategy if there is free space in ring buffer

95af329

knizhnik changed the base branch from REL_15_STABLE_neon to REL_14_STABLE_neon March 15, 2023 17:19

MMeent approved these changes Mar 15, 2023

View reviewed changes

src/backend/storage/buffer/freelist.c Outdated Show resolved Hide resolved

Update src/backend/storage/buffer/freelist.c

894e8b0

Co-authored-by: MMeent <[email protected]>

MMeent force-pushed the REL_14_STABLE_neon branch from a2daebc to 1144aee Compare May 16, 2023 11:37

tristan957 force-pushed the REL_14_STABLE_neon branch from 28bf5cc to 5d5cfee Compare August 11, 2023 14:42

tristan957 force-pushed the REL_14_STABLE_neon branch from dd067cf to 0bb356a Compare December 11, 2023 16:12

tristan957 force-pushed the REL_14_STABLE_neon branch from be7a65f to 018fb05 Compare February 7, 2024 16:40

tristan957 force-pushed the REL_14_STABLE_neon branch 2 times, most recently from b8e5379 to 21ec61d Compare May 20, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warm strategy #270

Warm strategy #270

knizhnik commented Mar 15, 2023

MMeent left a comment

Warm strategy #270

Are you sure you want to change the base?

Warm strategy #270

Conversation

knizhnik commented Mar 15, 2023

MMeent left a comment

Choose a reason for hiding this comment