Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temp file offload v14 #257

Open
wants to merge 205 commits into
base: REL_14_STABLE_neon
Choose a base branch
from
Open

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Feb 6, 2023

No description provided.

lubennikovaav and others added 30 commits November 21, 2022 12:59
Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs.
Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads
f_smgr implementation using smgr_hook.

Also add smgr_init_hook and smgr_shutdown_hook.
And a lot of mechanical changes in smgr.c functions.

This patch is proposed to community: https://commitfest.postgresql.org/33/3216/

Author: anastasia <[email protected]>
Add contrib/zenith that handles interaction with remote pagestore.
To use it add 'shared_preload_library = zenith' to postgresql.conf.

It adds a protocol for network communications - see libpagestore.c;
and implements smgr API.

Also it adds several custom GUC variables:
- zenith.page_server_connstring
- zenith.callmemaybe_connstring
- zenith.zenith_timeline
- zenith.wal_redo

Authors:
Stas Kelvich <[email protected]>
Konstantin Knizhnik <[email protected]>
Heikki Linnakangas <[email protected]>
Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request.

To start postgres in wal-redo mode, run postgres with --wal-redo option
It requires zenith shared library and zenith.wal_redo

Author: Heikki Linnakangas <[email protected]>
Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver.

Authors:
Konstantin Knizhnik <[email protected]>
Heikki Linnakangas <[email protected]>
In the test_createdb test, we created a new database, and created a new
branch after that. I was seeing the test fail with:

    PANIC:  could not open critical system index 2662

The WAL contained records like this:

    rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW
    rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online
    rmgr: Database    len (rec/tot):     42/    42, tx:        540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online
    rmgr: Transaction len (rec/tot):     66/    66, tx:        540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown

The compute node had correctly replayed all the WAL up to the last
record, and opened up. But when you tried to connect to the new
database, the very first requests for the critical relations, like
pg_class, were made with request LSN 0/01642990. That's the last
record that's applicable to a particular block. Because the database
CREATE record didn't bump up the "last written LSN", the getpage
requests were made with too old LSN.

I fixed this by adding a SetLastWrittenLSN() call to the redo of
database CREATE record. It probably wouldn't hurt to also throw in a
call at the end of WAL replay, but let's see if we bump into more
cases like this first.

This doesn't seem to be happening with page server as of 'main'; I was
testing with a version where I had temporarily reverted all the recent
changes to reconstruct control file, checkpoints, relmapper files
etc. from the WAL records in the page server, so that the compute node
was redoing all the WAL. I'm pretty sure we need this fix even with
'main', even though this test case wasn't failing there right now.
Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits)
or delay wal-logging till the end of operation (i.e. index build).
So if such page is evicted, we will lose the update.

To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged.
If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it.

Authors:
Konstantin Knizhnik <[email protected]>
anastasia <[email protected]>
Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors

Author: Konstantin Knizhnik <[email protected]>
Ignore unlogged table qualifier. Add respective changes to regression test outputs.

Author: Konstantin Knizhnik <[email protected]>
Request relation size via smgr function, not just stat(filepath).
Author: Konstantin Knizhnik <[email protected]>
Author: Konstantin Knizhnik <[email protected]>
…mmon error. TODO: add a comment, why this is fine for zenith.
…d of WAL page header, then return it back to the page origin
…of WAL at compute node

+ Check for presence of replication slot
…t inside.

WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously
ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely.

Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if
all connections are dead as well. Fix that.

Also move break on latch set to the end of the loop to attempt
ReconnectWalKeepers even if latch is constantly set.

Per test_race_conditions (Python version now).
…kpoint from WAL

+ Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL

+ Pass prev_record_ptr through zenith.signal file to postgres
This patch aims to make our bespoke WAL redo machinery more robust
in the presence of untrusted (in other words, possibly malicious) inputs.

Pageserver delegates complex WAL decoding duties to postgres,
which means that the latter might fall victim to carefully designed
malicious WAL records and start doing harmful things to the system.
To prevent this, it has been decided to limit possible interactions
with the outside world using the Secure Computing BPF mode.

We use this mode to disable all syscalls not in the allowlist.
Please refer to src/backend/postmaster/seccomp.c to learn more
about the pros & cons of the current approach.

+ Fix some bugs in seccomp bpf wrapper

* Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals.
* Add a missing variant of select() syscall (thx to @knizhnik).
* Write error messages to an fd stderr's currently pointing to.
…ause it cause memory leak in wal-redo-postgres

2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR

+ Call smgr_init_standard from smgr_init_zenith
this patch adds support for zenith_tenant variable. it has similar
format as zenith_timeline. It is used in callmemaybe query to pass
tenant to pageserver and in ServerInfo structure passed to wal acceptor
…recovery.

Rust's postgres_backend currently is too dummy to handle it properly: reading
happens in separate thread which just ignores CopyDone. Instead, writer thread
must get aware of termination and send CommandComplete. Also reading socket must
be transferred back to postgres_backend (or connection terminated completely
after COPY). Let's do that after more basic safkeeper refactoring and right now
cover this up to make tests pass.

ref #388
…ion position in wal_proppser to segment boundary
…ugging.

Now it contains only one function test_consume_xids() for xid wraparound testing.
MMeent and others added 5 commits December 5, 2022 16:19
- Prefetch the pages in index vacuum's sequential scans
   Implemented in NBTREE, GIST and SP-GIST.
   BRIN does not have a 2nd phase of vacuum, and both GIN and HASH clean up
   their indexes in a non-seqscan fashion: GIN scans the btree from left to
   right, and HASH only scans the initial buckets sequentially.
The compiler warning was correct and would have the potential to disable prefetching.
* Show prefetch statistic in EXPLAIN

refer #2994

* Collect per-node prefetch statistics

* Show number of prefetch duplicates in explain
@knizhnik knizhnik changed the base branch from REL_15_STABLE_neon to REL_14_STABLE_neon February 7, 2023 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.