Store timelines in internal catalogs #848

hanefi · 2024-07-19T22:46:49Z

This is a WIP PR that still has some rough edges.

TODO:

It does not look acceptable to include "schema.h" or "catalog.h" in file pgsql.c. I'll try to find a workaround
- I ended up creating a new module.
Having to pass the internal database details in a context does not look elegant.
- I guess that is a compromise that is necessary.
Missing some catalog functions that can be used to populate data structures on current timeline. I do not see the value in storing a value that is not used anywhere.
- I store the current timeline details in memory now. I believe having only the current timeline in memory is enough for today.

dimitri · 2024-07-22T10:14:08Z

It does not look acceptable to include "schema.h" or "catalog.h" in file pgsaql.c. I'll try to find a workaround

Usually that involves creating a new C module that will include all the needed headers. Here, that would be something like a “pgsql_timeline.c” module I suppose.

Missing some catalog functions that can be used to populate data structures on current timeline. I do not see the value in storing a value that is not used anywhere.

At the moment we fetch the timeline mostly for DEBUG purposes. Supporting a TLI change in Logical Decoding is still an open topic in pgcopydb. We need to have enough information to decide if we can follow a TLI change, depending on what we replayed already in the past etc. So for the moment, we just store the information in a way that's easy to review/debug, but we not use the information anywhere yet.

dimitri · 2024-07-23T10:39:13Z

See #834

dimitri

Looks good! I think there is a major aspect that needs a revisit, about the allocation strategy for the timeline history bits. We receive them in a single chunk of memory, but having a static size won't be good enough.

Other than than, the usual amount of nitpicking.

src/bin/pgcopydb/catalog.c

src/bin/pgcopydb/pgsql_timeline.c

src/bin/pgcopydb/catalog.c

hanefi · 2024-07-23T11:10:15Z

It does not look acceptable to include "schema.h" or "catalog.h" in file pgsaql.c. I'll try to find a workaround

Usually that involves creating a new C module that will include all the needed headers. Here, that would be something like a “pgsql_timeline.c” module I suppose.

Thanks for the idea of a new module. Here are some information on how I created the new module:

psql_timeline.c contains all code related to interacting with PG timelines.
pgsql_utils.h contains some helper functions that are shared by pgsql.c and psql_timeline.c. They used to be static functions.
pgsql.c remains agnostic of our internal catalogs, unlike pgsql_timeline.c.

Missing some catalog functions that can be used to populate data structures on current timeline. I do not see the value in storing a value that is not used anywhere.

At the moment we fetch the timeline mostly for DEBUG purposes. Supporting a TLI change in Logical Decoding is still an open topic in pgcopydb. We need to have enough information to decide if we can follow a TLI change, depending on what we replayed already in the past etc. So for the moment, we just store the information in a way that's easy to review/debug, but we not use the information anywhere yet.

I only store the current timeline details in memory now. After moving some portion of code to a new module, some of those changes may escape the attention of the reader/reviewer.

We no longer have TimeLineHistory struct.
We only store the current timeline details in memory.

src/bin/pgcopydb/catalog.c

src/bin/pgcopydb/pgsql_timeline.c

src/bin/pgcopydb/schema.h

The function catalog_lookup_timeline is renamed to catalog_lookup_timeline_history for consistency with the other functions in the catalog module. The memory representation is called a TimelineHistoryEntry, so the function name should reflect that.

dimitri

Another round of review, the PR is now taking a good shape! thanks

dimitri · 2024-07-23T15:11:07Z

src/bin/pgcopydb/catalog.c

+bool
+catalog_add_timeline_history(void *ctx, TimelineHistoryEntry *entry)
+{
+	DatabaseCatalog *catalog = (DatabaseCatalog *) ctx;


Why use a void * context that actually is a DatabaseCatalog * argument? It would be better to skip the cast and have the compiler validation for us... is there a reason to do this that I can't see and that is better than having the compiler checks?

I could not figure out the circular dependency in our headers. In the end I ended up writing some forward declaration for DatabaseCatalog to circumvent that.

We might need a new header pgsql_timeline.h that could use both the catalogs and the lower-level pgsql header?

I created a new header in a new commit, and removed the forward declaration for DatabaseCatalog

It contains the declaration of a function defined in psql.c just because I can not include the file that defines the struct DatabaseCatalog in psql.h.

/* pgsql.c */ bool pgsql_start_replication(LogicalStreamClient *client, DatabaseCatalog *catalog, char *cdcPathDir);

src/bin/pgcopydb/pgsql_timeline.c

dimitri · 2024-07-23T15:14:47Z

src/bin/pgcopydb/pgsql_timeline.c

+bool
+parseTimelineHistory(const char *filename, const char *content,
+					 IdentifySystem *system, void *context)
+{


Here too I believe using DatabaseCatalog * rather than void * would make sense. What do you think?

Also is the filename necessary here at all?

I also think that it makes sense to use DatabaseCatalog * here. However, it is not trivial to solve the circular dependency. We have similar issues in other occurences of void pointers in this branch. Sharing an example:

parseTimelineHistory declaration is at pgsql.h, implementation is at pgsql_timeline.c

DatabaseCatalog definition is in schema.h

schema.h includes pgsql.h

If I want to add an include directive for schema.h in pgsql.h, we got a circular dependency.

Shall I move things to a separate module and try to come up with a better hierarchy?

graph TD; parseTimelineHistory --> pgsql.h parseTimelineHistory --> pgsql_timeline.c DatabaseCatalog --> schema.h schema.h --> pgsql.h pgsql.h ==does not work==> schema.h

Loading

I can also use forward declarations for problematic structs. I never liked them, and I avoid them usually, but there may be no other option.

I guess there is no other way around it. I'll add the following to pgsql.h:

/* forward declaration */ typedef struct DatabaseCatalog DatabaseCatalog;

See my other comment about pgsql_timeline.h maybe?

src/bin/pgcopydb/pgsql_timeline.c

src/bin/pgcopydb/pgsql_utils.h

dimitri · 2024-07-23T15:20:12Z

src/bin/pgcopydb/pgsql_timeline.c

+		log_trace("parseTimelineHistory line %lld is \"%s\"",
+				  (long long) lineNumber,
+				  lbuf.lines[lineNumber]);


If we get to store the history file as-is on-disk then we should be able to remove that line entirely.

It is doable but not so easy right now. Right now, I have the following lines of code on my branch:

if (!pgsql_start_replication(&stream, specs->sourceDB)) { /* errors have already been logged */ return false; } /* write the wal_segment_size and timeline history files */ if (!stream_write_context(specs, &stream)) { /* errors have already been logged */ return false; }

The first function call, reads the contents of the file, parses them and stores records in internal catalog. We do not store the file contents in memory like we used to.

The second function writes the information in memory to files on disk. There is no elegant way I can think of that can continue to uphold separation of concerns here. I will add a separate commit that will change the first function to write the files to disk. In case you object, I can revert that and come up with another method. We can eventually move away from writing timeline history to disk if we can not come up with an elegant solution.

Yeah that's why the placement of my comment about storing to disk was at this seemingly random place. I would like to store the file exactly as Postgres makes it available to us, only for debugging purposes, in case our parsing of the file were to fail.

I see now what you've done, and I think that's good. When implementing debug traces that are not used in the normal flow of operations, or even in the code, I don't think separation of concerns is as imperative. Here, we want to log-to-disk the actual result from the Postgres protocol, before any kind of processing...

After reading the following sentence in the last comment, I realized that I did not print a debug message when writing the tli history file.

When implementing debug traces that are not used in the normal flow of operations, or even in the code, I don't think separation of concerns is as imperative.

I pushed a new commit 2eacb24 that adds a debug message similar to what we had prior to this PR.

Only thinking about it now, sorry: what could make sense is to store the timeline data received from the protocol directly to disk, no parsing, and then have a function that reads the file one line at a time and parses the content etc.

This would be better at memory management, and also separation of concerns and internal API/headers.

What do you think?

I actually write contents to disk before parsing it.

Currently pgsql_identify_system function does the following calls:

parseTimelineHistoryResult parses PGResult into a TimelineHistoryResult

writeTimelineHistoryFile writes the TimelineHistoryResult content that holds raw data to disk

parseTimelineHistoryfile parses the TimelineHistoryResult and stores it in internal catalog

See the other comments I have added, with more details. I think we have an opportunity to implement things in a way that start_logical_decoding does not need access to our SQLite catalogs.

…f-the-history-file-contents-needs-dynamic-memory-management

src/bin/pgcopydb/catalog.c

dimitri · 2024-07-25T09:30:36Z

src/bin/pgcopydb/pgsql_timeline.c

+		log_debug("Wrote tli %s timeline history file \"%s/%s\"",
+				  system->currentTimeline.tli, cdcPathDir, hContext.filename);
+


This should be done in writeTimelineHistoryFile and before the call to write_file, so that if the writing to file happens fail or take a long time, we already would have printed the log information about it.

Ok. I'll move this before the call to write_file and change wording slightly (e.g. Wrote -> Writing).

In my defense, in stream_write_context we print the logs only after successfully writing the files to disk.

dimitri · 2024-07-25T09:31:57Z

src/bin/pgcopydb/pgsql_timeline.c

+	context->content = strdup(value);
+


I was thinking we might want to write to file at this place in the code, then skipping the strdup entirely, and keeping only the filename in our internal structure.

Done. After implementing the iterator API from scratch, I needed to move the filename to another structure. However, I think it is fine as I did not duplicate the file content and store only the file name.

dimitri · 2024-07-25T09:34:21Z

src/bin/pgcopydb/pgsql_timeline.c

+	LinesBuffer lbuf = { 0 };
+
+	if (!splitLines(&lbuf, (char *) content))
+	{
+		/* errors have already been logged */
+		return false;
+	}
+


If we write the file to disk as soon as receiving it, and without duplicating the contents in-memory, then we don't need to splitLines here and instead of looping through the array lines we could read the file line-by-line using the iterator and a callback function.

I implemented the iterator API, and removed the splitLines here in favor of our file iterators.

If I can share some context, I did not really write this piece of code myself. I just moved it from one file to another as it made sense to group all timeline related functions to the same module. We had a function who was good at parsing the complete file by itself, and know we have timeline_iter_history[|_init|_next|_finish] functions that handle portions of it.

dimitri · 2024-07-25T09:36:58Z

src/bin/pgcopydb/ld_stream.c

-		if (!pgsql_start_replication(&stream))
+		if (!pgsql_start_replication(&stream, specs->sourceDB, specs->paths.dir))


The whole idea behing the latest opened discussion in this review is to allow to skip using the sourceDB when doing pgsql_start_replication(), getting back to your point about separation of concerns. At replication start, only store the timeline history to a file.

In stream_write_context() (or maybe in a new function, if needed?) we can then parse the history file with a file iterator and store our own format in the SQLite catalogs.

dimitri · 2024-07-25T09:37:57Z

src/bin/pgcopydb/pgsql_timeline.c

+		log_trace("parseTimelineHistory line %lld is \"%s\"",
+				  (long long) lineNumber,
+				  lbuf.lines[lineNumber]);


See the other comments I have added, with more details. I think we have an opportunity to implement things in a way that start_logical_decoding does not need access to our SQLite catalogs.

…f-the-history-file-contents-needs-dynamic-memory-management

dimitri

Quick review from the metro, hope it's useful. I am surprised we need to define our own iterator. Expected that a callback to the file line iterator would do.

dimitri · 2024-07-30T21:16:20Z

src/bin/pgcopydb/pgsql.c

@@ -4182,7 +3816,7 @@ pgsql_timestamptz_to_string(TimestampTz ts, char *str, size_t size)
 * Send the START_REPLICATION logical replication command.
 */
 bool
-pgsql_start_replication(LogicalStreamClient *client)
+pgsql_start_replication(LogicalStreamClient *client, char *cdcPathDir)


Can we add the cdcPathDir to the LogicalStreamClient structure?

dimitri · 2024-07-30T21:19:05Z

src/bin/pgcopydb/pgsql_timeline.h

+bool timeline_iter_history_next(TimelineHistoryIterator *iter);
+bool timeline_iter_history_finish(TimelineHistoryIterator *iter);
+
+bool timeline_history_add_hook(void *context, TimelineHistoryEntry *entry);


I'm not fond of the name. Not a showstopper. Will think. The idea is to describe what the hook does, not where we call it...

I ended up removing this completely. Relevant discussion at other comment: #848 (comment)

dimitri · 2024-07-30T21:23:47Z

src/bin/pgcopydb/pgsql_timeline.c

+	}
+
+	iter->filename = filename;
+	iter->currentTimeline = context->currentTimeline;


The iterator should never look into the context.

This is a great point. Applying a fix for this allowed me to make the following improvements:

I passed the value as an argument, I had a slimmer context.

Now that the context had a single attribute, I deleted that completely.

A function with a bad name timeline_history_add_hook that was a wrapper for catalog_add_timeline_history was no longer needed as I did not need to have the logic to access a subset of values in a context.

When the number of parameters get large, I start to think of context structs as a means to group them. This was a wrong way to think, and I see that now.

hanefi · 2024-07-31T13:40:07Z

I am surprised we need to define our own iterator. Expected that a callback to the file line iterator would do.

I had 2 main issues that were not easy to overcome if I used the current file line iterators, and file_iter_lines function.

Problem 1:

When parsing a line, I needed to store some values in the iterator, and it was not possible with the current file iterator. I believe we need to be able to pass custom iterators for that to be useful.

I considered updating the current implementation of file_iter_lines to accept an iterator instead of some fixed set of arguments that are used to construct the iterator in the function body. That would have worked out for me.

Problem 2:

I need to be able to call the callback function after all the lines are processed. This way, we are able to add a final timeline entry for the current timeline. This was not easy to accomplish in the current implementation of file_iter_lines.

Below is a comment from @dimitri as he mistakenly edited the original comment instead of quoting it in a new comment:

I had 2 main issues that were not easy to overcome if I used the current file line iterators, and file_iter_lines function.

It seems to me that there is still confusion about the responsibilities of the iterator parts of the code, which should only know how to read a line at a time, and the callback/context side of things, where I expected we would deal with prevend and all and have all the needed information for the last entry, after the iterator is finished.

It might be my confusion though, so I will have another look later today and maybe try to actually write it the way I though. Sometimes it's the only way to understand the difference between ideas and reality...

src/bin/pgcopydb/pgsql_timeline.c

dimitri · 2024-08-02T10:44:21Z

Thanks for pushing all the work Hanefi, I like what we ended-up with!

hanefi added 3 commits July 20, 2024 01:41

TimeLine -> Timeline

79635c9

Store timeline history in internal catalogs

198c70d

Reindent

75199cc

dimitri assigned hanefi Jul 22, 2024

dimitri added bug Something isn't working enhancement New feature or request labels Jul 22, 2024

dimitri added this to the v0.17 milestone Jul 22, 2024

hanefi added 3 commits July 23, 2024 12:49

Fix interface and imports

5c22d86

Remove some unused structs and struct members

dc76ce5

Read current timeline from catalogs

e619b2b

Fix module structure

83cbfd6

dimitri requested changes Jul 23, 2024

View reviewed changes

hanefi added 4 commits July 23, 2024 15:29

Address reviews

f08cac3

Remove unnecessary internal function

494fc27

Properly handle failures when adding timelines to catalog

0e7859f

Remove unused constants

431058e

hanefi marked this pull request as ready for review July 23, 2024 13:10

dimitri reviewed Jul 23, 2024

View reviewed changes

src/bin/pgcopydb/catalog.c Outdated Show resolved Hide resolved

src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved

src/bin/pgcopydb/schema.h Outdated Show resolved Hide resolved

hanefi added 4 commits July 23, 2024 16:48

Remove TimelineHistoryContext

bbc6b63

Format LSN with helpers

eb621c3

Remove tli hist file references

a424f0e

Rename a function for consistency

cf79152

The function catalog_lookup_timeline is renamed to catalog_lookup_timeline_history for consistency with the other functions in the catalog module. The memory representation is called a TimelineHistoryEntry, so the function name should reflect that.

dimitri requested changes Jul 23, 2024

View reviewed changes

hanefi added 4 commits July 23, 2024 18:54

Merge branch 'main' of github.com:dimitri/pgcopydb into 834-parsing-o…

713f1f2

…f-the-history-file-contents-needs-dynamic-memory-management

Remove void pointers

5174220

Bring tlihistory file back

70b5079

Fix formatting of new header file

d3e6d28

hanefi added 4 commits July 23, 2024 22:39

Persist timeline history files for debugging

905b093

Fix copy paste errors, add missing comments

b5070fa

Introduce new header file

77c5a36

Add debug message for writing history file to disk

2eacb24

hanefi requested a review from dimitri July 25, 2024 09:00

dimitri reviewed Jul 25, 2024

View reviewed changes

hanefi added 7 commits July 29, 2024 13:08

Add function level comments

aee6efe

Move debug message before writing file to disk

2499fd1

Implement iterator API for timeline history parsing

c517ae6

Do not access catalogs in pgsql_start_replication

8c7a8da

Write to files only if we have timeline history

eb6b808

Merge branch 'main' of github.com:dimitri/pgcopydb into 834-parsing-o…

937fd1a

…f-the-history-file-contents-needs-dynamic-memory-management

Reindent

8ba0fe3

dimitri reviewed Jul 30, 2024

View reviewed changes

hanefi added 4 commits July 31, 2024 15:21

store cdc path in logical stream client

a02d5ea

Fix conditional to decide when to write history to file

a9a29c9

Cleanup contexts and hooks

f9e59b6

Fix file paths

356f198

dimitri reviewed Jul 31, 2024

View reviewed changes

src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved

src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved

hanefi added 5 commits July 31, 2024 17:34

Calculate abs path only once

2ab8a2e

Remove custom iterators

fea700b

Remove tlihistfile references

b87cae7

Fix warning message

7985881

Move declarations in timeline header

751574e

dimitri approved these changes Aug 2, 2024

View reviewed changes

dimitri merged commit 6da67d7 into dimitri:main Aug 2, 2024
19 checks passed

hanefi deleted the 834-parsing-of-the-history-file-contents-needs-dynamic-memory-management branch August 2, 2024 10:52

hanefi mentioned this pull request Aug 5, 2024

Parsing of the history file contents needs dynamic memory management #834

Closed

		log_debug("Wrote tli %s timeline history file \"%s/%s\"",
		system->currentTimeline.tli, cdcPathDir, hContext.filename);

		if (!pgsql_start_replication(&stream))
		if (!pgsql_start_replication(&stream, specs->sourceDB, specs->paths.dir))

Store timelines in internal catalogs #848

Store timelines in internal catalogs #848

Conversation

hanefi commented Jul 19, 2024 • edited Loading

dimitri commented Jul 22, 2024 • edited Loading

dimitri commented Jul 23, 2024

dimitri left a comment

Choose a reason for hiding this comment

hanefi commented Jul 23, 2024

dimitri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitri left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanefi commented Jul 31, 2024 • edited Loading

Problem 1:

Problem 2:

dimitri commented Aug 2, 2024

hanefi commented Jul 19, 2024 •

edited

Loading

dimitri commented Jul 22, 2024 •

edited

Loading

dimitri left a comment •

edited

Loading

hanefi commented Jul 31, 2024 •

edited

Loading