Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store timelines in internal catalogs #848

Conversation

hanefi
Copy link
Contributor

@hanefi hanefi commented Jul 19, 2024

This is a WIP PR that still has some rough edges.

TODO:

  • It does not look acceptable to include "schema.h" or "catalog.h" in file pgsql.c. I'll try to find a workaround
    • I ended up creating a new module.
  • Having to pass the internal database details in a context does not look elegant.
    • I guess that is a compromise that is necessary.
  • Missing some catalog functions that can be used to populate data structures on current timeline. I do not see the value in storing a value that is not used anywhere.
    • I store the current timeline details in memory now. I believe having only the current timeline in memory is enough for today.

@dimitri
Copy link
Owner

dimitri commented Jul 22, 2024

  • It does not look acceptable to include "schema.h" or "catalog.h" in file pgsaql.c. I'll try to find a workaround

Usually that involves creating a new C module that will include all the needed headers. Here, that would be something like a “pgsql_timeline.c” module I suppose.

  • Missing some catalog functions that can be used to populate data structures on current timeline. I do not see the value in storing a value that is not used anywhere.

At the moment we fetch the timeline mostly for DEBUG purposes. Supporting a TLI change in Logical Decoding is still an open topic in pgcopydb. We need to have enough information to decide if we can follow a TLI change, depending on what we replayed already in the past etc. So for the moment, we just store the information in a way that's easy to review/debug, but we not use the information anywhere yet.

@dimitri dimitri added bug Something isn't working enhancement New feature or request labels Jul 22, 2024
@dimitri dimitri added this to the v0.17 milestone Jul 22, 2024
@dimitri
Copy link
Owner

dimitri commented Jul 23, 2024

See #834

Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think there is a major aspect that needs a revisit, about the allocation strategy for the timeline history bits. We receive them in a single chunk of memory, but having a static size won't be good enough.

Other than than, the usual amount of nitpicking.

src/bin/pgcopydb/catalog.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/catalog.c Outdated Show resolved Hide resolved
@hanefi
Copy link
Contributor Author

hanefi commented Jul 23, 2024

  • It does not look acceptable to include "schema.h" or "catalog.h" in file pgsaql.c. I'll try to find a workaround

Usually that involves creating a new C module that will include all the needed headers. Here, that would be something like a “pgsql_timeline.c” module I suppose.

Thanks for the idea of a new module. Here are some information on how I created the new module:

  • psql_timeline.c contains all code related to interacting with PG timelines.
  • pgsql_utils.h contains some helper functions that are shared by pgsql.c and psql_timeline.c. They used to be static functions.
  • pgsql.c remains agnostic of our internal catalogs, unlike pgsql_timeline.c.
  • Missing some catalog functions that can be used to populate data structures on current timeline. I do not see the value in storing a value that is not used anywhere.

At the moment we fetch the timeline mostly for DEBUG purposes. Supporting a TLI change in Logical Decoding is still an open topic in pgcopydb. We need to have enough information to decide if we can follow a TLI change, depending on what we replayed already in the past etc. So for the moment, we just store the information in a way that's easy to review/debug, but we not use the information anywhere yet.

I only store the current timeline details in memory now. After moving some portion of code to a new module, some of those changes may escape the attention of the reader/reviewer.

  • We no longer have TimeLineHistory struct.
  • We only store the current timeline details in memory.

@hanefi hanefi marked this pull request as ready for review July 23, 2024 13:10
src/bin/pgcopydb/catalog.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/schema.h Outdated Show resolved Hide resolved
The function catalog_lookup_timeline is renamed to
catalog_lookup_timeline_history for consistency with the other
functions in the catalog module. The memory representation is called a
TimelineHistoryEntry, so the function name should reflect that.
Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of review, the PR is now taking a good shape! thanks

Comment on lines 7711 to 7714
bool
catalog_add_timeline_history(void *ctx, TimelineHistoryEntry *entry)
{
DatabaseCatalog *catalog = (DatabaseCatalog *) ctx;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a void * context that actually is a DatabaseCatalog * argument? It would be better to skip the cast and have the compiler validation for us... is there a reason to do this that I can't see and that is better than having the compiler checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not figure out the circular dependency in our headers. In the end I ended up writing some forward declaration for DatabaseCatalog to circumvent that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a new header pgsql_timeline.h that could use both the catalogs and the lower-level pgsql header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new header in a new commit, and removed the forward declaration for DatabaseCatalog

It contains the declaration of a function defined in psql.c just because I can not include the file that defines the struct DatabaseCatalog in psql.h.

/* pgsql.c */
bool pgsql_start_replication(LogicalStreamClient *client, DatabaseCatalog *catalog,
							 char *cdcPathDir);

src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved
Comment on lines 258 to 261
bool
parseTimelineHistory(const char *filename, const char *content,
IdentifySystem *system, void *context)
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too I believe using DatabaseCatalog * rather than void * would make sense. What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is the filename necessary here at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that it makes sense to use DatabaseCatalog * here. However, it is not trivial to solve the circular dependency. We have similar issues in other occurences of void pointers in this branch. Sharing an example:

  • parseTimelineHistory declaration is at pgsql.h, implementation is at pgsql_timeline.c
  • DatabaseCatalog definition is in schema.h
  • schema.h includes pgsql.h
  • If I want to add an include directive for schema.h in pgsql.h, we got a circular dependency.

Shall I move things to a separate module and try to come up with a better hierarchy?

	graph TD;
parseTimelineHistory --> pgsql.h
parseTimelineHistory --> pgsql_timeline.c
DatabaseCatalog --> schema.h
schema.h --> pgsql.h
pgsql.h ==does not work==> schema.h
Loading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also use forward declarations for problematic structs. I never liked them, and I avoid them usually, but there may be no other option.

I guess there is no other way around it. I'll add the following to pgsql.h:

/* forward declaration */
typedef struct DatabaseCatalog DatabaseCatalog;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment about pgsql_timeline.h maybe?

src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql_utils.h Show resolved Hide resolved
Comment on lines 293 to 295
log_trace("parseTimelineHistory line %lld is \"%s\"",
(long long) lineNumber,
lbuf.lines[lineNumber]);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get to store the history file as-is on-disk then we should be able to remove that line entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is doable but not so easy right now. Right now, I have the following lines of code on my branch:

		if (!pgsql_start_replication(&stream, specs->sourceDB))
		{
			/* errors have already been logged */
			return false;
		}

		/* write the wal_segment_size and timeline history files */
		if (!stream_write_context(specs, &stream))
		{
			/* errors have already been logged */
			return false;
		}

The first function call, reads the contents of the file, parses them and stores records in internal catalog. We do not store the file contents in memory like we used to.

The second function writes the information in memory to files on disk. There is no elegant way I can think of that can continue to uphold separation of concerns here. I will add a separate commit that will change the first function to write the files to disk. In case you object, I can revert that and come up with another method. We can eventually move away from writing timeline history to disk if we can not come up with an elegant solution.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's why the placement of my comment about storing to disk was at this seemingly random place. I would like to store the file exactly as Postgres makes it available to us, only for debugging purposes, in case our parsing of the file were to fail.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now what you've done, and I think that's good. When implementing debug traces that are not used in the normal flow of operations, or even in the code, I don't think separation of concerns is as imperative. Here, we want to log-to-disk the actual result from the Postgres protocol, before any kind of processing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the following sentence in the last comment, I realized that I did not print a debug message when writing the tli history file.

When implementing debug traces that are not used in the normal flow of operations, or even in the code, I don't think separation of concerns is as imperative.

I pushed a new commit 2eacb24 that adds a debug message similar to what we had prior to this PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thinking about it now, sorry: what could make sense is to store the timeline data received from the protocol directly to disk, no parsing, and then have a function that reads the file one line at a time and parses the content etc.

This would be better at memory management, and also separation of concerns and internal API/headers.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually write contents to disk before parsing it.

Currently pgsql_identify_system function does the following calls:

  • parseTimelineHistoryResult parses PGResult into a TimelineHistoryResult
  • writeTimelineHistoryFile writes the TimelineHistoryResult content that holds raw data to disk
  • parseTimelineHistoryfile parses the TimelineHistoryResult and stores it in internal catalog

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other comments I have added, with more details. I think we have an opportunity to implement things in a way that start_logical_decoding does not need access to our SQLite catalogs.

@hanefi hanefi requested a review from dimitri July 25, 2024 09:00
src/bin/pgcopydb/catalog.c Show resolved Hide resolved
Comment on lines 128 to 130
log_debug("Wrote tli %s timeline history file \"%s/%s\"",
system->currentTimeline.tli, cdcPathDir, hContext.filename);

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in writeTimelineHistoryFile and before the call to write_file, so that if the writing to file happens fail or take a long time, we already would have printed the log information about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll move this before the call to write_file and change wording slightly (e.g. Wrote -> Writing).

In my defense, in stream_write_context we print the logs only after successfully writing the files to disk.

Comment on lines 266 to 267
context->content = strdup(value);

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we might want to write to file at this place in the code, then skipping the strdup entirely, and keeping only the filename in our internal structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. After implementing the iterator API from scratch, I needed to move the filename to another structure. However, I think it is fine as I did not duplicate the file content and store only the file name.

Comment on lines 288 to 295
LinesBuffer lbuf = { 0 };

if (!splitLines(&lbuf, (char *) content))
{
/* errors have already been logged */
return false;
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we write the file to disk as soon as receiving it, and without duplicating the contents in-memory, then we don't need to splitLines here and instead of looping through the array lines we could read the file line-by-line using the iterator and a callback function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the iterator API, and removed the splitLines here in favor of our file iterators.

If I can share some context, I did not really write this piece of code myself. I just moved it from one file to another as it made sense to group all timeline related functions to the same module. We had a function who was good at parsing the complete file by itself, and know we have timeline_iter_history[|_init|_next|_finish] functions that handle portions of it.

Comment on lines 493 to 494
if (!pgsql_start_replication(&stream))
if (!pgsql_start_replication(&stream, specs->sourceDB, specs->paths.dir))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole idea behing the latest opened discussion in this review is to allow to skip using the sourceDB when doing pgsql_start_replication(), getting back to your point about separation of concerns. At replication start, only store the timeline history to a file.

In stream_write_context() (or maybe in a new function, if needed?) we can then parse the history file with a file iterator and store our own format in the SQLite catalogs.

Comment on lines 293 to 295
log_trace("parseTimelineHistory line %lld is \"%s\"",
(long long) lineNumber,
lbuf.lines[lineNumber]);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other comments I have added, with more details. I think we have an opportunity to implement things in a way that start_logical_decoding does not need access to our SQLite catalogs.

Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review from the metro, hope it's useful. I am surprised we need to define our own iterator. Expected that a callback to the file line iterator would do.

@@ -4182,7 +3816,7 @@ pgsql_timestamptz_to_string(TimestampTz ts, char *str, size_t size)
* Send the START_REPLICATION logical replication command.
*/
bool
pgsql_start_replication(LogicalStreamClient *client)
pgsql_start_replication(LogicalStreamClient *client, char *cdcPathDir)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the cdcPathDir to the LogicalStreamClient structure?

bool timeline_iter_history_next(TimelineHistoryIterator *iter);
bool timeline_iter_history_finish(TimelineHistoryIterator *iter);

bool timeline_history_add_hook(void *context, TimelineHistoryEntry *entry);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of the name. Not a showstopper. Will think. The idea is to describe what the hook does, not where we call it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this completely. Relevant discussion at other comment: #848 (comment)

}

iter->filename = filename;
iter->currentTimeline = context->currentTimeline;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator should never look into the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point. Applying a fix for this allowed me to make the following improvements:

  • I passed the value as an argument, I had a slimmer context.
  • Now that the context had a single attribute, I deleted that completely.
  • A function with a bad name timeline_history_add_hook that was a wrapper for catalog_add_timeline_history was no longer needed as I did not need to have the logic to access a subset of values in a context.

When the number of parameters get large, I start to think of context structs as a means to group them. This was a wrong way to think, and I see that now.

@hanefi
Copy link
Contributor Author

hanefi commented Jul 31, 2024

I am surprised we need to define our own iterator. Expected that a callback to the file line iterator would do.

I had 2 main issues that were not easy to overcome if I used the current file line iterators, and file_iter_lines function.

Problem 1:

When parsing a line, I needed to store some values in the iterator, and it was not possible with the current file iterator. I believe we need to be able to pass custom iterators for that to be useful.

I considered updating the current implementation of file_iter_lines to accept an iterator instead of some fixed set of arguments that are used to construct the iterator in the function body. That would have worked out for me.

Problem 2:

I need to be able to call the callback function after all the lines are processed. This way, we are able to add a final timeline entry for the current timeline. This was not easy to accomplish in the current implementation of file_iter_lines.


Below is a comment from @dimitri as he mistakenly edited the original comment instead of quoting it in a new comment:

I had 2 main issues that were not easy to overcome if I used the current file line iterators, and file_iter_lines function.

It seems to me that there is still confusion about the responsibilities of the iterator parts of the code, which should only know how to read a line at a time, and the callback/context side of things, where I expected we would deal with prevend and all and have all the needed information for the last entry, after the iterator is finished.

It might be my confusion though, so I will have another look later today and maybe try to actually write it the way I though. Sometimes it's the only way to understand the difference between ideas and reality...

src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql_timeline.c Outdated Show resolved Hide resolved
@dimitri dimitri merged commit 6da67d7 into dimitri:main Aug 2, 2024
19 checks passed
@dimitri
Copy link
Owner

dimitri commented Aug 2, 2024

Thanks for pushing all the work Hanefi, I like what we ended-up with!

@hanefi hanefi deleted the 834-parsing-of-the-history-file-contents-needs-dynamic-memory-management branch August 2, 2024 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants