Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maxwell breaking data integrity of datetime fields on bootstraps #1754

Open
lpmarques opened this issue Oct 4, 2021 · 4 comments
Open

Maxwell breaking data integrity of datetime fields on bootstraps #1754

lpmarques opened this issue Oct 4, 2021 · 4 comments

Comments

@lpmarques
Copy link

lpmarques commented Oct 4, 2021

Hey there,

Last week I noticed some oddly valued (but still valid) datetime fields in a source table were being reproduced as invalid datetimes by Maxwell's bootstrap process.

The original value stored in the source mysql database for these fields is 9999-12-31 23:59:59, while resulting bootstrap-insert events show them as 10000-01-01 02:59:59, probably over-assuming that:
1 - these values were recorded in the time zone specified by GLOBAL.time_zone mysql variable (-03:00), which is not necessarily the case;
2 - they should be represented as some sort of timestamp in the target, therefore needing to be converted to UTC first, which is not necessarily the desired behavior and is also, afaik, not informed in the documentation.

Later I could confirm this is consistently happening to all datetime values (not only the oddly valued ones) during bootstrap runs, but the ongoing CDC process shows the expected behavior of keeping all datetime values as recorded in the source.

Maxwell version: 1.33.1

@osheroff
Copy link
Collaborator

I think that JDBC (which controls the bootstrapping process) is probably adding a timezone to the datetimes returned, which maxwell yeah, then attempts to convert to UTC times.

If you have a surgical fix here to suggest I'll listen, but I've gone so many damn rounds on mysql and its timezone support (and dumbness thereof) that mostly I just throw up my hands and say "Listen. If your server and client aren't in UTC and your datetimes aren't in UTC you're opening up a big can of trouble anyway. I can't even begin to guess what your desired behavior is." And normally I don't do that kind of daddy-engineer-knows-best shit, but in this case... well, I am doing it. sorry.

You might try passing different values of "serverTimezone=SOMETHING" in jdbc_options. It might help.

@lpmarques
Copy link
Author

So, I get your point, but in my case (and I suppose most data engineers' too) I have no decision power on timezone configuration of transactional source databases.

On the desired behavior of datetime manipulation, maybe I got it wrong (which would be no surprise given mysql's confusing concepts), but my unsderstanding is that mysql datetimes have no timezone data associated to them whatsoever. If that's correct, any datetime value should be read as stored, almost like a simple string would: without any regard to server or client timezones and without any conversion either.

Therefore, if the JDBC is adding timezone to mysql datetimes based on server's global time_zone variable, it sounds like unexpected behavior. But I'll tinker with jdbc_options and report any effect. Thanks for the suggestion!

@lpmarques
Copy link
Author

lpmarques commented Nov 27, 2021

Hey @osheroff!

I tried many different things on jdbc_options with no success. When set to UTC, the serverTimezone parameter you mentioned actually did ensure datetime values were fetched as stored, but had the side-effect of fetching timestamp values with database's timezone, not UTC.

By tinkering with the source code (SynchronousBootstrapper.java, more specifically), I noticed JDBC is indeed responsible for adding timezone to both datetime and timestamp values as you correctly guessed. I'm sure you may come with a more elegant solution, but I created a pull request to avoid the proceeding timezone conversion of datetime values (only). I've run some tests and confirmed the fix works for datetimes of any precision, making bootstrapping and CDC processes coherent.

Please, consider reviewing it for inclusion in Maxwell's next releases.

Thank you!

@lpmarques
Copy link
Author

PS: another thing I noticed is this timezone conversion of both datetimes and timestamps during bootstrapping is not necessarily to UTC, it's actually from the server database to the client host timezone, which I guess can cause differences from the CDC process behavior too. Since at least I can ensure client host timezone to be UTC, I did not tackle this problem in my pull request, but still I thought it was worth mentioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants