Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster/easier loading of types #2046

Merged
merged 1 commit into from
Jul 1, 2024
Merged

Conversation

nicois
Copy link
Contributor

@nicois nicois commented Jun 15, 2024

As discussed in #2030 , improve the LoadType method to use a single SQL call.

Add LoadTypes() which adds multiple types in a single call, and supports recursively loading of other implicitly-required types in the same SQL call.

@nicois
Copy link
Contributor Author

nicois commented Jun 15, 2024

@jackc when tests run against cockroachdb, I see:

ERROR: unknown signature: array_agg(name[]) (SQLSTATE 42883)

I don't have any experience with this database, and I don't see any mention of being incompatible with this function. If pgx needs to support cockroachdb, how would you suggest we do so? I could try using an alternate syntax, but I'd like your advice before doing so.

@nicois
Copy link
Contributor Author

nicois commented Jun 15, 2024

All regression tests apart from cockroachdb are passing: https://github.com/nicois/pgx/actions/runs/9526586068

@jackc
Copy link
Owner

jackc commented Jun 15, 2024

Regarding Cockroach DB, if it is something that is not supported by that database engine then there is a helper method to skip tests. But as far as I know, array_agg is supported by CRDB. I'm not sure where that error is coming from but the name[] is suspicious. I would expect the call to aggregate name not name[]. As in, it looks like it is trying to aggregate an array of name into an array of arrays of names instead of aggregating name into an array of name.

I also was wondering if the loading all custom types includes table types.

@nicois
Copy link
Contributor Author

nicois commented Jun 15, 2024

array_agg is being used by the new single-SQL statement, to select everything in one go - so it's not viable to simply skip the test.
This is related to supporting composite fields, where I am collecting multiple 2-tuples of composite fields into an array of 2-tuples, so it is in fact intended that this is an array of arrays of strings.
I will try to flatten the structure to a single array of 2X elements, where X is the number of composite fields.

Regarding the loading of all types: yes, there is further work needed there. I've added a simple test and there are complications. I think this needs some experimentation/discussion, and it's probably better to cover that in its own PR. I will remove the public method to invoke the loading of all types from this PR.

@nicois nicois force-pushed the nicois/load-types branch 3 times, most recently from 9fba6cf to f238d1d Compare June 16, 2024 00:45
@nicois
Copy link
Contributor Author

nicois commented Jun 16, 2024

It's passing with cockroachdb now. I've also reinstated the original code for LoadType to minimise the chance of causing any regressions for existing users.
When loading multiple types, it's pretty much essential to load each type into the map as it's found, so I've also renamed the methods to more clearly capture this fact.

@nicois nicois marked this pull request as ready for review June 16, 2024 00:52
@nicois nicois marked this pull request as draft June 17, 2024 03:53
@nicois
Copy link
Contributor Author

nicois commented Jun 17, 2024

I have refined the SQL further, to deal with more complex heirachies of types/domains/arrays/composite types.

@nicois nicois force-pushed the nicois/load-types branch 2 times, most recently from 8b6bf61 to 924f798 Compare June 17, 2024 10:49
@nicois nicois marked this pull request as ready for review June 17, 2024 10:50
@jackc
Copy link
Owner

jackc commented Jun 17, 2024

I suppose there is no way around the bit recursive query. But, wow, that is scary looking. I'd hate to have to debug that. 🤷

I also think the server version should be an internal function.

@nicois
Copy link
Contributor Author

nicois commented Jun 17, 2024

Sure, I'll make it internal. I thought it might be helpful for a client, but perhaps it's better to not expose it.

That query isn't nice, and it's quite possible it could be optimised in some way too; it takes longer than I'd like it to. However, my upcoming pgxpool PR has support for caching this mapping (for non-replicated environments), meaning the overhead is only felt once per invocation.

@nicois
Copy link
Contributor Author

nicois commented Jun 17, 2024

I'll add a few comments to the scary SQL to make it a little less nightmarish.

@nicois
Copy link
Contributor Author

nicois commented Jun 17, 2024

I've changed the serverVersion to being private, added some comments to the SQL, and also modified the structure a little, so there is a LoadTypes and RegisterTypes call. This is both more similar to the LoadType which exists, and will be a better structure for what's coming.

@jackc
Copy link
Owner

jackc commented Jun 18, 2024

Hmmm... The functionality looks good, but I'm not so sure about the interface.

At first glance, I didn't like TypeInfo. I didn't understand why pgtype.Type wouldn't be used directly. But then I realized that since types dependent on other types directly refer to the registered Codec of those types it wouldn't be possible to safely share the type. Be that as it may, it's still confusing to have two types that are so similar. Not sure what the solution is.

Then LoadTypes and RegisterTypes are named such that I would expect they are exactly the same as the singular versions except they do multiple at a time. But they are different from their singular counterparts. It is counterintuitive that LoadType returns a *pgtype.Type but LoadTypes returns []*TypeInfo.

I think I understand at this point why *pgtype.Type couldn't be used directly, but the naming, at least, is very unfortunate.

Upon further reflection, I also question if LoadTypes and RegisterTypes should be members of *pgx.Conn at all. LoadTypes doesn't use any private values or functions of the *pgx.Conn except serverVersion and that doesn't need to be a member either. RegisterTypes has even less justification to be a member function. Not only does it not use the receiver it actually uses the registerWith argument instead.


So here's some suggestions for your consideration. I'm not sure if they are good ideas, but I think they are worth looking into.

  • Rename LoadTypes to disambiguate loading a pgtype.Type and pgx.TypeInfo.
  • Rename RegisterTypes to disambiguate registering a pgtype.Type and pgx.TypeInfo.
  • Make LoadTypes a function instead of a method.
  • Make RegisterTypes a function instead of a method. Or perhaps it should be a method on *pgtype.Map.
  • Perhaps TypeInfo should be part of the pgtype instead of the pgx package.
  • I wish there was a overall naming scheme. Perhaps something to do with derived or dependent types would be better. LoadType(s) only works on types that are derived from known types. e.g. You can't use LoadType for an entirely unknown type like something from PostGIS. But you do use pgtype.Map.RegisterType. Maybe there should be something like LoadDerivedTypes / RegisterDerivedType / DerivedTypeInfo. 🤷‍♂️
  • Possibly Conn.LoadType should be deprecated in favor of this new functionality.

Also, one other random thing. I noticed LoadTypes is purposely using the simple protocol. Any particular reason?

@nicois
Copy link
Contributor Author

nicois commented Jun 21, 2024

I've addressed the feedback.

There are a couple of reasons I didn't change the original LoadType:

  • when only a single type is being loaded, in some situations it might be faster, if the database connection is low latency and there are many other types which would be scanned by the mega query
  • people who want speed will switch over to the new syntax, probably via the (coming) pgxpool PR
  • if there are any bugs, they can be ironed out without upsetting existing users

@nicois
Copy link
Contributor Author

nicois commented Jun 21, 2024

Regarding the use of the simple protocol: when I remove this, at least when running locally the test fails:

While scanning type information: can't scan into dest[1]: cannot scan char (OID 18) in binary format into *string

I am quite new to pgx so perhaps I am missing something obvious which would allow the connection to work in the default mode.

@nicois nicois force-pushed the nicois/load-types branch 2 times, most recently from c2503dd to f68972a Compare June 21, 2024 21:42
@jackc
Copy link
Owner

jackc commented Jun 22, 2024

Regarding scanning "char", that is a special type. I believe the issue is scanning pg_type.typtype into a string. Try scanning it into a byte or rune instead.


I like the progress that is being made here. But I've been reviewing this PR, looking back at the existing pgx type code, and mulling it over. And I think there might be more fundamental improvements that can be made.

As I understand it, you are trying to improve two issues. 1. This PR is making loading types faster. 2. The pgxpool PR is trying to make it unnecessary to introspect the database for each connection. They are somewhat related and improvements in one may affect what is desirable or necessary for the other.

Regarding loading types faster, as we've discussed above, I consider it unfortunate that we end up with 2 similar things pgtype.Type and pgtype.DerivedTypeInfo. The naming, at least, is better now, but two similar things still bother me. That's makes me wonder what if instead of adding pgtype.DerivedTypeInfo we instead added all of its fields to pgtype.Type? LoadDerivedTypes would again become LoadTypes.

... 🤔 ... 🤔 ... 🤔 ...

Actually, wait a second. Why do we need the DerivedTypeInfo fields at all?

My previous observation was wrong:

At first glance, I didn't like TypeInfo. I didn't understand why pgtype.Type wouldn't be used directly. But then I realized that since types dependent on other types directly refer to the registered Codec of those types it wouldn't be possible to safely share the type.

It is possible to safely share the *pgtype.Type. According to the docs *pgtype.Type and *pgtype.Codec must not be altered once they are registered with a pgtype.Map.

I think this means PR this and the pgxpool PR can be significantly simplified. We still need the complicated SQL query, but I think much of the rest could go away.

Here's an idea: We add a LoadTypes function. It runs your big SQL query, but it returns []*pgtype.Type. It's slightly tricky because it needs to build up its own map of types in addition to the *pgtype.Map. For example, _mytypedepends onmytypewhich is a composite which contains_myothertype. The *pgtype.Types would need to be created in the order myothertype, _myothertype, mytype, _mytype. So LoadTypeswould have to keep track of what*pgtype.Type` it has created.

RegisterDerivedTypes goes away. The existing *Map.RegisterType function is used.

Add another small change and we also solve the _hstore problem discussed on the other PR. LoadTypes could support types where typtype = 'b' (base type). It would leave the Codec nil in that case. Derived types like _hstore would still be initialized because they would have a pointer to the pgtype.Type for hstore. The caller of LoadTypes would be responsible for filling in the Codec before registering the types.

This should allow for all type introspection to be done in a single SQL query regardless of how many types or what kind types are used.

Regarding the pgxpool change to avoid having to introspect the database for each connection, I'm not sure how much of that would still be necessary either. Since pgtype.Type and pgtype.Codec are immutable once registered, there's no reason we can't just copy one pgtype.Map's type information to another pgtype.Map.

@nicois
Copy link
Contributor Author

nicois commented Jun 22, 2024

Ok, I see where you're going with this. I will try to avoid the need to define a second type, if I can.

Regarding pgxpool: the type map has to be populated for each connection, doesn't it? And there is still the need to support situations where the pool does not always connect to servers sharing the same OID mapping. Assuming that, there are wins in making it easy to let the custom and automatic loaders be defined more easily, and to cache the results when it's safe to do so.

Either way, it's good to keep the PRs separate as the first one is needed either way, and once it's solid, it's easier to look at the incremental benefit of the other.

@nicois
Copy link
Contributor Author

nicois commented Jun 23, 2024

I've made some changes which didn't take very long at all, and am pretty happy with the results:

  • there is now LoadTypes which returns []*pgtype.Type. It copies the connection's Map and then mutates that copy while it builds the list of types, preventing types from complaining about not being able to find other types which they expect to already be registered. But this is just an temporary internal map; what is returned is a clean list of Types. Note that this also deals with the hstore issue: as long as someone has registered hstore in their connection's Map before calling LoadTypes, that hstore type will be found and respected.
  • I created a Map.Copy() method to support the above
  • A Map.RegisterTypes() convenience method is trivial, and can be removed if you think it's too trivial.

@nicois
Copy link
Contributor Author

nicois commented Jun 23, 2024

Regarding scanning "char", that is a special type. I believe the issue is scanning pg_type.typtype into a string. Try scanning it into a byte or rune instead.

I've tried removing the simple protocol designator, then using byte and rune, and they both error. I'll leave it as-is for now, unless you can suggest something else.

@jackc
Copy link
Owner

jackc commented Jun 25, 2024

I mentioned this idea on #2056, but I will restate it here.

What if pgx.ConnConfig had a list of PG type names to load when the connection is established. For base types like hstore or PostGIS types, a map of name to function could be provided that would contain the loading logic.

This would remove the need to manually load types after connection, and would probably remove most uses of the after connect hooks provided by pgxpool and stdlib.

It would also remove the need to have any new functions or methods. The only new interface would be additional fields on *pgx.ConnConfig.

This would be an even simpler interface for loading types.

@nicois
Copy link
Contributor Author

nicois commented Jun 27, 2024

@jackc I have tried to modify these 3 PRs to minimise the change in the interface, as you suggested. It did work out a bit less messy than I anticipated, so let me know what you think.

@nicois nicois force-pushed the nicois/load-types branch 2 times, most recently from abda9f0 to 22fb4e4 Compare June 27, 2024 12:18
@jackc
Copy link
Owner

jackc commented Jun 29, 2024

@nicois

Thanks for making all these changes. I still think this configuration should not live on the pool. But now actually seeing it on the connection and especially seeing how it complicates #2048 (ConnectConfigReusingTypeMap is a rather inelegant necessity caused by this change), I'm not sure AutoLoadTypes on the pgx.ConnConfig is right either.

Here's what I think we should do. Let's go back to some of your original design.

  • Back out this last change of AutoLoadTypes on pgx.ConnConfig.
  • Make LoadTypes a method of *pgx.Conn again. Given that it now really is the collection version of LoadType it should live next to it. If I could do it all over again I might not make LoadType a Conn method, but at this point consistency is probably the priority.

I'll merge just that core functionality. Then we can evaluate what is the most convenient interface for exposing the functionality.


With regard to that convenient interface, the pgxpool interface and possibly configuration on Conn, I would like us to evaluate using https://pkg.go.dev/sync#OnceValue to load the types. This, possibly with a wrapping helper function, might be a better abstraction than the config options.

@nicois
Copy link
Contributor Author

nicois commented Jun 29, 2024

OK, I will undo those changes. It is hard to get this right; there are competing desires and we have to choose a single implementation to service them all.

There is one refinement I think I need to make to the autoloader, which I am testing locally at the moment: I believe that we need to register both the namespace and non-namespace name with the type map. I have been running into trouble where some types are defined in namespaces not in the search path, and I think this is needed. It would be nicer if a single type could be assigned multiple names (to result in a single OID being associated with a single typemap entry), but I don't think this is currently possible.

@nicois nicois force-pushed the nicois/load-types branch 2 times, most recently from 84103a1 to dcb32f9 Compare June 29, 2024 12:46
@nicois
Copy link
Contributor Author

nicois commented Jun 29, 2024

I've pushed the earlier version back, with the duplicate creation of types for the namespaced version of each type's name. You can more easily see this here: https://github.com/jackc/pgx/compare/4660f1d7546adb2f6091625ed9fe42f393daf522..dcb32f94021386bf4f450b858281687faa4a0641

Copy link
Owner

@jackc jackc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few notes above in the review, but none of them are critical. We also might want to add a test for the namespaced type path.

But I think this is mergeable at this point. Just let me know if you are planning on making any more changes.

supportsMultirange := (pgVersion >= 14)
var typeNamesClause string

if typeNames == nil {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this branch still valid? As far as I can see the only caller, LoadTypes, enforces len(typeNames) > 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this branch to be kept, as the alternative is to fail at runtime if a nil is passed in. It is an unexercised branch right now, and the compiler will optimise it away.
If there were a way to make it a compile-time error, if a nil was passed in, it would be even better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified this branch to run, but not match any records. This shouldn't usually happen, but at least avoids a runtime error.

derived_types.go Outdated Show resolved Hide resolved
derived_types.go Outdated
for i, fieldName := range ti.Attnames {
//if fieldOID64, err = strconv.ParseUint(composite_fields[i+1], 10, 32); err != nil {
// return nil, fmt.Errorf("While extracting OID used in composite field: %w", err)
//}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed?

@nicois
Copy link
Contributor Author

nicois commented Jun 30, 2024

I'll address your feedback above, then let you know. We can follow up further if I need be.

@nicois
Copy link
Contributor Author

nicois commented Jun 30, 2024

ok, this should be good to merge now.

When loading even a single type into pgx's type map, multiple SQL
queries are performed in series. Over a slow link, this is not ideal.
Worse, if multiple types are being registered, this is repeated multiple
times.

This commit add LoadTypes, which can retrieve type
mapping information for multiple types in a single SQL call, including
recursive fetching of dependent types.
RegisterTypes performs the second stage of this operation.
@jackc jackc merged commit 161ce73 into jackc:master Jul 1, 2024
14 checks passed
@jackc
Copy link
Owner

jackc commented Jul 1, 2024

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants