-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve order_by performance #7
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Job! Great improvement.
I left some suggestions, mostly about comments to make it easier to understand.
Another question about the compound index. As we discussed lately, mysql is able to merge multiple indices, see https://dev.mysql.com/doc/refman/8.0/en/index-merge-optimization.html
Would it thus be enough to define an index on custom order column e.g. only on the "author" column?
6f1f5f2
to
d24fd65
Compare
Works perfectly, great job! |
The previous logic relied on using concatenation to generate a unique field that could both be filtered and sorted over. This was done with the assumption that databases would allow adding an index on such a concatenated field. But this is actually not the case, neither MySQL nor Postgres support indices on functions like `CONCAT`. Therefore, the queries used by the gem had a vert poor performance. To combat this, the queries are now changed to query the given `order_by` field and the ID field separately and also have them separately in the `ORDER BY` clause. This allows making use of a compound or multicolumn index on the combination of the two fields. Since both MySQL and Postgres support efficient use of compound indices when accessing / filtering over / ordering by either all fields of the index or the leftmost fields only, and index on `(<custom_column>, id)` will ensure a high performance of these queries. Furthermore, abandoning the concatenation now also ensures that non-string columns are ordered as expected. Before, e.g. integer columns were treated as text columns, leading to weird orders like `1, 10, 11, 2, 3, 4`. This has also been solved with these changes. However, due to this changing the order of the returned results, releasing this cause a breaking change. Since the set of tables this gem can be used on will surely be fairly limited, we want the string returned by the `Paginator#id_column` method to be frozen. In Ruby < 3.0 this is automatic through the magic comment on top of the file, however in Ruby 3.0 this is no longer the case for interpolated strings. Therefore it has to be manually frozen and the Rubocop rule (which checks for Ruby 2.5 syntax) has to be adjusted accordingly.
Since we just used the pre-assembled relations to then call `#size` on them, this caused queries like this to be sent to the database: ``` SELECT COUNT(*) FROM ( SELECT 1 AS one FROM `posts` ORDER BY `posts`.`author` DESC, `posts`.`id` DESC LIMIT 3 ) subquery_for_count ``` The `ORDER BY` part will never influence the number of returned records. But depending on the used database and their query optimization settings, it will make the query less performant. Therefore, remove any order from the queries before calling `#size` on them. With `ActiveRecord` this can be achieved by calling `#reorder` on the relation with an empty string.
One of the optimizations that cursor pagination allows is fetching one more record than required from the database to then be able to efficiently determine if there is another page (which would contain at the least this extra item). However, the way this was implemented it actually triggered individual queries, one to load the records and another `COUNT` query to determine the amount of records with the additional one. This was due to how ActiveRecord tries to delay fetching records of a relation to the latest possible point. To prevent this, we can call `#fetch` manually to request the records to be loaded earlier.
The gem in its current version does not allow ordering by any arbitrary SQL statements. Only SQL columns can be used. This is on one side due to performance reasons, but on the other side also due to the added high complexity of supporting arbitrary SQL queries. We would have to ensure that it doesn't open the door for SQL injection attacks. We would also have to return the value of this query so that it can be properly encoded into the cursor. For now, this is out of scope for this gem.
This method is invoked multiple times to retrieve the `start_cursor`, the `end_cursor`, and ultimately to return the actual `page` record. By memoizing it, we avoid it from mapping over the `records` again and again and rebuilding all the cursors every time it is invoked.
d24fd65
to
3c56399
Compare
As raised in issue #3 by @domangi, the performance of the previous implementation when passing an
order_by
parameter was pretty bad. It suggested users should set an index ontoCONCAT(<order_by_column>, '-', id)
– but it turns out that neither MySQL nor Postgres support indices on a function like that.The solution that was suggested in the issue was to remove custom ordering for non-unique columns from the gem all together and leave it up to the user to ensure that a column is unique. But in my opinion this would make the barrier of entry for this gem quite high as most columns on normal relations are not unique by default, so users would have to manually add and maintain a unique column for each of the columns they'd want to sort on.
This PR suggests an improved mechanism to make the old behavior work. It also brings some other performance improvements and the sorting now happens based on the actual column type. It does not add support for arbitrary SQL queries in the
order_by
param, but adds some description to theREADME.md
file around this.The new mechanism works like this (as also detailed in #3):
If this gem gets called like this:
and our cursor encodes something like
['Jane', 4]
, the generated SQL query so far waswhich was performing poorly since contrary to this gem's documentation neither MySQL nor Postgres support indices on functions like
CONCAT
.But we can rewrite this query to something like this:
If we then add a compound index to
(author, id)
, the database can use this to resolve both theWHERE
clauses as well as theORDER BY
condition. I created an index on MySQL like this:And then an
EXPLAIN
on theSELECT
query returned this (on a database with 10000 records):So it managed to use the index for a
range
query.In detail:
Change queries used for pagination when
order_by
parameter is passedThe previous logic relied on using concatenation to generate a unique field that could both be filtered and sorted over. This was done with
the assumption that databases would allow adding an index on such a concatenated field. But this is actually not the case, neither MySQL nor Postgres support indices on functions like
CONCAT
.Therefore, the queries used by the gem had a vert poor performance.
To combat this, the queries are now changed to query the given
order_by
field and the ID field separately and also have them separately in theORDER BY
clause. This allows making use of a compound or multicolumn index on the combination of the two fields.Since both MySQL and Postgres support efficient use of compound indices when accessing / filtering over / ordering by either all fields of the index or the leftmost fields only, and index on
(<custom_column>, id)
will ensure a high performance of these queries.Furthermore, abandoning the concatenation now also ensures that non-string columns are ordered as expected. Before, e.g. integer columns were treated as text columns, leading to weird orders like
1, 10, 11, 2, 3, 4
. This has also been solved with these changes.However, due to this changing the order of the returned results, releasing this cause a breaking change.
Remove
ORDER BY
clause fromCOUNT
queriesSince we just used the pre-assembled relations to then call
#size
on them, this caused queries like this to be sent to the database:The
ORDER BY
part will never influence the number of returned records. But depending on the used database and their query optimization settings, it will make the query less performant.Therefore, remove any order from the queries before calling
#size
on them. WithActiveRecord
this can be achieved by calling#reorder
on the relation with an empty string.Ensure records are fetched early to avoid multiple DB queries
One of the optimizations that cursor pagination allows is fetching one more record than required from the database to then be able to efficiently determine if there is another page (which would contain at the least this extra item). However, the way this was implemented it actually triggered individual queries, one to load the records and another
COUNT
query to determine the amount of records with the additional one. This was due to how ActiveRecord tries to delay fetching records of a relation to the latest possible point.To prevent this, we can call
#fetch
manually to request the records to be loaded earlier.Add explanation to README.md about ordering by arbitrary SQL
The gem in its current version does not allow ordering by any arbitrary SQL statements. Only SQL columns can be used.
This is on one side due to performance reasons, but on the other side also due to the added high complexity of supporting arbitrary SQL queries. We would have to ensure that it doesn't open the door for SQL injection attacks. We would also have to return the value of this query so that it can be properly encoded into the cursor.
For now, this is out of scope for this gem.
Memoize the
Paginator#page
methodThis method is invoked multiple times to retrieve the
start_cursor
, theend_cursor
, and ultimately to return the actualpage
record. By memoizing it, we avoid it from mapping over therecords
again and again and rebuilding all the cursors every time it is invoked.Resolves #3