Improve order_by performance #7

nicolas-fricke · 2021-04-15T08:14:19Z

As raised in issue #3 by @domangi, the performance of the previous implementation when passing an order_by parameter was pretty bad. It suggested users should set an index onto CONCAT(<order_by_column>, '-', id) – but it turns out that neither MySQL nor Postgres support indices on a function like that.

The solution that was suggested in the issue was to remove custom ordering for non-unique columns from the gem all together and leave it up to the user to ensure that a column is unique. But in my opinion this would make the barrier of entry for this gem quite high as most columns on normal relations are not unique by default, so users would have to manually add and maintain a unique column for each of the columns they'd want to sort on.

This PR suggests an improved mechanism to make the old behavior work. It also brings some other performance improvements and the sorting now happens based on the actual column type. It does not add support for arbitrary SQL queries in the order_by param, but adds some description to the README.md file around this.

The new mechanism works like this (as also detailed in #3):

If this gem gets called like this:

RailsCursorPagination::Paginator
  .new(relation, order_by: :author, first: 2, after: "WyJKYW5lIiw0XQ==")
  .fetch

and our cursor encodes something like ['Jane', 4], the generated SQL query so far was

SELECT *
FROM "posts"
WHERE CONCAT(author, '-', "posts"."id") > 'Jane-4'
ORDER BY CONCAT(author, '-', "posts"."id") ASC
LIMIT 2

which was performing poorly since contrary to this gem's documentation neither MySQL nor Postgres support indices on functions like CONCAT.

But we can rewrite this query to something like this:

SELECT *
FROM "posts"
WHERE (author > 'Jane') OR (author = 'Jane' AND "posts"."id" > 4)
ORDER BY author ASC, "posts"."id" ASC
LIMIT 2

If we then add a compound index to (author, id), the database can use this to resolve both the WHERE clauses as well as the ORDER BY condition. I created an index on MySQL like this:

CREATE INDEX index_posts_on_author_and_id ON posts (author, id);

And then an EXPLAIN on the SELECT query returned this (on a database with 10000 records):

+----+-------------+-------+------------+-------+--------------------------------------+------------------------------+---------+------+------+----------+-----------------------+
| id | select_type | table | partitions | type  | possible_keys                        | key                          | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+-------+------------+-------+--------------------------------------+------------------------------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | posts | NULL       | range | PRIMARY,index_posts_on_author_and_id | index_posts_on_author_and_id | 776     | NULL | 5016 |   100.00 | Using index condition |
+----+-------------+-------+------------+-------+--------------------------------------+------------------------------+---------+------+------+----------+-----------------------+

So it managed to use the index for a range query.

In detail:

Change queries used for pagination when order_by parameter is passed

The previous logic relied on using concatenation to generate a unique field that could both be filtered and sorted over. This was done with
the assumption that databases would allow adding an index on such a concatenated field. But this is actually not the case, neither MySQL nor Postgres support indices on functions like CONCAT.

Therefore, the queries used by the gem had a vert poor performance.

To combat this, the queries are now changed to query the given order_by field and the ID field separately and also have them separately in the ORDER BY clause. This allows making use of a compound or multicolumn index on the combination of the two fields.

Since both MySQL and Postgres support efficient use of compound indices when accessing / filtering over / ordering by either all fields of the index or the leftmost fields only, and index on (<custom_column>, id) will ensure a high performance of these queries.

Furthermore, abandoning the concatenation now also ensures that non-string columns are ordered as expected. Before, e.g. integer columns were treated as text columns, leading to weird orders like 1, 10, 11, 2, 3, 4. This has also been solved with these changes.

However, due to this changing the order of the returned results, releasing this cause a breaking change.

Remove ORDER BY clause from COUNT queries

Since we just used the pre-assembled relations to then call #size on them, this caused queries like this to be sent to the database:

SELECT COUNT(*) FROM (
  SELECT 1 AS one
  FROM `posts`
  ORDER BY `posts`.`author` DESC, `posts`.`id` DESC
  LIMIT 3
) subquery_for_count

The ORDER BY part will never influence the number of returned records. But depending on the used database and their query optimization settings, it will make the query less performant.

Therefore, remove any order from the queries before calling #size on them. With ActiveRecord this can be achieved by calling #reorder on the relation with an empty string.

Ensure records are fetched early to avoid multiple DB queries

One of the optimizations that cursor pagination allows is fetching one more record than required from the database to then be able to efficiently determine if there is another page (which would contain at the least this extra item). However, the way this was implemented it actually triggered individual queries, one to load the records and another COUNT query to determine the amount of records with the additional one. This was due to how ActiveRecord tries to delay fetching records of a relation to the latest possible point.

To prevent this, we can call #fetch manually to request the records to be loaded earlier.

Add explanation to README.md about ordering by arbitrary SQL

The gem in its current version does not allow ordering by any arbitrary SQL statements. Only SQL columns can be used.

This is on one side due to performance reasons, but on the other side also due to the added high complexity of supporting arbitrary SQL queries. We would have to ensure that it doesn't open the door for SQL injection attacks. We would also have to return the value of this query so that it can be properly encoded into the cursor.

For now, this is out of scope for this gem.

Memoize the Paginator#page method

This method is invoked multiple times to retrieve the start_cursor, the end_cursor, and ultimately to return the actual page record. By memoizing it, we avoid it from mapping over the records again and again and rebuilding all the cursors every time it is invoked.

Resolves #3

domangi

Good Job! Great improvement.
I left some suggestions, mostly about comments to make it easier to understand.

Another question about the compound index. As we discussed lately, mysql is able to merge multiple indices, see https://dev.mysql.com/doc/refman/8.0/en/index-merge-optimization.html

Would it thus be enough to define an index on custom order column e.g. only on the "author" column?

README.md

lib/rails_cursor_pagination/paginator.rb

Marta83 · 2021-04-19T08:00:10Z

Works perfectly, great job!

The previous logic relied on using concatenation to generate a unique field that could both be filtered and sorted over. This was done with the assumption that databases would allow adding an index on such a concatenated field. But this is actually not the case, neither MySQL nor Postgres support indices on functions like `CONCAT`. Therefore, the queries used by the gem had a vert poor performance. To combat this, the queries are now changed to query the given `order_by` field and the ID field separately and also have them separately in the `ORDER BY` clause. This allows making use of a compound or multicolumn index on the combination of the two fields. Since both MySQL and Postgres support efficient use of compound indices when accessing / filtering over / ordering by either all fields of the index or the leftmost fields only, and index on `(<custom_column>, id)` will ensure a high performance of these queries. Furthermore, abandoning the concatenation now also ensures that non-string columns are ordered as expected. Before, e.g. integer columns were treated as text columns, leading to weird orders like `1, 10, 11, 2, 3, 4`. This has also been solved with these changes. However, due to this changing the order of the returned results, releasing this cause a breaking change. Since the set of tables this gem can be used on will surely be fairly limited, we want the string returned by the `Paginator#id_column` method to be frozen. In Ruby < 3.0 this is automatic through the magic comment on top of the file, however in Ruby 3.0 this is no longer the case for interpolated strings. Therefore it has to be manually frozen and the Rubocop rule (which checks for Ruby 2.5 syntax) has to be adjusted accordingly.

Since we just used the pre-assembled relations to then call `#size` on them, this caused queries like this to be sent to the database: ``` SELECT COUNT(*) FROM ( SELECT 1 AS one FROM `posts` ORDER BY `posts`.`author` DESC, `posts`.`id` DESC LIMIT 3 ) subquery_for_count ``` The `ORDER BY` part will never influence the number of returned records. But depending on the used database and their query optimization settings, it will make the query less performant. Therefore, remove any order from the queries before calling `#size` on them. With `ActiveRecord` this can be achieved by calling `#reorder` on the relation with an empty string.

One of the optimizations that cursor pagination allows is fetching one more record than required from the database to then be able to efficiently determine if there is another page (which would contain at the least this extra item). However, the way this was implemented it actually triggered individual queries, one to load the records and another `COUNT` query to determine the amount of records with the additional one. This was due to how ActiveRecord tries to delay fetching records of a relation to the latest possible point. To prevent this, we can call `#fetch` manually to request the records to be loaded earlier.

The gem in its current version does not allow ordering by any arbitrary SQL statements. Only SQL columns can be used. This is on one side due to performance reasons, but on the other side also due to the added high complexity of supporting arbitrary SQL queries. We would have to ensure that it doesn't open the door for SQL injection attacks. We would also have to return the value of this query so that it can be properly encoded into the cursor. For now, this is out of scope for this gem.

This method is invoked multiple times to retrieve the `start_cursor`, the `end_cursor`, and ultimately to return the actual `page` record. By memoizing it, we avoid it from mapping over the `records` again and again and rebuilding all the cursors every time it is invoked.

nicolas-fricke · 2021-04-19T14:25:59Z

Thanks for the reviews and testing @domangi and @Marta83! The Postgres tests on this PR kept failing due to an issue in the Github Actions step that was used. I disabled the failing PG versions in #8, rebased this PR and will now proceed with merging it :)

nicolas-fricke added the enhancement New feature or request label Apr 15, 2021

nicolas-fricke requested a review from domangi April 15, 2021 08:14

domangi reviewed Apr 15, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

lib/rails_cursor_pagination/paginator.rb Show resolved Hide resolved

lib/rails_cursor_pagination/paginator.rb Show resolved Hide resolved

nicolas-fricke force-pushed the fix-order-by-performance branch 2 times, most recently from 6f1f5f2 to d24fd65 Compare April 15, 2021 16:31

Marta83 approved these changes Apr 19, 2021

View reviewed changes

nicolas-fricke added 5 commits April 19, 2021 16:21

nicolas-fricke force-pushed the fix-order-by-performance branch from d24fd65 to 3c56399 Compare April 19, 2021 14:23

nicolas-fricke merged commit 29c1ce8 into master Apr 19, 2021

nicolas-fricke deleted the fix-order-by-performance branch February 18, 2022 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve order_by performance #7

Improve order_by performance #7

nicolas-fricke commented Apr 15, 2021 •

edited

Loading

domangi left a comment •

edited

Loading

Marta83 commented Apr 19, 2021

nicolas-fricke commented Apr 19, 2021

Improve order_by performance #7

Improve order_by performance #7

Conversation

nicolas-fricke commented Apr 15, 2021 • edited Loading

domangi left a comment • edited Loading

Choose a reason for hiding this comment

Marta83 commented Apr 19, 2021

nicolas-fricke commented Apr 19, 2021

nicolas-fricke commented Apr 15, 2021 •

edited

Loading

domangi left a comment •

edited

Loading