Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve order_by performance #7

Merged
merged 5 commits into from
Apr 19, 2021
Merged

Conversation

nicolas-fricke
Copy link
Member

@nicolas-fricke nicolas-fricke commented Apr 15, 2021

As raised in issue #3 by @domangi, the performance of the previous implementation when passing an order_by parameter was pretty bad. It suggested users should set an index onto CONCAT(<order_by_column>, '-', id) – but it turns out that neither MySQL nor Postgres support indices on a function like that.

The solution that was suggested in the issue was to remove custom ordering for non-unique columns from the gem all together and leave it up to the user to ensure that a column is unique. But in my opinion this would make the barrier of entry for this gem quite high as most columns on normal relations are not unique by default, so users would have to manually add and maintain a unique column for each of the columns they'd want to sort on.

This PR suggests an improved mechanism to make the old behavior work. It also brings some other performance improvements and the sorting now happens based on the actual column type. It does not add support for arbitrary SQL queries in the order_by param, but adds some description to the README.md file around this.

The new mechanism works like this (as also detailed in #3):

If this gem gets called like this:

RailsCursorPagination::Paginator
  .new(relation, order_by: :author, first: 2, after: "WyJKYW5lIiw0XQ==")
  .fetch

and our cursor encodes something like ['Jane', 4], the generated SQL query so far was

SELECT *
FROM "posts"
WHERE CONCAT(author, '-', "posts"."id") > 'Jane-4'
ORDER BY CONCAT(author, '-', "posts"."id") ASC
LIMIT 2

which was performing poorly since contrary to this gem's documentation neither MySQL nor Postgres support indices on functions like CONCAT.

But we can rewrite this query to something like this:

SELECT *
FROM "posts"
WHERE (author > 'Jane') OR (author = 'Jane' AND "posts"."id" > 4)
ORDER BY author ASC, "posts"."id" ASC
LIMIT 2

If we then add a compound index to (author, id), the database can use this to resolve both the WHERE clauses as well as the ORDER BY condition. I created an index on MySQL like this:

CREATE INDEX index_posts_on_author_and_id ON posts (author, id);

And then an EXPLAIN on the SELECT query returned this (on a database with 10000 records):

+----+-------------+-------+------------+-------+--------------------------------------+------------------------------+---------+------+------+----------+-----------------------+
| id | select_type | table | partitions | type  | possible_keys                        | key                          | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+-------+------------+-------+--------------------------------------+------------------------------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | posts | NULL       | range | PRIMARY,index_posts_on_author_and_id | index_posts_on_author_and_id | 776     | NULL | 5016 |   100.00 | Using index condition |
+----+-------------+-------+------------+-------+--------------------------------------+------------------------------+---------+------+------+----------+-----------------------+

So it managed to use the index for a range query.

In detail:

Change queries used for pagination when order_by parameter is passed

The previous logic relied on using concatenation to generate a unique field that could both be filtered and sorted over. This was done with
the assumption that databases would allow adding an index on such a concatenated field. But this is actually not the case, neither MySQL nor Postgres support indices on functions like CONCAT.

Therefore, the queries used by the gem had a vert poor performance.

To combat this, the queries are now changed to query the given order_by field and the ID field separately and also have them separately in the ORDER BY clause. This allows making use of a compound or multicolumn index on the combination of the two fields.

Since both MySQL and Postgres support efficient use of compound indices when accessing / filtering over / ordering by either all fields of the index or the leftmost fields only, and index on (<custom_column>, id) will ensure a high performance of these queries.

Furthermore, abandoning the concatenation now also ensures that non-string columns are ordered as expected. Before, e.g. integer columns were treated as text columns, leading to weird orders like 1, 10, 11, 2, 3, 4. This has also been solved with these changes.

However, due to this changing the order of the returned results, releasing this cause a breaking change.

Remove ORDER BY clause from COUNT queries

Since we just used the pre-assembled relations to then call #size on them, this caused queries like this to be sent to the database:

SELECT COUNT(*) FROM (
  SELECT 1 AS one
  FROM `posts`
  ORDER BY `posts`.`author` DESC, `posts`.`id` DESC
  LIMIT 3
) subquery_for_count

The ORDER BY part will never influence the number of returned records. But depending on the used database and their query optimization settings, it will make the query less performant.

Therefore, remove any order from the queries before calling #size on them. With ActiveRecord this can be achieved by calling #reorder on the relation with an empty string.

Ensure records are fetched early to avoid multiple DB queries

One of the optimizations that cursor pagination allows is fetching one more record than required from the database to then be able to efficiently determine if there is another page (which would contain at the least this extra item). However, the way this was implemented it actually triggered individual queries, one to load the records and another COUNT query to determine the amount of records with the additional one. This was due to how ActiveRecord tries to delay fetching records of a relation to the latest possible point.

To prevent this, we can call #fetch manually to request the records to be loaded earlier.

Add explanation to README.md about ordering by arbitrary SQL

The gem in its current version does not allow ordering by any arbitrary SQL statements. Only SQL columns can be used.

This is on one side due to performance reasons, but on the other side also due to the added high complexity of supporting arbitrary SQL queries. We would have to ensure that it doesn't open the door for SQL injection attacks. We would also have to return the value of this query so that it can be properly encoded into the cursor.

For now, this is out of scope for this gem.

Memoize the Paginator#page method

This method is invoked multiple times to retrieve the start_cursor, the end_cursor, and ultimately to return the actual page record. By memoizing it, we avoid it from mapping over the records again and again and rebuilding all the cursors every time it is invoked.

Resolves #3

@nicolas-fricke nicolas-fricke added the enhancement New feature or request label Apr 15, 2021
@nicolas-fricke nicolas-fricke requested a review from domangi April 15, 2021 08:14
Copy link

@domangi domangi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Job! Great improvement.
I left some suggestions, mostly about comments to make it easier to understand.

Another question about the compound index. As we discussed lately, mysql is able to merge multiple indices, see https://dev.mysql.com/doc/refman/8.0/en/index-merge-optimization.html

Would it thus be enough to define an index on custom order column e.g. only on the "author" column?

README.md Outdated Show resolved Hide resolved
lib/rails_cursor_pagination/paginator.rb Show resolved Hide resolved
lib/rails_cursor_pagination/paginator.rb Show resolved Hide resolved
@nicolas-fricke nicolas-fricke force-pushed the fix-order-by-performance branch 2 times, most recently from 6f1f5f2 to d24fd65 Compare April 15, 2021 16:31
@Marta83
Copy link
Collaborator

Marta83 commented Apr 19, 2021

Works perfectly, great job!

The previous logic relied on using concatenation to generate a unique
field that could both be filtered and sorted over. This was done with
the assumption that databases would allow adding an index on such a
concatenated field. But this is actually not the case, neither MySQL nor
Postgres support indices on functions like `CONCAT`.

Therefore, the queries used by the gem had a vert poor performance.

To combat this, the queries are now changed to query the given
`order_by` field and the ID field separately and also have them
separately in the `ORDER BY` clause. This allows making use of a
compound or multicolumn index on the combination of the two fields.

Since both MySQL and Postgres support efficient use of compound indices
when accessing / filtering over / ordering by either all fields of the
index or the leftmost fields only, and index on `(<custom_column>, id)`
will ensure a high performance of these queries.

Furthermore, abandoning the concatenation now also ensures that
non-string columns are ordered as expected. Before, e.g. integer columns
were treated as text columns, leading to weird orders like
`1, 10, 11, 2, 3, 4`. This has also been solved with these changes.

However, due to this changing the order of the returned results,
releasing this cause a breaking change.

Since the set of tables this gem can be used on will surely be fairly
limited, we want the string returned by the `Paginator#id_column`
method to be frozen. In Ruby < 3.0 this is automatic through the magic
comment on top of the file, however in Ruby 3.0 this is no longer the
case for interpolated strings. Therefore it has to be manually frozen
and the Rubocop rule (which checks for Ruby 2.5 syntax) has to be
adjusted accordingly.
Since we just used the pre-assembled relations to then call `#size` on
them, this caused queries like this to be sent to the database:
```
SELECT COUNT(*) FROM (
  SELECT 1 AS one
  FROM `posts`
  ORDER BY `posts`.`author` DESC, `posts`.`id` DESC
  LIMIT 3
) subquery_for_count
```

The `ORDER BY` part will never influence the number of returned records.
But depending on the used database and their query optimization
settings, it will make the query less performant.

Therefore, remove any order from the queries before calling `#size` on
them. With `ActiveRecord` this can be achieved by calling `#reorder` on
the relation with an empty string.
One of the optimizations that cursor pagination allows is fetching one
more record than required from the database to then be able to
efficiently determine if there is another page (which would contain at
the least this extra item). However, the way this was implemented it
actually triggered individual queries, one to load the records and
another `COUNT` query to determine the amount of records with the
additional one. This was due to how ActiveRecord tries to delay fetching
records of a relation to the latest possible point.

To prevent this, we can call `#fetch` manually to request the records to
be loaded earlier.
The gem in its current version does not allow ordering by any arbitrary
SQL statements. Only SQL columns can be used.

This is on one side due to performance reasons, but on the other side
also due to the added high complexity of supporting arbitrary SQL
queries. We would have to ensure that it doesn't open the door for SQL
injection attacks. We would also have to return the value of this query
so that it can be properly encoded into the cursor.

For now, this is out of scope for this gem.
This method is invoked multiple times to retrieve the `start_cursor`,
the `end_cursor`, and ultimately to return the actual `page` record.
By memoizing it, we avoid it from mapping over the `records` again and
again and rebuilding all the cursors every time it is invoked.
@nicolas-fricke nicolas-fricke force-pushed the fix-order-by-performance branch from d24fd65 to 3c56399 Compare April 19, 2021 14:23
@nicolas-fricke
Copy link
Member Author

Thanks for the reviews and testing @domangi and @Marta83! The Postgres tests on this PR kept failing due to an issue in the Github Actions step that was used. I disabled the failing PG versions in #8, rebased this PR and will now proceed with merging it :)

@nicolas-fricke nicolas-fricke merged commit 29c1ce8 into master Apr 19, 2021
@nicolas-fricke nicolas-fricke deleted the fix-order-by-performance branch February 18, 2022 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use order_by argument as cursor
3 participants