[Ideas] Optimize SQL: select distinct count(a) from t1; #677

avamingli · 2024-10-16T14:47:52Z

avamingli
Oct 16, 2024
Collaborator

Description

User has a auto-generated SQL by 3rd-party like:

select DISTINCT count(distinct a) form t1;

However, for such query: Aggregate SQL without Group by, there is one row returned at most.
The first DISTINCT could be removed in theory, then the Unique and Sort(May be other nodes due to planner) on Finalize Aggregate could be avoided.

explain(costs off) select distinct count(distinct a) from t1;
                         QUERY PLAN
------------------------------------------------------------
 Unique
   Group Key: (count(DISTINCT a))
   ->  Sort
         Sort Key: (count(DISTINCT a))
         ->  Finalize Aggregate
               ->  Gather Motion 3:1  (slice1; segments: 3)
                     ->  Partial Aggregate
                           ->  Seq Scan on t1
 Optimizer: Postgres query optimizer
(9 rows)

Currently, Postgres planner doesn't have that feature.

Shall we do it in CBDB?

Pro:
Simply plan tree a little.
Con:
User's SQL is changed, and there may be other risk for planner.

Use case/motivation

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

my-ship-it · 2024-10-16T16:19:53Z

my-ship-it
Oct 16, 2024
Collaborator

ORCA already has this feature

postgres=# explain select distinct count(v) from tbl;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..431.00 rows=1 width=8)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=1 width=8)
         ->  Partial Aggregate  (cost=0.00..431.00 rows=1 width=8)
               ->  Seq Scan on tbl  (cost=0.00..431.00 rows=1 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
(5 rows)

It's a special case for another general cases.
If target list is unique, for example, count result, or columns with unique indexes, seems the optimization should apply?

1 reply

avamingli Oct 17, 2024
Collaborator Author

Yes, generally for such query: Aggregate SQL without Group by, there is one row returned at most, then we could optimize DISTINCT.
It's simple as we know.

columns with unique indexes

This one is more hard, not sure Postgres has optimized or not.
It needs to see what the expression is, a simple column is ok, ex: select distinct a from t1;( a has an unique index).
But for complex expression, we don't know how to decide if DISTINCT could be removed or not.
And no matter it cloud be or not, we have to know the rules how to decide

avamingli · 2024-10-22T15:42:05Z

avamingli
Oct 22, 2024
Collaborator Author

For Order By, it could also be optimized in the case: agg without group by.
And Postgres' special DISTINCT ON is usually based on order.

0 replies

leborchuk · 2024-10-23T11:43:17Z

leborchuk
Oct 23, 2024

Do not know helped it or not, but in Clickhouse the same query will look like

 select count(distinct sess_id) from sessions;

SELECT countDistinct(sess_id)
FROM sessions

Query id: f1c0eb0e-b0f3-4477-98e9-58c1aba11b72

   ┌─uniqExact(sess_id)─┐
1. │           16975496 │ -- 16.98 million
   └────────────────────┘

1 row in set. Elapsed: 0.295 sec. Processed 48.97 million rows, 391.74 MB (165.84 million rows/s., 1.33 GB/s.)
Peak memory usage: 621.76 MiB.

 explain plan actions=1 select count(distinct sess_id) from sessions;

EXPLAIN actions = 1
SELECT countDistinct(sess_id)
FROM sessions

Query id: b09c677d-bb95-47a9-855d-cb8bc82333b0

    ┌─explain──────────────────────────────────────────────┐
 1. │ Expression ((Projection + Before ORDER BY))          │
 2. │ Actions: INPUT :: 0 -> uniqExact(sess_id) UInt64 : 0 │
 3. │ Positions: 0                                         │
 4. │   Aggregating                                        │
 5. │   Keys:                                              │
 6. │   Aggregates:                                        │
 7. │       uniqExact(sess_id)                             │
 8. │         Function: uniqExact(UInt64) → UInt64         │
 9. │         Arguments: sess_id                           │
10. │   Skip merging: 0                                    │
11. │     Expression (Before GROUP BY)                     │
12. │     Actions: INPUT :: 0 -> sess_id UInt64 : 0        │
13. │     Positions: 0                                     │
14. │       ReadFromMergeTree (yagpcc.sessions_part)       │
15. │       ReadType: Default                              │
16. │       Parts: 169                                     │
17. │       Granules: 8365                                 │
    └──────────────────────────────────────────────────────┘

 explain pipeline select count(distinct sess_id) from sessions;

EXPLAIN PIPELINE
SELECT countDistinct(sess_id)
FROM sessions

Query id: 1fbb7d32-4373-45e9-be35-d262c9883c31

    ┌─explain─────────────────────────────────────────────────────────────────┐
 1. │ (Expression)                                                            │
 2. │ ExpressionTransform × 16                                                │
 3. │   (Aggregating)                                                         │
 4. │   Resize 16 → 16                                                        │
 5. │     AggregatingTransform × 16                                           │
 6. │       StrictResize 16 → 16                                              │
 7. │         (Expression)                                                    │
 8. │         ExpressionTransform × 16                                        │
 9. │           (ReadFromMergeTree)                                           │
10. │           MergeTreeSelect(pool: ReadPool, algorithm: Thread) × 16 0 → 1 │
    └─────────────────────────────────────────────────────────────────────────┘

The small description - https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/uniq

4 replies

avamingli Oct 24, 2024
Collaborator Author

Thanks @leborchuk

avamingli Oct 24, 2024
Collaborator Author

The small description - https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/uniq

I don't know about ClickHouse and have a learn.

Uses an adaptive sampling algorithm. For the calculation state, the function uses a sample of element hash values up to 65536. This algorithm is very accurate and very efficient on the CPU. When the query contains several of these functions, using uniq is almost as fast as using other aggregate functions.

It seems ClickHouse uniq() use a sampling algorithm and can not guarantee the right results at any time.

this algorithm is very accurate

That means it's not the 100% right results at any time, do I understand it right?

The uniq() of ClickHouse aims to best efficiency at cost of imprecision.
That's a good point in some cases, however, in Relational DataBase like Postgres and our CBDB, we must guarantee a 100% right result at any time.

We're trying to remove useless nodes/process of a plan with DISTINCT/ORDER BY when SQL is AGG without Group by.
But suggestion and knowledge are always welcome, thanks again @leborchuk

leborchuk Oct 25, 2024

You are right. In Clickhouse, they use the uniqExact function. It's similar to uniq, but use more memory https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/uniqexact#agg_function-uniqexact (Get link to uniq function because uniqExact does not contain proper algorithm description)

I also found out discussion in pg hackers about distinct deletion -
https://www.postgresql.org/message-id/CAKU4AWqZvSyxroHkbpiHSCEAY2C41dG7VWs%3Dc188KKznSK_2Zg%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CAKJS1f-wH83Fi2coEVNUWFxOGQ4BJRRTGqDMvidCoiR9WEwxsw%40mail.gmail.com#56a08b441cc61afaf85c6232c5d40a3f.

maybe it will be helpfull too.

avamingli Oct 28, 2024
Collaborator Author

@leborchuk Thanks a lot.

avamingli · 2024-10-24T13:29:01Z

avamingli
Oct 24, 2024
Collaborator Author

Also optimize DISTINCT ON, see #685

0 replies

avamingli · 2024-11-04T02:47:28Z

avamingli
Nov 4, 2024
Collaborator Author

However, for such query: Aggregate SQL without Group by, there is one row returned at most.

However SRF will break this:

 select count(*), generate_series(1, 4) from t1;
 count | generate_series
-------+-----------------
     3 |               1
     3 |               2
     3 |               3
     3 |               4
(4 rows)

1 reply

avamingli Nov 4, 2024
Collaborator Author

And WITH ORDINALITY fixed, in #685

avamingli · 2024-12-04T03:31:13Z

avamingli
Dec 4, 2024
Collaborator Author

Implemented by #685 .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ideas] Optimize SQL: select distinct count(a) from t1; #677

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Ideas] Optimize SQL: select distinct count(a) from t1; #677

avamingli Oct 16, 2024 Collaborator

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Replies: 6 comments · 6 replies

my-ship-it Oct 16, 2024 Collaborator

avamingli Oct 17, 2024 Collaborator Author

avamingli Oct 22, 2024 Collaborator Author

leborchuk Oct 23, 2024

avamingli Oct 24, 2024 Collaborator Author

avamingli Oct 24, 2024 Collaborator Author

leborchuk Oct 25, 2024

avamingli Oct 28, 2024 Collaborator Author

avamingli Oct 24, 2024 Collaborator Author

avamingli Nov 4, 2024 Collaborator Author

avamingli Nov 4, 2024 Collaborator Author

avamingli Dec 4, 2024 Collaborator Author

avamingli
Oct 16, 2024
Collaborator

Replies: 6 comments 6 replies

my-ship-it
Oct 16, 2024
Collaborator

avamingli Oct 17, 2024
Collaborator Author

avamingli
Oct 22, 2024
Collaborator Author

leborchuk
Oct 23, 2024

avamingli Oct 24, 2024
Collaborator Author

avamingli Oct 24, 2024
Collaborator Author

avamingli Oct 28, 2024
Collaborator Author

avamingli
Oct 24, 2024
Collaborator Author

avamingli
Nov 4, 2024
Collaborator Author

avamingli Nov 4, 2024
Collaborator Author

avamingli
Dec 4, 2024
Collaborator Author