-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: improve SQL parsing #26767
chore: improve SQL parsing #26767
Conversation
@@ -252,6 +253,163 @@ def __eq__(self, __o: object) -> bool: | |||
return str(self) == str(__o) | |||
|
|||
|
|||
def extract_tables_from_statement( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was originally a method in ParsedQuery
, I just converted it into a function without any big changes (see below).
a3715d4
to
276ccfb
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #26767 +/- ##
==========================================
- Coverage 69.73% 69.69% -0.04%
==========================================
Files 1909 1909
Lines 74692 74734 +42
Branches 8325 8325
==========================================
Hits 52086 52086
- Misses 20556 20598 +42
Partials 2050 2050
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
f5ebc7f
to
02c5093
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments:
- Let's remove
sqlparse
from setup.py and requirements. I noticedpip-compile-multi
was SUPER slow last time I tried it - about the description and now improved support for multi-statements, are all the cases covered by new tests in
sql_parse_tests.py
?
02c5093
to
4099d11
Compare
I think this is a welcome change and a move in the right direction, but I feel a big change like this should probably be discussed in the form of a SIP first. I anticipate the final implementation to end up looking similar to this PR, but it may be beneficial to discuss the change first before jumping directly into the code. |
SIP here: #26786 |
Thanks @betodealmeida ! |
b5f06ea
to
d6bc9c9
Compare
ae687cc
to
4769289
Compare
bfb69a9
to
8237798
Compare
#26786 approved! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. At a high level the main thing for me would be to clarify the object model and inheritance scheme for the SQL-related utilities. At a high level thinking about this, it sits close to db_engine_spec or at least relate to it in some ways.
@@ -22,12 +22,13 @@ | |||
import urllib.parse | |||
from collections.abc import Iterable, Iterator | |||
from dataclasses import dataclass | |||
from typing import Any, cast, Optional | |||
from typing import Any, cast, Optional, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be good to rename this module to a better name, the current name can be confused with having a relationship with sqlparse
, and overall just isn't a good name. It could be utils/sql.py
, or more directly superset/sql/*
if we need to grow this into a package with multiple modules. superset/sql_parser.py
(?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I like this!
@john-bodley, you did some work on reorganizing the codebase, what are your thought here?
@@ -537,7 +537,7 @@ def test_mssql_engine_spec_pymssql(self): | |||
) | |||
|
|||
def test_comments_in_sqlatable_query(self): | |||
clean_query = "SELECT '/* val 1 */' as c1, '-- val 2' as c2 FROM tbl" | |||
clean_query = "SELECT\n '/* val 1 */' AS c1,\n '-- val 2' AS c2\nFROM tbl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[suggesting] it could be good to have some sort of reusable compare_sql(strict=False, case_sensitive=False, disregard_schema_prefix=True)
function that could be reused in unit tests. Maybe it's a method of ParseQuery
(is_identical
or is_similar
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, I'll add that!
Where are we at with this work? There's some additional changes that I'd be willing to take on (#19572) but looks like this is a dependency. We've made and validated significant performance improvements in our Superset fork, but makes sense to consolidate those into SQLScript and SQLStatement rather than directly replace sqlparse calls with sqlglot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
b49fe98
to
f6ccbd5
Compare
f6ccbd5
to
316e3ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed just the cypress test portion as a code-owner requirement. That LGTM. 👍
Hey @betodealmeida , this PR might have broken a Hive-specific test -> https://github.com/apache/superset/actions/runs/8272398041/job/22634142747#step:11:608 It appears to fail after this PR was merged (I noticed since it was failing on some other unrelated PR) and trace it back here. Note that this particular GHA test for Hive runs only when Hive seems to be complaining about a |
SUMMARY
This is the first PR in a series to improve our SQL parsing, in effort to clean up the code base and make it more secure.
Currently, we have a class called
superset.sql_parse.ParsedQuery
that is used as an interface every time we need to parse or manipulate SQL. There are a few problems with the status quo:ParsedQuery
, it's designed to accept a single SQL statement, not necessarily a query. For example, it has aset_or_update_query_limit
method that works only on the first statement, if there are multiple.sqlparse
directly to manipulate SQL. For example, when inserting RLS into a query, or when modifying a query to insert aLIMIT
orTOP
.Ideally we'd have a single abstraction that handles SQL parsing. Superset code should never parse code directly by importing a 3rd party library like
sqlparse
,sqloxide
, orsqlglot
; instead, it should only call the abstraction. Similarly, the abstraction should wrap library specific exceptions (likesqlglot.errors.ParseError
) with its own.This PR introduces two new classes that aim to be those abstractions:
SQLScript
andSQLStatement
. TheSQLStatement
strictly represents a single SQL statement. These classes are instantiated with SQL and an optional engine (aBaseEngineSpec.engine
attribute), and should provide all the methods needed for manipulating and introspecting SQL queries and statements. The initial implementation usessqlglot
, but the clean interface allows the parser to be changed in the future if we want.My plan for improving the SQL parsing is first to get rid of code outside of
sql_parse.py
that usingsqlparse
directly:SQLScript
andSQLStatement
, and replace any simple calls tosqlparse
with calls to the new classes (this PR).SQLStatement
class forKusto
, since it doesn't use conventional SQL.insert_rls_in_predicate
insideSQLStatement
.get_cte_query
insideSQLStatement
.apply_top_to_sql
inSQLStatement
.At this point we should have
sqlparse
being used only insidesql_parse.py
byParsedQuery
. The last steps would be:ParsedQuery
with calls to the new classes.sqlparse
dependency.Note that despite the big changes, no public interfaces will change. Also, we've been gradually changing the parser due to security issues, having introduced
sqlglot
as a dependency in #26476. Because of this, I thought it wouldn't be necessary to write a SIP. I'm happy to write one if people think it's necessary.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A
TESTING INSTRUCTIONS
Updated unit tests, with mostly cosmetic changes because the SQL pretty-printing has changed.
ADDITIONAL INFORMATION