Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improve SQL parsing #26767

Merged
merged 5 commits into from
Mar 13, 2024
Merged

chore: improve SQL parsing #26767

merged 5 commits into from
Mar 13, 2024

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Jan 23, 2024

SUMMARY

This is the first PR in a series to improve our SQL parsing, in effort to clean up the code base and make it more secure.

Currently, we have a class called superset.sql_parse.ParsedQuery that is used as an interface every time we need to parse or manipulate SQL. There are a few problems with the status quo:

  1. Even though the class is called ParsedQuery, it's designed to accept a single SQL statement, not necessarily a query. For example, it has a set_or_update_query_limit method that works only on the first statement, if there are multiple.
  2. Even though it's designed to accept a single statement, it doesn't raise an error when instantiated with a query that has multiple statements, so in several places the class is used with a full query.
  3. There are multiple places that import sqlparse directly to manipulate SQL. For example, when inserting RLS into a query, or when modifying a query to insert a LIMIT or TOP.
  4. In many places we manipulate SQL inefficiently, serializing and deserializing the SQL multiple times.

Ideally we'd have a single abstraction that handles SQL parsing. Superset code should never parse code directly by importing a 3rd party library like sqlparse, sqloxide, or sqlglot; instead, it should only call the abstraction. Similarly, the abstraction should wrap library specific exceptions (like sqlglot.errors.ParseError) with its own.

This PR introduces two new classes that aim to be those abstractions: SQLScript and SQLStatement. The SQLStatement strictly represents a single SQL statement. These classes are instantiated with SQL and an optional engine (a BaseEngineSpec.engine attribute), and should provide all the methods needed for manipulating and introspecting SQL queries and statements. The initial implementation uses sqlglot, but the clean interface allows the parser to be changed in the future if we want.

My plan for improving the SQL parsing is first to get rid of code outside of sql_parse.py that using sqlparse directly:

  • Introduce SQLScript and SQLStatement, and replace any simple calls to sqlparse with calls to the new classes (this PR).
  • Implement a custom SQLStatement class for Kusto, since it doesn't use conventional SQL.
  • Implement insert_rls_in_predicate inside SQLStatement.
  • Implement get_cte_query inside SQLStatement.
  • Implement apply_top_to_sql in SQLStatement.

At this point we should have sqlparse being used only inside sql_parse.py by ParsedQuery. The last steps would be:

  • Replace all calls to ParsedQuery with calls to the new classes.
  • Get rid of the sqlparse dependency.

Note that despite the big changes, no public interfaces will change. Also, we've been gradually changing the parser due to security issues, having introduced sqlglot as a dependency in #26476. Because of this, I thought it wouldn't be necessary to write a SIP. I'm happy to write one if people think it's necessary.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TESTING INSTRUCTIONS

Updated unit tests, with mostly cosmetic changes because the SQL pretty-printing has changed.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@@ -252,6 +253,163 @@ def __eq__(self, __o: object) -> bool:
return str(self) == str(__o)


def extract_tables_from_statement(
Copy link
Member Author

@betodealmeida betodealmeida Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was originally a method in ParsedQuery, I just converted it into a function without any big changes (see below).

@betodealmeida betodealmeida force-pushed the remove-sqlparse-step-1 branch 4 times, most recently from a3715d4 to 276ccfb Compare January 23, 2024 22:39
Copy link

codecov bot commented Jan 23, 2024

Codecov Report

Attention: Patch coverage is 96.25000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 69.69%. Comparing base (735b895) to head (d1a5b7e).

Files Patch % Lines
superset/sql_parse.py 96.42% 2 Missing ⚠️
superset/models/helpers.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #26767      +/-   ##
==========================================
- Coverage   69.73%   69.69%   -0.04%     
==========================================
  Files        1909     1909              
  Lines       74692    74734      +42     
  Branches     8325     8325              
==========================================
  Hits        52086    52086              
- Misses      20556    20598      +42     
  Partials     2050     2050              
Flag Coverage Δ
hive ?
javascript 57.22% <ø> (ø)
mysql 78.03% <76.25%> (+0.01%) ⬆️
postgres 78.15% <86.25%> (+0.03%) ⬆️
presto 53.69% <67.50%> (+0.03%) ⬆️
python 83.05% <96.25%> (-0.10%) ⬇️
sqlite 77.58% <76.25%> (+0.01%) ⬆️
unit 56.67% <82.50%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@betodealmeida betodealmeida changed the title chore: get rid of sqlparse chore: improve SQL parsing Jan 24, 2024
@betodealmeida betodealmeida force-pushed the remove-sqlparse-step-1 branch 3 times, most recently from f5ebc7f to 02c5093 Compare January 24, 2024 16:39
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments:

  • Let's remove sqlparse from setup.py and requirements. I noticed pip-compile-multi was SUPER slow last time I tried it
  • about the description and now improved support for multi-statements, are all the cases covered by new tests in sql_parse_tests.py?

@villebro
Copy link
Member

I think this is a welcome change and a move in the right direction, but I feel a big change like this should probably be discussed in the form of a SIP first. I anticipate the final implementation to end up looking similar to this PR, but it may be beneficial to discuss the change first before jumping directly into the code.

@betodealmeida
Copy link
Member Author

I think this is a welcome change and a move in the right direction, but I feel a big change like this should probably be discussed in the form of a SIP first. I anticipate the final implementation to end up looking similar to this PR, but it may be beneficial to discuss the change first before jumping directly into the code.

SIP here: #26786

@villebro
Copy link
Member

Thanks @betodealmeida !

@betodealmeida betodealmeida force-pushed the remove-sqlparse-step-1 branch 2 times, most recently from b5f06ea to d6bc9c9 Compare January 24, 2024 19:50
@betodealmeida betodealmeida force-pushed the remove-sqlparse-step-1 branch 4 times, most recently from ae687cc to 4769289 Compare January 24, 2024 21:38
@betodealmeida
Copy link
Member Author

#26786 approved!

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. At a high level the main thing for me would be to clarify the object model and inheritance scheme for the SQL-related utilities. At a high level thinking about this, it sits close to db_engine_spec or at least relate to it in some ways.

@@ -22,12 +22,13 @@
import urllib.parse
from collections.abc import Iterable, Iterator
from dataclasses import dataclass
from typing import Any, cast, Optional
from typing import Any, cast, Optional, Union
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be good to rename this module to a better name, the current name can be confused with having a relationship with sqlparse, and overall just isn't a good name. It could be utils/sql.py, or more directly superset/sql/* if we need to grow this into a package with multiple modules. superset/sql_parser.py (?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I like this!

@john-bodley, you did some work on reorganizing the codebase, what are your thought here?

superset/sql_parse.py Show resolved Hide resolved
superset/sql_parse.py Show resolved Hide resolved
@@ -537,7 +537,7 @@ def test_mssql_engine_spec_pymssql(self):
)

def test_comments_in_sqlatable_query(self):
clean_query = "SELECT '/* val 1 */' as c1, '-- val 2' as c2 FROM tbl"
clean_query = "SELECT\n '/* val 1 */' AS c1,\n '-- val 2' AS c2\nFROM tbl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggesting] it could be good to have some sort of reusable compare_sql(strict=False, case_sensitive=False, disregard_schema_prefix=True) function that could be reused in unit tests. Maybe it's a method of ParseQuery (is_identical or is_similar)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, I'll add that!

@github-actions github-actions bot added the api Related to the REST API label Feb 7, 2024
@rafehi
Copy link

rafehi commented Feb 28, 2024

Where are we at with this work? There's some additional changes that I'd be willing to take on (#19572) but looks like this is a dependency.

We've made and validated significant performance improvements in our Superset fork, but makes sense to consolidate those into SQLScript and SQLStatement rather than directly replace sqlparse calls with sqlglot.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@eschutho eschutho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed just the cypress test portion as a code-owner requirement. That LGTM. 👍

@betodealmeida betodealmeida merged commit 26d8077 into master Mar 13, 2024
36 of 37 checks passed
@mistercrunch
Copy link
Member

Hey @betodealmeida , this PR might have broken a Hive-specific test -> https://github.com/apache/superset/actions/runs/8272398041/job/22634142747#step:11:608

It appears to fail after this PR was merged (I noticed since it was failing on some other unrelated PR) and trace it back here. Note that this particular GHA test for Hive runs only when superset/** is altered by a PR, that explains why other builds succeeded on master since then.

Hive seems to be complaining about a GROUP BY 1 but from memory it totally supports that. Unclear what's up with the test but I thought you might know more given this PR.

@joshuaray joshuaray mentioned this pull request Mar 19, 2024
3 tasks
sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
EandrewJones pushed a commit to UMD-ARLIS/superset that referenced this pull request Apr 5, 2024
@rusackas rusackas deleted the remove-sqlparse-step-1 branch April 16, 2024 16:52
qleroy pushed a commit to qleroy/superset that referenced this pull request Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the REST API size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants