Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySQL Connector split into connector and sqlalchemy #444

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

jprakash-db
Copy link
Contributor

@jprakash-db jprakash-db commented Sep 24, 2024

Major Change - v4.x.x

Related Links

databricks_sqlalchemy split is present in this PR - databricks/databricks-sqlalchemy#1

Description

databricks-sql-python library is being split into 2 packages to satisfy the business needs

  • PyArrow wants to be kept optional for users not intending to deal with large volumes of data. And also for users who want a small package for their needs
  • SQLAlchemy part of the code is moved to a separate library databricks-sqlalchemy such that the user can use either the SQLAlchemy v1 or SQLAlchemy v2 with the latest version of the connector

The Split

The two packages post split are

databricks-sql-python

  • It will be the core part of the library and will exist in this github repo itself.
  • It will have an optional dependency on PyArrow and will not be installed by default.
    pip install databricks-sql-connector will install the lean connector and pip install databricks-sql-connector[pyarrow] will install the complete connector

! Not installing PyArrow will disable features such as Cloudfetch and other Arrow needed functions. Without PyArrow only inline results will be supported

databricks-sqlalchemy

  • The SQLAlchemy code is moved to a separate repository to control it release flow
  • databricks-sqlalchemy library will have a core dependency on the connector with PyArrow and hence the databricks-sql-python and PyArrow will be installed while installing databricks-sqlalchemy
  • You can install latest SQLAlchemy v1 based library using pip install databricks-sqlalchemy~=1.0 or the SQLAlchemy v2 based library using pip install databricks-sqlalchemy

Published Library on PyPi

Development Details

  • Going forward all the PRs related to databricks-sql-python will be raised on this repo
  • SQLAlchemy v1 based library is not under active development and hence has been moved to v1/main branch in the databricks-sqlalchemy repo. All future PRs must be raised wrt this branch
  • SQLAlchemy v2 based library is under active development and will be the default main branch in the databricks-sqlalchemy repo

PR Details

Tasks Completed

  • Refractored the code into its respective folders based on the proposed design doc
  • pyproject.toml file has been changed to reflect the proper dependencies for the split
  • Made sure that all the existing e2e and units tests are working pre and post spit, ensuring parity
  • Added benchmarking queries to test the performance of pre and post split and a dashboard has been created for visualization
  • Dependency tests are also added to check how the library behaves when certain libraries are not available and the user requests their functions

How to Test

Testing pipeline remains the same as it is before the split.
pytest can be used to directly run both the integration as well as unit tests, by pytest [directory_name or file_name]

Performance Comparison - Benchmarking

The pre-split and post-split preformance comparison has been made using the large and small queries to make sure their is no regression of performance
Dashboard has been created so that everytime the benchmarking is run the result are stored in the benchfood, and comparisons can be made easily
Screenshot 2024-09-03 at 2 48 19 PM

…ore part (#417)

* Implemented ColumnQueue to test the fetchall without pyarrow

Removed token

removed token

* order of fields in row corrected

* Changed the folder structure and tested the basic setup to work

* Refractored the code to make connector to work

* Basic Setup of connector, core and sqlalchemy is working

* Basic integration of core, connect and sqlalchemy is working

* Setup working dynamic change from ColumnQueue to ArrowQueue

* Refractored the test code and moved to respective folders

* Added the unit test for column_queue

Fixed __version__

Fix

* venv_main added to git ignore

* Added code for merging columnar table

* Merging code for columnar

* Fixed the retry_close sesssion test issue with logging

* Fixed the databricks_sqlalchemy tests and introduced pytest.ini for the sqla_testing

* Added pyarrow_test mark on pytest

* Fixed databricks.sqlalchemy to databricks_sqlalchemy imports

* Added poetry.lock

* Added dist folder

* Changed the pyproject.toml

* Minor Fix

* Added the pyarrow skip tag on unit tests and tested their working

* Fixed the Decimal and timestamp conversion issue in non arrow pipeline

* Removed not required files and reformatted

* Fixed test_retry error

* Changed the folder structure to src / databricks

* Removed the columnar non arrow flow to another PR

* Moved the README to the root

* removed columnQueue instance

* Revmoved databricks_sqlalchemy dependency in core

* Changed the pysql_supports_arrow predicate, introduced changes in the pyproject.toml

* Ran the black formatter with the original version

* Extra .py removed from all the __init__.py files names

* Undo formatting check

* Check

* Check

* Check

* Check

* Check

* Check

* Check

* Check

* Check

* Check

* Check

* Check

* Check

* Check

* BIG UPDATE

* Refeactor code

* Refractor

* Fixed versioning

* Minor refractoring

* Minor refractoring
Print warning message if pyarrow is not installed

Signed-off-by: Jacky Hu <[email protected]>
Remove sqlalchemy and update README.md

Signed-off-by: Jacky Hu <[email protected]>
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@jprakash-db jprakash-db changed the title PySQL Connector split into core and non core part PySQL Connector split into connector and sqlalchemy Dec 11, 2024
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@@ -52,7 +52,7 @@ jobs:
# install your root project, if required
#----------------------------------------------
- name: Install library
run: poetry install --no-interaction
run: poetry install --no-interaction --all-extras
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to check and test for non-extra dep scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants