PySQL Connector split into connector and sqlalchemy #444

jprakash-db · 2024-09-24T05:04:17Z

Major Change - v4.x.x

Description

databricks-sql-python library is being split into 2 packages to satisfy the business needs

PyArrow wants to be kept optional for users not intending to deal with large volumes of data. And also for users who want a small package for their needs
SQLAlchemy part of the code is moved to a separate library databricks-sqlalchemy such that the user can use either the SQLAlchemy v1 or SQLAlchemy v2 with the latest version of the connector

The Split

The two packages post split are

databricks-sql-python

It will be the core part of the library and will exist in this github repo itself.
It will have an optional dependency on PyArrow and will not be installed by default.
pip install databricks-sql-connector will install the lean connector and pip install databricks-sql-connector[pyarrow] will install the complete connector

! Not installing PyArrow will disable features such as Cloudfetch and other Arrow needed functions. Without PyArrow only inline results will be supported

databricks-sqlalchemy

The SQLAlchemy code is moved to a separate repository to control it release flow
databricks-sqlalchemy library will have a core dependency on the connector with PyArrow and hence the databricks-sql-python and PyArrow will be installed while installing databricks-sqlalchemy
You can install latest SQLAlchemy v1 based library using pip install databricks-sqlalchemy~=1.0 or the SQLAlchemy v2 based library using pip install databricks-sqlalchemy

Published Library on PyPi

Development Details

Going forward all the PRs related to databricks-sql-python will be raised on this repo
SQLAlchemy v1 based library is not under active development and hence has been moved to v1/main branch in the databricks-sqlalchemy repo. All future PRs must be raised wrt this branch
SQLAlchemy v2 based library is under active development and will be the default main branch in the databricks-sqlalchemy repo

PR Details

Tasks Completed

Refractored the code into its respective folders based on the proposed design doc
pyproject.toml file has been changed to reflect the proper dependencies for the split
Made sure that all the existing e2e and units tests are working pre and post spit, ensuring parity
Added benchmarking queries to test the performance of pre and post split and a dashboard has been created for visualization
Dependency tests are also added to check how the library behaves when certain libraries are not available and the user requests their functions

How to Test

Testing pipeline remains the same as it is before the split.
pytest can be used to directly run both the integration as well as unit tests, by pytest [directory_name or file_name]

Performance Comparison - Benchmarking

The pre-split and post-split preformance comparison has been made using the large and small queries to make sure their is no regression of performance
Dashboard has been created so that everytime the benchmarking is run the result are stored in the benchfood, and comparisons can be made easily

…ore part (#417) * Implemented ColumnQueue to test the fetchall without pyarrow Removed token removed token * order of fields in row corrected * Changed the folder structure and tested the basic setup to work * Refractored the code to make connector to work * Basic Setup of connector, core and sqlalchemy is working * Basic integration of core, connect and sqlalchemy is working * Setup working dynamic change from ColumnQueue to ArrowQueue * Refractored the test code and moved to respective folders * Added the unit test for column_queue Fixed __version__ Fix * venv_main added to git ignore * Added code for merging columnar table * Merging code for columnar * Fixed the retry_close sesssion test issue with logging * Fixed the databricks_sqlalchemy tests and introduced pytest.ini for the sqla_testing * Added pyarrow_test mark on pytest * Fixed databricks.sqlalchemy to databricks_sqlalchemy imports * Added poetry.lock * Added dist folder * Changed the pyproject.toml * Minor Fix * Added the pyarrow skip tag on unit tests and tested their working * Fixed the Decimal and timestamp conversion issue in non arrow pipeline * Removed not required files and reformatted * Fixed test_retry error * Changed the folder structure to src / databricks * Removed the columnar non arrow flow to another PR * Moved the README to the root * removed columnQueue instance * Revmoved databricks_sqlalchemy dependency in core * Changed the pysql_supports_arrow predicate, introduced changes in the pyproject.toml * Ran the black formatter with the original version * Extra .py removed from all the __init__.py files names * Undo formatting check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * BIG UPDATE * Refeactor code * Refractor * Fixed versioning * Minor refractoring * Minor refractoring

…ave pyarrow as optional

Print warning message if pyarrow is not installed Signed-off-by: Jacky Hu <[email protected]>

Remove sqlalchemy and update README.md Signed-off-by: Jacky Hu <[email protected]>

github-actions · 2024-12-10T06:26:56Z