DB engine for pandas: sql.connect or sqlalchemy #476

rth · 2024-11-27T13:04:42Z

Hello,

I was wondering what's the best practice for using this package with pandas.

It's possible to create a databricks.sql.connect and pass it to pandas.read_sql. This works however it raises

UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 
connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.

Alternatively it's possible to use SQLAlchemy with a databricks:// URL and pass that to pandas. Doesn't it mean an extra serialization step performance wise though?

What's the recommended way, in particular regarding performance? Would both use CloudFetch for larger queries? I see there are some fixes/improvements done for pandas done in PRs so which API should be used to benefit from those?

Thanks!

cc @kravets-levko

The text was updated successfully, but these errors were encountered:

rth · 2024-11-29T10:56:58Z

Unless one is supposed to use fetchall_arrow and convert the resulting PyArrow table to pandas? Some example would be good (also in #21)

Edit: Or actually some util function would be even better as proposed in #134

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB engine for pandas: sql.connect or sqlalchemy #476

DB engine for pandas: sql.connect or sqlalchemy #476

rth commented Nov 27, 2024 •

edited

Loading

rth commented Nov 29, 2024 •

edited

Loading

DB engine for pandas: sql.connect or sqlalchemy #476

DB engine for pandas: sql.connect or sqlalchemy #476

Comments

rth commented Nov 27, 2024 • edited Loading

rth commented Nov 29, 2024 • edited Loading

rth commented Nov 27, 2024 •

edited

Loading

rth commented Nov 29, 2024 •

edited

Loading