-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DuckDB for aggregations? #39
Comments
did some simple tests with the following query: SELECT AuthorId, AffiliationId, Year
FROM (
SELECT *,
MAX(PaperCount) OVER(PARTITION BY AuthorId, Year) AS MaxPaperCount
FROM (
SELECT a.AuthorId,
a.AffiliationId,
c.Year,
count(PaperId) AS PaperCount
FROM PaperAuthorAffiliations a
INNER JOIN (
SELECT AuthorId
FROM author_sample
) b USING (AuthorId)
INNER JOIN (
SELECT PaperId, Year
FROM Papers
WHERE DocType IN ('Journal', 'Conference')
) c USING (PaperId)
WHERE AffiliationId != ''
GROUP BY a.AuthorId, a.AffiliationId, c.Year
)
)
WHERE PaperCount = MaxPaperCount When run through sqlite, this takes 700s; when run through duckdb, it takes 77s. The syntax for duckb is SET GLOBAL sqlite_all_varchar = true;
load sqlite;
attach 'path/to/db/AcademicGraph.sqlite' (type sqlite);
use AcademicGraph;
show tables;
/* then execute the query above */ open questions
|
in python, this works: https://gist.github.com/f-hafner/f49ee843b61d395adf1afc532df8e5db |
how about setting storing the entire data in duckdb instead? CREATE TABLE PaperAuthorAffiliations(PaperId BIGINT, AuthorId BIGINT, AffiliationId BIGINT);
CREATE TABLE Papers(PaperId BIGINT, DocType VARCHAR, "Year" BIGINT);
CREATE TABLE author_sample(AuthorId BIGINT, YearLastPub INTEGER, YearFirstPub INTEGER, PaperCount INTEGER, FirstName VARCHAR); (this schema is created with the following inter-acting queries on the sqlite database and on the duckdb database): /*make table author_sample */
/* -------------------------------------------------------------------------*/
/* in sqlite */
create table flavio_test (AuthorId INT, YearLastPub INT, YearFirstPub INT, PaperCount INT, FirstName VARCHAR);
insert into flavio_test
select * from author_sample;
/* in duckdb */
/* I think the install and load of sqlite is missing here */
create table author_sample(AuthorId bigint, YearLastPub INT, YearFirstPub INT, PaperCount INT, FirstName varchar);
insert into author_sample
select * from sqlite_scan("/mnt/ssd/AcademicGraph/AcademicGraph.sqlite", "flavio_test");
/* make table Papers*/
/* -------------------------------------------------------------------------*/
/* in sqlite */
drop table flavio_test;
create table flavio_test(PaperId INT, DocType TEXT, Year INT);
insert into flavio_test
select PaperId, DocType, Year from Papers
where DocType != "" and Year is not NULL and Year != "";
/* in duckdb */
install sqlite;
load sqlite;
create table papers as select * from sqlite_scan("/mnt/ssd/AcademicGraph/AcademicGraph.sqlite", "flavio_test");
/* make table PaperAuthorAffiliations*/
/* -------------------------------------------------------------------------*/
/* in sqlite */
drop table flavio_test;
create table flavio_test(PaperId INT, AuthorId INT, AffiliationId INT);
insert into flavio_test
select PaperId, AuthorId, AffiliationId
from PaperAuthorAffiliations
where AffiliationId != "";
/* in duckdb */
create table PaperAuthorAffiliations as select * from sqlite_scan("/mnt/ssd/AcademicGraph/AcademicGraph.sqlite", "flavio_test");
then the following query takes 11 seconds create temp table flavio_test as
SELECT AuthorId, AffiliationId, Year
FROM (
SELECT *,
MAX(PaperCount) OVER(PARTITION BY AuthorId, Year) AS MaxPaperCount
FROM (
SELECT a.AuthorId,
a.AffiliationId,
c.Year,
count(PaperId) AS PaperCount
FROM PaperAuthorAffiliations a
INNER JOIN (
SELECT AuthorId
FROM author_sample
) b USING (AuthorId)
INNER JOIN (
SELECT PaperId, Year
FROM Papers
WHERE DocType IN ('Journal', 'Conference')
) c USING (PaperId)
GROUP BY a.AuthorId, a.AffiliationId, c.Year
)
)
WHERE PaperCount = MaxPaperCount ; I dropped the |
instead of using pandas in-memory with batches.
See here: https://motherduck.com/blog/analyze-sqlite-databases-duckdb/
do some speed comparisons
The text was updated successfully, but these errors were encountered: