Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector search support #1402

Closed
wants to merge 11 commits into from
Closed

Vector search support #1402

wants to merge 11 commits into from

Conversation

penberg
Copy link
Collaborator

@penberg penberg commented May 20, 2024

This pull request adds initial support for vector search in libSQL.

Highlights

  • Vector column type for storing vectors in tables.
  • Vector index creation that is automatically updated on table updates.
  • Exact vector search with metadata filtering using plain SQL.
  • Approximate vector search using the new vector_top_k() function that is backed by DiskANN-based vector index.

Usage

Creating a table with a vector column:

CREATE TABLE movies (
  title TEXT, 
  year INT, 
  embedding FLOAT32(3)
);

Inserting vector data:

INSERT INTO movies (title, year, embedding) 
VALUES 
  (
    'Napoleon', 
    2023, 
    vector('[1,2,3]')
  ), 
  (
    'Black Hawk Down', 
    2001, 
    vector('[10,11,12]')
  ), 
  (
    'Gladiator', 
    2000, 
    vector('[7,8,9]')
  ), 
  (
    'Blade Runner', 
    1982, 
    vector('[4,5,6]')
  );

Creating an index on vector column:

CREATE INDEX movies_idx USING vector_cosine_ops ON movies (embedding);

Finding top-k similar rows (exact):

SELECT title, year FROM movies ORDER BY vector_distance_cos(embedding, '[3,1,2]') LIMIT 3;

Finding top-k similar rows (approximate):

SELECT 
  title, 
  year 
FROM 
  vector_top_k('movies_idx', '[4,5,6]', 3) 
JOIN
  movies 
ON 
  movies.rowid = id;

Limitations

  • Index key is always rowid, primary keys not supported.
  • CREATE INDEX does not index rows that already exist in the base table.
  • Vector index uses 32-bit per vector element, which causes redundant I/O and space amplification.

@penberg penberg force-pushed the vector branch 4 times, most recently from 33eb181 to 2bc5121 Compare May 23, 2024 10:36
@penberg penberg force-pushed the vector branch 9 times, most recently from 90b9191 to 56bfe98 Compare June 5, 2024 14:32
@pax-k
Copy link

pax-k commented Jun 11, 2024

Any status for this PR? We're hardly waiting for vector support 🙏🏻 Thanks for the work btw!

@penberg
Copy link
Collaborator Author

penberg commented Jun 12, 2024

@pax-k I am actively working with folks to iron out some bugs and then get this merged.

@penberg penberg force-pushed the vector branch 4 times, most recently from 980582f to 1eff387 Compare June 18, 2024 06:50
@penberg penberg marked this pull request as ready for review June 18, 2024 08:59
@penberg penberg force-pushed the vector branch 2 times, most recently from 8c8a696 to 3201dc5 Compare June 20, 2024 06:22
libsql-sqlite3/src/build.c Outdated Show resolved Hide resolved

assert( pUsing!= 0);

for( i=0; i<pUsing->nId; i++ ){
Copy link
Contributor

@haaawk haaawk Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ever allowed for vector index to have pUsing->nId != 1?

return -1;
}
zSql = sqlite3MPrintf(db, "CREATE TABLE IF NOT EXISTS %s_shadow (index_key INT, data BLOB)", pIdx->zName);
rc = sqlite3_exec(db, zSql, 0, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably still not understanding something but it seems that return -1 in line 393 and 399 will leave shadow table present despite index creation failing. Is this a problem? Probably nothing bad happens despite some junk being kept around. Creation of the index with the same name will just work because of IF NOT EXISTIS part in the SQL.

if( !sqlite3Isdigit(*z) ){
return -1;
}
dims = dims*10 + (*z - '0');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to define FLOAT32(X) with X being bigger than MAX_INT? Would it make sense to check for overflow here and report user readable error explaining that they try to use too big number of dimensions in case of X > MAX_INT or even X > MAX_VECTOR_DIMS?

@penberg
Copy link
Collaborator Author

penberg commented Jul 24, 2024

This work has been merged as part of the following PRs:

#1531

#1551

#1557

#1560

Therefore, closing this.

@penberg penberg closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants