Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool #24

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

leonbi100
Copy link

@leonbi100 leonbi100 commented Dec 5, 2024

What does this PR do?

This PR introduces the class VectorSearchRetrieverTool into the databricks-langchain package which allows the user to instantiate a langchain native tool that calls Databricks Vector Search when invoked. Under the hood, the core logic is from DatabricksVectorSearch in the langchain-databricks package.

How was it tested?

  • New unit tests in test_vector_search_retriever_tool.py which verifies the new tool object can be instantiated correctly and can be invoked by an arbitrary langchain llm
  • Tested e2e in notebook: image
  • Can see the populated tool description when no description is provided: image

@leonbi100 leonbi100 changed the title Add VectorSearchRetrieverTool Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool Dec 5, 2024
@leonbi100 leonbi100 marked this pull request as ready for review December 5, 2024 18:26
text_column: Optional[str] = Field(None, description="If using a direct-access index or delta-sync index, specify the text column.")
embedding: Optional[Embeddings] = Field(None, description="Embedding model for self-managed embeddings.")
# TODO: Confirm if we can add this endpoint field
endpoint: Optional[str] = Field(None, description="Endpoint for DatabricksVectorSearch.")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field was added because of this restriction in databricks-langchain. I felt that if we threw this error without giving the ability for the user to rectify it, it would be a poor user experience. Alternatively maybe we pin databricks-vectorsearch to be >=0.35.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's valid to require databricks-vectorsearch >= 0.35 especially because this is new - that might be the better considering we don't need endpoint for any other reason.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah reasonable to require new versions of other clients!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we already mark the "databricks-vectorsearch>=0.40" as a dependency here, so I'll just remove this argument.

Comment on lines 26 to 27
text_column: Optional[str] = Field(None, description="If using a direct-access index or delta-sync index, specify the text column.")
embedding: Optional[Embeddings] = Field(None, description="Embedding model for self-managed embeddings.")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two fields are required for direct-access indexes or delta-sync indexes with self-managed embeddings. Should we support these additional fields?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like if we support it for DatabricksVectorSearch it makes sense to support it here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems reasonable to support these, though I'd say it's worth asking vector search folks how commonly direct access indexes are used, if it's infrequent we could drop this to start with to simplify the API/testing surface

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to block this PR on that though, I figure we'll need this eventually anyways, would just be good for us to know

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_tool_description():
default_tool_description = "A vector search-based retrieval tool for querying indexed embeddings."
index_details = IndexDetails(dbvs.index)
if index_details.is_delta_sync_index():
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

direct access indexes don't have an associated source table so we'll just use the default tool description.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what the existing langchain-databricks DatabricksVectorSearch.as_retriever(...).as_tool(...) ends up generating as the tool description

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to tell would be to use it as a tool with payload logging enabled & see what the tools argument to the LLM API in model serving looks like

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks reasonable, just curious if we can keep it in sync with the existing behavior/default

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Tested it out in a notebook. The default seems to be extremely basic and lacking in content. Does this answer your question?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol yep makes sense, the updated version in this PR is definitely better

index_name: str = Field(..., description="The name of the index to use, format: 'catalog.schema.index'.")
num_results: int = Field(10, description="The number of results to return.")
columns: Optional[List[str]] = Field(None, description="Columns to return when doing the search.")
filters: Optional[Dict[str, Any]] = Field(None, description="Filters to apply to the search.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ, does this get sent to the LLM as the parameter description? If so I wonder if it's worth including examples like the ones in https://docs.databricks.com/api/workspace/vectorsearchindexes/queryindex

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm, this is in the init, not in the tool call

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But seems like there is a way we can specify the description of the params for the LLM too: https://chatgpt.com/share/6764d76f-69a0-8009-8a8f-f58977753057

Copy link
Collaborator

@smurching smurching Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include VectorSearchRetrieverToolInput as an args_schema

Copy link
Collaborator

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good! Just had some small comments

@leonbi100 leonbi100 requested a review from smurching December 20, 2024 05:48
Copy link

@annzhang-db annzhang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment - we should update the tests to reflect the most recent changes. 🙇‍♀️

tool_description: Optional[str],
embedding: Optional[Any],
text_column: Optional[str],
endpoint: Optional[str],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need the endpoint argument anymore right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch updated!

from typing import Generator
from unittest import mock

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update these tests to also assert that the tool description + args description are properly set? Lmk if it's already done and I missed it (just looking at changes since my last review rn)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new test test_vector_search_retriever_tool_description_generation to explicitly test these changes.

Copy link
Collaborator

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will stamp after my testing comment is addressed ,thanks Leon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants