Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool #24

leonbi100 · 2024-12-05T03:00:03Z

What does this PR do?

This PR introduces the class VectorSearchRetrieverTool into the databricks-langchain package which allows the user to instantiate a langchain native tool that calls Databricks Vector Search when invoked. Under the hood, the core logic is from DatabricksVectorSearch in the langchain-databricks package.

How was it tested?

New unit tests in test_vector_search_retriever_tool.py which verifies the new tool object can be instantiated correctly and can be invoked by an arbitrary langchain llm
Tested e2e in notebook:
Can see the populated tool description when no description is provided:

leonbi100 · 2024-12-19T18:33:25Z

integrations/langchain/src/databricks_langchain/vector_search_retriever_tool.py

+    text_column: Optional[str] = Field(None, description="If using a direct-access index or delta-sync index, specify the text column.")
+    embedding: Optional[Embeddings] = Field(None, description="Embedding model for self-managed embeddings.")
+    # TODO: Confirm if we can add this endpoint field
+    endpoint: Optional[str] = Field(None, description="Endpoint for DatabricksVectorSearch.")


This field was added because of this restriction in databricks-langchain. I felt that if we threw this error without giving the ability for the user to rectify it, it would be a poor user experience. Alternatively maybe we pin databricks-vectorsearch to be >=0.35.

I think it's valid to require databricks-vectorsearch >= 0.35 especially because this is new - that might be the better considering we don't need endpoint for any other reason.

Yeah reasonable to require new versions of other clients!

Turns out we already mark the "databricks-vectorsearch>=0.40" as a dependency here, so I'll just remove this argument.

leonbi100 · 2024-12-19T18:35:04Z

integrations/langchain/src/databricks_langchain/vector_search_retriever_tool.py

+    text_column: Optional[str] = Field(None, description="If using a direct-access index or delta-sync index, specify the text column.")
+    embedding: Optional[Embeddings] = Field(None, description="Embedding model for self-managed embeddings.")


These two fields are required for direct-access indexes or delta-sync indexes with self-managed embeddings. Should we support these additional fields?

I feel like if we support it for DatabricksVectorSearch it makes sense to support it here.

Yeah, seems reasonable to support these, though I'd say it's worth asking vector search folks how commonly direct access indexes are used, if it's infrequent we could drop this to start with to simplify the API/testing surface

No need to block this PR on that though, I figure we'll need this eventually anyways, would just be good for us to know

Linking the slack thread I started here

leonbi100 · 2024-12-19T18:40:24Z

integrations/langchain/src/databricks_langchain/vector_search_retriever_tool.py

+        def get_tool_description():
+            default_tool_description = "A vector search-based retrieval tool for querying indexed embeddings."
+            index_details = IndexDetails(dbvs.index)
+            if index_details.is_delta_sync_index():


direct access indexes don't have an associated source table so we'll just use the default tool description.

Curious what the existing langchain-databricks DatabricksVectorSearch.as_retriever(...).as_tool(...) ends up generating as the tool description

One way to tell would be to use it as a tool with payload logging enabled & see what the tools argument to the LLM API in model serving looks like

This generally looks reasonable, just curious if we can keep it in sync with the existing behavior/default

Tested it out in a notebook. The default seems to be extremely basic and lacking in content. Does this answer your question?

Lol yep makes sense, the updated version in this PR is definitely better

smurching · 2024-12-20T02:30:28Z

integrations/langchain/src/databricks_langchain/vector_search_retriever_tool.py

+    index_name: str = Field(..., description="The name of the index to use, format: 'catalog.schema.index'.")
+    num_results: int = Field(10, description="The number of results to return.")
+    columns: Optional[List[str]] = Field(None, description="Columns to return when doing the search.")
+    filters: Optional[Dict[str, Any]] = Field(None, description="Filters to apply to the search.")


QQ, does this get sent to the LLM as the parameter description? If so I wonder if it's worth including examples like the ones in https://docs.databricks.com/api/workspace/vectorsearchindexes/queryindex

Oh nvm, this is in the init, not in the tool call

But seems like there is a way we can specify the description of the params for the LLM too: https://chatgpt.com/share/6764d76f-69a0-8009-8a8f-f58977753057

See also https://python.langchain.com/docs/how_to/custom_tools/#subclass-basetool (we can use args_schema)

Updated to include VectorSearchRetrieverToolInput as an args_schema

smurching

Mostly looks good! Just had some small comments

annzhang-db

LGTM with one comment - we should update the tests to reflect the most recent changes. 🙇‍♀️

annzhang-db · 2024-12-20T18:21:06Z

integrations/langchain/tests/unit_tests/test_vector_search_retriever_tool.py

+    tool_description: Optional[str],
+    embedding: Optional[Any],
+    text_column: Optional[str],
+    endpoint: Optional[str],


We shouldn't need the endpoint argument anymore right?

Good catch updated!

smurching · 2024-12-20T19:34:30Z

integrations/langchain/tests/utils/chat_models.py

 from typing import Generator
 from unittest import mock
+


Should we update these tests to also assert that the tool description + args description are properly set? Lmk if it's already done and I missed it (just looking at changes since my last review rn)

Added a new test test_vector_search_retriever_tool_description_generation to explicitly test these changes.

smurching

Will stamp after my testing comment is addressed ,thanks Leon!

Add VectorSearchRetrieverTool

fc34a39

leonbi100 changed the title ~~Add VectorSearchRetrieverTool~~ Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool Dec 5, 2024

leonbi100 marked this pull request as ready for review December 5, 2024 18:26

leonbi100 added 8 commits December 5, 2024 15:35

Add docs and fix tool creation

1e1798e

Fix bug

d662b46

Merge master

9d32449

Refactor

60a5eb5

Merge branch 'main' into langchain-vs-tool

fe09b25

Update API to be uniform

2105f55

Refactor based on Pydantic

5b5361a

Rename files

66fe9a7

leonbi100 commented Dec 19, 2024

View reviewed changes

annzhang-db requested review from smurching and annzhang-db December 19, 2024 23:19

smurching reviewed Dec 20, 2024

View reviewed changes

PR feedback and lint

9166b5a

leonbi100 requested a review from smurching December 20, 2024 05:48

Fix lint

53f9924

annzhang-db approved these changes Dec 20, 2024

View reviewed changes

Remove endpoint arg from tests

ef765b1

smurching reviewed Dec 20, 2024

View reviewed changes

Add new test checking tool descriptions

8798989

smurching approved these changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool #24

Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool #24

leonbi100 commented Dec 5, 2024 •

edited

Loading

leonbi100 Dec 19, 2024

annzhang-db Dec 20, 2024

smurching Dec 20, 2024

leonbi100 Dec 20, 2024

leonbi100 Dec 19, 2024

annzhang-db Dec 20, 2024

smurching Dec 20, 2024

smurching Dec 20, 2024

leonbi100 Dec 20, 2024

leonbi100 Dec 19, 2024

smurching Dec 20, 2024

smurching Dec 20, 2024

smurching Dec 20, 2024

leonbi100 Dec 20, 2024

smurching Dec 20, 2024

smurching Dec 20, 2024

smurching Dec 20, 2024

smurching Dec 20, 2024

smurching Dec 20, 2024 •

edited

Loading

leonbi100 Dec 20, 2024

smurching left a comment

annzhang-db left a comment

annzhang-db Dec 20, 2024

leonbi100 Dec 20, 2024

smurching Dec 20, 2024

leonbi100 Dec 20, 2024

smurching left a comment

		text_column: Optional[str] = Field(None, description="If using a direct-access index or delta-sync index, specify the text column.")
		embedding: Optional[Embeddings] = Field(None, description="Embedding model for self-managed embeddings.")

Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool #24

Are you sure you want to change the base?

Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool #24

Conversation

leonbi100 commented Dec 5, 2024 • edited Loading

What does this PR do?

How was it tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smurching Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smurching left a comment

Choose a reason for hiding this comment

annzhang-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smurching left a comment

Choose a reason for hiding this comment

leonbi100 commented Dec 5, 2024 •

edited

Loading

smurching Dec 20, 2024 •

edited

Loading