[Cookbook improvement] Working PR #31

epec254 · 2024-09-27T16:01:32Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

How is this PR tested?

FMurray · 2024-10-03T18:23:42Z

agent_app_sample_code/00_shared_config.py

+shared_config = AgentCookbookConfig(
+    uc_catalog_name=f"{default_catalog}",
+    uc_schema_name=f"{user_name}_agents", 
+    uc_asset_prefix="agent_app_name", # Prefix to every created UC asset, typically the Agent's name


Unclear how to scale this to multiple models - IMO schema should be the “agent app name” (where agent apps contain one or more models/data sources)

That would simplify the confusing idea of us_asset_prefix...and avoid long names for each asset

Is that the common pattern you see @FMurray - one agent == one schema?

By agent app - I mean one use case, which might have several agent models, like for example 3 distinct RAG models and an supervisor model.

I see customers who have less permissive UC setups typically apply grants on the schema (occasionally on catalog), so I wouldn't want to have one schema per model, because that would slow them down.

FMurray · 2024-10-03T18:24:18Z

agent_app_sample_code/cookbook_utils/cookbook_config.py

+from databricks.sdk import WorkspaceClient
+from databricks.sdk.errors.platform import ResourceAlreadyExists, ResourceDoesNotExist, NotFound, PermissionDenied
+
+class AgentCookbookConfig(BaseModel):


Pydantic settings might get us more out of the box here: https://docs.pydantic.dev/latest/concepts/pydantic_settings/#usage

FMurray · 2024-10-03T18:25:50Z

agent_app_sample_code/00_shared_config.py

I'd prefer "lazy" config where validation happens when the config actually is used vs needing to explicitly run this notebook

Yeah the tradeoff/feedback I think the current logic is trying to address is failing fast and surface all permissions issues/blockers early on, since folks found it confusing to tab back and forth between notebooks to make fixes

FMurray · 2024-10-03T18:27:14Z

agent_app_sample_code/00_shared_config.py

+# MAGIC %md
+# MAGIC **Important note:** Throughout this notebook, we indicate which cells you:
+# MAGIC - ✅✏️ *should* customize - these cells contain config settings to change
+# MAGIC - 🚫✏️ *typically will not* customize - these cells contain boilerplate code required to validate / save the configuration


Why not put the things that shouldn't be customized in python files?

We can move more into the python files, but i tried to strike a balance of not hiding code that the user should be aware of. for example, i made a cell with mlflow logging boilerplate that shouldn't be modified but i also didnt want to hide it from the user since we got feedback before that hiding the details of how model logging works was confusing

Ack - probably out of scope for this PR but I'd prefer some better logging, output rendering and maybe the ability to change the config inline using an input Ipywidget or similar. Fewer notebook cells, less to think about

I kinda agree with @FMurray here, as a reader these notes are quite helpful but they seem hard for us to maintain. I wonder if we could recommend folks to use IDEs to view the cookbook (that's what I'm doing right now as I review & try out changes to this PR), which has better jump-to-definition support for things like easily reading the code in Python modules, if readers are interested - that way we can try to keep maintainability but avoid the "hidden magic code" issue

TBH though, there are other things that I think affect maintainability more, so fine to try this to start, it is quite helpful

FMurray · 2024-10-03T18:59:44Z

agent_app_sample_code/01_data_pipeline.py

+  pymupdf4llm==0.0.5 pymupdf==1.24.5 `# PDF parsing` \
+  markdownify==0.12.1  `# HTML parsing` \
+  pypandoc_binary==1.13  `# DOCX parsing` \
+  transformers==4.41.1 torch==2.3.0 tiktoken==0.7.0 langchain-text-splitters==0.2.0. `# For get_recursive_character_text_splitter()`


customer feedback - it's not uncommon for this to break in customer environments without external internet access. Can we put it behind a flag and toggle the text splitter behavior based on data pipeline config?

We could, but we would need to write a default text splitter that doesn't use any external packages. these are all required by the default splitter :(

FMurray · 2024-10-04T01:25:24Z

agent_app_sample_code/01_data_pipeline.py

+source_config = UnstructuredDataPipelineSourceConfig(
+    uc_catalog_name=cookbook_shared_config.uc_catalog_name,
+    uc_schema_name=cookbook_shared_config.uc_schema_name,
+    uc_volume_name=f"{cookbook_shared_config.uc_asset_prefix}_source_docs"


My source docs are in a different catalog - do I need to move them? I will try it referencing the other catalog and see what breaks

The implementation of this config dataclass would need to be changed if I want to use data from another catalog without creating a new volume. This would require data copying

When I tried a configuration from scratch I got something like:

app_name: alphaledger data_sources: sec_10k: catalog: field_ai_examples schema: alphaledger source_type: volume path: '/pdf' format: pdf glob: '*10k.pdf' sec_10q: catalog: field_ai_examples schema: alphaledger source_type: volume path: '/pdf' format: pdf glob: '*10k.pdf' marketdata: catalog: dev schema: alphaledger source_type: delta data_pipeline: stages: - name: ingest_sec sources: - sec_10k - sec_10q

This makes it easier to define semantics of data in the Volume, but makes the ingest job more complex

smurching · 2024-10-08T03:03:56Z

agent_app_sample_code/cookbook_utils/cookbook_config.py

+        print(json.dumps(config_dump, indent=4))
+
+
+    def validate_or_create_uc_catalog(self) -> bool:


I found the default error message from the SDK in case of permission failures is actually decent, e.g. for permission denied on schema creation I got:

PermissionDenied: User does not have CREATE SCHEMA and USE CATALOG on Catalog 'default'. Config: host=https://oregon.staging.cloud.databricks.com/, auth_type=runtime

Will propose a simplification to this code (I think the print statements when we're attempting to create are still useful, but we probably don't need to wrap the permission denied case) that reduces the # of branches to test/maintain and produces a stacktrace directly where the failure happens

smurching · 2024-10-08T03:10:25Z

agent_app_sample_code/cookbook_utils/cookbook_config.py

+          return True
+      except Exception as e:
+          print(
+              f"\nFAIL: `{self.mlflow_experiment_name}` is not a valid directory for an MLflow experiment.  An experiment name must be an absolute path within the Databricks workspace, e.g. '/Users/<some-username>/my-experiment'.\n\nIf you tried to specify a directory, either remove the `mlflow_experiment_name` parameter to try the default value or manually specify a valid path for `mlflow_experiment_name` to `AgentCookbookConfig(...)`.\n\nIf you did not pass a value for `mlflow_experiment_name` and are seeing this message, pass a valid workspace directory for `mlflow_experiment_name` and try again."


The error message here was also decent when I passed a bad experiment name:

RestException: INVALID_PARAMETER_VALUE: Got an invalid experiment name 'abcd/Users/[email protected]/agent_app_name_mlflow_experiment'. An experiment name must be an absolute path within the Databricks workspace, e.g. '/Users/<some-username>/my-experiment'.

Will push some suggestions to simplify this, since there are other potential causes of failure here e.g. PermissionDenied (so catching Exception + printing this new message may be inaccurate)

Basically I'll just push a suggestion that follows the pattern mentioned above (it's helpful to print when we're trying to create a new resource etc, but we can just let the SDK/API tell us what happened during failures and push error message improvements in the backend if needed)

sunishsheth2009 · 2024-10-15T20:58:11Z

agent_app_sample_code/agents/function_calling_agent/function_calling_agent_mlflow_sdk.py

-        resulting_prompt = self.config.get("prompt_template").format(context=context)
-
-        return resulting_prompt
+        return context.strip()


I couldn't make a comment on line 79, but class Document can be replaced with: https://mlflow.org/docs/latest/python_api/mlflow.entities.html#mlflow.entities.Document not

epec254 added 19 commits September 27, 2024 16:00

Agent config

e2318fa

Global config

95de0a0

Global config

eb0dea0

Data pipeline v1

eacc990

Eric's updates to refine cookbook

2be2631

Move function calling agent config into its own folder

aa4d641

RAG Only agent

58973ec

config folder

ade5c26

remove eric hardcoded vs endpoint

946cbc8

respect token config in data pipeline

720ab5f

Configs to a single cell

51d2965

Fix print of config

02fead4

Speed up parsing by caching the parsed table

618a35c

Fix error handling in create VS index

972d758

REmove debug code

80520d2

Clean up uses new cookbook config

02905fb

Refactor the agent notebooks

fb06252

REmove extra init

c3424e3

fix data pipeline mlflow tag

72930bc

FMurray reviewed Oct 3, 2024

View reviewed changes

FMurray reviewed Oct 4, 2024

View reviewed changes

epec254 added 3 commits October 4, 2024 22:26

Default to llama

099082a

Improve debug of failed records

dc05d52

Add debug code

f7ec5b1

smurching reviewed Oct 8, 2024

View reviewed changes

sunishsheth2009 reviewed Oct 15, 2024

View reviewed changes

smurching mentioned this pull request Nov 1, 2024

Updates to PR #31 #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cookbook improvement] Working PR #31

[Cookbook improvement] Working PR #31

epec254 commented Sep 27, 2024

FMurray Oct 3, 2024

epec254 Oct 3, 2024

epec254 Oct 3, 2024

FMurray Oct 4, 2024

FMurray Oct 3, 2024

FMurray Oct 3, 2024

smurching Oct 8, 2024

FMurray Oct 3, 2024

epec254 Oct 3, 2024

FMurray Oct 4, 2024

smurching Oct 8, 2024

smurching Oct 8, 2024

FMurray Oct 3, 2024

epec254 Oct 3, 2024

FMurray Oct 4, 2024

FMurray Oct 4, 2024

FMurray Oct 4, 2024

smurching Oct 8, 2024 •

edited

Loading

smurching Oct 8, 2024

smurching Oct 8, 2024

sunishsheth2009 Oct 15, 2024

		print(json.dumps(config_dump, indent=4))


		def validate_or_create_uc_catalog(self) -> bool:

[Cookbook improvement] Working PR #31

Are you sure you want to change the base?

[Cookbook improvement] Working PR #31

Conversation

epec254 commented Sep 27, 2024

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smurching Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smurching Oct 8, 2024 •

edited

Loading