-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cookbook improvement] Working PR #31
base: main
Are you sure you want to change the base?
Conversation
shared_config = AgentCookbookConfig( | ||
uc_catalog_name=f"{default_catalog}", | ||
uc_schema_name=f"{user_name}_agents", | ||
uc_asset_prefix="agent_app_name", # Prefix to every created UC asset, typically the Agent's name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear how to scale this to multiple models - IMO schema should be the “agent app name” (where agent apps contain one or more models/data sources)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would simplify the confusing idea of us_asset_prefix...and avoid long names for each asset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the common pattern you see @FMurray - one agent == one schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By agent app - I mean one use case, which might have several agent models, like for example 3 distinct RAG models and an supervisor model.
I see customers who have less permissive UC setups typically apply grants on the schema (occasionally on catalog), so I wouldn't want to have one schema per model, because that would slow them down.
from databricks.sdk import WorkspaceClient | ||
from databricks.sdk.errors.platform import ResourceAlreadyExists, ResourceDoesNotExist, NotFound, PermissionDenied | ||
|
||
class AgentCookbookConfig(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pydantic settings might get us more out of the box here: https://docs.pydantic.dev/latest/concepts/pydantic_settings/#usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer "lazy" config where validation happens when the config actually is used vs needing to explicitly run this notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the tradeoff/feedback I think the current logic is trying to address is failing fast and surface all permissions issues/blockers early on, since folks found it confusing to tab back and forth between notebooks to make fixes
# MAGIC %md | ||
# MAGIC **Important note:** Throughout this notebook, we indicate which cells you: | ||
# MAGIC - ✅✏️ *should* customize - these cells contain config settings to change | ||
# MAGIC - 🚫✏️ *typically will not* customize - these cells contain boilerplate code required to validate / save the configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put the things that shouldn't be customized in python files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move more into the python files, but i tried to strike a balance of not hiding code that the user should be aware of. for example, i made a cell with mlflow logging boilerplate that shouldn't be modified but i also didnt want to hide it from the user since we got feedback before that hiding the details of how model logging works was confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack - probably out of scope for this PR but I'd prefer some better logging, output rendering and maybe the ability to change the config inline using an input Ipywidget or similar. Fewer notebook cells, less to think about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda agree with @FMurray here, as a reader these notes are quite helpful but they seem hard for us to maintain. I wonder if we could recommend folks to use IDEs to view the cookbook (that's what I'm doing right now as I review & try out changes to this PR), which has better jump-to-definition support for things like easily reading the code in Python modules, if readers are interested - that way we can try to keep maintainability but avoid the "hidden magic code" issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH though, there are other things that I think affect maintainability more, so fine to try this to start, it is quite helpful
pymupdf4llm==0.0.5 pymupdf==1.24.5 `# PDF parsing` \ | ||
markdownify==0.12.1 `# HTML parsing` \ | ||
pypandoc_binary==1.13 `# DOCX parsing` \ | ||
transformers==4.41.1 torch==2.3.0 tiktoken==0.7.0 langchain-text-splitters==0.2.0. `# For get_recursive_character_text_splitter()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
customer feedback - it's not uncommon for this to break in customer environments without external internet access. Can we put it behind a flag and toggle the text splitter behavior based on data pipeline config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but we would need to write a default text splitter that doesn't use any external packages. these are all required by the default splitter :(
source_config = UnstructuredDataPipelineSourceConfig( | ||
uc_catalog_name=cookbook_shared_config.uc_catalog_name, | ||
uc_schema_name=cookbook_shared_config.uc_schema_name, | ||
uc_volume_name=f"{cookbook_shared_config.uc_asset_prefix}_source_docs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My source docs are in a different catalog - do I need to move them? I will try it referencing the other catalog and see what breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of this config dataclass would need to be changed if I want to use data from another catalog without creating a new volume. This would require data copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried a configuration from scratch I got something like:
app_name: alphaledger
data_sources:
sec_10k:
catalog: field_ai_examples
schema: alphaledger
source_type: volume
path: '/pdf'
format: pdf
glob: '*10k.pdf'
sec_10q:
catalog: field_ai_examples
schema: alphaledger
source_type: volume
path: '/pdf'
format: pdf
glob: '*10k.pdf'
marketdata:
catalog: dev
schema: alphaledger
source_type: delta
data_pipeline:
stages:
- name: ingest_sec
sources:
- sec_10k
- sec_10q
This makes it easier to define semantics of data in the Volume, but makes the ingest job more complex
print(json.dumps(config_dump, indent=4)) | ||
|
||
|
||
def validate_or_create_uc_catalog(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the default error message from the SDK in case of permission failures is actually decent, e.g. for permission denied on schema creation I got:
PermissionDenied: User does not have CREATE SCHEMA and USE CATALOG on Catalog 'default'. Config: host=https://oregon.staging.cloud.databricks.com/, auth_type=runtime
Will propose a simplification to this code (I think the print statements when we're attempting to create are still useful, but we probably don't need to wrap the permission denied case) that reduces the # of branches to test/maintain and produces a stacktrace directly where the failure happens
return True | ||
except Exception as e: | ||
print( | ||
f"\nFAIL: `{self.mlflow_experiment_name}` is not a valid directory for an MLflow experiment. An experiment name must be an absolute path within the Databricks workspace, e.g. '/Users/<some-username>/my-experiment'.\n\nIf you tried to specify a directory, either remove the `mlflow_experiment_name` parameter to try the default value or manually specify a valid path for `mlflow_experiment_name` to `AgentCookbookConfig(...)`.\n\nIf you did not pass a value for `mlflow_experiment_name` and are seeing this message, pass a valid workspace directory for `mlflow_experiment_name` and try again." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message here was also decent when I passed a bad experiment name:
RestException: INVALID_PARAMETER_VALUE: Got an invalid experiment name 'abcd/Users/[email protected]/agent_app_name_mlflow_experiment'. An experiment name must be an absolute path within the Databricks workspace, e.g. '/Users/<some-username>/my-experiment'.
Will push some suggestions to simplify this, since there are other potential causes of failure here e.g. PermissionDenied
(so catching Exception
+ printing this new message may be inaccurate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I'll just push a suggestion that follows the pattern mentioned above (it's helpful to print when we're trying to create a new resource etc, but we can just let the SDK/API tell us what happened during failures and push error message improvements in the backend if needed)
resulting_prompt = self.config.get("prompt_template").format(context=context) | ||
|
||
return resulting_prompt | ||
return context.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't make a comment on line 79, but class Document can be replaced with: https://mlflow.org/docs/latest/python_api/mlflow.entities.html#mlflow.entities.Document not
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
How is this PR tested?