Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cookbook improvement] Working PR #31

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

epec254
Copy link
Collaborator

@epec254 epec254 commented Sep 27, 2024

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

How is this PR tested?

shared_config = AgentCookbookConfig(
uc_catalog_name=f"{default_catalog}",
uc_schema_name=f"{user_name}_agents",
uc_asset_prefix="agent_app_name", # Prefix to every created UC asset, typically the Agent's name
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear how to scale this to multiple models - IMO schema should be the “agent app name” (where agent apps contain one or more models/data sources)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would simplify the confusing idea of us_asset_prefix...and avoid long names for each asset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the common pattern you see @FMurray - one agent == one schema?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By agent app - I mean one use case, which might have several agent models, like for example 3 distinct RAG models and an supervisor model.

I see customers who have less permissive UC setups typically apply grants on the schema (occasionally on catalog), so I wouldn't want to have one schema per model, because that would slow them down.

from databricks.sdk import WorkspaceClient
from databricks.sdk.errors.platform import ResourceAlreadyExists, ResourceDoesNotExist, NotFound, PermissionDenied

class AgentCookbookConfig(BaseModel):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pydantic settings might get us more out of the box here: https://docs.pydantic.dev/latest/concepts/pydantic_settings/#usage

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer "lazy" config where validation happens when the config actually is used vs needing to explicitly run this notebook

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the tradeoff/feedback I think the current logic is trying to address is failing fast and surface all permissions issues/blockers early on, since folks found it confusing to tab back and forth between notebooks to make fixes

# MAGIC %md
# MAGIC **Important note:** Throughout this notebook, we indicate which cells you:
# MAGIC - ✅✏️ *should* customize - these cells contain config settings to change
# MAGIC - 🚫✏️ *typically will not* customize - these cells contain boilerplate code required to validate / save the configuration
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put the things that shouldn't be customized in python files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move more into the python files, but i tried to strike a balance of not hiding code that the user should be aware of. for example, i made a cell with mlflow logging boilerplate that shouldn't be modified but i also didnt want to hide it from the user since we got feedback before that hiding the details of how model logging works was confusing

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack - probably out of scope for this PR but I'd prefer some better logging, output rendering and maybe the ability to change the config inline using an input Ipywidget or similar. Fewer notebook cells, less to think about

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda agree with @FMurray here, as a reader these notes are quite helpful but they seem hard for us to maintain. I wonder if we could recommend folks to use IDEs to view the cookbook (that's what I'm doing right now as I review & try out changes to this PR), which has better jump-to-definition support for things like easily reading the code in Python modules, if readers are interested - that way we can try to keep maintainability but avoid the "hidden magic code" issue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH though, there are other things that I think affect maintainability more, so fine to try this to start, it is quite helpful

pymupdf4llm==0.0.5 pymupdf==1.24.5 `# PDF parsing` \
markdownify==0.12.1 `# HTML parsing` \
pypandoc_binary==1.13 `# DOCX parsing` \
transformers==4.41.1 torch==2.3.0 tiktoken==0.7.0 langchain-text-splitters==0.2.0. `# For get_recursive_character_text_splitter()`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

customer feedback - it's not uncommon for this to break in customer environments without external internet access. Can we put it behind a flag and toggle the text splitter behavior based on data pipeline config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but we would need to write a default text splitter that doesn't use any external packages. these are all required by the default splitter :(

source_config = UnstructuredDataPipelineSourceConfig(
uc_catalog_name=cookbook_shared_config.uc_catalog_name,
uc_schema_name=cookbook_shared_config.uc_schema_name,
uc_volume_name=f"{cookbook_shared_config.uc_asset_prefix}_source_docs"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My source docs are in a different catalog - do I need to move them? I will try it referencing the other catalog and see what breaks

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of this config dataclass would need to be changed if I want to use data from another catalog without creating a new volume. This would require data copying

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried a configuration from scratch I got something like:

app_name: alphaledger
data_sources: 
  sec_10k:
    catalog: field_ai_examples
    schema: alphaledger
    source_type: volume
    path: '/pdf'
    format: pdf
    glob: '*10k.pdf'
  sec_10q:
    catalog: field_ai_examples
    schema: alphaledger
    source_type: volume
    path: '/pdf'
    format: pdf
    glob: '*10k.pdf'
  marketdata:
    catalog: dev
    schema: alphaledger
    source_type: delta

data_pipeline:
  stages:
    - name: ingest_sec
      sources:
        - sec_10k
        - sec_10q

This makes it easier to define semantics of data in the Volume, but makes the ingest job more complex

print(json.dumps(config_dump, indent=4))


def validate_or_create_uc_catalog(self) -> bool:
Copy link
Collaborator

@smurching smurching Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the default error message from the SDK in case of permission failures is actually decent, e.g. for permission denied on schema creation I got:

PermissionDenied: User does not have CREATE SCHEMA and USE CATALOG on Catalog 'default'. Config: host=https://oregon.staging.cloud.databricks.com/, auth_type=runtime

Will propose a simplification to this code (I think the print statements when we're attempting to create are still useful, but we probably don't need to wrap the permission denied case) that reduces the # of branches to test/maintain and produces a stacktrace directly where the failure happens

return True
except Exception as e:
print(
f"\nFAIL: `{self.mlflow_experiment_name}` is not a valid directory for an MLflow experiment. An experiment name must be an absolute path within the Databricks workspace, e.g. '/Users/<some-username>/my-experiment'.\n\nIf you tried to specify a directory, either remove the `mlflow_experiment_name` parameter to try the default value or manually specify a valid path for `mlflow_experiment_name` to `AgentCookbookConfig(...)`.\n\nIf you did not pass a value for `mlflow_experiment_name` and are seeing this message, pass a valid workspace directory for `mlflow_experiment_name` and try again."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message here was also decent when I passed a bad experiment name:

RestException: INVALID_PARAMETER_VALUE: Got an invalid experiment name 'abcd/Users/[email protected]/agent_app_name_mlflow_experiment'. An experiment name must be an absolute path within the Databricks workspace, e.g. '/Users/<some-username>/my-experiment'.

Will push some suggestions to simplify this, since there are other potential causes of failure here e.g. PermissionDenied (so catching Exception + printing this new message may be inaccurate)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I'll just push a suggestion that follows the pattern mentioned above (it's helpful to print when we're trying to create a new resource etc, but we can just let the SDK/API tell us what happened during failures and push error message improvements in the backend if needed)

resulting_prompt = self.config.get("prompt_template").format(context=context)

return resulting_prompt
return context.strip()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make a comment on line 79, but class Document can be replaced with: https://mlflow.org/docs/latest/python_api/mlflow.entities.html#mlflow.entities.Document not

@smurching smurching mentioned this pull request Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants