diff --git a/404.html b/404.html index f74ed3b1..62647e0a 100644 --- a/404.html +++ b/404.html @@ -16,7 +16,7 @@ - + diff --git a/api/public/API/index.html b/api/public/API/index.html index 3c032e59..2074613e 100644 --- a/api/public/API/index.html +++ b/api/public/API/index.html @@ -16,7 +16,7 @@ - + diff --git a/api/public/cmf/index.html b/api/public/cmf/index.html index d6c3af14..6073fd8e 100644 --- a/api/public/cmf/index.html +++ b/api/public/cmf/index.html @@ -20,7 +20,7 @@ - + @@ -4331,7 +4331,7 @@
CMF (Common Metadata Framework) collects and stores information associated with Machine Learning (ML) pipelines. It also implements APIs to query this metadata. The CMF adopts a data-first approach: all artifacts (such as datasets, ML models and performance metrics) recorded by the framework are versioned and identified by their content hash.
"},{"location":"#installation","title":"Installation","text":""},{"location":"#1-pre-requisites","title":"1. Pre-Requisites:","text":"conda create -n cmf python=3.10\nconda activate cmf\n
virtualenv --python=3.10 .cmf\nsource .cmf/bin/activate\n
"},{"location":"#3-install-cmf","title":"3. Install CMF:","text":"Latest version form GitHubStable version form PyPI pip install git+https://github.com/HewlettPackard/cmf\n
# pip install cmflib\n
"},{"location":"#next-steps","title":"Next Steps","text":"After installing CMF, proceed to configure CMF server and client. For detailed configuration instructions, refer to the Quick start with cmf-client page.
"},{"location":"#introduction","title":"Introduction","text":"Complex ML projects rely on ML pipelines
to train and test ML models. An ML pipeline is a sequence of stages where each stage performs a particular task, such as data loading, pre-processing, ML model training and testing stages. Each stage can have multiple Executions. Each Execution,
inputs
and produce outputs
.CMF uses the abstractions of Pipeline
,Context
and Executions
to store the metadata of complex ML pipelines. Each pipeline has a name. Users provide it when they initialize the CMF. Each stage is represented by a Context
object. Metadata associated with each run of a stage is captured in the Execution object. Inputs and outputs of Executions can be logged as dataset, model or metrics. While parameters of executions are recorded as properties of executions.
Start tracking the pipeline metadata by initializing the CMF runtime. The metadata will be associated with the pipeline named test_pipeline
.
from cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n\ncmf = Cmf(\n filename=\"mlmd\",\n pipeline_name=\"test_pipeline\",\n) \n
Before we can start tracking metadata, we need to let CMF know about stage type. This is not yet associated with this particular execution.
context: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"train\"\n)\n
Now we can create a new stage execution associated with the train
stage. The CMF always creates a new execution, and will adjust its name, so it's unique. This is also the place where we can log execution parameters
like seed, hyper-parameters etc .
execution: mlmd.proto.Execution = cmf.create_execution(\n execution_type=\"train\",\n custom_properties = {\"num_epochs\": 100, \"learning_rate\": 0.01}\n)\n
Finally, we can log an input (train dataset), and once trained, an output (ML model) artifacts.
cmf.log_dataset(\n 'artifacts/test_dataset.csv', # Dataset path \n \"input\" # This is INPUT artifact\n)\ncmf.log_model(\n \"artifacts/model.pkl\", # Model path \n event=\"output\" # This is OUTPUT artifact\n)\n
"},{"location":"#quick-example","title":"Quick Example","text":"Go through Getting Started page to learn more about CMF API usage.
"},{"location":"#api-overview","title":"API Overview","text":"Import CMF.
from cmflib import cmf\n
Initialize CMF. The CMF object is responsible for managing a CMF backend to record the pipeline metadata. Internally, it creates a pipeline abstraction that groups individual stages and their executions. All stages, their executions and produced artifacts will be associated with a pipeline with the given name.
cmf = cmf.Cmf(\n filename=\"mlmd\", # Path to ML Metadata file.\n pipeline_name=\"mnist\" # Name of a ML pipeline.\n) \n
Define a stage. An ML pipeline can have multiple stages, and each stage can be associated with multiple executions. A stage is like a class in the world of object-oriented programming languages. A context (stage description) defines what this stage looks like (name and optional properties), and is created with the create_context method.
context = cmf.create_context(\n pipeline_stage=\"download\", # Stage name\n custom_properties={ # Optional properties\n \"uses_network\": True, # Downloads from the Internet\n \"disk_space\": \"10GB\" # Needs this much space\n }\n)\n
Create a stage execution. A stage in ML pipeline can have multiple executions. Every run is marked as an execution. This API helps to track the metadata associated with the execution, like stage parameters (e.g., number of epochs and learning rate for train stages). The stage execution name does not need to be the same as the name of its context. Moreover, the CMF will adjust this name to ensure every execution has a unique name. The CMF will internally associate this execution with the context created previously. Stage executions are created by calling the create_execution method.
execution = cmf.create_execution(\n execution_type=\"download\", # Execution name.\n custom_properties = { # Execution parameters\n \"url\": \"https://a.com/mnist.gz\" # Data URL.\n }\n)\n
Log artifacts. A stage execution can consume (inputs) and produce (outputs) multiple artifacts (datasets, models and performance metrics). The path of these artifacts must be relative to the project (repository) root path. Artifacts might have optional metadata associated with them. These metadata could include feature statistics for ML datasets, or useful parameters for ML models (such as, for instance, number of trees in a random forest classifier).
Datasets are logged with the log_dataset method.
cmf.log_dataset('data/mnist.gz', \"input\", custom_properties={\"name\": \"mnist\", \"type\": 'raw'})\ncmf.log_dataset('data/train.csv', \"output\", custom_properties={\"name\": \"mnist\", \"type\": \"train_split\"})\ncmf.log_dataset('data/test.csv', \"output\", custom_properties={\"name\": \"mnist\", \"type\": \"test_split\"})\n
ML models produced by training stages are logged using log_model API. ML models can be both input and output artifacts. The metadata associated with the artifact could be logged as an optional argument.
# In train stage\ncmf.log_model(\n path=\"model/rf.pkl\", event=\"output\", model_framework=\"scikit-learn\", model_type=\"RandomForestClassifier\", \n model_name=\"RandomForestClassifier:default\" \n)\n\n# In test stage\ncmf.log_model(\n path=\"model/rf.pkl\", event=\"input\" \n)\n
Metrics of every optimization step (one epoch of Stochastic Gradient Descent, or one boosting round in Gradient Boosting Trees) are logged using log_metric API.
#Can be called at every epoch or every step in the training. This is logged to a parquet file and committed at the \n# commit stage.\n\n#Inside training loop\nwhile True: \n cmf.log_metric(\"training_metrics\", {\"loss\": loss}) \ncmf.commit_metrics(\"training_metrics\")\n
Stage metrics, or final metrics, are logged with the log_execution_metrics method. These are final metrics of a stage, such as final train or test accuracy.
cmf.log_execution_metrics(\"metrics\", {\"avg_prec\": avg_prec, \"roc_auc\": roc_auc})\n
Dataslices are intended to be used to track subsets of the data. For instance, this can be used to track and compare accuracies of ML models on these subsets to identify model bias. Data slices are created with the create_dataslice method.
dataslice = cmf.create_dataslice(\"slice-a\")\nfor i in range(1, 20, 1):\n j = random.randrange(100)\n dataslice.add_data(\"data/raw_data/\"+str(j)+\".xml\")\ndataslice.commit()\n
"},{"location":"#graph-layer-overview","title":"Graph Layer Overview","text":"CMF library has an optional graph layer
which stores the relationships in a Neo4J graph database. To use the graph layer, the graph
parameter in the library init call must be set to true (it is set to false by default). The library reads the configuration parameters of the graph database from cmf config
generated by cmf init
command.
cmf init minioS3 --url s3://dvc-art --endpoint-url http://x.x.x.x:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Here, \"dvc-art\" is provided as an example bucket name. However, users can change it as needed, if the user chooses to change it, they will need to update the Dockerfile for minioS3 accordingly.
To use the graph layer, instantiate the CMF with graph=True
parameter:
from cmflib import cmf\n\ncmf = cmf.Cmf(\n filename=\"mlmd\",\n pipeline_name=\"anomaly_detection_pipeline\", \n graph=True\n)\n
"},{"location":"#jupyter-lab-docker-container-with-cmf-pre-installed","title":"Jupyter Lab docker container with CMF pre-installed","text":""},{"location":"#use-a-jupyterlab-docker-environment-with-cmf-pre-installed","title":"Use a Jupyterlab Docker environment with CMF pre-installed","text":"CMF has a docker-compose file which creates two docker containers, - JupyterLab Notebook Environment with CMF pre installed. - Accessible at http://[HOST.IP.AD.DR]:8888 (default token: docker
) - Within the Jupyterlab environment, a startup script switches context to $USER:$GROUP
as specified in .env
- example-get-started
from this repo is bind mounted into /home/jovyan/example-get-started
- Neo4j Docker container to store and access lineages.
create .env file in current folder using env-example as a template. Modify the .env file for the following variables USER,UID,GROUP,GID,GIT_USER_NAME,GIT_USER_EMAIL,GIT_REMOTE_URL #These are used by docker-compose.yml
Update docker-compose.yml
as needed. your .ssh folder is mounted inside the docker conatiner to enable you to push and pull code from git To-Do Create these directories in your home folder
mkdir $HOME/workspace \nmkdir $HOME/dvc_remote \n
workspace - workspace will be mounted inside the cmf pre-installed docker conatiner (can be your code directory) dvc_remote - remote data store for dvc or Change the below lines in docker-compose to reflect the appropriate directories
If your workspace is named \"experiment\" change the below line\n$HOME/workspace:/home/jovyan/workspace to \n$HOME/experiment:/home/jovyan/wokspace\n
If your remote is /extmount/data change the line \n$HOME/dvc_remote:/home/jovyan/dvc_remote to \n/extmount/data:/home/jovyan/dvc_remote \n
Start the docker docker-compose up --build -d\n
Access the jupyter notebook http://[HOST.IP.AD.DR]:8888 (default token: docker
) Click the terminal icon Quick Start
cd example-get-started\ncmf init local --path /home/user/local-storage --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://127.0.0.1:80 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\nsh test_script.sh\ncmf artifact push -p 'Test-env'\n
The above steps will run a pre coded example pipeline and the metadata is stored in a file named \"mlmd\". The artifacts created will be pushed to configured dvc remote (default: /home/dvc_remote) The stored metadata is displayed as Metadata lineage can be accessed in neo4j. Open http://host:7475/browser/ Connect to server with default password neo4j123 (To change this modify .env file) Run the query
MATCH (a:Execution)-[r]-(b) WHERE (b:Dataset or b:Model or b:Metrics) RETURN a,r, b \n
Expected output Jupyter Lab Notebook Select the kernel as Python[conda env:python37]
Shutdown/remove (Remove volumes as well)
docker-compose down -v\n
"},{"location":"#license","title":"License","text":"CMF is an open source project hosted on GitHub and distributed according to the Apache 2.0 licence. We are welcome user contributions - send us a message on the Slack channel or open a GitHub issue or a pull request on GitHub.
"},{"location":"#citation","title":"Citation","text":"@mist{foltin2022cmf,\n title={Self-Learning Data Foundation for Scientific AI},\n author={Martin Foltin, Annmary Justine, Sergey Serebryakov, Cong Xu, Aalap Tripathy, Suparna Bhattacharya, \n Paolo Faraboschi},\n year={2022},\n note = {Presented at the \"Monterey Data Conference\"},\n URL={https://drive.google.com/file/d/1Oqs0AN0RsAjt_y9ZjzYOmBxI8H0yqSpB/view},\n}\n
"},{"location":"#community","title":"Community","text":"Help
Common Metadata Framework and its documentation are in active stage of development and are very new. If there is anything unclear, missing or there's a typo, please, open an issue or pull request on GitHub.
"},{"location":"_src/","title":"CMF docs development resources","text":"This directory contains files that are used to create some content for the CMF documentation. This process is not automated yet. Files in this directory are not supposed to be referenced from documentation pages.
It also should not be required to automatically redeploy documentation (e.g., with GitHub actions) when documentation files change only in this particular directory.
This calls initiates the library and also creates a pipeline object with the name provided. Arguments to be passed CMF:
\ncmf = cmf.Cmf(filename=\"mlmd\", pipeline_name=\"Test-env\") \n\nReturns a Context object of mlmd.proto.Context\nArguments filename String Path to the sqlite file to store the metadata pipeline_name String Name to uniquely identify the pipeline. Note that name is the unique identification for a pipeline. If a pipeline already exist with the same name, the existing pipeline object is reused custom_properties Dictionary (Optional Parameter) - Additional properties of the pipeline that needs to be stored graph Bool (Optional Parameter) If set to true, the libray also stores the relationships in the provided graph database. Following environment variables should be set NEO4J_URI - The value should be set to the Graph server URI . export NEO4J_URI=\"bolt://ip:port\" User name and password export NEO4J_USER_NAME=neo4j export NEO4J_PASSWD=neo4j
Return Object mlmd.proto.Context
mlmd.proto.Context Attributes create_time_since_epoch int64 create_time_since_epoch custom_properties repeated CustomPropertiesEntry custom_properties id int64 id last_update_time_since_epoch int64 last_update_time_since_epoch name string name properties repeated PropertiesEntry properties type string type type_id int64 type_id ### 2. create_context - Creates a Stage with properties A pipeline may include multiple stages. A unique name should be provided for every Stage in a pipeline. Arguments to be passed CMF:\ncontext = cmf.create_context(pipeline_stage=\"Prepare\", custom_properties ={\"user-metadata1\":\"metadata_value\"}\nArguments pipeline_stage String Name of the pipeline Stage custom_properties Dictionary (Optional Parameter) - The developer's can provide key value pairs of additional properties of the stage that needs to be stored.
Return Object mlmd.proto.Context |mlmd.proto.Context Attributes| | |------|------| |create_time_since_epoch| int64 create_time_since_epoch| |custom_properties| repeated CustomPropertiesEntry custom_properties| |id| int64 id| |last_update_time_since_epoch| int64 last_update_time_since_epoch| |name| string name| |properties| repeated PropertiesEntry properties| |type| string type| |type_id| int64 type_id|
"},{"location":"api/public/API/#3-create_execution-creates-an-execution-with-properties","title":"3. create_execution - Creates an Execution with properties","text":"A stage can have multiple executions. A unique name should ne provided for exery execution. Properties of the execution can be paased as key value pairs in the custom properties. Eg: The hyper parameters used for the execution can be passed.
\n\nexecution = cmf.create_execution(execution_type=\"Prepare\",\n custom_properties = {\"Split\":split, \"Seed\":seed})\nexecution_type:String - Name of the execution\ncustom_properties:Dictionary (Optional Parameter)\nReturn Execution object of type mlmd.proto.Execution\nArguments execution_type String Name of the execution custom_properties Dictionary (Optional Parameter)
Return object of type mlmd.proto.Execution | mlmd.proto.Execution Attributes| | |---------------|-------------| |create_time_since_epoch |int64 create_time_since_epoch| |custom_properties |repeated CustomPropertiesEntry custom_properties| |id |int64 id| |last_known_state |State last_known_state| |last_update_time_since_epoch| int64 last_update_time_since_epoch| |name |string name| |properties |repeated PropertiesEntry properties [Git_Repo, Context_Type, Git_Start_Commit, Pipeline_Type, Context_ID, Git_End_Commit, Execution(Command used), Pipeline_id| |type |string type| |type_id| int64 type_id|
"},{"location":"api/public/API/#4-log_dataset-logs-a-dataset-and-its-properties","title":"4. log_dataset - Logs a Dataset and its properties","text":"Tracks a Dataset and its version. The version of the dataset is automatically obtained from the versioning software(DVC) and tracked as a metadata.
\nartifact = cmf.log_dataset(\"/repo/data.xml\", \"input\", custom_properties={\"Source\":\"kaggle\"})\nArguments url String The path to the dataset event String Takes arguments INPUT OR OUTPUT custom_properties Dictionary The Dataset properties
Returns an Artifact object of type mlmd.proto.Artifact
mlmd.proto.Artifact Attributes create_time_since_epoch int64 create_time_since_epoch custom_properties repeated CustomPropertiesEntry custom_properties id int64 id last_update_time_since_epoch int64 last_update_time_since_epoch name string name properties repeated PropertiesEntry properties(Commit, Git_Repo) state State state type string type type_id int64 type_id uri string uri"},{"location":"api/public/API/#5-log_model-logs-a-model-and-its-properties","title":"5. log_model - Logs a model and its properties.","text":"\ncmf.log_model(path=\"path/to/model.pkl\", event=\"output\", model_framework=\"SKlearn\", model_type=\"RandomForestClassifier\", model_name=\"RandomForestClassifier:default\")\n\nReturns an Artifact object of type mlmd.proto.Artifact\nArguments path String Path to the model model file event String Takes arguments INPUT OR OUTPUT model_framework String Framework used to create model model_type String Type of Model Algorithm used model_name String Name of the Algorithm used custom_properties Dictionary The model properties
Returns Atifact object of type mlmd.proto.Artifact |mlmd.proto.Artifact Attributes| | |-----------|---------| |create_time_since_epoch| int64 create_time_since_epoch| |custom_properties| repeated CustomPropertiesEntry custom_properties| |id| int64 id |last_update_time_since_epoch| int64 last_update_time_since_epoch |name| string name |properties| repeated PropertiesEntry properties(commit, model_framework, model_type, model_name)| |state| State state| |type| string type| |type_id| int64 type_id| |uri| string uri|
"},{"location":"api/public/API/#6-log_execution_metrics-logs-the-metrics-for-the-execution","title":"6. log_execution_metrics Logs the metrics for the execution","text":"\ncmf.log_execution_metrics(metrics_name :\"Training_Metrics\", {\"auc\":auc,\"loss\":loss}\nArguments metrics_name String Name to identify the metrics custom_properties Dictionary Metrics"},{"location":"api/public/API/#7-log_metrics-logs-the-per-step-metrics-for-fine-grained-tracking","title":"7. log_metrics Logs the per Step metrics for fine grained tracking","text":"
The metrics provided is stored in a parquet file. The commit_metrics call add the parquet file in the version control framework. The metrics written in the parquet file can be retrieved using the read_metrics call
\n#Can be called at every epoch or every step in the training. This is logged to a parquet file and commited at the commit stage.\nwhile True: #Inside training loop\n metawriter.log_metric(\"training_metrics\", {\"loss\":loss}) \nmetawriter.commit_metrics(\"training_metrics\")\nArguments for log_metric metrics_name String Name to identify the metrics custom_properties Dictionary Metrics Arguments for commit_metrics metrics_name String Name to identify the metrics"},{"location":"api/public/API/#8-create_dataslice","title":"8. create_dataslice","text":"
This helps to track a subset of the data. Currently supported only for file abstractions. For eg- Accuracy of the model for a slice of data(gender, ethnicity etc)
\ndataslice = cmf.create_dataslice(\"slice-a\")\nArguments for create_dataslice name String Name to identify the dataslice Returns a Dataslice object"},{"location":"api/public/API/#9-add_data-adds-data-to-a-dataslice","title":"9. add_data Adds data to a dataslice.","text":"
Currently supported only for file abstractions. Pre condition - The parent folder, containing the file should already be versioned.
\ndataslice.add_data(\"data/raw_data/\"+str(j)+\".xml\")\nArguments name String Name to identify the file to be added to the dataslice"},{"location":"api/public/API/#10-dataslice-commit-commits-the-created-dataslice","title":"10. Dataslice Commit - Commits the created dataslice","text":"
The created dataslice is versioned and added to underneath data versioning softwarre
\ndataslice.commit()\n"},{"location":"api/public/cmf/","title":"cmflib.cmf","text":"
This class provides methods to log metadata for distributed AI pipelines. The class instance creates an ML metadata store to store the metadata. It creates a driver to store nodes and its relationships to neo4j. The user has to provide the name of the pipeline, that needs to be recorded with CMF.
cmflib.cmf.Cmf(\n filepath=\"mlmd\",\n pipeline_name=\"test_pipeline\",\n custom_properties={\"owner\": \"user_a\"},\n graph=False\n)\n
Args: filepath: Path to the sqlite file to store the metadata pipeline_name: Name to uniquely identify the pipeline. Note that name is the unique identifier for a pipeline. If a pipeline already exist with the same name, the existing pipeline object is reused. custom_properties: Additional properties of the pipeline that needs to be stored. graph: If set to true, the libray also stores the relationships in the provided graph database. The following variables should be set: neo4j_uri
(graph server URI), neo4j_user
(user name) and neo4j_password
(user password), e.g.: cmf init local --path /home/user/local-storage --git-remote-url https://github.com/XXX/exprepo.git --neo4j-user neo4j --neo4j-password neo4j\n --neo4j-uri bolt://localhost:7687\n
Source code in cmflib/cmf.py
def __init__(\n self,\n filepath: str = \"mlmd\",\n pipeline_name: str = \"\",\n custom_properties: t.Optional[t.Dict] = None,\n graph: bool = False,\n is_server: bool = False,\n ):\n #path to directory\n self.cmf_init_path = filepath.rsplit(\"/\",1)[0] \\\n\t\t\t\t if len(filepath.rsplit(\"/\",1)) > 1 \\\n\t\t\t\t\telse os.getcwd()\n\n logging_dir = change_dir(self.cmf_init_path)\n if is_server is False:\n Cmf.__prechecks()\n if custom_properties is None:\n custom_properties = {}\n if not pipeline_name:\n # assign folder name as pipeline name \n cur_folder = os.path.basename(os.getcwd())\n pipeline_name = cur_folder\n config = mlpb.ConnectionConfig()\n config.sqlite.filename_uri = filepath\n self.store = metadata_store.MetadataStore(config)\n self.filepath = filepath\n self.child_context = None\n self.execution = None\n self.execution_name = \"\"\n self.execution_command = \"\"\n self.metrics = {}\n self.input_artifacts = []\n self.execution_label_props = {}\n self.graph = graph\n #last token in filepath\n self.branch_name = filepath.rsplit(\"/\", 1)[-1]\n\n if is_server is False:\n git_checkout_new_branch(self.branch_name)\n self.parent_context = get_or_create_parent_context(\n store=self.store,\n pipeline=pipeline_name,\n custom_properties=custom_properties,\n )\n if is_server:\n Cmf.__get_neo4j_server_config()\n if graph is True:\n Cmf.__load_neo4j_params()\n self.driver = graph_wrapper.GraphDriver(\n Cmf.__neo4j_uri, Cmf.__neo4j_user, Cmf.__neo4j_password\n )\n self.driver.create_pipeline_node(\n pipeline_name, self.parent_context.id, custom_properties\n )\n os.chdir(logging_dir)\n
This module contains all the public API for CMF
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.create_context","title":"create_context(pipeline_stage, custom_properties=None)
","text":"Create's a context(stage). Every call creates a unique pipeline stage. Updates Pipeline_stage name. Example:
#Create context\n# Import CMF\nfrom cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n# Create CMF logger\ncmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n# Create context\ncontext: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n)\n
Args: Pipeline_stage: Name of the Stage. custom_properties: Developers can provide key value pairs with additional properties of the execution that need to be stored. Returns: Context object from ML Metadata library associated with the new context for this stage. Source code in cmflib/cmf.py
def create_context(\n self, pipeline_stage: str, custom_properties: t.Optional[t.Dict] = None\n) -> mlpb.Context:\n \"\"\"Create's a context(stage).\n Every call creates a unique pipeline stage.\n Updates Pipeline_stage name.\n Example:\n ```python\n #Create context\n # Import CMF\n from cmflib.cmf import Cmf\n from ml_metadata.proto import metadata_store_pb2 as mlpb\n # Create CMF logger\n cmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n # Create context\n context: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n )\n\n ```\n Args:\n Pipeline_stage: Name of the Stage.\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be stored.\n Returns:\n Context object from ML Metadata library associated with the new context for this stage.\n \"\"\"\n custom_props = {} if custom_properties is None else custom_properties\n pipeline_stage = self.parent_context.name + \"/\" + pipeline_stage\n ctx = get_or_create_run_context(\n self.store, pipeline_stage, custom_props)\n self.child_context = ctx\n associate_child_to_parent_context(\n store=self.store, parent_context=self.parent_context, child_context=ctx\n )\n if self.graph:\n self.driver.create_stage_node(\n pipeline_stage, self.parent_context, ctx.id, custom_props\n )\n return ctx\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.merge_created_context","title":"merge_created_context(pipeline_stage, custom_properties=None)
","text":"Merge created context. Every call creates a unique pipeline stage. Created for metadata push purpose. Example:
```python\n#Create context\n# Import CMF\nfrom cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n# Create CMF logger\ncmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n# Create context\ncontext: mlmd.proto.Context = cmf.merge_created_context(\n pipeline_stage=\"Test-env/prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n```\nArgs:\n Pipeline_stage: Pipeline_Name/Stage_name.\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be stored.\nReturns:\n Context object from ML Metadata library associated with the new context for this stage.\n
Source code in cmflib/cmf.py
def merge_created_context(\n self, pipeline_stage: str, custom_properties: t.Optional[t.Dict] = None\n) -> mlpb.Context:\n \"\"\"Merge created context.\n Every call creates a unique pipeline stage.\n Created for metadata push purpose.\n Example:\n\n ```python\n #Create context\n # Import CMF\n from cmflib.cmf import Cmf\n from ml_metadata.proto import metadata_store_pb2 as mlpb\n # Create CMF logger\n cmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n # Create context\n context: mlmd.proto.Context = cmf.merge_created_context(\n pipeline_stage=\"Test-env/prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n ```\n Args:\n Pipeline_stage: Pipeline_Name/Stage_name.\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be stored.\n Returns:\n Context object from ML Metadata library associated with the new context for this stage.\n \"\"\"\n\n custom_props = {} if custom_properties is None else custom_properties\n ctx = get_or_create_run_context(\n self.store, pipeline_stage, custom_props)\n self.child_context = ctx\n associate_child_to_parent_context(\n store=self.store, parent_context=self.parent_context, child_context=ctx\n )\n if self.graph:\n self.driver.create_stage_node(\n pipeline_stage, self.parent_context, ctx.id, custom_props\n )\n return ctx\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.create_execution","title":"create_execution(execution_type, custom_properties=None, cmd=None, create_new_execution=True)
","text":"Create execution. Every call creates a unique execution. Execution can only be created within a context, so create_context must be called first. Example:
# Import CMF\nfrom cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n# Create CMF logger\ncmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n# Create or reuse context for this stage\ncontext: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n)\n# Create a new execution for this stage run\nexecution: mlmd.proto.Execution = cmf.create_execution(\n execution_type=\"Prepare\",\n custom_properties = {\"split\": split, \"seed\": seed}\n)\n
Args: execution_type: Type of the execution.(when create_new_execution is False, this is the name of execution) custom_properties: Developers can provide key value pairs with additional properties of the execution that need to be stored. cmd: command used to run this execution.\n\ncreate_new_execution:bool = True, This can be used by advanced users to re-use executions\n This is applicable, when working with framework code like mmdet, pytorch lightning etc, where the\n custom call-backs are used to log metrics.\n if create_new_execution is True(Default), execution_type parameter will be used as the name of the execution type.\n if create_new_execution is False, if existing execution exist with the same name as execution_type.\n it will be reused.\n Only executions created with create_new_execution as False will have \"name\" as a property.\n
Returns:
Type DescriptionExecution
Execution object from ML Metadata library associated with the new execution for this stage.
Source code incmflib/cmf.py
def create_execution(\n self,\n execution_type: str,\n custom_properties: t.Optional[t.Dict] = None,\n cmd: str = None,\n create_new_execution: bool = True,\n) -> mlpb.Execution:\n \"\"\"Create execution.\n Every call creates a unique execution. Execution can only be created within a context, so\n [create_context][cmflib.cmf.Cmf.create_context] must be called first.\n Example:\n ```python\n # Import CMF\n from cmflib.cmf import Cmf\n from ml_metadata.proto import metadata_store_pb2 as mlpb\n # Create CMF logger\n cmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n # Create or reuse context for this stage\n context: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n )\n # Create a new execution for this stage run\n execution: mlmd.proto.Execution = cmf.create_execution(\n execution_type=\"Prepare\",\n custom_properties = {\"split\": split, \"seed\": seed}\n )\n ```\n Args:\n execution_type: Type of the execution.(when create_new_execution is False, this is the name of execution)\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be stored.\n\n cmd: command used to run this execution.\n\n create_new_execution:bool = True, This can be used by advanced users to re-use executions\n This is applicable, when working with framework code like mmdet, pytorch lightning etc, where the\n custom call-backs are used to log metrics.\n if create_new_execution is True(Default), execution_type parameter will be used as the name of the execution type.\n if create_new_execution is False, if existing execution exist with the same name as execution_type.\n it will be reused.\n Only executions created with create_new_execution as False will have \"name\" as a property.\n\n\n Returns:\n Execution object from ML Metadata library associated with the new execution for this stage.\n \"\"\"\n logging_dir = change_dir(self.cmf_init_path)\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.child_context:\n self.create_context(pipeline_stage=name_without_extension)\n assert self.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # Initializing the execution related fields\n\n self.metrics = {}\n self.input_artifacts = []\n self.execution_label_props = {}\n custom_props = {} if custom_properties is None else custom_properties\n git_repo = git_get_repo()\n git_start_commit = git_get_commit()\n cmd = str(sys.argv) if cmd is None else cmd\n python_env=get_python_env()\n self.execution = create_new_execution_in_existing_run_context(\n store=self.store,\n # Type field when re-using executions\n execution_type_name=self.child_context.name,\n execution_name=execution_type, \n #Name field if we are re-using executions\n #Type field , if creating new executions always \n context_id=self.child_context.id,\n execution=cmd,\n pipeline_id=self.parent_context.id,\n pipeline_type=self.parent_context.name,\n git_repo=git_repo,\n git_start_commit=git_start_commit,\n python_env=python_env,\n custom_properties=custom_props,\n create_new_execution=create_new_execution,\n )\n uuids = self.execution.properties[\"Execution_uuid\"].string_value\n if uuids:\n self.execution.properties[\"Execution_uuid\"].string_value = uuids+\",\"+str(uuid.uuid1())\n else:\n self.execution.properties[\"Execution_uuid\"].string_value = str(uuid.uuid1()) \n self.store.put_executions([self.execution])\n self.execution_name = str(self.execution.id) + \",\" + execution_type\n self.execution_command = cmd\n for k, v in custom_props.items():\n k = re.sub(\"-\", \"_\", k)\n self.execution_label_props[k] = v\n self.execution_label_props[\"Execution_Name\"] = (\n execution_type + \":\" + str(self.execution.id)\n )\n\n self.execution_label_props[\"execution_command\"] = cmd\n if self.graph:\n self.driver.create_execution_node(\n self.execution_name,\n self.child_context.id,\n self.parent_context,\n cmd,\n self.execution.id,\n custom_props,\n )\n os.chdir(logging_dir)\n return self.execution\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.update_execution","title":"update_execution(execution_id, custom_properties=None)
","text":"Updates an existing execution. The custom properties can be updated after creation of the execution. The new custom properties is merged with earlier custom properties. Example
# Import CMF\nfrom cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n# Create CMF logger\ncmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n# Update a execution\nexecution: mlmd.proto.Execution = cmf.update_execution(\n execution_id=8,\n custom_properties = {\"split\": split, \"seed\": seed}\n)\n
Args: execution_id: id of the execution. custom_properties: Developers can provide key value pairs with additional properties of the execution that need to be updated. Returns: Execution object from ML Metadata library associated with the updated execution for this stage. Source code in cmflib/cmf.py
def update_execution(\n self, execution_id: int, custom_properties: t.Optional[t.Dict] = None\n):\n \"\"\"Updates an existing execution.\n The custom properties can be updated after creation of the execution.\n The new custom properties is merged with earlier custom properties.\n Example\n ```python\n # Import CMF\n from cmflib.cmf import Cmf\n from ml_metadata.proto import metadata_store_pb2 as mlpb\n # Create CMF logger\n cmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n # Update a execution\n execution: mlmd.proto.Execution = cmf.update_execution(\n execution_id=8,\n custom_properties = {\"split\": split, \"seed\": seed}\n )\n ```\n Args:\n execution_id: id of the execution.\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be updated.\n Returns:\n Execution object from ML Metadata library associated with the updated execution for this stage.\n \"\"\"\n self.execution = self.store.get_executions_by_id([execution_id])[0]\n if self.execution is None:\n print(\"Error - no execution id\")\n return\n execution_type = self.store.get_execution_types_by_id([self.execution.type_id])[\n 0\n ]\n\n if custom_properties:\n for key, value in custom_properties.items():\n if isinstance(value, int):\n self.execution.custom_properties[key].int_value = value\n else:\n self.execution.custom_properties[key].string_value = str(\n value)\n self.store.put_executions([self.execution])\n c_props = {}\n for k, v in self.execution.custom_properties.items():\n key = re.sub(\"-\", \"_\", k)\n val_type = str(v).split(\":\", maxsplit=1)[0]\n if val_type == \"string_value\":\n val = self.execution.custom_properties[k].string_value\n else:\n val = str(v).split(\":\")[1]\n # The properties value are stored in the format type:value hence,\n # taking only value\n self.execution_label_props[key] = val\n c_props[key] = val\n self.execution_name = str(self.execution.id) + \\\n \",\" + execution_type.name\n self.execution_command = self.execution.properties[\"Execution\"]\n self.execution_label_props[\"Execution_Name\"] = (\n execution_type.name + \":\" + str(self.execution.id)\n )\n self.execution_label_props[\"execution_command\"] = self.execution.properties[\n \"Execution\"\n ].string_value\n if self.graph:\n self.driver.create_execution_node(\n self.execution_name,\n self.child_context.id,\n self.parent_context,\n self.execution.properties[\"Execution\"].string_value,\n self.execution.id,\n c_props,\n )\n return self.execution\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_dataset","title":"log_dataset(url, event, custom_properties=None, external=False)
","text":"Logs a dataset as artifact. This call adds the dataset to dvc. The dvc metadata file created (.dvc) will be added to git and committed. The version of the dataset is automatically obtained from the versioning software(DVC) and tracked as a metadata. Example:
artifact: mlmd.proto.Artifact = cmf.log_dataset(\n url=\"/repo/data.xml\",\n event=\"input\",\n custom_properties={\"source\":\"kaggle\"}\n)\n
Args: url: The path to the dataset. event: Takes arguments INPUT
OR OUTPUT
. custom_properties: Dataset properties (key/value pairs). Returns: Artifact object from ML Metadata library associated with the new dataset artifact. Source code in cmflib/cmf.py
def log_dataset(\n self,\n url: str,\n event: str,\n custom_properties: t.Optional[t.Dict] = None,\n external: bool = False,\n) -> mlpb.Artifact:\n \"\"\"Logs a dataset as artifact.\n This call adds the dataset to dvc. The dvc metadata file created (.dvc) will be added to git and committed. The\n version of the dataset is automatically obtained from the versioning software(DVC) and tracked as a metadata.\n Example:\n ```python\n artifact: mlmd.proto.Artifact = cmf.log_dataset(\n url=\"/repo/data.xml\",\n event=\"input\",\n custom_properties={\"source\":\"kaggle\"}\n )\n ```\n Args:\n url: The path to the dataset.\n event: Takes arguments `INPUT` OR `OUTPUT`.\n custom_properties: Dataset properties (key/value pairs).\n Returns:\n Artifact object from ML Metadata library associated with the new dataset artifact.\n \"\"\"\n logging_dir = change_dir(self.cmf_init_path)\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.child_context:\n self.create_context(pipeline_stage=name_without_extension)\n assert self.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # create execution if not already created\n if not self.execution:\n self.create_execution(execution_type=name_without_extension)\n assert self.execution is not None, f\"Failed to create execution for {self.pipeline_name}!!\"\n\n ### To Do : Technical Debt. \n # If the dataset already exist , then we just link the existing dataset to the execution\n # We do not update the dataset properties . \n # We need to append the new properties to the existing dataset properties\n custom_props = {} if custom_properties is None else custom_properties\n git_repo = git_get_repo()\n name = re.split(\"/\", url)[-1]\n event_type = mlpb.Event.Type.OUTPUT\n existing_artifact = []\n if event.lower() == \"input\":\n event_type = mlpb.Event.Type.INPUT\n\n commit_output(url, self.execution.id)\n c_hash = dvc_get_hash(url)\n\n if c_hash == \"\":\n print(\"Error in getting the dvc hash,return without logging\")\n return\n\n dataset_commit = c_hash\n dvc_url = dvc_get_url(url)\n dvc_url_with_pipeline = f\"{self.parent_context.name}:{dvc_url}\"\n url = url + \":\" + c_hash\n if c_hash and c_hash.strip:\n existing_artifact.extend(self.store.get_artifacts_by_uri(c_hash))\n\n # To Do - What happens when uri is the same but names are different\n if existing_artifact and len(existing_artifact) != 0:\n existing_artifact = existing_artifact[0]\n\n # Quick fix- Updating only the name\n if custom_properties is not None:\n self.update_existing_artifact(\n existing_artifact, custom_properties)\n uri = c_hash\n # update url for existing artifact\n self.update_dataset_url(existing_artifact, dvc_url_with_pipeline)\n artifact = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=uri,\n input_name=url,\n event_type=event_type,\n )\n else:\n # if((existing_artifact and len(existing_artifact )!= 0) and c_hash != \"\"):\n # url = url + \":\" + str(self.execution.id)\n uri = c_hash if c_hash and c_hash.strip() else str(uuid.uuid1())\n artifact = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=url,\n type_name=\"Dataset\",\n event_type=event_type,\n properties={\n \"git_repo\": str(git_repo),\n # passing c_hash value to commit\n \"Commit\": str(dataset_commit),\n \"url\": str(dvc_url_with_pipeline),\n },\n artifact_type_properties={\n \"git_repo\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n custom_props[\"git_repo\"] = git_repo\n custom_props[\"Commit\"] = dataset_commit\n self.execution_label_props[\"git_repo\"] = git_repo\n self.execution_label_props[\"Commit\"] = dataset_commit\n\n if self.graph:\n self.driver.create_dataset_node(\n name,\n url,\n uri,\n event,\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n if event.lower() == \"input\":\n self.input_artifacts.append(\n {\n \"Name\": name,\n \"Path\": url,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Dataset\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n )\n self.driver.create_execution_links(uri, name, \"Dataset\")\n else:\n child_artifact = {\n \"Name\": name,\n \"Path\": url,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Dataset\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n os.chdir(logging_dir)\n return artifact\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_dataset_with_version","title":"log_dataset_with_version(url, version, event, props=None, custom_properties=None)
","text":"Logs a dataset when the version (hash) is known. Example:
artifact: mlpb.Artifact = cmf.log_dataset_with_version( \n url=\"path/to/dataset\", \n version=\"abcdef\",\n event=\"output\",\n props={ \"git_repo\": \"https://github.com/example/repo\",\n \"url\": \"/path/in/repo\", },\n custom_properties={ \"custom_key\": \"custom_value\", }, \n ) \n
Args: url: Path to the dataset. version: Hash or version identifier for the dataset. event: Takes arguments INPUT
or OUTPUT
. props: Optional properties for the dataset (e.g., git_repo, url). custom_properties: Optional custom properties for the dataset. Returns: Artifact object from the ML Protocol Buffers library associated with the new dataset artifact. Source code in cmflib/cmf.py
def log_dataset_with_version(\n self,\n url: str,\n version: str,\n event: str,\n props: t.Optional[t.Dict] = None,\n custom_properties: t.Optional[t.Dict] = None,\n) -> mlpb.Artifact:\n \"\"\"Logs a dataset when the version (hash) is known.\n Example: \n ```python \n artifact: mlpb.Artifact = cmf.log_dataset_with_version( \n url=\"path/to/dataset\", \n version=\"abcdef\",\n event=\"output\",\n props={ \"git_repo\": \"https://github.com/example/repo\",\n \"url\": \"/path/in/repo\", },\n custom_properties={ \"custom_key\": \"custom_value\", }, \n ) \n ```\n Args: \n url: Path to the dataset. \n version: Hash or version identifier for the dataset. \n event: Takes arguments `INPUT` or `OUTPUT`. \n props: Optional properties for the dataset (e.g., git_repo, url). \n custom_properties: Optional custom properties for the dataset.\n Returns:\n Artifact object from the ML Protocol Buffers library associated with the new dataset artifact. \n \"\"\"\n\n props = {} if props is None else props\n custom_props = {} if custom_properties is None else custom_properties\n git_repo = props.get(\"git_repo\", \"\")\n name = url\n event_type = mlpb.Event.Type.OUTPUT\n existing_artifact = []\n c_hash = version\n if event.lower() == \"input\":\n event_type = mlpb.Event.Type.INPUT\n\n # dataset_commit = commit_output(url, self.execution.id)\n\n dataset_commit = version\n url = url + \":\" + c_hash\n if c_hash and c_hash.strip:\n existing_artifact.extend(self.store.get_artifacts_by_uri(c_hash))\n\n # To Do - What happens when uri is the same but names are different\n if existing_artifact and len(existing_artifact) != 0:\n existing_artifact = existing_artifact[0]\n\n # Quick fix- Updating only the name\n if custom_properties is not None:\n self.update_existing_artifact(\n existing_artifact, custom_properties)\n uri = c_hash\n # update url for existing artifact\n self.update_dataset_url(existing_artifact, props.get(\"url\", \"\"))\n artifact = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=uri,\n input_name=url,\n event_type=event_type,\n )\n else:\n # if((existing_artifact and len(existing_artifact )!= 0) and c_hash != \"\"):\n # url = url + \":\" + str(self.execution.id)\n uri = c_hash if c_hash and c_hash.strip() else str(uuid.uuid1())\n artifact = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=url,\n type_name=\"Dataset\",\n event_type=event_type,\n properties={\n \"git_repo\": str(git_repo),\n \"Commit\": str(dataset_commit),\n \"url\": props.get(\"url\", \" \"),\n },\n artifact_type_properties={\n \"git_repo\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n custom_props[\"git_repo\"] = git_repo\n custom_props[\"Commit\"] = dataset_commit\n self.execution_label_props[\"git_repo\"] = git_repo\n self.execution_label_props[\"Commit\"] = dataset_commit\n\n if self.graph:\n self.driver.create_dataset_node(\n name,\n url,\n uri,\n event,\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n if event.lower() == \"input\":\n self.input_artifacts.append(\n {\n \"Name\": name,\n \"Path\": url,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Dataset\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n )\n self.driver.create_execution_links(uri, name, \"Dataset\")\n else:\n child_artifact = {\n \"Name\": name,\n \"Path\": url,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Dataset\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n return artifact\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_model","title":"log_model(path, event, model_framework='Default', model_type='Default', model_name='Default', custom_properties=None)
","text":"Logs a model. The model is added to dvc and the metadata file (.dvc) gets committed to git. Example:
artifact: mlmd.proto.Artifact= cmf.log_model(\n path=\"path/to/model.pkl\",\n event=\"output\",\n model_framework=\"SKlearn\",\n model_type=\"RandomForestClassifier\",\n model_name=\"RandomForestClassifier:default\"\n)\n
Args: path: Path to the model file. event: Takes arguments INPUT
OR OUTPUT
. model_framework: Framework used to create the model. model_type: Type of model algorithm used. model_name: Name of the algorithm used. custom_properties: The model properties. Returns: Artifact object from ML Metadata library associated with the new model artifact. Source code in cmflib/cmf.py
def log_model(\n self,\n path: str,\n event: str,\n model_framework: str = \"Default\",\n model_type: str = \"Default\",\n model_name: str = \"Default\",\n custom_properties: t.Optional[t.Dict] = None,\n) -> mlpb.Artifact:\n \"\"\"Logs a model.\n The model is added to dvc and the metadata file (.dvc) gets committed to git.\n Example:\n ```python\n artifact: mlmd.proto.Artifact= cmf.log_model(\n path=\"path/to/model.pkl\",\n event=\"output\",\n model_framework=\"SKlearn\",\n model_type=\"RandomForestClassifier\",\n model_name=\"RandomForestClassifier:default\"\n )\n ```\n Args:\n path: Path to the model file.\n event: Takes arguments `INPUT` OR `OUTPUT`.\n model_framework: Framework used to create the model.\n model_type: Type of model algorithm used.\n model_name: Name of the algorithm used.\n custom_properties: The model properties.\n Returns:\n Artifact object from ML Metadata library associated with the new model artifact.\n \"\"\"\n\n logging_dir = change_dir(self.cmf_init_path)\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.child_context:\n self.create_context(pipeline_stage=name_without_extension)\n assert self.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # create execution if not already created\n if not self.execution:\n self.create_execution(execution_type=name_without_extension)\n assert self.execution is not None, f\"Failed to create execution for {self.pipeline_name}!!\"\n\n\n # To Do : Technical Debt. \n # If the model already exist , then we just link the existing model to the execution\n # We do not update the model properties . \n # We need to append the new properties to the existing model properties\n if custom_properties is None:\n custom_properties = {}\n custom_props = {} if custom_properties is None else custom_properties\n # name = re.split('/', path)[-1]\n event_type = mlpb.Event.Type.OUTPUT\n existing_artifact = []\n if event.lower() == \"input\":\n event_type = mlpb.Event.Type.INPUT\n\n commit_output(path, self.execution.id)\n c_hash = dvc_get_hash(path)\n\n if c_hash == \"\":\n print(\"Error in getting the dvc hash,return without logging\")\n return\n\n model_commit = c_hash\n\n # If connecting to an existing artifact - The name of the artifact is\n # used as path/steps/key\n model_uri = path + \":\" + c_hash\n dvc_url = dvc_get_url(path, False)\n url = dvc_url\n url_with_pipeline = f\"{self.parent_context.name}:{url}\"\n uri = \"\"\n if c_hash and c_hash.strip():\n uri = c_hash.strip()\n existing_artifact.extend(self.store.get_artifacts_by_uri(uri))\n else:\n raise RuntimeError(\"Model commit failed, Model uri empty\")\n\n if (\n existing_artifact\n and len(existing_artifact) != 0\n ):\n # update url for existing artifact\n existing_artifact = self.update_model_url(\n existing_artifact, url_with_pipeline\n )\n artifact = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=c_hash,\n input_name=model_uri,\n event_type=event_type,\n )\n model_uri = artifact.name\n else:\n uri = c_hash if c_hash and c_hash.strip() else str(uuid.uuid1())\n model_uri = model_uri + \":\" + str(self.execution.id)\n artifact = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=model_uri,\n type_name=\"Model\",\n event_type=event_type,\n properties={\n \"model_framework\": str(model_framework),\n \"model_type\": str(model_type),\n \"model_name\": str(model_name),\n # passing c_hash value to commit\n \"Commit\": str(model_commit),\n \"url\": str(url_with_pipeline),\n },\n artifact_type_properties={\n \"model_framework\": mlpb.STRING,\n \"model_type\": mlpb.STRING,\n \"model_name\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n # custom_properties[\"Commit\"] = model_commit\n self.execution_label_props[\"Commit\"] = model_commit\n #To DO model nodes should be similar to dataset nodes when we create neo4j\n if self.graph:\n self.driver.create_model_node(\n model_uri,\n uri,\n event,\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n if event.lower() == \"input\":\n self.input_artifacts.append(\n {\n \"Name\": model_uri,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Model\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n )\n self.driver.create_execution_links(uri, model_uri, \"Model\")\n else:\n child_artifact = {\n \"Name\": model_uri,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Model\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n os.chdir(logging_dir)\n return artifact\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_model_with_version","title":"log_model_with_version(path, event, props=None, custom_properties=None)
","text":"Logs a model when the version(hash) is known The model is added to dvc and the metadata file (.dvc) gets committed to git. Example:
artifact: mlmd.proto.Artifact= cmf.log_model_with_version(\n path=\"path/to/model.pkl\",\n event=\"output\",\n props={\n \"url\": \"/home/user/local-storage/bf/629ccd5cd008066b72c04f9a918737\",\n \"model_type\": \"RandomForestClassifier\",\n \"model_name\": \"RandomForestClassifier:default\",\n \"Commit\": \"commit 1146dad8b74cae205db6a3132ea403db1e4032e5\",\n \"model_framework\": \"SKlearn\",\n },\n custom_properties={\n \"uri\": \"bf629ccd5cd008066b72c04f9a918737\",\n },\n\n)\n
Args: path: Path to the model file. event: Takes arguments INPUT
OR OUTPUT
. props: Model artifact properties. custom_properties: The model properties. Returns: Artifact object from ML Metadata library associated with the new model artifact. Source code in cmflib/cmf.py
def log_model_with_version(\n self,\n path: str,\n event: str,\n props=None,\n custom_properties: t.Optional[t.Dict] = None,\n) -> object:\n \"\"\"Logs a model when the version(hash) is known\n The model is added to dvc and the metadata file (.dvc) gets committed to git.\n Example:\n ```python\n artifact: mlmd.proto.Artifact= cmf.log_model_with_version(\n path=\"path/to/model.pkl\",\n event=\"output\",\n props={\n \"url\": \"/home/user/local-storage/bf/629ccd5cd008066b72c04f9a918737\",\n \"model_type\": \"RandomForestClassifier\",\n \"model_name\": \"RandomForestClassifier:default\",\n \"Commit\": \"commit 1146dad8b74cae205db6a3132ea403db1e4032e5\",\n \"model_framework\": \"SKlearn\",\n },\n custom_properties={\n \"uri\": \"bf629ccd5cd008066b72c04f9a918737\",\n },\n\n )\n ```\n Args:\n path: Path to the model file.\n event: Takes arguments `INPUT` OR `OUTPUT`.\n props: Model artifact properties.\n custom_properties: The model properties.\n Returns:\n Artifact object from ML Metadata library associated with the new model artifact.\n \"\"\"\n\n if custom_properties is None:\n custom_properties = {}\n custom_props = {} if custom_properties is None else custom_properties\n name = re.split(\"/\", path)[-1]\n event_type = mlpb.Event.Type.OUTPUT\n existing_artifact = []\n if event.lower() == \"input\":\n event_type = mlpb.Event.Type.INPUT\n\n # props[\"commit\"] = \"\" # To do get from incoming data\n c_hash = props.get(\"uri\", \" \")\n # If connecting to an existing artifact - The name of the artifact is used as path/steps/key\n model_uri = path + \":\" + c_hash\n # dvc_url = dvc_get_url(path, False)\n url = props.get(\"url\", \"\")\n # uri = \"\"\n if c_hash and c_hash.strip():\n uri = c_hash.strip()\n existing_artifact.extend(self.store.get_artifacts_by_uri(uri))\n else:\n raise RuntimeError(\"Model commit failed, Model uri empty\")\n\n if (\n existing_artifact\n and len(existing_artifact) != 0\n ):\n # update url for existing artifact\n existing_artifact = self.update_model_url(existing_artifact, url)\n artifact = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=c_hash,\n input_name=model_uri,\n event_type=event_type,\n )\n model_uri = artifact.name\n else:\n uri = c_hash if c_hash and c_hash.strip() else str(uuid.uuid1())\n model_uri = model_uri + \":\" + str(self.execution.id)\n artifact = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=model_uri,\n type_name=\"Model\",\n event_type=event_type,\n properties={\n \"model_framework\": props.get(\"model_framework\", \"\"),\n \"model_type\": props.get(\"model_type\", \"\"),\n \"model_name\": props.get(\"model_name\", \"\"),\n \"Commit\": props.get(\"Commit\", \"\"),\n \"url\": str(url),\n },\n artifact_type_properties={\n \"model_framework\": mlpb.STRING,\n \"model_type\": mlpb.STRING,\n \"model_name\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n # custom_properties[\"Commit\"] = model_commit\n # custom_props[\"url\"] = url\n self.execution_label_props[\"Commit\"] = props.get(\"Commit\", \"\")\n if self.graph:\n self.driver.create_model_node(\n model_uri,\n uri,\n event,\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n if event.lower() == \"input\":\n self.input_artifacts.append(\n {\n \"Name\": model_uri,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Model\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n )\n self.driver.create_execution_links(uri, model_uri, \"Model\")\n else:\n child_artifact = {\n \"Name\": model_uri,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Model\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n\n return artifact\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_execution_metrics_from_client","title":"log_execution_metrics_from_client(metrics_name, custom_properties=None)
","text":"Logs execution metrics from a client. Data from pre-existing metrics from client side is used to create identical metrics on server side. Example:
artifact: mlpb.Artifact = cmf.log_execution_metrics_from_client( \n metrics_name=\"example_metrics:uri:123\", \n custom_properties={\"custom_key\": \"custom_value\"}, \n )\n
Args: metrics_name: Name of the metrics in the format \"name:uri:execution_id\". custom_properties: Optional custom properties for the metrics. Returns: Artifact object from the ML Protocol Buffers library associated with the metrics artifact. Source code in cmflib/cmf.py
def log_execution_metrics_from_client(self, metrics_name: str,\n custom_properties: t.Optional[t.Dict] = None) -> mlpb.Artifact:\n \"\"\" Logs execution metrics from a client.\n Data from pre-existing metrics from client side is used to create identical metrics on server side. \n Example: \n ```python \n artifact: mlpb.Artifact = cmf.log_execution_metrics_from_client( \n metrics_name=\"example_metrics:uri:123\", \n custom_properties={\"custom_key\": \"custom_value\"}, \n )\n ``` \n Args: \n metrics_name: Name of the metrics in the format \"name:uri:execution_id\". \n custom_properties: Optional custom properties for the metrics. \n Returns: \n Artifact object from the ML Protocol Buffers library associated with the metrics artifact.\n \"\"\"\n\n metrics = None\n custom_props = {} if custom_properties is None else custom_properties\n existing_artifact = []\n name_tokens = metrics_name.split(\":\")\n if name_tokens and len(name_tokens) > 2:\n name = name_tokens[0]\n uri = name_tokens[1]\n execution_id = name_tokens[2]\n else:\n print(f\"Error : metrics name {metrics_name} is not in the correct format\")\n return \n\n #we need to add the execution id to the metrics name\n new_metrics_name = f\"{name}:{uri}:{str(self.execution.id)}\"\n existing_artifacts = self.store.get_artifacts_by_uri(uri)\n\n existing_artifact = existing_artifacts[0] if existing_artifacts else None\n if not existing_artifact or \\\n ((existing_artifact) and not\n (existing_artifact.name == new_metrics_name)): #we need to add the artifact otherwise its already there \n metrics = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=new_metrics_name,\n type_name=\"Metrics\",\n event_type=mlpb.Event.Type.OUTPUT,\n properties={\"metrics_name\": metrics_name},\n artifact_type_properties={\"metrics_name\": mlpb.STRING},\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n if self.graph:\n # To do create execution_links\n self.driver.create_metrics_node(\n metrics_name,\n uri,\n \"output\",\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n child_artifact = {\n \"Name\": metrics_name,\n \"URI\": uri,\n \"Event\": \"output\",\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Metrics\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n return metrics\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_execution_metrics","title":"log_execution_metrics(metrics_name, custom_properties=None)
","text":"Log the metadata associated with the execution (coarse-grained tracking). It is stored as a metrics artifact. This does not have a backing physical file, unlike other artifacts that we have. Example:
exec_metrics: mlpb.Artifact = cmf.log_execution_metrics(\n metrics_name=\"Training_Metrics\",\n {\"auc\": auc, \"loss\": loss}\n)\n
Args: metrics_name: Name to identify the metrics. custom_properties: Dictionary with metric values. Returns: Artifact object from ML Metadata library associated with the new coarse-grained metrics artifact. Source code in cmflib/cmf.py
def log_execution_metrics(\n self, metrics_name: str, custom_properties: t.Optional[t.Dict] = None\n) -> mlpb.Artifact:\n \"\"\"Log the metadata associated with the execution (coarse-grained tracking).\n It is stored as a metrics artifact. This does not have a backing physical file, unlike other artifacts that we\n have.\n Example:\n ```python\n exec_metrics: mlpb.Artifact = cmf.log_execution_metrics(\n metrics_name=\"Training_Metrics\",\n {\"auc\": auc, \"loss\": loss}\n )\n ```\n Args:\n metrics_name: Name to identify the metrics.\n custom_properties: Dictionary with metric values.\n Returns:\n Artifact object from ML Metadata library associated with the new coarse-grained metrics artifact.\n \"\"\"\n logging_dir = change_dir(self.cmf_init_path)\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.child_context:\n self.create_context(pipeline_stage=name_without_extension)\n assert self.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # create execution if not already created\n if not self.execution:\n self.create_execution(execution_type=name_without_extension)\n assert self.execution is not None, f\"Failed to create execution for {self.pipeline_name}!!\"\n\n custom_props = {} if custom_properties is None else custom_properties\n uri = str(uuid.uuid1())\n metrics_name = metrics_name + \":\" + uri + \":\" + str(self.execution.id)\n metrics = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=metrics_name,\n type_name=\"Metrics\",\n event_type=mlpb.Event.Type.OUTPUT,\n properties={\"metrics_name\": metrics_name},\n artifact_type_properties={\"metrics_name\": mlpb.STRING},\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n if self.graph:\n # To do create execution_links\n self.driver.create_metrics_node(\n metrics_name,\n uri,\n \"output\",\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n child_artifact = {\n \"Name\": metrics_name,\n \"URI\": uri,\n \"Event\": \"output\",\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Metrics\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n os.chdir(logging_dir)\n return metrics\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_metric","title":"log_metric(metrics_name, custom_properties=None)
","text":"Stores the fine-grained (per step or per epoch) metrics to memory. The metrics provided are stored in a parquet file. The commit_metrics
call add the parquet file in the version control framework. The metrics written in the parquet file can be retrieved using the read_metrics
call. Example:
# Can be called at every epoch or every step in the training. This is logged to a parquet file and committed\n# at the commit stage.\n# Inside training loop\nwhile True:\n cmf.log_metric(\"training_metrics\", {\"train_loss\": train_loss})\ncmf.commit_metrics(\"training_metrics\")\n
Args: metrics_name: Name to identify the metrics. custom_properties: Dictionary with metrics. Source code in cmflib/cmf.py
def log_metric(\n self, metrics_name: str, custom_properties: t.Optional[t.Dict] = None\n) -> None:\n \"\"\"Stores the fine-grained (per step or per epoch) metrics to memory.\n The metrics provided are stored in a parquet file. The `commit_metrics` call add the parquet file in the version\n control framework. The metrics written in the parquet file can be retrieved using the `read_metrics` call.\n Example:\n ```python\n # Can be called at every epoch or every step in the training. This is logged to a parquet file and committed\n # at the commit stage.\n # Inside training loop\n while True:\n cmf.log_metric(\"training_metrics\", {\"train_loss\": train_loss})\n cmf.commit_metrics(\"training_metrics\")\n ```\n Args:\n metrics_name: Name to identify the metrics.\n custom_properties: Dictionary with metrics.\n \"\"\"\n if metrics_name in self.metrics:\n key = max((self.metrics[metrics_name]).keys()) + 1\n self.metrics[metrics_name][key] = custom_properties\n else:\n self.metrics[metrics_name] = {}\n self.metrics[metrics_name][1] = custom_properties\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.commit_existing_metrics","title":"commit_existing_metrics(metrics_name, uri, props=None, custom_properties=None)
","text":"Commits existing metrics associated with the given URI to MLMD. Example:
artifact: mlpb.Artifact = cmf.commit_existing_metrics(\"existing_metrics\", \"abc123\",\n {\"custom_key\": \"custom_value\"})\n
Args: metrics_name: Name of the metrics. uri: Unique identifier associated with the metrics. custom_properties: Optional custom properties for the metrics. Returns: Artifact object from the ML Protocol Buffers library associated with the existing metrics artifact. Source code in cmflib/cmf.py
def commit_existing_metrics(self, metrics_name: str, uri: str, props: t.Optional[t.Dict] = None, custom_properties: t.Optional[t.Dict] = None):\n \"\"\"\n Commits existing metrics associated with the given URI to MLMD.\n Example:\n ```python\n artifact: mlpb.Artifact = cmf.commit_existing_metrics(\"existing_metrics\", \"abc123\",\n {\"custom_key\": \"custom_value\"})\n ```\n Args:\n metrics_name: Name of the metrics.\n uri: Unique identifier associated with the metrics.\n custom_properties: Optional custom properties for the metrics.\n Returns:\n Artifact object from the ML Protocol Buffers library associated with the existing metrics artifact.\n \"\"\"\n\n custom_props = {} if custom_properties is None else custom_properties\n c_hash = uri.strip()\n existing_artifact = []\n existing_artifact.extend(self.store.get_artifacts_by_uri(c_hash))\n if (existing_artifact\n and len(existing_artifact) != 0 ):\n metrics = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=c_hash,\n input_name=metrics_name,\n event_type=mlpb.Event.Type.OUTPUT,\n )\n else:\n metrics = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=metrics_name,\n type_name=\"Step_Metrics\",\n event_type=mlpb.Event.Type.OUTPUT,\n properties={\n # passing uri value to commit\n \"Commit\": props.get(\"Commit\", \"\"),\n \"url\": props.get(\"url\", \"\"),\n },\n artifact_type_properties={\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n if self.graph:\n self.driver.create_metrics_node(\n metrics_name,\n uri,\n \"output\",\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n child_artifact = {\n \"Name\": metrics_name,\n \"URI\": uri,\n \"Event\": \"output\",\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Metrics\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n return metrics\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.create_dataslice","title":"create_dataslice(name)
","text":"Creates a dataslice object. Once created, users can add data instances to this data slice with add_data method. Users are also responsible for committing data slices by calling the commit method. Example:
dataslice = cmf.create_dataslice(\"slice-a\")\n
Args: name: Name to identify the dataslice. Returns:
Type DescriptionDataSlice
Instance of a newly created DataSlice.
Source code incmflib/cmf.py
def create_dataslice(self, name: str) -> \"Cmf.DataSlice\":\n \"\"\"Creates a dataslice object.\n Once created, users can add data instances to this data slice with [add_data][cmflib.cmf.Cmf.DataSlice.add_data]\n method. Users are also responsible for committing data slices by calling the\n [commit][cmflib.cmf.Cmf.DataSlice.commit] method.\n Example:\n ```python\n dataslice = cmf.create_dataslice(\"slice-a\")\n ```\n Args:\n name: Name to identify the dataslice.\n\n Returns:\n Instance of a newly created [DataSlice][cmflib.cmf.Cmf.DataSlice].\n \"\"\"\n return Cmf.DataSlice(name, self)\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.update_dataslice","title":"update_dataslice(name, record, custom_properties)
","text":"Updates a dataslice record in a Parquet file with the provided custom properties. Example:
dataslice=cmf.update_dataslice(\"dataslice_file.parquet\", \"record_id\", \n {\"key1\": \"updated_value\"})\n
Args: name: Name of the Parquet file. record: Identifier of the dataslice record to be updated. custom_properties: Dictionary containing custom properties to update. Returns:
Type DescriptionNone
Source code incmflib/cmf.py
def update_dataslice(self, name: str, record: str, custom_properties: t.Dict):\n \"\"\"Updates a dataslice record in a Parquet file with the provided custom properties.\n Example:\n ```python\n dataslice=cmf.update_dataslice(\"dataslice_file.parquet\", \"record_id\", \n {\"key1\": \"updated_value\"})\n ```\n Args:\n name: Name of the Parquet file.\n record: Identifier of the dataslice record to be updated.\n custom_properties: Dictionary containing custom properties to update.\n\n Returns:\n None\n \"\"\"\n directory_path = os.path.join(ARTIFACTS_PATH, self.execution.properties[\"Execution_uuid\"].string_value.split(',')[0], DATASLICE_PATH)\n name = os.path.join(directory_path, name)\n df = pd.read_parquet(name)\n temp_dict = df.to_dict(\"index\")\n temp_dict[record].update(custom_properties)\n dataslice_df = pd.DataFrame.from_dict(temp_dict, orient=\"index\")\n dataslice_df.index.names = [\"Path\"]\n dataslice_df.to_parquet(name)\n
"},{"location":"api/public/cmf/#cmflib.cmf.cmf_init_show","title":"cmf_init_show()
","text":"Initializes and shows details of the CMF command. Example:
result = cmf_init_show() \n
Returns: Output from the _cmf_cmd_init function. Source code in cmflib/cmf.py
def cmf_init_show():\n \"\"\" Initializes and shows details of the CMF command. \n Example: \n ```python \n result = cmf_init_show() \n ``` \n Returns: \n Output from the _cmf_cmd_init function. \n \"\"\"\n\n output=_cmf_cmd_init()\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.cmf_init","title":"cmf_init(type='', path='', git_remote_url='', cmf_server_url='', neo4j_user='', neo4j_password='', neo4j_uri='', url='', endpoint_url='', access_key_id='', secret_key='', session_token='', user='', password='', port=0, osdf_path='', key_id='', key_path='', key_issuer='')
","text":"Initializes the CMF configuration based on the provided parameters. Example:
cmf_init( type=\"local\", \n path=\"/path/to/re\",\n git_remote_url=\"git@github.com:user/repo.git\",\n cmf_server_url=\"http://cmf-server\"\n neo4j_user\", \n neo4j_password=\"password\",\n neo4j_uri=\"bolt://localhost:76\"\n )\n
Args: type: Type of repository (\"local\", \"minioS3\", \"amazonS3\", \"sshremote\") path: Path for the local repository. git_remote_url: Git remote URL for version control. cmf_server_url: CMF server URL. neo4j_user: Neo4j database username. neo4j_password: Neo4j database password. neo4j_uri: Neo4j database URI. url: URL for MinioS3 or AmazonS3. endpoint_url: Endpoint URL for MinioS3. access_key_id: Access key ID for MinioS3 or AmazonS3. secret_key: Secret key for MinioS3 or AmazonS3. session_token: Session token for AmazonS3. user: SSH remote username. password: SSH remote password. port: SSH remote port Returns: Output based on the initialized repository type. Source code in cmflib/cmf.py
def cmf_init(type: str = \"\",\n path: str = \"\",\n git_remote_url: str = \"\",\n cmf_server_url: str = \"\",\n neo4j_user: str = \"\",\n neo4j_password: str = \"\",\n neo4j_uri: str = \"\",\n url: str = \"\",\n endpoint_url: str = \"\",\n access_key_id: str = \"\",\n secret_key: str = \"\",\n session_token: str = \"\",\n user: str = \"\",\n password: str = \"\",\n port: int = 0,\n osdf_path: str = \"\",\n key_id: str = \"\",\n key_path: str = \"\",\n key_issuer: str = \"\",\n ):\n\n \"\"\" Initializes the CMF configuration based on the provided parameters. \n Example:\n ```python\n cmf_init( type=\"local\", \n path=\"/path/to/re\",\n git_remote_url=\"git@github.com:user/repo.git\",\n cmf_server_url=\"http://cmf-server\"\n neo4j_user\", \n neo4j_password=\"password\",\n neo4j_uri=\"bolt://localhost:76\"\n )\n ```\n Args: \n type: Type of repository (\"local\", \"minioS3\", \"amazonS3\", \"sshremote\")\n path: Path for the local repository. \n git_remote_url: Git remote URL for version control.\n cmf_server_url: CMF server URL.\n neo4j_user: Neo4j database username.\n neo4j_password: Neo4j database password.\n neo4j_uri: Neo4j database URI.\n url: URL for MinioS3 or AmazonS3.\n endpoint_url: Endpoint URL for MinioS3.\n access_key_id: Access key ID for MinioS3 or AmazonS3.\n secret_key: Secret key for MinioS3 or AmazonS3. \n session_token: Session token for AmazonS3.\n user: SSH remote username.\n password: SSH remote password. \n port: SSH remote port\n Returns:\n Output based on the initialized repository type.\n \"\"\"\n\n if type == \"\":\n return print(\"Error: Type is not provided\")\n if type not in [\"local\",\"minioS3\",\"amazonS3\",\"sshremote\",\"osdfremote\"]:\n return print(\"Error: Type value is undefined\"+ \" \"+type+\".Expected: \"+\",\".join([\"local\",\"minioS3\",\"amazonS3\",\"sshremote\",\"osdfremote\"]))\n\n if neo4j_user != \"\" and neo4j_password != \"\" and neo4j_uri != \"\":\n pass\n elif neo4j_user == \"\" and neo4j_password == \"\" and neo4j_uri == \"\":\n pass\n else:\n return print(\"Error: Enter all neo4j parameters.\") \n\n args={'path': path,\n 'git_remote_url': git_remote_url,\n 'url': url,\n 'endpoint_url': endpoint_url,\n 'access_key_id': access_key_id,\n 'secret_key': secret_key,\n 'session_token': session_token,\n 'user': user,\n 'password': password,\n 'osdf_path': osdf_path,\n 'key_id': key_id,\n 'key_path': key_path, \n 'key-issuer': key_issuer,\n }\n\n status_args=non_related_args(type, args)\n\n if type == \"local\" and path != \"\" and git_remote_url != \"\" :\n \"\"\"Initialize local repository\"\"\"\n output = _init_local(\n path, git_remote_url, cmf_server_url, neo4j_user, neo4j_password, neo4j_uri\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n return output\n\n elif type == \"minioS3\" and url != \"\" and endpoint_url != \"\" and access_key_id != \"\" and secret_key != \"\" and git_remote_url != \"\":\n \"\"\"Initialize minioS3 repository\"\"\"\n output = _init_minioS3(\n url,\n endpoint_url,\n access_key_id,\n secret_key,\n git_remote_url,\n cmf_server_url,\n neo4j_user,\n neo4j_password,\n neo4j_uri,\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n return output\n\n elif type == \"amazonS3\" and url != \"\" and access_key_id != \"\" and secret_key != \"\" and git_remote_url != \"\":\n \"\"\"Initialize amazonS3 repository\"\"\"\n output = _init_amazonS3(\n url,\n access_key_id,\n secret_key,\n session_token,\n git_remote_url,\n cmf_server_url,\n neo4j_user,\n neo4j_password,\n neo4j_uri,\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n\n return output\n\n elif type == \"sshremote\" and path != \"\" and user != \"\" and port != 0 and password != \"\" and git_remote_url != \"\":\n \"\"\"Initialize sshremote repository\"\"\"\n output = _init_sshremote(\n path,\n user,\n port,\n password,\n git_remote_url,\n cmf_server_url,\n neo4j_user,\n neo4j_password,\n neo4j_uri,\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n\n return output\n\n elif type == \"osdfremote\" and osdf_path != \"\" and key_id != \"\" and key_path != 0 and key_issuer != \"\" and git_remote_url != \"\":\n \"\"\"Initialize osdfremote repository\"\"\"\n output = _init_osdfremote(\n osdf_path,\n key_id,\n key_path,\n key_issuer,\n git_remote_url,\n cmf_server_url,\n neo4j_user,\n neo4j_password,\n neo4j_uri,\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n\n return output\n\n else:\n print(\"Error: Enter all arguments\")\n
"},{"location":"api/public/cmf/#cmflib.cmf.metadata_push","title":"metadata_push(pipeline_name, filepath='./mlmd', tensorboard_path='', execution_id='')
","text":"Pushes MLMD file to CMF-server. Example:
result = metadata_push(\"example_pipeline\", \"mlmd_file\", \"3\")\n
Args: pipeline_name: Name of the pipeline. filepath: Path to the MLMD file. execution_id: Optional execution ID. tensorboard_path: Path to tensorboard logs. Returns:
Type DescriptionResponse output from the _metadata_push function.
Source code incmflib/cmf.py
def metadata_push(pipeline_name: str, filepath = \"./mlmd\", tensorboard_path: str = \"\", execution_id: str = \"\"):\n \"\"\" Pushes MLMD file to CMF-server.\n Example:\n ```python\n result = metadata_push(\"example_pipeline\", \"mlmd_file\", \"3\")\n ```\n Args:\n pipeline_name: Name of the pipeline.\n filepath: Path to the MLMD file.\n execution_id: Optional execution ID.\n tensorboard_path: Path to tensorboard logs.\n\n Returns:\n Response output from the _metadata_push function.\n \"\"\"\n # Required arguments: pipeline_name\n # Optional arguments: Execution_ID, filepath (mlmd file path, tensorboard_path\n output = _metadata_push(pipeline_name, filepath, execution_id, tensorboard_path)\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.metadata_pull","title":"metadata_pull(pipeline_name, filepath='./mlmd', execution_id='')
","text":"Pulls MLMD file from CMF-server. Example:
result = metadata_pull(\"example_pipeline\", \"./mlmd_directory\", \"execution_123\") \n
Args: pipeline_name: Name of the pipeline. filepath: File path to store the MLMD file. execution_id: Optional execution ID. Returns: Message from the _metadata_pull function. Source code in cmflib/cmf.py
def metadata_pull(pipeline_name: str, filepath = \"./mlmd\", execution_id: str = \"\"):\n \"\"\" Pulls MLMD file from CMF-server. \n Example: \n ```python \n result = metadata_pull(\"example_pipeline\", \"./mlmd_directory\", \"execution_123\") \n ``` \n Args: \n pipeline_name: Name of the pipeline. \n filepath: File path to store the MLMD file. \n execution_id: Optional execution ID. \n Returns: \n Message from the _metadata_pull function. \n \"\"\"\n # Required arguments: pipeline_name \n #Optional arguments: Execution_ID, filepath(file path to store mlmd file) \n output = _metadata_pull(pipeline_name, filepath, execution_id)\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.artifact_pull","title":"artifact_pull(pipeline_name, filepath='./mlmd')
","text":"Pulls artifacts from the initialized repository.
Example:
result = artifact_pull(\"example_pipeline\", \"./mlmd_directory\")\n
Parameters:
Name Type Description Defaultpipeline_name
str
Name of the pipeline.
requiredfilepath
Path to store artifacts.
'./mlmd'
Returns: Output from the _artifact_pull function.
Source code incmflib/cmf.py
def artifact_pull(pipeline_name: str, filepath = \"./mlmd\"):\n \"\"\" Pulls artifacts from the initialized repository.\n\n Example:\n ```python\n result = artifact_pull(\"example_pipeline\", \"./mlmd_directory\")\n ```\n\n Args:\n pipeline_name: Name of the pipeline.\n filepath: Path to store artifacts.\n Returns:\n Output from the _artifact_pull function.\n \"\"\"\n\n # Required arguments: Pipeline_name\n # Optional arguments: filepath( path to store artifacts)\n output = _artifact_pull(pipeline_name, filepath)\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.artifact_pull_single","title":"artifact_pull_single(pipeline_name, filepath, artifact_name)
","text":"Pulls a single artifact from the initialized repository. Example:
result = artifact_pull_single(\"example_pipeline\", \"./mlmd_directory\", \"example_artifact\") \n
Args: pipeline_name: Name of the pipeline. filepath: Path to store the artifact. artifact_name: Name of the artifact. Returns: Output from the _artifact_pull_single function. Source code in cmflib/cmf.py
def artifact_pull_single(pipeline_name: str, filepath: str, artifact_name: str):\n \"\"\" Pulls a single artifact from the initialized repository. \n Example: \n ```python \n result = artifact_pull_single(\"example_pipeline\", \"./mlmd_directory\", \"example_artifact\") \n ```\n Args: \n pipeline_name: Name of the pipeline. \n filepath: Path to store the artifact. \n artifact_name: Name of the artifact. \n Returns:\n Output from the _artifact_pull_single function. \n \"\"\"\n\n # Required arguments: Pipeline_name\n # Optional arguments: filepath( path to store artifacts), artifact_name\n output = _artifact_pull_single(pipeline_name, filepath, artifact_name)\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.artifact_push","title":"artifact_push(pipeline_name, filepath='./mlmd')
","text":"Pushes artifacts to the initialized repository.
Example:
result = artifact_push(\"example_pipeline\", \"./mlmd_directory\")\n
Args: pipeline_name: Name of the pipeline. filepath: Path to store the artifact. Returns: Output from the _artifact_push function. Source code in cmflib/cmf.py
def artifact_push(pipeline_name: str, filepath = \"./mlmd\"):\n \"\"\" Pushes artifacts to the initialized repository.\n\n Example:\n ```python\n result = artifact_push(\"example_pipeline\", \"./mlmd_directory\")\n ```\n Args: \n pipeline_name: Name of the pipeline. \n filepath: Path to store the artifact. \n Returns:\n Output from the _artifact_push function.\n \"\"\"\n\n output = _artifact_push(pipeline_name, filepath)\n return output\n
"},{"location":"api/public/cmfquery/","title":"cmflib.cmfquery.CmfQuery","text":" Bases: object
CMF Query communicates with the MLMD database and implements basic search and retrieval functionality.
This class has been designed to work with the CMF framework. CMF alters names of pipelines, stages and artifacts in various ways. This means that actual names in the MLMD database will be different from those originally provided by users via CMF API. When methods in this class accept name
parameters, it is expected that values of these parameters are fully-qualified names of respective entities.
Parameters:
Name Type Description Defaultfilepath
str
Path to the MLMD database file.
'mlmd'
Source code in cmflib/cmfquery.py
def __init__(self, filepath: str = \"mlmd\") -> None:\n config = mlpb.ConnectionConfig()\n config.sqlite.filename_uri = filepath\n self.store = metadata_store.MetadataStore(config)\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_pipeline_names","title":"get_pipeline_names()
","text":"Return names of all pipelines.
Returns:
Type DescriptionList[str]
List of all pipeline names.
Source code incmflib/cmfquery.py
def get_pipeline_names(self) -> t.List[str]:\n \"\"\"Return names of all pipelines.\n\n Returns:\n List of all pipeline names.\n \"\"\"\n return [ctx.name for ctx in self._get_pipelines()]\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_pipeline_id","title":"get_pipeline_id(pipeline_name)
","text":"Return pipeline identifier for the pipeline names pipeline_name
. Args: pipeline_name: Name of the pipeline. Returns: Pipeline identifier or -1 if one does not exist.
cmflib/cmfquery.py
def get_pipeline_id(self, pipeline_name: str) -> int:\n \"\"\"Return pipeline identifier for the pipeline names `pipeline_name`.\n Args:\n pipeline_name: Name of the pipeline.\n Returns:\n Pipeline identifier or -1 if one does not exist.\n \"\"\"\n pipeline: t.Optional[mlpb.Context] = self._get_pipeline(pipeline_name)\n return -1 if not pipeline else pipeline.id\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_pipeline_stages","title":"get_pipeline_stages(pipeline_name)
","text":"Return list of pipeline stages for the pipeline with the given name.
Parameters:
Name Type Description Defaultpipeline_name
str
Name of the pipeline for which stages need to be returned. In CMF, there are no different pipelines with the same name.
requiredReturns: List of stage names associated with the given pipeline.
Source code incmflib/cmfquery.py
def get_pipeline_stages(self, pipeline_name: str) -> t.List[str]:\n \"\"\"Return list of pipeline stages for the pipeline with the given name.\n\n Args:\n pipeline_name: Name of the pipeline for which stages need to be returned. In CMF, there are no different\n pipelines with the same name.\n Returns:\n List of stage names associated with the given pipeline.\n \"\"\"\n stages = []\n for pipeline in self._get_pipelines(pipeline_name):\n stages.extend(stage.name for stage in self._get_stages(pipeline.id))\n return stages\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_exe_in_stage","title":"get_all_exe_in_stage(stage_name)
","text":"Return list of all executions for the stage with the given name.
Parameters:
Name Type Description Defaultstage_name
str
Name of the stage. Before stages are recorded in MLMD, they are modified (e.g., pipeline name will become part of the stage name). So stage names from different pipelines will not collide.
requiredReturns: List of executions for the given stage.
Source code incmflib/cmfquery.py
def get_all_exe_in_stage(self, stage_name: str) -> t.List[mlpb.Execution]:\n \"\"\"Return list of all executions for the stage with the given name.\n\n Args:\n stage_name: Name of the stage. Before stages are recorded in MLMD, they are modified (e.g., pipeline name\n will become part of the stage name). So stage names from different pipelines will not collide.\n Returns:\n List of executions for the given stage.\n \"\"\"\n for pipeline in self._get_pipelines():\n for stage in self._get_stages(pipeline.id):\n if stage.name == stage_name:\n return self.store.get_executions_by_context(stage.id)\n return []\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_executions_by_ids_list","title":"get_all_executions_by_ids_list(exe_ids)
","text":"Return executions for given execution ids list as a pandas data frame.
Parameters:
Name Type Description Defaultexe_ids
List[int]
List of execution identifiers.
requiredReturns:
Type DescriptionDataFrame
Data frame with all executions for the list of given execution identifiers.
Source code incmflib/cmfquery.py
def get_all_executions_by_ids_list(self, exe_ids: t.List[int]) -> pd.DataFrame:\n \"\"\"Return executions for given execution ids list as a pandas data frame.\n\n Args:\n exe_ids: List of execution identifiers.\n\n Returns:\n Data frame with all executions for the list of given execution identifiers.\n \"\"\"\n\n df = pd.DataFrame()\n executions = self.store.get_executions_by_id(exe_ids)\n for exe in executions:\n d1 = self._transform_to_dataframe(exe)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifacts_by_context","title":"get_all_artifacts_by_context(pipeline_name)
","text":"Return artifacts for given pipeline name as a pandas data frame.
Parameters:
Name Type Description Defaultpipeline_name
str
Name of the pipeline.
requiredReturns:
Type DescriptionDataFrame
Data frame with all artifacts associated with given pipeline name.
Source code incmflib/cmfquery.py
def get_all_artifacts_by_context(self, pipeline_name: str) -> pd.DataFrame:\n \"\"\"Return artifacts for given pipeline name as a pandas data frame.\n\n Args:\n pipeline_name: Name of the pipeline.\n\n Returns:\n Data frame with all artifacts associated with given pipeline name.\n \"\"\"\n df = pd.DataFrame()\n contexts = self.store.get_contexts_by_type(\"Parent_Context\")\n context_id = self.get_pipeline_id(pipeline_name)\n for ctx in contexts:\n if ctx.id == context_id:\n child_contexts = self.store.get_children_contexts_by_context(ctx.id)\n for cc in child_contexts:\n artifacts = self.store.get_artifacts_by_context(cc.id)\n for art in artifacts:\n d1 = self.get_artifact_df(art)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifacts_by_ids_list","title":"get_all_artifacts_by_ids_list(artifact_ids)
","text":"Return all artifacts for the given artifact ids list.
Parameters:
Name Type Description Defaultartifact_ids
List[int]
List of artifact identifiers
requiredReturns:
Type DescriptionDataFrame
Data frame with all artifacts for the given artifact ids list.
Source code incmflib/cmfquery.py
def get_all_artifacts_by_ids_list(self, artifact_ids: t.List[int]) -> pd.DataFrame:\n \"\"\"Return all artifacts for the given artifact ids list.\n\n Args:\n artifact_ids: List of artifact identifiers\n\n Returns:\n Data frame with all artifacts for the given artifact ids list.\n \"\"\"\n df = pd.DataFrame()\n artifacts = self.store.get_artifacts_by_id(artifact_ids)\n for art in artifacts:\n d1 = self.get_artifact_df(art)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_executions_in_stage","title":"get_all_executions_in_stage(stage_name)
","text":"Return executions of the given stage as pandas data frame. Args: stage_name: Stage name. See doc strings for the prev method. Returns: Data frame with all executions associated with the given stage.
Source code incmflib/cmfquery.py
def get_all_executions_in_stage(self, stage_name: str) -> pd.DataFrame:\n \"\"\"Return executions of the given stage as pandas data frame.\n Args:\n stage_name: Stage name. See doc strings for the prev method.\n Returns:\n Data frame with all executions associated with the given stage.\n \"\"\"\n df = pd.DataFrame()\n for pipeline in self._get_pipelines():\n for stage in self._get_stages(pipeline.id):\n if stage.name == stage_name:\n for execution in self._get_executions(stage.id):\n ex_as_df: pd.DataFrame = self._transform_to_dataframe(\n execution, {\"id\": execution.id, \"name\": execution.name}\n )\n df = pd.concat([df, ex_as_df], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_artifact_df","title":"get_artifact_df(artifact, d=None)
","text":"Return artifact's data frame representation.
Parameters:
Name Type Description Defaultartifact
Artifact
MLMD entity representing artifact.
requiredd
Optional[Dict]
Optional initial content for data frame.
None
Returns: A data frame with the single row containing attributes of this artifact.
Source code incmflib/cmfquery.py
def get_artifact_df(self, artifact: mlpb.Artifact, d: t.Optional[t.Dict] = None) -> pd.DataFrame:\n \"\"\"Return artifact's data frame representation.\n\n Args:\n artifact: MLMD entity representing artifact.\n d: Optional initial content for data frame.\n Returns:\n A data frame with the single row containing attributes of this artifact.\n \"\"\"\n if d is None:\n d = {}\n d.update(\n {\n \"id\": artifact.id,\n \"type\": self.store.get_artifact_types_by_id([artifact.type_id])[0].name,\n \"uri\": artifact.uri,\n \"name\": artifact.name,\n \"create_time_since_epoch\": artifact.create_time_since_epoch,\n \"last_update_time_since_epoch\": artifact.last_update_time_since_epoch,\n }\n )\n return self._transform_to_dataframe(artifact, d)\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifacts","title":"get_all_artifacts()
","text":"Return names of all artifacts.
Returns:
Type DescriptionList[str]
List of all artifact names.
Source code incmflib/cmfquery.py
def get_all_artifacts(self) -> t.List[str]:\n \"\"\"Return names of all artifacts.\n\n Returns:\n List of all artifact names.\n \"\"\"\n return [artifact.name for artifact in self.store.get_artifacts()]\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_artifact","title":"get_artifact(name)
","text":"Return artifact's data frame representation using artifact name.
Parameters:
Name Type Description Defaultname
str
Artifact name.
requiredReturns: Pandas data frame with one row containing attributes of this artifact.
Source code incmflib/cmfquery.py
def get_artifact(self, name: str) -> t.Optional[pd.DataFrame]:\n \"\"\"Return artifact's data frame representation using artifact name.\n\n Args:\n name: Artifact name.\n Returns:\n Pandas data frame with one row containing attributes of this artifact.\n \"\"\"\n artifact: t.Optional[mlpb.Artifact] = self._get_artifact(name)\n if artifact:\n return self.get_artifact_df(artifact)\n return None\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifacts_for_execution","title":"get_all_artifacts_for_execution(execution_id)
","text":"Return input and output artifacts for the given execution.
Parameters:
Name Type Description Defaultexecution_id
int
Execution identifier.
requiredReturn: Data frame containing input and output artifacts for the given execution, one artifact per row.
Source code incmflib/cmfquery.py
def get_all_artifacts_for_execution(self, execution_id: int) -> pd.DataFrame:\n \"\"\"Return input and output artifacts for the given execution.\n\n Args:\n execution_id: Execution identifier.\n Return:\n Data frame containing input and output artifacts for the given execution, one artifact per row.\n \"\"\"\n df = pd.DataFrame()\n for event in self.store.get_events_by_execution_ids([execution_id]):\n event_type = \"INPUT\" if event.type == mlpb.Event.Type.INPUT else \"OUTPUT\"\n for artifact in self.store.get_artifacts_by_id([event.artifact_id]):\n df = pd.concat(\n [df, self.get_artifact_df(artifact, {\"event\": event_type})], sort=True, ignore_index=True\n )\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifact_types","title":"get_all_artifact_types()
","text":"Return names of all artifact types.
Returns:
Type DescriptionList[str]
List of all artifact types.
Source code incmflib/cmfquery.py
def get_all_artifact_types(self) -> t.List[str]:\n \"\"\"Return names of all artifact types.\n\n Returns:\n List of all artifact types.\n \"\"\"\n artifact_list = self.store.get_artifact_types()\n types=[i.name for i in artifact_list]\n return types\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_executions_for_artifact","title":"get_all_executions_for_artifact(artifact_name)
","text":"Return executions that consumed and produced given artifact.
Parameters:
Name Type Description Defaultartifact_name
str
Artifact name.
requiredReturns: Pandas data frame containing stage executions, one execution per row.
Source code incmflib/cmfquery.py
def get_all_executions_for_artifact(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return executions that consumed and produced given artifact.\n\n Args:\n artifact_name: Artifact name.\n Returns:\n Pandas data frame containing stage executions, one execution per row.\n \"\"\"\n df = pd.DataFrame()\n\n artifact: t.Optional = self._get_artifact(artifact_name)\n if not artifact:\n return df\n\n for event in self.store.get_events_by_artifact_ids([artifact.id]):\n stage_ctx = self.store.get_contexts_by_execution(event.execution_id)[0]\n linked_execution = {\n \"Type\": \"INPUT\" if event.type == mlpb.Event.Type.INPUT else \"OUTPUT\",\n \"execution_id\": event.execution_id,\n \"execution_name\": self.store.get_executions_by_id([event.execution_id])[0].name,\n \"execution_type_name\":self.store.get_executions_by_id([event.execution_id])[0].properties['Execution_type_name'],\n \"stage\": stage_ctx.name,\n \"pipeline\": self.store.get_parent_contexts_by_context(stage_ctx.id)[0].name,\n }\n d1 = pd.DataFrame(\n linked_execution,\n index=[\n 0,\n ],\n )\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_one_hop_child_artifacts","title":"get_one_hop_child_artifacts(artifact_name, pipeline_id=None)
","text":"Get artifacts produced by executions that consume given artifact.
Parameters:
Name Type Description Defaultartifact
name
Name of an artifact.
requiredReturn: Output artifacts of all executions that consumed given artifact.
Source code incmflib/cmfquery.py
def get_one_hop_child_artifacts(self, artifact_name: str, pipeline_id: str = None) -> pd.DataFrame:\n \"\"\"Get artifacts produced by executions that consume given artifact.\n\n Args:\n artifact name: Name of an artifact.\n Return:\n Output artifacts of all executions that consumed given artifact.\n \"\"\"\n artifact: t.Optional = self._get_artifact(artifact_name)\n if not artifact:\n return pd.DataFrame()\n\n # Get output artifacts of executions consumed the above artifact.\n artifacts_ids = self._get_output_artifacts(self._get_executions_by_input_artifact_id(artifact.id,pipeline_id))\n return self._as_pandas_df(\n self.store.get_artifacts_by_id(artifacts_ids), lambda _artifact: self.get_artifact_df(_artifact)\n )\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_child_artifacts","title":"get_all_child_artifacts(artifact_name)
","text":"Return all downstream artifacts starting from the given artifact.
Parameters:
Name Type Description Defaultartifact_name
str
Artifact name.
requiredReturns: Data frame containing all child artifacts.
Source code incmflib/cmfquery.py
def get_all_child_artifacts(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return all downstream artifacts starting from the given artifact.\n\n Args:\n artifact_name: Artifact name.\n Returns:\n Data frame containing all child artifacts.\n \"\"\"\n df = pd.DataFrame()\n d1 = self.get_one_hop_child_artifacts(artifact_name)\n # df = df.append(d1, sort=True, ignore_index=True)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n for row in d1.itertuples():\n d1 = self.get_all_child_artifacts(row.name)\n # df = df.append(d1, sort=True, ignore_index=True)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n df = df.drop_duplicates(subset=None, keep=\"first\", inplace=False)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_one_hop_parent_artifacts","title":"get_one_hop_parent_artifacts(artifact_name)
","text":"Return input artifacts for the execution that produced the given artifact. Args: artifact_name: Artifact name. Returns: Data frame containing immediate parent artifactog of given artifact.
Source code incmflib/cmfquery.py
def get_one_hop_parent_artifacts(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return input artifacts for the execution that produced the given artifact.\n Args:\n artifact_name: Artifact name.\n Returns:\n Data frame containing immediate parent artifactog of given artifact.\n \"\"\"\n artifact: t.Optional = self._get_artifact(artifact_name)\n if not artifact:\n return pd.DataFrame()\n\n artifact_ids: t.List[int] = self._get_input_artifacts(self._get_executions_by_output_artifact_id(artifact.id))\n\n return self._as_pandas_df(\n self.store.get_artifacts_by_id(artifact_ids), lambda _artifact: self.get_artifact_df(_artifact)\n )\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_parent_artifacts","title":"get_all_parent_artifacts(artifact_name)
","text":"Return all upstream artifacts. Args: artifact_name: Artifact name. Returns: Data frame containing all parent artifacts.
Source code incmflib/cmfquery.py
def get_all_parent_artifacts(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return all upstream artifacts.\n Args:\n artifact_name: Artifact name.\n Returns:\n Data frame containing all parent artifacts.\n \"\"\"\n df = pd.DataFrame()\n d1 = self.get_one_hop_parent_artifacts(artifact_name)\n # df = df.append(d1, sort=True, ignore_index=True)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n for row in d1.itertuples():\n d1 = self.get_all_parent_artifacts(row.name)\n # df = df.append(d1, sort=True, ignore_index=True)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n df = df.drop_duplicates(subset=None, keep=\"first\", inplace=False)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_parent_executions","title":"get_all_parent_executions(artifact_name)
","text":"Return all executions that produced upstream artifacts for the given artifact. Args: artifact_name: Artifact name. Returns: Data frame containing all parent executions.
Source code incmflib/cmfquery.py
def get_all_parent_executions(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return all executions that produced upstream artifacts for the given artifact.\n Args:\n artifact_name: Artifact name.\n Returns:\n Data frame containing all parent executions.\n \"\"\"\n parent_artifacts: pd.DataFrame = self.get_all_parent_artifacts(artifact_name)\n if parent_artifacts.shape[0] == 0:\n # If it's empty, there's no `id` column and the code below raises an exception.\n return pd.DataFrame()\n\n execution_ids = set(\n event.execution_id\n for event in self.store.get_events_by_artifact_ids(parent_artifacts.id.values.tolist())\n if event.type == mlpb.Event.OUTPUT\n )\n\n return self._as_pandas_df(\n self.store.get_executions_by_id(execution_ids),\n lambda _exec: self._transform_to_dataframe(_exec, {\"id\": _exec.id, \"name\": _exec.name}),\n )\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_metrics","title":"get_metrics(metrics_name)
","text":"Return metric data frame. Args: metrics_name: Metrics name. Returns: Data frame containing all metrics.
Source code incmflib/cmfquery.py
def get_metrics(self, metrics_name: str) -> t.Optional[pd.DataFrame]:\n \"\"\"Return metric data frame.\n Args:\n metrics_name: Metrics name.\n Returns:\n Data frame containing all metrics.\n \"\"\"\n for metric in self.store.get_artifacts_by_type(\"Step_Metrics\"):\n if metric.name == metrics_name:\n name: t.Optional[str] = metric.custom_properties.get(\"Name\", None)\n if name:\n return pd.read_parquet(name)\n break\n return None\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.dumptojson","title":"dumptojson(pipeline_name, exec_id=None)
","text":"Return JSON-parsable string containing details about the given pipeline. Args: pipeline_name: Name of an AI pipelines. exec_id: Optional stage execution ID - filter stages by this execution ID. Returns: Pipeline in JSON format.
Source code incmflib/cmfquery.py
def dumptojson(self, pipeline_name: str, exec_id: t.Optional[int] = None) -> t.Optional[str]:\n \"\"\"Return JSON-parsable string containing details about the given pipeline.\n Args:\n pipeline_name: Name of an AI pipelines.\n exec_id: Optional stage execution ID - filter stages by this execution ID.\n Returns:\n Pipeline in JSON format.\n \"\"\"\n if exec_id is not None:\n exec_id = int(exec_id)\n\n def _get_node_attributes(_node: t.Union[mlpb.Context, mlpb.Execution, mlpb.Event], _attrs: t.Dict) -> t.Dict:\n for attr in CONTEXT_LIST:\n #Artifacts getattr call on Type was giving empty string, which was overwriting \n # the defined types such as Dataset, Metrics, Models\n if getattr(_node, attr, None) is not None and not getattr(_node, attr, None) == \"\":\n _attrs[attr] = getattr(_node, attr)\n\n if \"properties\" in _attrs:\n _attrs[\"properties\"] = CmfQuery._copy(_attrs[\"properties\"])\n if \"custom_properties\" in _attrs:\n # TODO: (sergey) why do we need to rename \"type\" to \"user_type\" if we just copy into a new dictionary?\n _attrs[\"custom_properties\"] = CmfQuery._copy(\n _attrs[\"custom_properties\"], key_mapper={\"type\": \"user_type\"}\n )\n return _attrs\n\n pipelines: t.List[t.Dict] = []\n for pipeline in self._get_pipelines(pipeline_name):\n pipeline_attrs = _get_node_attributes(pipeline, {\"stages\": []})\n for stage in self._get_stages(pipeline.id):\n stage_attrs = _get_node_attributes(stage, {\"executions\": []})\n for execution in self._get_executions(stage.id, execution_id=exec_id):\n # name will be an empty string for executions that are created with\n # create new execution as true(default)\n # In other words name property will there only for execution\n # that are created with create new execution flag set to false(special case)\n exec_attrs = _get_node_attributes(\n execution,\n {\n \"type\": self.store.get_execution_types_by_id([execution.type_id])[0].name,\n \"name\": execution.name if execution.name != \"\" else \"\",\n \"events\": [],\n },\n )\n for event in self.store.get_events_by_execution_ids([execution.id]):\n event_attrs = _get_node_attributes(event, {})\n # An event has only a single Artifact associated with it. \n # For every artifact we create an event to link it to the execution.\n\n artifacts = self.store.get_artifacts_by_id([event.artifact_id])\n artifact_attrs = _get_node_attributes(\n artifacts[0], {\"type\": self.store.get_artifact_types_by_id([artifacts[0].type_id])[0].name}\n )\n event_attrs[\"artifact\"] = artifact_attrs\n exec_attrs[\"events\"].append(event_attrs)\n stage_attrs[\"executions\"].append(exec_attrs)\n pipeline_attrs[\"stages\"].append(stage_attrs)\n pipelines.append(pipeline_attrs)\n\n return json.dumps({\"Pipeline\": pipelines})\n
"},{"location":"api/public/dataslice/","title":"cmflib.cmf.Cmf.DataSlice","text":"A data slice represents a named subset of data. It can be used to track performance of an ML model on different slices of the training or testing dataset splits. This can be useful from different perspectives, for instance, to mitigate model bias.
Instances of data slices are not meant to be created manually by users. Instead, use Cmf.create_dataslice method.
Source code incmflib/cmf.py
def __init__(self, name: str, writer):\n self.props = {}\n self.name = name\n self.writer = writer\n
"},{"location":"api/public/dataslice/#cmflib.cmf.Cmf.DataSlice.add_data","title":"add_data(path, custom_properties=None)
","text":"Add data to create the dataslice. Currently supported only for file abstractions. Pre-condition - the parent folder, containing the file should already be versioned. Example:
dataslice.add_data(f\"data/raw_data/{j}.xml)\n
Args: path: Name to identify the file to be added to the dataslice. custom_properties: Properties associated with this datum. Source code in cmflib/cmf.py
def add_data(\n self, path: str, custom_properties: t.Optional[t.Dict] = None\n) -> None:\n \"\"\"Add data to create the dataslice.\n Currently supported only for file abstractions. Pre-condition - the parent folder, containing the file\n should already be versioned.\n Example:\n ```python\n dataslice.add_data(f\"data/raw_data/{j}.xml)\n ```\n Args:\n path: Name to identify the file to be added to the dataslice.\n custom_properties: Properties associated with this datum.\n \"\"\"\n\n self.props[path] = {}\n self.props[path]['hash'] = dvc_get_hash(path)\n parent_path = path.rsplit(\"/\", 1)[0]\n self.data_parent = parent_path.rsplit(\"/\", 1)[1]\n if custom_properties:\n for k, v in custom_properties.items():\n self.props[path][k] = v\n
"},{"location":"api/public/dataslice/#cmflib.cmf.Cmf.DataSlice.commit","title":"commit(custom_properties=None)
","text":"Commit the dataslice. The created dataslice is versioned and added to underneath data versioning software. Example:
dataslice.commit()\n```\n
Args: custom_properties: Dictionary to store key value pairs associated with Dataslice Example{\"mean\":2.5, \"median\":2.6}
Source code incmflib/cmf.py
def commit(self, custom_properties: t.Optional[t.Dict] = None) -> None:\n \"\"\"Commit the dataslice.\n The created dataslice is versioned and added to underneath data versioning software.\n Example:\n\n dataslice.commit()\n ```\n Args:\n custom_properties: Dictionary to store key value pairs associated with Dataslice\n Example{\"mean\":2.5, \"median\":2.6}\n \"\"\"\n\n logging_dir = change_dir(self.writer.cmf_init_path)\n # code for nano cmf\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.writer.child_context:\n self.writer.create_context(pipeline_stage=name_without_extension)\n assert self.writer.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # create execution if not already created\n if not self.writer.execution:\n self.writer.create_execution(execution_type=name_without_extension)\n assert self.writer.execution is not None, f\"Failed to create execution for {self.pipeline_name}!!\"\n\n directory_path = os.path.join(self.writer.ARTIFACTS_PATH, self.writer.execution.properties[\"Execution_uuid\"].string_value.split(',')[0], self.writer.DATASLICE_PATH)\n os.makedirs(directory_path, exist_ok=True)\n custom_props = {} if custom_properties is None else custom_properties\n git_repo = git_get_repo()\n dataslice_df = pd.DataFrame.from_dict(self.props, orient=\"index\")\n dataslice_df.index.names = [\"Path\"]\n dataslice_path = os.path.join(directory_path,self.name)\n dataslice_df.to_parquet(dataslice_path)\n existing_artifact = []\n\n commit_output(dataslice_path, self.writer.execution.id)\n c_hash = dvc_get_hash(dataslice_path)\n if c_hash == \"\":\n print(\"Error in getting the dvc hash,return without logging\")\n return\n\n dataslice_commit = c_hash\n url = dvc_get_url(dataslice_path)\n dvc_url_with_pipeline = f\"{self.writer.parent_context.name}:{url}\"\n if c_hash and c_hash.strip():\n existing_artifact.extend(\n self.writer.store.get_artifacts_by_uri(c_hash))\n if existing_artifact and len(existing_artifact) != 0:\n print(\"Adding to existing data slice\")\n # Haven't added event type in this if cond, is it not needed??\n slice = link_execution_to_input_artifact(\n store=self.writer.store,\n execution_id=self.writer.execution.id,\n uri=c_hash,\n input_name=dataslice_path + \":\" + c_hash,\n )\n else:\n props={\n \"git_repo\": str(git_repo),\n # passing c_hash value to commit\n \"Commit\": str(dataslice_commit),\n \"url\": str(dvc_url_with_pipeline),\n },\n slice = create_new_artifact_event_and_attribution(\n store=self.writer.store,\n execution_id=self.writer.execution.id,\n context_id=self.writer.child_context.id,\n uri=c_hash,\n name=dataslice_path + \":\" + c_hash,\n type_name=\"Dataslice\",\n event_type=mlpb.Event.Type.OUTPUT,\n properties={\n \"git_repo\": str(git_repo),\n # passing c_hash value to commit\n \"Commit\": str(dataslice_commit),\n \"url\": str(dvc_url_with_pipeline),\n },\n artifact_type_properties={\n \"git_repo\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n if self.writer.graph:\n self.writer.driver.create_dataslice_node(\n self.name, dataslice_path + \":\" + c_hash, c_hash, self.data_parent, props\n )\n os.chdir(logging_dir)\n return slice\n
"},{"location":"architecture/advantages/","title":"Advantages","text":"Common metadata framework has the following components:
The API\u2019s and the abstractions provided by the library enables tracking of pipeline metadata. It tracks the stages in the pipeline, the input and output artifacts at each stage and metrics. The framework allows metrics to be tracked both at coarse and fine-grained intervals. It could be a stage metrics, which could be captured at the end of a stage or fine-grained metrics which is tracked per step (epoch) or at regular intervals during the execution of the stage.
The metadata logged through the APIs are written to a backend relational database. The library also provides API\u2019s to query the metadata stored in the relational database for the users to inspect pipelines.
In addition to explicit tracking through the API\u2019s library also provides, implicit tracking. The implicit tracking automatically tracks the software version used in the pipelines. The function arguments and function return values can be automatically tracked by adding metadata tracker class decorators on the functions.
Before writing the metadata to relational database, the metadata operations are journaled in the metadata journal log. This enables the framework to transfer the local metadata to the central server.
All artifacts are versioned with a data versioning framework (for e.g., DVC). The content hash of the artifacts are generated and stored along with the user provided metadata. A special artifact metadata file called a \u201c.dvc\u201d file is created for every artifact (file / folder) which is added to data version management system. The .dvc file contains the content hash of the artifact.
For every new execution, the metadata tracker creates a new branch to track the code. The special metadata file created for artifacts, the \u201c.dvc\u201d file is also committed to GIT and its commit id is tracked as a metadata information. The artifacts are versioned through the versioning of its metadata file. Whenever there is a change in the artifact, the metadata file is modified to reflect its current content hash, and the file is tracked as a new version of the metadata file.
The metadata tracker automatically tracks the start commit when the library was initialized and creates separate commit for each change in the artifact along the experiment. This helps to track the transformations on the artifacts along the different stages in the pipeline.
"},{"location":"architecture/components/#local-client","title":"Local Client","text":"The metadata client interacts with the metadata server. It communicates with the server, for synchronization of metadata.
After the experiment is completed, the user invokes the \u201cCmf push\u201d command to push the collected metadata to the remote. This transfers the existing metadata journal to the server.
The metadata from the central repository can be pulled to the local repository, either using the artifacts or using the project as the identifier or both.
When artifact is used as the identifier, all metadata associated with the artifacts currently present in the branch of the cloned Git repository is pulled from the central repository to the local repository. The pulled metadata consist of not only the immediate metadata associated with the artifacts, it contains the metadata of all the artifacts in its chain of lineage.
When project is used as the identifier, all the metadata associated with the current branch of the pipeline code that is checked out is pulled to the local repository.
"},{"location":"architecture/components/#central-server","title":"Central Server","text":"The central server, exposes REST API\u2019s that can be called from the remote clients. This can help in situations where the connectivity between the core datacenter and the remote client is robust. The remote client calls the API\u2019s exposed by the central server to log the metadata directly to the central metadata repository.
Where the connectivity with the central server is intermittent, the remote clients log the metadata to the local repository. The journaled metadata is pushed by the remote client to the central server. The central server, will replay the journal and merge the incoming metadata with the metadata already existing in the central repository. The ability to accurately identify the artifacts anywhere using their content hash, makes this merge robust.
"},{"location":"architecture/components/#central-repositories","title":"Central Repositories","text":"The common metadata framework consist of three central repositories for the code, data and metadata.
"},{"location":"architecture/components/#central-metadata-repository","title":"Central Metadata repository","text":"Central metadata repository holds the metadata pushed from the distributed sites. It holds metadata about all the different pipelines that was tracked using the common metadata tracker. The consolidated view of the metadata stored in the central repository, helps the users to learn across various stages in the pipeline executed at different locations. Using the query layer that is pointed to the central repository, the users gets the global view of the metadata which provides them with a deeper understanding of the pipelines and its metadata. The metadata helps to understand nonobvious results like performance of a dataset with respect to other datasets, Performance of a particular pipeline with respect to other pipelines etc.
"},{"location":"architecture/components/#central-artifact-storage-repository","title":"Central Artifact storage repository","text":"Central Artifact storage repository stores all the artifacts related to experiment. The data versioning framework (DVC) stores the artifacts in a content addressable layout. The artifacts are stored inside the folder with name as the first two characters of the content hash and the name of the artifact as the remaining part of the content hash. This helps in efficient retrieval of the artifacts.
"},{"location":"architecture/components/#git-repository","title":"Git Repository","text":"Git repository is used to track the code. Along with the code, the metadata file of the artifacts which contain the content hash of the artifacts are also stored in GIT. The Data versioning framework (dvc) would use these files to retrieve the artifacts from the artifact storage repository.
"},{"location":"architecture/overview/","title":"Architecture Overview","text":"Interactions in data pipelines can be complex. The Different stages in the pipeline, (which may not be next to each other) may have to interact to produce or transform artifacts. As the artifacts navigates and undergo transformations through this pipeline, it can take a complicated path, which might also involve bidirectional movement across these stages. Also, there could be dependencies between the multiple stages, where the metrics produced by a stage could influence the metrics at a subsequent stage. It is important to track the metadata across a pipeline to provide features like, lineage tracking, provenance and reproducibility.
The tracking of metadata through these complex pipelines have multiple challenges, some of them being,
Common metadata framework (CMF) addresses the problems associated with tracking of pipeline metadata from distributed sites and tracks code, data and metadata together for end-to-end traceability.
The framework automatically tracks the code version as one of the metadata for an execution. Additionally, the data artifacts are also versioned automatically using a data versioning framework (like DVC) and the metadata regarding the data version is stored along with the code. The framework stores the Git commit id of the metadata file associated with the artifact and content hash of the artifact as metadata. The framework provides API\u2019s to track the hyperparameters and other metadata of pipelines. Therefore, from the metadata stored, users can zero in on the hyperparameters, code version and the artifact version used for the experiment.
Identifying the artifacts by content hash allows the framework, to uniquely identify an artifact anywhere in the distributed sites. This enables the metadata from the distributed sites to be precisely merged to a central repository, thereby providing a single global metadata from the distributed sites.
On this backbone, we build the Git like experience for metadata, enabling users to push their local metadata to the remote repository, where it is merged to create the global metadata and pull metadata from the global metadata to the local, to create a local view, which would contain only the metadata of interest.
The framework can be used to track various types of pipelines such as data pipelines or AI pipelines.
"},{"location":"cmf_client/Getting%20Started%20with%20cmf/","title":"Getting started with cmf","text":"Common metadata framework (cmf) has the following components:
cmf-client is a tool that facilitates metadata collaboration between different teams or two team members. It allows users to pull or push metadata from or to the cmf-server.
Follow the below-mentioned steps for the end-to-end setup of cmf-client:-
Pre-Requisites
Install cmf library i.e. cmflib
pip install https://github.com/HewlettPackard/cmf\n
OR pip install cmflib\n
Check here for more details."},{"location":"cmf_client/Getting%20Started%20with%20cmf/#install-cmf-server","title":"Install cmf-server","text":"cmf-server is a key interface for the user to explore and track their ML training runs. It allows users to store the metadata file on the cmf-server. The user can retrieve the saved metadata file and can view the content of the saved metadata file using the UI provided by the cmf-server.
Follow here for details on how to setup a cmf-server.
"},{"location":"cmf_client/Getting%20Started%20with%20cmf/#how-to-effectively-use-cmf-client","title":"How to effectively use cmf-client?","text":"Let's assume we are tracking the metadata for a pipeline named Test-env
with minio S3 bucket as the artifact repository and a cmf-server.
Create a folder
mkdir example-folder\n
Initialize cmf
CMF initialization is the first and foremost to use cmf-client commads. This command in one go complete initialization process making cmf-client user friendly. Execute cmf init
in the example-folder
directory created in the above step.
cmf init minioS3 --url s3://bucket-name --endpoint-url http://localhost:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://X.X.X.X:7687\n
Check here for more details. Check status of CMF initialization (Optional)
cmf init show\n
Check here for more details. Track metadata using cmflib
Use Sample projects as a reference to create a new project to track metadata for ML pipelines.
More info is available here.
Push artifacts
Push artifacts in the artifact repo initialised in the Initialize cmf step.
cmf artifact push \n
Check here for more details. Push metadata to cmf-server
cmf metadata push -p 'Test-env'\n
Check here for more details."},{"location":"cmf_client/Getting%20Started%20with%20cmf/#cmf-client-with-collaborative-development","title":"cmf-client with collaborative development","text":"In the case of collaborative development, in addition to the above commands, users can follow the commands below to pull metadata and artifacts from a common cmf server and a central artifact repository.
Pull metadata from the server
Execute cmf metadata
command in the example_folder
.
cmf metadata pull -p 'Test-env'\n
Check here for more details. Pull artifacts from the central artifact repo
Execute cmf artifact
command in the example_folder
.
cmf artifact pull -p \"Test-env\"\n
Check here for more details."},{"location":"cmf_client/Getting%20Started%20with%20cmf/#flow-chart-for-cmf","title":"Flow Chart for cmf","text":""},{"location":"cmf_client/cmf_client/","title":"Getting started with cmf-client commands","text":""},{"location":"cmf_client/cmf_client/#cmf-init","title":"cmf init","text":"Usage: cmf init [-h] {minioS3,amazonS3,local,sshremote,osdfremote,show}\n
cmf init
initializes an artifact repository for cmf. Local directory, Minio S3 bucket, Amazon S3 bucket, SSH Remote and Remote OSDF directory are the options available. Additionally, user can provide cmf-server url."},{"location":"cmf_client/cmf_client/#cmf-init-show","title":"cmf init show","text":"Usage: cmf init show\n
cmf init show
displays current cmf configuration."},{"location":"cmf_client/cmf_client/#cmf-init-minios3","title":"cmf init minioS3","text":"Usage: cmf init minioS3 [-h] --url [url] \n --endpoint-url [endpoint_url]\n --access-key-id [access_key_id] \n --secret-key [secret_key] \n --git-remote-url[git_remote_url] \n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password [neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init minioS3
configures Minio S3 bucket as a cmf artifact repository. Refer minio-server.md to set up a minio server. cmf init minioS3 --url s3://dvc-art --endpoint-url http://x.x.x.x:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Here, \"dvc-art\" is provided as an example bucket name. However, users can change it as needed, if the user chooses to change it, they will need to update the Dockerfile for MinIOS3 accordingly.
Required Arguments
--url [url] Specify bucket url.\n --endpoint-url [endpoint_url] Specify the endpoint url of minio UI.\n --access-key-id [access_key_id] Specify Access Key Id.\n --secret-key [secret_key] Specify Secret Key.\n --git-remote-url [git_remote_url] Specify git repo url.\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-init-local","title":"cmf init local","text":"Usage: cmf init local [-h] --path [path] -\n --git-remote-url [git_remote_url]\n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password [neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init local
initialises local directory as a cmf artifact repository. cmf init local --path /home/XXXX/local-storage --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Replace 'XXXX' with your system username in the following path: /home/XXXX/local-storage
Required Arguments
--path [path] Specify local directory path.\n --git-remote-url [git_remote_url] Specify git repo url.\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-init-amazons3","title":"cmf init amazonS3","text":"Before setting up, obtain AWS temporary security credentials using the AWS Security Token Service (STS). These credentials are short-term and can last from minutes to hours. They are dynamically generated and provided to trusted users upon request, and expire after use. Users with appropriate permissions can request new credentials before or upon expiration. For further information, refer to the Temporary security credentials in IAM page.
To retrieve temporary security credentials using multi-factor authentication (MFA) for an IAM user, you can use the below command.
aws sts get-session-token --duration-seconds <duration> --serial-number <MFA_device_serial_number> --token-code <MFA_token_code>\n
Required Arguments --serial-number Specifies the serial number of the MFA device associated with the IAM user.\n --token-code Specifies the one-time code generated by the MFA device.\n
Optional Arguments
--duration-seconds Specifies the duration for which the temporary credentials will be valid, in seconds.\n
Example
aws sts get-session-token --duration-seconds 3600 --serial-number arn:aws:iam::123456789012:mfa/user --token-code 123456\n
This will return output like
{\n \"Credentials\": {\n \"AccessKeyId\": \"ABCDEFGHIJKLMNO123456\",\n \"SecretAccessKey\": \"PQRSTUVWXYZ789101112131415\",\n \"SessionToken\": \"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijlmnopqrstuvwxyz12345678910ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijlmnopqrstuvwxyz12345678910ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijlmnopqrstuvwxyz12345678910\",\n \"Expiration\": \"2021-05-10T15:31:08+00:00\"\n }\n}\n
Initialization of amazonS3 Usage: cmf init amazonS3 [-h] --url [url] \n --access-key-id [access_key_id]\n --secret-key [secret_key]\n --session-token [session_token]\n --git-remote-url [git_remote_url]\n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init amazonS3
initialises Amazon S3 bucket as a CMF artifact repository. cmf init amazonS3 --url s3://bucket-name --access-key-id XXXXXXXXXXXXX --secret-key XXXXXXXXXXXXX --session-token XXXXXXXXXXXXX --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687 \n
Here, use the --access-key-id, --secret-key and --session-token generated from the aws sts
command which is mentioned above.
The bucket-name must exist within Amazon S3 before executing the cmf artifact push
command.
Required Arguments
--url [url] Specify bucket url.\n --access-key-id [access_key_id] Specify Access Key Id.\n --secret-key [secret_key] Specify Secret Key.\n --session-token Specify session token. (default: )\n --git-remote-url [git_remote_url] Specify git repo url.\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-init-sshremote","title":"cmf init sshremote","text":"Usage: cmf init sshremote [-h] --path [path] \n --user [user]\n --port [port]\n --password [password] \n --git-remote-url [git_remote_url] \n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init sshremote
command initialises remote ssh directory as a cmf artifact repository. cmf init sshremote --path ssh://127.0.0.1/home/user/ssh-storage --user XXXXX --port 22 --password example@123 --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Required Arguments --path [path] Specify ssh directory path.\n --user [user] Specify user.\n --port [port] Specify Port.\n --password [password] Specify password.\n --git-remote-url [git_remote_url] Specify git repo url.\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-init-osdfremote","title":"cmf init osdfremote","text":"Usage: cmf init osdfremote [-h] --path [path] \n --key-id [key_id]\n --key-path [key_path] \n --key-issuer [key_issuer] \n --git-remote-url[git_remote_url] \n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password [neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init osdfremote
configures a OSDF Origin as a cmf artifact repository. cmf init osdfremote --path https://[Some Origin]:8443/nrp/fdp/ --key-id c2a5 --key-path ~/.ssh/fdp.pem --key-issuer https://[Token Issuer]] --git-remote-url https://github.com/user/experiment-repo.git --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://127.0.0.1:80 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Required Arguments --path [path] Specify FQDN for OSDF origin including including port and directory path\n --key-id [key_id] Specify key_id for provided private key. eg. b2d3\n --key-path [key_path] Specify path for private key on local filesystem. eg. ~/.ssh/XXX.pem\n --key-issuer [key_issuer] Specify URL for Key Issuer. eg. https://t.nationalresearchplatform.org/XXX\n --git-remote-url [git_remote_url] Specify git repo url. eg: https://github.com/XXX/example.git\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-artifact","title":"cmf artifact","text":"Usage: cmf artifact [-h] {pull,push}\n
cmf artifact
pull or push artifacts from or to the user configured artifact repository, respectively."},{"location":"cmf_client/cmf_client/#cmf-artifact-pull","title":"cmf artifact pull","text":"Usage: cmf artifact pull [-h] -p [pipeline_name] -f [file_name] -a [artifact_name]\n
cmf artifact pull
command pull artifacts from the user configured repository to the user's local machine. cmf artifact pull -p 'pipeline-name' -f '/path/to/mlmd-file-name' -a 'artifact-name'\n
Required Arguments -p [pipeline_name], --pipeline-name [pipeline_name] Specify Pipeline name.\n
Optional Arguments -h, --help show this help message and exit\n -a [artifact_name], --artifact_name [artifact_name] Specify artifact name only; don't use folder name or absolute path.\n -f [file_name],--file-name [file_name] Specify mlmd file name.\n
"},{"location":"cmf_client/cmf_client/#cmf-artifact-push","title":"cmf artifact push","text":"Usage: cmf artifact push [-h] -p [pipeline_name] -f [file_name]\n
cmf artifact push
command push artifacts from the user's local machine to the user configured artifact repository. cmf artifact push -p 'pipeline_name' -f '/path/to/mlmd-file-name'\n
Required Arguments -p [pipeline_name], --pipeline-name [pipeline_name] Specify Pipeline name.\n
Optional Arguments -h, --help show this help message and exit.\n -f [file_name],--file-name [file_name] Specify mlmd file name.\n
"},{"location":"cmf_client/cmf_client/#cmf-metadata","title":"cmf metadata","text":"Usage: cmf metadata [-h] {pull,push,export}\n
cmf metadata
push, pull or export the metadata file to and from the cmf-server, respectively."},{"location":"cmf_client/cmf_client/#cmf-metadata-pull","title":"cmf metadata pull","text":"Usage: cmf metadata pull [-h] -p [pipeline_name] -f [file_name] -e [exec_id]\n
cmf metadata pull
command pulls the metadata file from the cmf-server to the user's local machine. cmf metadata pull -p 'pipeline-name' -f '/path/to/mlmd-file-name' -e 'execution_id'\n
Required Arguments -p [pipeline_name], --pipeline_name [pipeline_name] Specify Pipeline name.\n
Optional Arguments -h, --help show this help message and exit\n-e [exec_id], --execution [exec_id] Specify execution id\n-f [file_name], --file_name [file_name] Specify mlmd file name with full path(either relative or absolute).\n
"},{"location":"cmf_client/cmf_client/#cmf-metadata-push","title":"cmf metadata push","text":"Usage: cmf metadata push [-h] -p [pipeline_name] -f [file_name] -e [exec_id] -t [tensorboard]\n
cmf metadata push
command pushes the metadata file from the local machine to the cmf-server. cmf metadata push -p 'pipeline-name' -f '/path/to/mlmd-file-name' -e 'execution_id' -t '/path/to/tensorboard-log'\n
Required Arguments -p [pipeline_name], --pipeline_name [pipeline_name] Specify Pipeline name.\n
Optional Arguments
-h, --help show this help message and exit\n -f [file_name], --file_name [file_name] Specify mlmd file name.\n -e [exec_id], --execution [exec_id] Specify execution id.\n -t [tensorboard], --tensorboard [tensorboard] Specify path to tensorboard logs for the pipeline.\n
"},{"location":"cmf_client/cmf_client/#cmf-metadata-export","title":"cmf metadata export","text":"Usage: cmf metadata export [-h] -p [pipeline_name] -j [json_file_name] -f [file_name]\n
cmf metadata export
export local mlmd's metadata in json format to a json file. cmf metadata export -p 'pipeline-name' -j '/path/to/json-file-name' -f '/path/to/mlmd-file-name'\n
Required Arguments -p [pipeline_name], --pipeline_name [pipeline_name] Specify Pipeline name.\n
Optional Arguments
-h, --help show this help message and exit\n -f [file_name], --file_name [file_name] Specify mlmd file name.\n -j [json_file_name], --json_file_name [json_file_name] Specify json file name with full path.\n
"},{"location":"cmf_client/minio-server/","title":"MinIO S3 Artifact Repo Setup","text":""},{"location":"cmf_client/minio-server/#steps-to-set-up-a-minio-server","title":"Steps to set up a MinIO server","text":"Object storage is an abstraction layer above the file system and helps to work with data using API. MinIO is the fastest way to start working with object storage. It is compatible with S3, easy to deploy, manage locally, and upscale if needed.
Follow the below mentioned steps to set up a MinIO server:
Copy contents of the example-get-started
directory to a separate directory outside the cmf repository.
Check whether cmf is initialized.
cmf init show\n
If cmf is not initialized, the following message will appear on the screen. 'cmf' is not configured.\nExecute the 'cmf init' command.\n
Execute the following command to initialize the MinIO S3 bucket as a CMF artifact repository.
cmf init minioS3 --url s3://dvc-art --endpoint-url http://x.x.x.x:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Here, \"dvc-art\" is provided as an example bucket name. However, users can change it as needed, if the user chooses to change it, they will need to update the Dockerfile for minioS3 accordingly.
Execute cmf init show
to check the CMF configuration. The sample output looks as follows:
remote.minio.url=s3://bucket-name\nremote.minio.endpointurl=http://localhost:9000\nremote.minio.access_key_id=minioadmin\nremote.minio.secret_access_key=minioadmin\ncore.remote=minio\n
Build a MinIO server using a Docker container. docker-compose.yml
available in example-get-started
directory provides two services: minio
and aws-cli
. User will initialise the repository with bucket name, storage URL, and credentials to access MinIO.
MYIP= XX.XX.XXX.XXX docker-compose up\n
or MYIP= XX.XX.XXX.XXX docker compose up\n
After executing the above command, following messages confirm that MinIO is up and running. Also you can adjust $MYIP
in examples/example-get-started/docker-compose.yml
to reflect the server IP and run the docker compose
command without specifying
Login into remote.minio.endpointurl
(in the above example - http://localhost:9000) using access-key and secret-key mentioned in cmf configuration.
Following image is an example snapshot of the MinIO server with bucket named 'dvc-art'.
SSH (Secure Shell) remote storage refers to using the SSH protocol to securely access and manage files and data on a remote server or storage system over a network. SSH is a cryptographic network protocol that allows secure communication and data transfer between a local computer and a remote server.
Proceed with the following steps to set up a SSH Remote Repository:
project directory
with SSH repo. Check whether cmf is initialized in your project directory with following command.
cmf init show\n
If cmf is not initialized, the following message will appear on the screen. 'cmf' is not configured.\nExecute the 'cmf init' command.\n
Execute the following command to initialize the SSH remote storage as a CMF artifact repository.
cmf init sshremote --path ssh://127.0.0.1/home/user/ssh-storage --user XXXXX --port 22 --password example@123 --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://127.0.0.1:80 \n
> When running cmf init sshremote
, please ensure that the specified IP address has the necessary permissions to allow access using the specified user ('XXXX'). If the IP address or user lacks the required permissions, the command will fail. Execute cmf init show
to check the CMF configuration.
/etc/ssh/sshd_config file
. This configuration file serves as the primary starting point for diagnosing and resolving SSH permission-related challenges.Common metadata framework (cmf) has the following components:
Before proceeding, ensure that the CMF library is installed on your system. If not, follow the installation instructions provided inside the CMF in a nutshell page.
"},{"location":"cmf_client/step-by-step/#install-cmf-server","title":"Install cmf-server","text":"cmf-server is a key interface for the user to explore and track their ML training runs. It allows users to store the metadata file on the cmf-server. The user can retrieve the saved metadata file and can view the content of the saved metadata file using the UI provided by the cmf-server.
Follow the instructions on the Getting started with cmf-server page for details on how to setup a cmf-server.
"},{"location":"cmf_client/step-by-step/#setup-a-cmf-client","title":"Setup a cmf-client","text":"cmf-client is a tool that facilitates metadata collaboration between different teams or two team members. It allows users to pull or push metadata from or to the cmf-server.
Follow the below-mentioned steps for the end-to-end setup of cmf-client:-
Configuration
mkdir <workdir>
cmf init
to configure dvc remote directory, git remote url, cmf server and neo4j. Follow the Overview page for more details.Let's assume we are tracking the metadata for a pipeline named Test-env
with minio S3 bucket as the artifact repository and a cmf-server.
Create a folder
mkdir example-folder\n
Initialize cmf
CMF initialization is the first and foremost to use cmf-client commads. This command in one go complete initialization process making cmf-client user friendly. Execute cmf init
in the example-folder
directory created in the above step.
cmf init minioS3 --url s3://dvc-art --endpoint-url http://x.x.x.x:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Here, \"dvc-art\" is provided as an example bucket name. However, users can change it as needed, if the user chooses to change it, they will need to update the Dockerfile for minioS3 accordingly.
Check Overview page for more details.
Check status of CMF initialization (Optional)
cmf init show\n
Check Overview page for more details. Track metadata using cmflib
Use Sample projects as a reference to create a new project to track metadata for ML pipelines.
More information is available inside Getting Started.
Before pushing artifacts or metadata, ensure that the cmf server and minioS3 are up and running.
Push artifacts
Push artifacts in the artifact repo initialised in the Initialize cmf step.
cmf artifact push -p 'Test-env'\n
Check Overview page for more details. Push metadata to cmf-server
cmf metadata push -p 'Test-env'\n
Check Overview page for more details."},{"location":"cmf_client/step-by-step/#cmf-client-with-collaborative-development","title":"cmf-client with collaborative development","text":"In the case of collaborative development, in addition to the above commands, users can follow the commands below to pull metadata and artifacts from a common cmf server and a central artifact repository.
Pull metadata from the server
Execute cmf metadata
command in the example_folder
.
cmf metadata pull -p 'Test-env'\n
Check Overview page for more details. Pull artifacts from the central artifact repo
Execute cmf artifact
command in the example_folder
.
cmf artifact pull -p 'Test-env'\n
Check Overview page for more details."},{"location":"cmf_client/step-by-step/#flow-chart-for-cmf","title":"Flow Chart for cmf","text":""},{"location":"cmf_client/tensorflow_guide/","title":"How to Use TensorBoard with CMF","text":"Copy the contents of the 'example-get-started' directory from cmf/examples/example-get-started
into a separate directory outside cmf repository.
Execute the following command to install the TensorFlow library in the current directory:
pip install tensorflow\n
Create a new Python file (e.g., tensorflow_log.py
) and copy the following code:
import datetime\n import tensorflow as tf\n\n mnist = tf.keras.datasets.mnist\n (x_train, y_train),(x_test, y_test) = mnist.load_data()\n x_train, x_test = x_train / 255.0, x_test / 255.0\n\n def create_model():\n return tf.keras.models.Sequential([\n tf.keras.layers.Flatten(input_shape=(28, 28), name='layers_flatten'),\n tf.keras.layers.Dense(512, activation='relu', name='layers_dense'), \n tf.keras.layers.Dropout(0.2, name='layers_dropout'),\n tf.keras.layers.Dense(10, activation='softmax', name='layers_dense_2')\n ])\n\n model = create_model()\n model.compile(optimizer='adam',\n loss='sparse_categorical_crossentropy',\n metrics=['accuracy'])\n\n log_dir = \"logs/fit/\" + datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)\n model.fit(x=x_train,y=y_train,epochs=5,validation_data=(x_test, y_test),callbacks=[tensorboard_callback])\n\n train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))\n test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))\n\n train_dataset = train_dataset.shuffle(60000).batch(64)\n test_dataset = test_dataset.batch(64)\n\n loss_object = tf.keras.losses.SparseCategoricalCrossentropy()\n optimizer = tf.keras.optimizers.Adam()\n\n # Define our metrics\n train_loss = tf.keras.metrics.Mean('train_loss', dtype=tf.float32)\n train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy('train_accuracy')\n test_loss = tf.keras.metrics.Mean('test_loss', dtype=tf.float32)\n test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy('test_accuracy')\n\n def train_step(model, optimizer, x_train, y_train):\n with tf.GradientTape() as tape:\n predictions = model(x_train, training=True)\n loss = loss_object(y_train, predictions)\n grads = tape.gradient(loss, model.trainable_variables)\n optimizer.apply_gradients(zip(grads, model.trainable_variables))\n train_loss(loss)\n train_accuracy(y_train, predictions)\n\n def test_step(model, x_test, y_test):\n predictions = model(x_test)\n loss = loss_object(y_test, predictions)\n test_loss(loss)\n test_accuracy(y_test, predictions)\n\n current_time = datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n train_log_dir = 'logs/gradient_tape/' + current_time + '/train'\n test_log_dir = 'logs/gradient_tape/' + current_time + '/test'\n train_summary_writer = tf.summary.create_file_writer(train_log_dir)\n test_summary_writer = tf.summary.create_file_writer(test_log_dir)\n\n model = create_model() # reset our model\n EPOCHS = 5\n for epoch in range(EPOCHS):\n for (x_train, y_train) in train_dataset:\n train_step(model, optimizer, x_train, y_train)\n with train_summary_writer.as_default():\n tf.summary.scalar('loss', train_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch)\n\n for (x_test, y_test) in test_dataset:\n test_step(model, x_test, y_test)\n with test_summary_writer.as_default():\n tf.summary.scalar('loss', test_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', test_accuracy.result(), step=epoch)\n template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'\n print (template.format(epoch+1,\n train_loss.result(),\n train_accuracy.result()*100,\n test_loss.result(),\n test_accuracy.result()*100))\n
For more detailed information, check out the TensorBoard documentation. Execute the TensorFlow log script using the following command:
python3 tensorflow_log.py\n
The above script will automatically create a logs
directory inside your current directory.
Start the CMF server and configure the CMF client.
Use the following command to run the test script, which will generate the MLMD file:
sh test_script.sh\n
Use the following command to push the generated MLMD and TensorFlow log files to the CMF server:
cmf metadata push -p 'pipeline-name' -t 'tensorboard-log-file-name'\n
Go to the CMF server and navigate to the TensorBoard tab. You will see an interface similar to the following image.
cmf-server is a key interface for the user to explore and track their ML training runs. It allows users to store the metadata file on the cmf-server. The user can retrieve the saved metadata file and can view the content of the saved metadata file using the UI provided by the cmf-server.
"},{"location":"cmf_server/cmf-server/#setup-a-cmf-server","title":"Setup a cmf-server","text":"There are two ways to start cmf server -
Clone the Github repository.
git clone https://github.com/HewlettPackard/cmf\n
Install Docker Engine with non root user privileges.
In earlier versions of docker compose, docker compose
was independent of docker. Hence, docker-compose
was command. However, after introduction of Docker Compose Desktop V2, compose command become part of docker engine. The recommended way to install docker compose is installing a docker compose plugin on docker engine. For more information - Docker Compose Reference.
docker compose
file","text":"This is the recommended way as docker compose starts both ui-server and cmf-server in one go.
cmf
directory.Replace xxxx
with user-name in docker-compose-server.yml available in the root cmf directory.
......\nservices:\nserver:\n image: server:latest\n volumes:\n - /home/xxxx/cmf-server/data:/cmf-server/data # for example /home/hpe-user/cmf-server/data:/cmf-server/data \n - /home/xxxx/cmf-server/data/static:/cmf-server/data/static # for example /home/hpe-user/cmf-server/data/static:/cmf-server/data/static\n container_name: cmf-server\n build:\n....\n
Execute following command to start both the containers. IP
variable is the IP address and hostname
is host name of the machine on which you are executing the following command. You can use either way.
IP=200.200.200.200 docker compose -f docker-compose-server.yml up\n OR\nhostname=host_name docker compose -f docker-compose-server.yml up\n
Replace docker compose
with docker-compose
for older versions. Also you can adjust $IP
in docker-compose-server.yml
to reflect the server IP and run the docker compose
command without specifying IP=200.200.200.200.
.......\nenvironment:\nREACT_APP_MY_IP: ${IP}\n......\n
Stop the containers.
docker compose -f docker-compose-server.yml stop\n
It is neccessary to rebuild images for cmf-server and ui-server after cmf version update
or after pulling latest cmf code from git.
OR
"},{"location":"cmf_server/cmf-server/#using-docker-run-command","title":"Usingdocker run
command","text":"Install cmflib on your system.
Go to cmf/server
directory.
cd server\n
List all docker images.
docker images\n
Execute the below-mentioned command to create a cmf-server
docker image.
Usage: docker build -t [image_name] -f ./Dockerfile ../\n
Example: docker build -t server_image -f ./Dockerfile ../\n
Note
- '../'
represents the Build context for the docker image. Launch a new docker container using the image with directory /home/user/cmf-server/data mounted. Pre-requisite: mkdir /home/<user>/cmf-server/data/static
Usage: docker run --name [container_name] -p 0.0.0.0:8080:80 -v /home/<user>/cmf-server/data:/cmf-server/data -e MYIP=XX.XX.XX.XX [image_name]\n
Example: docker run --name cmf-server -p 0.0.0.0:8080:80 -v /home/user/cmf-server/data:/cmf-server/data -e MYIP=0.0.0.0 server_image\n
After cmf-server container is up, start ui-server
, Go to cmf/ui
folder.
cd cmf/ui\n
Execute the below-mentioned command to create a ui-server
docker image.
Usage: docker build -t [image_name] -f ./Dockerfile ./\n
Example: docker build -t ui_image -f ./Dockerfile ./\n
Launch a new docker container for UI.
Usage: docker run --name [container_name] -p 0.0.0.0:3000:3000 -e REACT_APP_MY_IP=XX.XX.XX.XX [image_name]\n
Example: docker run --name ui-server -p 0.0.0.0:3000:3000 -e REACT_APP_MY_IP=0.0.0.0 ui_image\n
Note: If you face issue regarding Libzbar-dev
similar to the snapshot, add proxies to '/.docker/config.json' {\n proxies: {\n \"default\": {\n \"httpProxy\": \"http://web-proxy.labs.xxxx.net:8080\",\n \"httpsProxy\": \"http://web-proxy.labs.xxxx.net:8080\",\n \"noProxy\": \".labs.xxxx.net,127.0.0.0/8\"\n }\n }\n }\n
To stop the docker container.
docker stop [container_name]\n
To delete the docker container.
docker rm [container_name] \n
To remove the docker image.
docker image rm [image_name] \n
cmf-server APIs are organized around FastAPI. They accept and return JSON-encoded request bodies and responses and return standard HTTP response codes.
"},{"location":"cmf_server/cmf-server/#list-of-apis","title":"List of APIs","text":"Method URL DescriptionPost
/mlmd_push
Used to push Json Encoded data to cmf-server Get
/mlmd_pull/{pipeline_name}
Retrieves a mlmd file from cmf-server Get
/display_executions
Retrieves all executions from cmf-server Get
/display_artifacts/{pipeline_name}/{data_type}
Retrieves all artifacts from cmf-server for resp datat type Get
/display_lineage/{lineage_type}/{pipeline_name}
Creates lineage data from cmf-server Get
/display_pipelines
Retrieves all pipelines present in mlmd file"},{"location":"cmf_server/cmf-server/#http-response-status-codes","title":"HTTP Response Status codes","text":"Code Title Description 200
OK
mlmd is successfully pushed (e.g. when using GET
, POST
). 400
Bad request
When the cmf-server is not available. 500
Internal server error
When an internal error has happened"},{"location":"common-metadata-ontology/readme/","title":"Ontology","text":""},{"location":"common-metadata-ontology/readme/#common-metadata-ontology","title":"Common Metadata Ontology","text":"Common Metadata Ontology (CMO) is proposed to integrate and aggregate the pipeline metadata from various sources such as Papers-with-code, OpenML and Huggingface. CMF's data model is a manifestation of CMO which is specifically designed to capture the pipeline-centric metadata of AI pipelines. It consists of nodes to represent a pipeline, components of a pipeline (stages), relationships to capture interaction among pipeline entities and properties. CMO offers interoperability of diverse metadata, search and recommendation with reasoning capabilities. CMO offers flexibility to incorporate various executions implemented for each stage such as dataset preprocessing, feature engineering, training (including HPO), testing and evaluation. This enables robust search capabilities to identify the best execution path for a given pipeline. Additionally, CMO also facilitates the inclusion of additional semantic and statistical properties to enhance the richness and comprehensiveness of the metadata associated with them. The overview of CMO can be found below.
The external link to arrows.app can be found here
"},{"location":"common-metadata-ontology/readme/#sample-pipeline-represented-using-cmo","title":"Sample pipeline represented using CMO","text":"The sample figure shows a pipeline titled \"Robust outlier detection by de-biasing VAE likelihoods\" executed for \"Outlier Detection\" task for the stage train/test. The model used in the pipeline was \"Variational Autoencoder\". Several datasets were used in the pipeline implementation which are as follows (i) German Traffic Sign, (ii) Street View House Numbers and (iii) CelebFaces Arrtibutes dataset. The corresponding hyperparameters used and the metrics generated as a result of execution are included in the figure. The external link to source figure created using arrows.app can be found here
"},{"location":"common-metadata-ontology/readme/#turtle-syntax","title":"Turtle Syntax","text":"The Turtle format of formal ontology can be found here
"},{"location":"common-metadata-ontology/readme/#properties-of-each-nodes","title":"Properties of each nodes","text":"The properties of each node can be found below.
"},{"location":"common-metadata-ontology/readme/#pipeline","title":"Pipeline","text":"AI pipeline executed to solve a machine or deep learning Task
"},{"location":"common-metadata-ontology/readme/#properties","title":"Properties","text":"Any published text document regarding the pipeline implementation
"},{"location":"common-metadata-ontology/readme/#properties_1","title":"Properties","text":"The AI Task for which the pipeline is implemented. Example: image classification
"},{"location":"common-metadata-ontology/readme/#properties_2","title":"Properties","text":"The framework used to implement the pipeline and their code repository
"},{"location":"common-metadata-ontology/readme/#properties_3","title":"Properties","text":"Various stages of the pipeline such as data preprocessing, training, testing or evaluation
"},{"location":"common-metadata-ontology/readme/#properties_4","title":"Properties","text":"Multiple executions of a given stage in a pipeline
"},{"location":"common-metadata-ontology/readme/#properties_5","title":"Properties","text":"Artifacts such as model, dataset and metric generated at the end of each execution
"},{"location":"common-metadata-ontology/readme/#properties_6","title":"Properties","text":"Subclass of artifact. The dataset used in each Execution of a Pipeline
"},{"location":"common-metadata-ontology/readme/#properties_7","title":"Properties","text":"Subclass of artifact. The model used in each execution or produced as a result of an execution
"},{"location":"common-metadata-ontology/readme/#properties_8","title":"Properties","text":"Subclass of artifact. The evaluation result of each execution
"},{"location":"common-metadata-ontology/readme/#properties_9","title":"Properties","text":"Parameter setting using for each Execution of a Stage
"},{"location":"common-metadata-ontology/readme/#properties_10","title":"Properties","text":"NOTE: * are optional properties * There additional information on each node, different for each source. As of now, there are included in the KG for efficient search. But they are available to be used in the future to extract the data and populate as node properties. * *For metric, there are umpteen possible metric names and values. Therefore, we capture all of them as a key value pair under evaluations * custom_properties are where user can enter custom properties for each node while executing a pipeline * source is the source from which the node is obtained - papers-with-code, openml, huggingface
"},{"location":"common-metadata-ontology/readme/#published-works","title":"Published works","text":"This example depends on the following packages: git
. We also recommend installing anaconda to manage python virtual environments. This example was tested in the following environments:
Ubuntu-22.04 with python-3.10
This example demonstrates how CMF tracks a metadata associated with executions of various machine learning (ML) pipelines. ML pipelines differ from other pipelines (e.g., data Extract-Transform-Load pipelines) by the presence of ML steps, such as training and testing ML models. More comprehensive ML pipelines may include steps such as deploying a trained model and tracking its inference parameters (such as response latency, memory consumption etc.). This example, located here implements a simple pipeline consisting of four steps:
train
and test
raw datasets for training and testing a machine learning model. This step registers one input artifact (raw dataset
) and two output artifacts (train and test datasets
). train
step. This step registers two input artifacts (ML model and test dataset) and one output artifact (performance metrics).We start by creating (1) a workspace directory that will contain all files for this example and (2) a python virtual environment. Then we will clone the CMF project that contains this example project.
# Create workspace directory\nmkdir cmf_getting_started_example\ncd cmf_getting_started_example\n\n# Create and activate Python virtual environment (the Python version may need to be adjusted depending on your system)\nconda create -n cmf_getting_started_example python=3.10 \nconda activate cmf_getting_started_example\n\n# Clone the CMF project from GitHub and install CMF\ngit clone https://github.com/HewlettPackard/cmf\npip install ./cmf\n
"},{"location":"examples/getting_started/#setup-a-cmf-server","title":"Setup a cmf-server","text":"cmf-server is a key interface for the user to explore and track their ML training runs. It allows users to store the metadata file on the cmf-server. The user can retrieve the saved metadata file and can view the content of the saved metadata file using the UI provided by the cmf-server.
Follow here to setup a common cmf-server.
"},{"location":"examples/getting_started/#project-initialization","title":"Project initialization","text":"We need to copy the source tree of the example in its own directory (that must be outside the CMF source tree), and using cmf init
command initialize dvc remote directory, git remote url, cmf server and neo4j with appropriate dvc backend for this project .
# Create a separate copy of the example project\ncp -r ./cmf/examples/example-get-started/ ./example-get-started\ncd ./example-get-started\n
"},{"location":"examples/getting_started/#cmf-init","title":"cmf init","text":"\nUsage: cmf init minioS3 [-h] --url [url] \n --endpoint-url [endpoint_url]\n --access-key-id [access_key_id] \n --secret-key [secret_key] \n --git-remote-url[git_remote_url] \n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password [neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init minioS3 --url s3://bucket-name --endpoint-url http://localhost:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://127.0.0.1:80 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Follow here for more details."},{"location":"examples/getting_started/#project-execution","title":"Project execution","text":"To execute the example pipeline, run the test_script.sh file (before that, study the contents of that file). Basically, that script will run a sequence of steps common for a typical machine learning project - getting raw data, converting it into machine learning train/test splits, training and testing a model. The execution of these steps (and parent pipeline) will be recorded by the CMF.
# Run the example pipeline\nsh ./test_script.sh\n
This script will run the pipeline and will store its metadata in a sqlite file named mlmd. Verify that all stages are done using git log
command. You should see commits corresponding to the artifacts that were created.
Under normal conditions, the next steps would be to: (1) execute the cmf artifact push
command to push the artifacts to the central artifact repository and (2) execute the cmf metadata push
command to track the metadata of the generated artifacts on a common cmf server.
Follow here for more details on cmf artifact
and cmf metadata
commands.
The stored metadata can be explored using the query layer. Example Jupyter notebook Query_Tester-base_mlmd.ipynb can be found in this directory.
"},{"location":"examples/getting_started/#clean-up","title":"Clean Up","text":"Metadata is stored in sqlite file named \"mlmd\". To clean up, delete the \"mlmd\" file.
"},{"location":"examples/getting_started/#steps-to-test-dataslice","title":"Steps to test dataslice","text":"Run the following command: python test-data-slice.py
.
CMF (Common Metadata Framework) collects and stores information associated with Machine Learning (ML) pipelines. It also implements APIs to query this metadata. The CMF adopts a data-first approach: all artifacts (such as datasets, ML models and performance metrics) recorded by the framework are versioned and identified by their content hash.
"},{"location":"#installation","title":"Installation","text":""},{"location":"#1-pre-requisites","title":"1. Pre-Requisites:","text":"conda create -n cmf python=3.10\nconda activate cmf\n
virtualenv --python=3.10 .cmf\nsource .cmf/bin/activate\n
"},{"location":"#3-install-cmf","title":"3. Install CMF:","text":"Latest version form GitHubStable version form PyPI pip install git+https://github.com/HewlettPackard/cmf\n
# pip install cmflib\n
"},{"location":"#next-steps","title":"Next Steps","text":"After installing CMF, proceed to configure CMF server and client. For detailed configuration instructions, refer to the Quick start with cmf-client page.
"},{"location":"#introduction","title":"Introduction","text":"Complex ML projects rely on ML pipelines
to train and test ML models. An ML pipeline is a sequence of stages where each stage performs a particular task, such as data loading, pre-processing, ML model training and testing stages. Each stage can have multiple Executions. Each Execution,
inputs
and produce outputs
.CMF uses the abstractions of Pipeline
,Context
and Executions
to store the metadata of complex ML pipelines. Each pipeline has a name. Users provide it when they initialize the CMF. Each stage is represented by a Context
object. Metadata associated with each run of a stage is captured in the Execution object. Inputs and outputs of Executions can be logged as dataset, model or metrics. While parameters of executions are recorded as properties of executions.
Start tracking the pipeline metadata by initializing the CMF runtime. The metadata will be associated with the pipeline named test_pipeline
.
from cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n\ncmf = Cmf(\n filename=\"mlmd\",\n pipeline_name=\"test_pipeline\",\n) \n
Before we can start tracking metadata, we need to let CMF know about stage type. This is not yet associated with this particular execution.
context: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"train\"\n)\n
Now we can create a new stage execution associated with the train
stage. The CMF always creates a new execution, and will adjust its name, so it's unique. This is also the place where we can log execution parameters
like seed, hyper-parameters etc .
execution: mlmd.proto.Execution = cmf.create_execution(\n execution_type=\"train\",\n custom_properties = {\"num_epochs\": 100, \"learning_rate\": 0.01}\n)\n
Finally, we can log an input (train dataset), and once trained, an output (ML model) artifacts.
cmf.log_dataset(\n 'artifacts/test_dataset.csv', # Dataset path \n \"input\" # This is INPUT artifact\n)\ncmf.log_model(\n \"artifacts/model.pkl\", # Model path \n event=\"output\" # This is OUTPUT artifact\n)\n
"},{"location":"#quick-example","title":"Quick Example","text":"Go through Getting Started page to learn more about CMF API usage.
"},{"location":"#api-overview","title":"API Overview","text":"Import CMF.
from cmflib import cmf\n
Initialize CMF. The CMF object is responsible for managing a CMF backend to record the pipeline metadata. Internally, it creates a pipeline abstraction that groups individual stages and their executions. All stages, their executions and produced artifacts will be associated with a pipeline with the given name.
cmf = cmf.Cmf(\n filename=\"mlmd\", # Path to ML Metadata file.\n pipeline_name=\"mnist\" # Name of a ML pipeline.\n) \n
Define a stage. An ML pipeline can have multiple stages, and each stage can be associated with multiple executions. A stage is like a class in the world of object-oriented programming languages. A context (stage description) defines what this stage looks like (name and optional properties), and is created with the create_context method.
context = cmf.create_context(\n pipeline_stage=\"download\", # Stage name\n custom_properties={ # Optional properties\n \"uses_network\": True, # Downloads from the Internet\n \"disk_space\": \"10GB\" # Needs this much space\n }\n)\n
Create a stage execution. A stage in ML pipeline can have multiple executions. Every run is marked as an execution. This API helps to track the metadata associated with the execution, like stage parameters (e.g., number of epochs and learning rate for train stages). The stage execution name does not need to be the same as the name of its context. Moreover, the CMF will adjust this name to ensure every execution has a unique name. The CMF will internally associate this execution with the context created previously. Stage executions are created by calling the create_execution method.
execution = cmf.create_execution(\n execution_type=\"download\", # Execution name.\n custom_properties = { # Execution parameters\n \"url\": \"https://a.com/mnist.gz\" # Data URL.\n }\n)\n
Log artifacts. A stage execution can consume (inputs) and produce (outputs) multiple artifacts (datasets, models and performance metrics). The path of these artifacts must be relative to the project (repository) root path. Artifacts might have optional metadata associated with them. These metadata could include feature statistics for ML datasets, or useful parameters for ML models (such as, for instance, number of trees in a random forest classifier).
Datasets are logged with the log_dataset method.
cmf.log_dataset('data/mnist.gz', \"input\", custom_properties={\"name\": \"mnist\", \"type\": 'raw'})\ncmf.log_dataset('data/train.csv', \"output\", custom_properties={\"name\": \"mnist\", \"type\": \"train_split\"})\ncmf.log_dataset('data/test.csv', \"output\", custom_properties={\"name\": \"mnist\", \"type\": \"test_split\"})\n
ML models produced by training stages are logged using log_model API. ML models can be both input and output artifacts. The metadata associated with the artifact could be logged as an optional argument.
# In train stage\ncmf.log_model(\n path=\"model/rf.pkl\", event=\"output\", model_framework=\"scikit-learn\", model_type=\"RandomForestClassifier\", \n model_name=\"RandomForestClassifier:default\" \n)\n\n# In test stage\ncmf.log_model(\n path=\"model/rf.pkl\", event=\"input\" \n)\n
Metrics of every optimization step (one epoch of Stochastic Gradient Descent, or one boosting round in Gradient Boosting Trees) are logged using log_metric API.
#Can be called at every epoch or every step in the training. This is logged to a parquet file and committed at the \n# commit stage.\n\n#Inside training loop\nwhile True: \n cmf.log_metric(\"training_metrics\", {\"loss\": loss}) \ncmf.commit_metrics(\"training_metrics\")\n
Stage metrics, or final metrics, are logged with the log_execution_metrics method. These are final metrics of a stage, such as final train or test accuracy.
cmf.log_execution_metrics(\"metrics\", {\"avg_prec\": avg_prec, \"roc_auc\": roc_auc})\n
Dataslices are intended to be used to track subsets of the data. For instance, this can be used to track and compare accuracies of ML models on these subsets to identify model bias. Data slices are created with the create_dataslice method.
dataslice = cmf.create_dataslice(\"slice-a\")\nfor i in range(1, 20, 1):\n j = random.randrange(100)\n dataslice.add_data(\"data/raw_data/\"+str(j)+\".xml\")\ndataslice.commit()\n
"},{"location":"#graph-layer-overview","title":"Graph Layer Overview","text":"CMF library has an optional graph layer
which stores the relationships in a Neo4J graph database. To use the graph layer, the graph
parameter in the library init call must be set to true (it is set to false by default). The library reads the configuration parameters of the graph database from cmf config
generated by cmf init
command.
cmf init minioS3 --url s3://dvc-art --endpoint-url http://x.x.x.x:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Here, \"dvc-art\" is provided as an example bucket name. However, users can change it as needed, if the user chooses to change it, they will need to update the Dockerfile for minioS3 accordingly.
To use the graph layer, instantiate the CMF with graph=True
parameter:
from cmflib import cmf\n\ncmf = cmf.Cmf(\n filename=\"mlmd\",\n pipeline_name=\"anomaly_detection_pipeline\", \n graph=True\n)\n
"},{"location":"#jupyter-lab-docker-container-with-cmf-pre-installed","title":"Jupyter Lab docker container with CMF pre-installed","text":""},{"location":"#use-a-jupyterlab-docker-environment-with-cmf-pre-installed","title":"Use a Jupyterlab Docker environment with CMF pre-installed","text":"CMF has a docker-compose file which creates two docker containers, - JupyterLab Notebook Environment with CMF pre installed. - Accessible at http://[HOST.IP.AD.DR]:8888 (default token: docker
) - Within the Jupyterlab environment, a startup script switches context to $USER:$GROUP
as specified in .env
- example-get-started
from this repo is bind mounted into /home/jovyan/example-get-started
- Neo4j Docker container to store and access lineages.
create .env file in current folder using env-example as a template. Modify the .env file for the following variables USER,UID,GROUP,GID,GIT_USER_NAME,GIT_USER_EMAIL,GIT_REMOTE_URL #These are used by docker-compose.yml
Update docker-compose.yml
as needed. your .ssh folder is mounted inside the docker conatiner to enable you to push and pull code from git To-Do Create these directories in your home folder
mkdir $HOME/workspace \nmkdir $HOME/dvc_remote \n
workspace - workspace will be mounted inside the cmf pre-installed docker conatiner (can be your code directory) dvc_remote - remote data store for dvc or Change the below lines in docker-compose to reflect the appropriate directories
If your workspace is named \"experiment\" change the below line\n$HOME/workspace:/home/jovyan/workspace to \n$HOME/experiment:/home/jovyan/wokspace\n
If your remote is /extmount/data change the line \n$HOME/dvc_remote:/home/jovyan/dvc_remote to \n/extmount/data:/home/jovyan/dvc_remote \n
Start the docker docker-compose up --build -d\n
Access the jupyter notebook http://[HOST.IP.AD.DR]:8888 (default token: docker
) Click the terminal icon Quick Start
cd example-get-started\ncmf init local --path /home/user/local-storage --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://127.0.0.1:80 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\nsh test_script.sh\ncmf artifact push -p 'Test-env'\n
The above steps will run a pre coded example pipeline and the metadata is stored in a file named \"mlmd\". The artifacts created will be pushed to configured dvc remote (default: /home/dvc_remote) The stored metadata is displayed as Metadata lineage can be accessed in neo4j. Open http://host:7475/browser/ Connect to server with default password neo4j123 (To change this modify .env file) Run the query
MATCH (a:Execution)-[r]-(b) WHERE (b:Dataset or b:Model or b:Metrics) RETURN a,r, b \n
Expected output Jupyter Lab Notebook Select the kernel as Python[conda env:python37]
Shutdown/remove (Remove volumes as well)
docker-compose down -v\n
"},{"location":"#license","title":"License","text":"CMF is an open source project hosted on GitHub and distributed according to the Apache 2.0 licence. We are welcome user contributions - send us a message on the Slack channel or open a GitHub issue or a pull request on GitHub.
"},{"location":"#citation","title":"Citation","text":"@mist{foltin2022cmf,\n title={Self-Learning Data Foundation for Scientific AI},\n author={Martin Foltin, Annmary Justine, Sergey Serebryakov, Cong Xu, Aalap Tripathy, Suparna Bhattacharya, \n Paolo Faraboschi},\n year={2022},\n note = {Presented at the \"Monterey Data Conference\"},\n URL={https://drive.google.com/file/d/1Oqs0AN0RsAjt_y9ZjzYOmBxI8H0yqSpB/view},\n}\n
"},{"location":"#community","title":"Community","text":"Help
Common Metadata Framework and its documentation are in active stage of development and are very new. If there is anything unclear, missing or there's a typo, please, open an issue or pull request on GitHub.
"},{"location":"_src/","title":"CMF docs development resources","text":"This directory contains files that are used to create some content for the CMF documentation. This process is not automated yet. Files in this directory are not supposed to be referenced from documentation pages.
It also should not be required to automatically redeploy documentation (e.g., with GitHub actions) when documentation files change only in this particular directory.
This calls initiates the library and also creates a pipeline object with the name provided. Arguments to be passed CMF:
\ncmf = cmf.Cmf(filename=\"mlmd\", pipeline_name=\"Test-env\") \n\nReturns a Context object of mlmd.proto.Context\nArguments filename String Path to the sqlite file to store the metadata pipeline_name String Name to uniquely identify the pipeline. Note that name is the unique identification for a pipeline. If a pipeline already exist with the same name, the existing pipeline object is reused custom_properties Dictionary (Optional Parameter) - Additional properties of the pipeline that needs to be stored graph Bool (Optional Parameter) If set to true, the libray also stores the relationships in the provided graph database. Following environment variables should be set NEO4J_URI - The value should be set to the Graph server URI . export NEO4J_URI=\"bolt://ip:port\" User name and password export NEO4J_USER_NAME=neo4j export NEO4J_PASSWD=neo4j
Return Object mlmd.proto.Context
mlmd.proto.Context Attributes create_time_since_epoch int64 create_time_since_epoch custom_properties repeated CustomPropertiesEntry custom_properties id int64 id last_update_time_since_epoch int64 last_update_time_since_epoch name string name properties repeated PropertiesEntry properties type string type type_id int64 type_id ### 2. create_context - Creates a Stage with properties A pipeline may include multiple stages. A unique name should be provided for every Stage in a pipeline. Arguments to be passed CMF:\ncontext = cmf.create_context(pipeline_stage=\"Prepare\", custom_properties ={\"user-metadata1\":\"metadata_value\"}\nArguments pipeline_stage String Name of the pipeline Stage custom_properties Dictionary (Optional Parameter) - The developer's can provide key value pairs of additional properties of the stage that needs to be stored.
Return Object mlmd.proto.Context |mlmd.proto.Context Attributes| | |------|------| |create_time_since_epoch| int64 create_time_since_epoch| |custom_properties| repeated CustomPropertiesEntry custom_properties| |id| int64 id| |last_update_time_since_epoch| int64 last_update_time_since_epoch| |name| string name| |properties| repeated PropertiesEntry properties| |type| string type| |type_id| int64 type_id|
"},{"location":"api/public/API/#3-create_execution-creates-an-execution-with-properties","title":"3. create_execution - Creates an Execution with properties","text":"A stage can have multiple executions. A unique name should ne provided for exery execution. Properties of the execution can be paased as key value pairs in the custom properties. Eg: The hyper parameters used for the execution can be passed.
\n\nexecution = cmf.create_execution(execution_type=\"Prepare\",\n custom_properties = {\"Split\":split, \"Seed\":seed})\nexecution_type:String - Name of the execution\ncustom_properties:Dictionary (Optional Parameter)\nReturn Execution object of type mlmd.proto.Execution\nArguments execution_type String Name of the execution custom_properties Dictionary (Optional Parameter)
Return object of type mlmd.proto.Execution | mlmd.proto.Execution Attributes| | |---------------|-------------| |create_time_since_epoch |int64 create_time_since_epoch| |custom_properties |repeated CustomPropertiesEntry custom_properties| |id |int64 id| |last_known_state |State last_known_state| |last_update_time_since_epoch| int64 last_update_time_since_epoch| |name |string name| |properties |repeated PropertiesEntry properties [Git_Repo, Context_Type, Git_Start_Commit, Pipeline_Type, Context_ID, Git_End_Commit, Execution(Command used), Pipeline_id| |type |string type| |type_id| int64 type_id|
"},{"location":"api/public/API/#4-log_dataset-logs-a-dataset-and-its-properties","title":"4. log_dataset - Logs a Dataset and its properties","text":"Tracks a Dataset and its version. The version of the dataset is automatically obtained from the versioning software(DVC) and tracked as a metadata.
\nartifact = cmf.log_dataset(\"/repo/data.xml\", \"input\", custom_properties={\"Source\":\"kaggle\"})\nArguments url String The path to the dataset event String Takes arguments INPUT OR OUTPUT custom_properties Dictionary The Dataset properties
Returns an Artifact object of type mlmd.proto.Artifact
mlmd.proto.Artifact Attributes create_time_since_epoch int64 create_time_since_epoch custom_properties repeated CustomPropertiesEntry custom_properties id int64 id last_update_time_since_epoch int64 last_update_time_since_epoch name string name properties repeated PropertiesEntry properties(Commit, Git_Repo) state State state type string type type_id int64 type_id uri string uri"},{"location":"api/public/API/#5-log_model-logs-a-model-and-its-properties","title":"5. log_model - Logs a model and its properties.","text":"\ncmf.log_model(path=\"path/to/model.pkl\", event=\"output\", model_framework=\"SKlearn\", model_type=\"RandomForestClassifier\", model_name=\"RandomForestClassifier:default\")\n\nReturns an Artifact object of type mlmd.proto.Artifact\nArguments path String Path to the model model file event String Takes arguments INPUT OR OUTPUT model_framework String Framework used to create model model_type String Type of Model Algorithm used model_name String Name of the Algorithm used custom_properties Dictionary The model properties
Returns Atifact object of type mlmd.proto.Artifact |mlmd.proto.Artifact Attributes| | |-----------|---------| |create_time_since_epoch| int64 create_time_since_epoch| |custom_properties| repeated CustomPropertiesEntry custom_properties| |id| int64 id |last_update_time_since_epoch| int64 last_update_time_since_epoch |name| string name |properties| repeated PropertiesEntry properties(commit, model_framework, model_type, model_name)| |state| State state| |type| string type| |type_id| int64 type_id| |uri| string uri|
"},{"location":"api/public/API/#6-log_execution_metrics-logs-the-metrics-for-the-execution","title":"6. log_execution_metrics Logs the metrics for the execution","text":"\ncmf.log_execution_metrics(metrics_name :\"Training_Metrics\", {\"auc\":auc,\"loss\":loss}\nArguments metrics_name String Name to identify the metrics custom_properties Dictionary Metrics"},{"location":"api/public/API/#7-log_metrics-logs-the-per-step-metrics-for-fine-grained-tracking","title":"7. log_metrics Logs the per Step metrics for fine grained tracking","text":"
The metrics provided is stored in a parquet file. The commit_metrics call add the parquet file in the version control framework. The metrics written in the parquet file can be retrieved using the read_metrics call
\n#Can be called at every epoch or every step in the training. This is logged to a parquet file and commited at the commit stage.\nwhile True: #Inside training loop\n metawriter.log_metric(\"training_metrics\", {\"loss\":loss}) \nmetawriter.commit_metrics(\"training_metrics\")\nArguments for log_metric metrics_name String Name to identify the metrics custom_properties Dictionary Metrics Arguments for commit_metrics metrics_name String Name to identify the metrics"},{"location":"api/public/API/#8-create_dataslice","title":"8. create_dataslice","text":"
This helps to track a subset of the data. Currently supported only for file abstractions. For eg- Accuracy of the model for a slice of data(gender, ethnicity etc)
\ndataslice = cmf.create_dataslice(\"slice-a\")\nArguments for create_dataslice name String Name to identify the dataslice Returns a Dataslice object"},{"location":"api/public/API/#9-add_data-adds-data-to-a-dataslice","title":"9. add_data Adds data to a dataslice.","text":"
Currently supported only for file abstractions. Pre condition - The parent folder, containing the file should already be versioned.
\ndataslice.add_data(\"data/raw_data/\"+str(j)+\".xml\")\nArguments name String Name to identify the file to be added to the dataslice"},{"location":"api/public/API/#10-dataslice-commit-commits-the-created-dataslice","title":"10. Dataslice Commit - Commits the created dataslice","text":"
The created dataslice is versioned and added to underneath data versioning softwarre
\ndataslice.commit()\n"},{"location":"api/public/cmf/","title":"cmflib.cmf","text":"
This class provides methods to log metadata for distributed AI pipelines. The class instance creates an ML metadata store to store the metadata. It creates a driver to store nodes and its relationships to neo4j. The user has to provide the name of the pipeline, that needs to be recorded with CMF.
cmflib.cmf.Cmf(\n filepath=\"mlmd\",\n pipeline_name=\"test_pipeline\",\n custom_properties={\"owner\": \"user_a\"},\n graph=False\n)\n
Args: filepath: Path to the sqlite file to store the metadata pipeline_name: Name to uniquely identify the pipeline. Note that name is the unique identifier for a pipeline. If a pipeline already exist with the same name, the existing pipeline object is reused. custom_properties: Additional properties of the pipeline that needs to be stored. graph: If set to true, the libray also stores the relationships in the provided graph database. The following variables should be set: neo4j_uri
(graph server URI), neo4j_user
(user name) and neo4j_password
(user password), e.g.: cmf init local --path /home/user/local-storage --git-remote-url https://github.com/XXX/exprepo.git --neo4j-user neo4j --neo4j-password neo4j\n --neo4j-uri bolt://localhost:7687\n
Source code in cmflib/cmf.py
def __init__(\n self,\n filepath: str = \"mlmd\",\n pipeline_name: str = \"\",\n custom_properties: t.Optional[t.Dict] = None,\n graph: bool = False,\n is_server: bool = False,\n ):\n #path to directory\n self.cmf_init_path = filepath.rsplit(\"/\",1)[0] \\\n\t\t\t\t if len(filepath.rsplit(\"/\",1)) > 1 \\\n\t\t\t\t\telse os.getcwd()\n\n logging_dir = change_dir(self.cmf_init_path)\n if is_server is False:\n Cmf.__prechecks()\n if custom_properties is None:\n custom_properties = {}\n if not pipeline_name:\n # assign folder name as pipeline name \n cur_folder = os.path.basename(os.getcwd())\n pipeline_name = cur_folder\n config = mlpb.ConnectionConfig()\n config.sqlite.filename_uri = filepath\n self.store = metadata_store.MetadataStore(config)\n self.filepath = filepath\n self.child_context = None\n self.execution = None\n self.execution_name = \"\"\n self.execution_command = \"\"\n self.metrics = {}\n self.input_artifacts = []\n self.execution_label_props = {}\n self.graph = graph\n #last token in filepath\n self.branch_name = filepath.rsplit(\"/\", 1)[-1]\n\n if is_server is False:\n git_checkout_new_branch(self.branch_name)\n self.parent_context = get_or_create_parent_context(\n store=self.store,\n pipeline=pipeline_name,\n custom_properties=custom_properties,\n )\n if is_server:\n Cmf.__get_neo4j_server_config()\n if graph is True:\n Cmf.__load_neo4j_params()\n self.driver = graph_wrapper.GraphDriver(\n Cmf.__neo4j_uri, Cmf.__neo4j_user, Cmf.__neo4j_password\n )\n self.driver.create_pipeline_node(\n pipeline_name, self.parent_context.id, custom_properties\n )\n os.chdir(logging_dir)\n
This module contains all the public API for CMF
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.create_context","title":"create_context(pipeline_stage, custom_properties=None)
","text":"Create's a context(stage). Every call creates a unique pipeline stage. Updates Pipeline_stage name. Example:
#Create context\n# Import CMF\nfrom cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n# Create CMF logger\ncmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n# Create context\ncontext: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n)\n
Args: Pipeline_stage: Name of the Stage. custom_properties: Developers can provide key value pairs with additional properties of the execution that need to be stored. Returns: Context object from ML Metadata library associated with the new context for this stage. Source code in cmflib/cmf.py
def create_context(\n self, pipeline_stage: str, custom_properties: t.Optional[t.Dict] = None\n) -> mlpb.Context:\n \"\"\"Create's a context(stage).\n Every call creates a unique pipeline stage.\n Updates Pipeline_stage name.\n Example:\n ```python\n #Create context\n # Import CMF\n from cmflib.cmf import Cmf\n from ml_metadata.proto import metadata_store_pb2 as mlpb\n # Create CMF logger\n cmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n # Create context\n context: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n )\n\n ```\n Args:\n Pipeline_stage: Name of the Stage.\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be stored.\n Returns:\n Context object from ML Metadata library associated with the new context for this stage.\n \"\"\"\n custom_props = {} if custom_properties is None else custom_properties\n pipeline_stage = self.parent_context.name + \"/\" + pipeline_stage\n ctx = get_or_create_run_context(\n self.store, pipeline_stage, custom_props)\n self.child_context = ctx\n associate_child_to_parent_context(\n store=self.store, parent_context=self.parent_context, child_context=ctx\n )\n if self.graph:\n self.driver.create_stage_node(\n pipeline_stage, self.parent_context, ctx.id, custom_props\n )\n return ctx\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.merge_created_context","title":"merge_created_context(pipeline_stage, custom_properties=None)
","text":"Merge created context. Every call creates a unique pipeline stage. Created for metadata push purpose. Example:
```python\n#Create context\n# Import CMF\nfrom cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n# Create CMF logger\ncmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n# Create context\ncontext: mlmd.proto.Context = cmf.merge_created_context(\n pipeline_stage=\"Test-env/prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n```\nArgs:\n Pipeline_stage: Pipeline_Name/Stage_name.\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be stored.\nReturns:\n Context object from ML Metadata library associated with the new context for this stage.\n
Source code in cmflib/cmf.py
def merge_created_context(\n self, pipeline_stage: str, custom_properties: t.Optional[t.Dict] = None\n) -> mlpb.Context:\n \"\"\"Merge created context.\n Every call creates a unique pipeline stage.\n Created for metadata push purpose.\n Example:\n\n ```python\n #Create context\n # Import CMF\n from cmflib.cmf import Cmf\n from ml_metadata.proto import metadata_store_pb2 as mlpb\n # Create CMF logger\n cmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n # Create context\n context: mlmd.proto.Context = cmf.merge_created_context(\n pipeline_stage=\"Test-env/prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n ```\n Args:\n Pipeline_stage: Pipeline_Name/Stage_name.\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be stored.\n Returns:\n Context object from ML Metadata library associated with the new context for this stage.\n \"\"\"\n\n custom_props = {} if custom_properties is None else custom_properties\n ctx = get_or_create_run_context(\n self.store, pipeline_stage, custom_props)\n self.child_context = ctx\n associate_child_to_parent_context(\n store=self.store, parent_context=self.parent_context, child_context=ctx\n )\n if self.graph:\n self.driver.create_stage_node(\n pipeline_stage, self.parent_context, ctx.id, custom_props\n )\n return ctx\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.create_execution","title":"create_execution(execution_type, custom_properties=None, cmd=None, create_new_execution=True)
","text":"Create execution. Every call creates a unique execution. Execution can only be created within a context, so create_context must be called first. Example:
# Import CMF\nfrom cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n# Create CMF logger\ncmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n# Create or reuse context for this stage\ncontext: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n)\n# Create a new execution for this stage run\nexecution: mlmd.proto.Execution = cmf.create_execution(\n execution_type=\"Prepare\",\n custom_properties = {\"split\": split, \"seed\": seed}\n)\n
Args: execution_type: Type of the execution.(when create_new_execution is False, this is the name of execution) custom_properties: Developers can provide key value pairs with additional properties of the execution that need to be stored. cmd: command used to run this execution.\n\ncreate_new_execution:bool = True, This can be used by advanced users to re-use executions\n This is applicable, when working with framework code like mmdet, pytorch lightning etc, where the\n custom call-backs are used to log metrics.\n if create_new_execution is True(Default), execution_type parameter will be used as the name of the execution type.\n if create_new_execution is False, if existing execution exist with the same name as execution_type.\n it will be reused.\n Only executions created with create_new_execution as False will have \"name\" as a property.\n
Returns:
Type DescriptionExecution
Execution object from ML Metadata library associated with the new execution for this stage.
Source code incmflib/cmf.py
def create_execution(\n self,\n execution_type: str,\n custom_properties: t.Optional[t.Dict] = None,\n cmd: str = None,\n create_new_execution: bool = True,\n) -> mlpb.Execution:\n \"\"\"Create execution.\n Every call creates a unique execution. Execution can only be created within a context, so\n [create_context][cmflib.cmf.Cmf.create_context] must be called first.\n Example:\n ```python\n # Import CMF\n from cmflib.cmf import Cmf\n from ml_metadata.proto import metadata_store_pb2 as mlpb\n # Create CMF logger\n cmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n # Create or reuse context for this stage\n context: mlmd.proto.Context = cmf.create_context(\n pipeline_stage=\"prepare\",\n custom_properties ={\"user-metadata1\": \"metadata_value\"}\n )\n # Create a new execution for this stage run\n execution: mlmd.proto.Execution = cmf.create_execution(\n execution_type=\"Prepare\",\n custom_properties = {\"split\": split, \"seed\": seed}\n )\n ```\n Args:\n execution_type: Type of the execution.(when create_new_execution is False, this is the name of execution)\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be stored.\n\n cmd: command used to run this execution.\n\n create_new_execution:bool = True, This can be used by advanced users to re-use executions\n This is applicable, when working with framework code like mmdet, pytorch lightning etc, where the\n custom call-backs are used to log metrics.\n if create_new_execution is True(Default), execution_type parameter will be used as the name of the execution type.\n if create_new_execution is False, if existing execution exist with the same name as execution_type.\n it will be reused.\n Only executions created with create_new_execution as False will have \"name\" as a property.\n\n\n Returns:\n Execution object from ML Metadata library associated with the new execution for this stage.\n \"\"\"\n logging_dir = change_dir(self.cmf_init_path)\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.child_context:\n self.create_context(pipeline_stage=name_without_extension)\n assert self.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # Initializing the execution related fields\n\n self.metrics = {}\n self.input_artifacts = []\n self.execution_label_props = {}\n custom_props = {} if custom_properties is None else custom_properties\n git_repo = git_get_repo()\n git_start_commit = git_get_commit()\n cmd = str(sys.argv) if cmd is None else cmd\n python_env=get_python_env()\n self.execution = create_new_execution_in_existing_run_context(\n store=self.store,\n # Type field when re-using executions\n execution_type_name=self.child_context.name,\n execution_name=execution_type, \n #Name field if we are re-using executions\n #Type field , if creating new executions always \n context_id=self.child_context.id,\n execution=cmd,\n pipeline_id=self.parent_context.id,\n pipeline_type=self.parent_context.name,\n git_repo=git_repo,\n git_start_commit=git_start_commit,\n python_env=python_env,\n custom_properties=custom_props,\n create_new_execution=create_new_execution,\n )\n uuids = self.execution.properties[\"Execution_uuid\"].string_value\n if uuids:\n self.execution.properties[\"Execution_uuid\"].string_value = uuids+\",\"+str(uuid.uuid1())\n else:\n self.execution.properties[\"Execution_uuid\"].string_value = str(uuid.uuid1()) \n self.store.put_executions([self.execution])\n self.execution_name = str(self.execution.id) + \",\" + execution_type\n self.execution_command = cmd\n for k, v in custom_props.items():\n k = re.sub(\"-\", \"_\", k)\n self.execution_label_props[k] = v\n self.execution_label_props[\"Execution_Name\"] = (\n execution_type + \":\" + str(self.execution.id)\n )\n\n self.execution_label_props[\"execution_command\"] = cmd\n if self.graph:\n self.driver.create_execution_node(\n self.execution_name,\n self.child_context.id,\n self.parent_context,\n cmd,\n self.execution.id,\n custom_props,\n )\n os.chdir(logging_dir)\n return self.execution\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.update_execution","title":"update_execution(execution_id, custom_properties=None)
","text":"Updates an existing execution. The custom properties can be updated after creation of the execution. The new custom properties is merged with earlier custom properties. Example
# Import CMF\nfrom cmflib.cmf import Cmf\nfrom ml_metadata.proto import metadata_store_pb2 as mlpb\n# Create CMF logger\ncmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n# Update a execution\nexecution: mlmd.proto.Execution = cmf.update_execution(\n execution_id=8,\n custom_properties = {\"split\": split, \"seed\": seed}\n)\n
Args: execution_id: id of the execution. custom_properties: Developers can provide key value pairs with additional properties of the execution that need to be updated. Returns: Execution object from ML Metadata library associated with the updated execution for this stage. Source code in cmflib/cmf.py
def update_execution(\n self, execution_id: int, custom_properties: t.Optional[t.Dict] = None\n):\n \"\"\"Updates an existing execution.\n The custom properties can be updated after creation of the execution.\n The new custom properties is merged with earlier custom properties.\n Example\n ```python\n # Import CMF\n from cmflib.cmf import Cmf\n from ml_metadata.proto import metadata_store_pb2 as mlpb\n # Create CMF logger\n cmf = Cmf(filepath=\"mlmd\", pipeline_name=\"test_pipeline\")\n # Update a execution\n execution: mlmd.proto.Execution = cmf.update_execution(\n execution_id=8,\n custom_properties = {\"split\": split, \"seed\": seed}\n )\n ```\n Args:\n execution_id: id of the execution.\n custom_properties: Developers can provide key value pairs with additional properties of the execution that\n need to be updated.\n Returns:\n Execution object from ML Metadata library associated with the updated execution for this stage.\n \"\"\"\n self.execution = self.store.get_executions_by_id([execution_id])[0]\n if self.execution is None:\n print(\"Error - no execution id\")\n return\n execution_type = self.store.get_execution_types_by_id([self.execution.type_id])[\n 0\n ]\n\n if custom_properties:\n for key, value in custom_properties.items():\n if isinstance(value, int):\n self.execution.custom_properties[key].int_value = value\n else:\n self.execution.custom_properties[key].string_value = str(\n value)\n self.store.put_executions([self.execution])\n c_props = {}\n for k, v in self.execution.custom_properties.items():\n key = re.sub(\"-\", \"_\", k)\n val_type = str(v).split(\":\", maxsplit=1)[0]\n if val_type == \"string_value\":\n val = self.execution.custom_properties[k].string_value\n else:\n val = str(v).split(\":\")[1]\n # The properties value are stored in the format type:value hence,\n # taking only value\n self.execution_label_props[key] = val\n c_props[key] = val\n self.execution_name = str(self.execution.id) + \\\n \",\" + execution_type.name\n self.execution_command = self.execution.properties[\"Execution\"]\n self.execution_label_props[\"Execution_Name\"] = (\n execution_type.name + \":\" + str(self.execution.id)\n )\n self.execution_label_props[\"execution_command\"] = self.execution.properties[\n \"Execution\"\n ].string_value\n if self.graph:\n self.driver.create_execution_node(\n self.execution_name,\n self.child_context.id,\n self.parent_context,\n self.execution.properties[\"Execution\"].string_value,\n self.execution.id,\n c_props,\n )\n return self.execution\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_dataset","title":"log_dataset(url, event, custom_properties=None, external=False)
","text":"Logs a dataset as artifact. This call adds the dataset to dvc. The dvc metadata file created (.dvc) will be added to git and committed. The version of the dataset is automatically obtained from the versioning software(DVC) and tracked as a metadata. Example:
artifact: mlmd.proto.Artifact = cmf.log_dataset(\n url=\"/repo/data.xml\",\n event=\"input\",\n custom_properties={\"source\":\"kaggle\"}\n)\n
Args: url: The path to the dataset. event: Takes arguments INPUT
OR OUTPUT
. custom_properties: Dataset properties (key/value pairs). Returns: Artifact object from ML Metadata library associated with the new dataset artifact. Source code in cmflib/cmf.py
def log_dataset(\n self,\n url: str,\n event: str,\n custom_properties: t.Optional[t.Dict] = None,\n external: bool = False,\n) -> mlpb.Artifact:\n \"\"\"Logs a dataset as artifact.\n This call adds the dataset to dvc. The dvc metadata file created (.dvc) will be added to git and committed. The\n version of the dataset is automatically obtained from the versioning software(DVC) and tracked as a metadata.\n Example:\n ```python\n artifact: mlmd.proto.Artifact = cmf.log_dataset(\n url=\"/repo/data.xml\",\n event=\"input\",\n custom_properties={\"source\":\"kaggle\"}\n )\n ```\n Args:\n url: The path to the dataset.\n event: Takes arguments `INPUT` OR `OUTPUT`.\n custom_properties: Dataset properties (key/value pairs).\n Returns:\n Artifact object from ML Metadata library associated with the new dataset artifact.\n \"\"\"\n logging_dir = change_dir(self.cmf_init_path)\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.child_context:\n self.create_context(pipeline_stage=name_without_extension)\n assert self.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # create execution if not already created\n if not self.execution:\n self.create_execution(execution_type=name_without_extension)\n assert self.execution is not None, f\"Failed to create execution for {self.pipeline_name}!!\"\n\n ### To Do : Technical Debt. \n # If the dataset already exist , then we just link the existing dataset to the execution\n # We do not update the dataset properties . \n # We need to append the new properties to the existing dataset properties\n custom_props = {} if custom_properties is None else custom_properties\n git_repo = git_get_repo()\n name = re.split(\"/\", url)[-1]\n event_type = mlpb.Event.Type.OUTPUT\n existing_artifact = []\n if event.lower() == \"input\":\n event_type = mlpb.Event.Type.INPUT\n\n commit_output(url, self.execution.id)\n c_hash = dvc_get_hash(url)\n\n if c_hash == \"\":\n print(\"Error in getting the dvc hash,return without logging\")\n return\n\n dataset_commit = c_hash\n dvc_url = dvc_get_url(url)\n dvc_url_with_pipeline = f\"{self.parent_context.name}:{dvc_url}\"\n url = url + \":\" + c_hash\n if c_hash and c_hash.strip:\n existing_artifact.extend(self.store.get_artifacts_by_uri(c_hash))\n\n # To Do - What happens when uri is the same but names are different\n if existing_artifact and len(existing_artifact) != 0:\n existing_artifact = existing_artifact[0]\n\n # Quick fix- Updating only the name\n if custom_properties is not None:\n self.update_existing_artifact(\n existing_artifact, custom_properties)\n uri = c_hash\n # update url for existing artifact\n self.update_dataset_url(existing_artifact, dvc_url_with_pipeline)\n artifact = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=uri,\n input_name=url,\n event_type=event_type,\n )\n else:\n # if((existing_artifact and len(existing_artifact )!= 0) and c_hash != \"\"):\n # url = url + \":\" + str(self.execution.id)\n uri = c_hash if c_hash and c_hash.strip() else str(uuid.uuid1())\n artifact = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=url,\n type_name=\"Dataset\",\n event_type=event_type,\n properties={\n \"git_repo\": str(git_repo),\n # passing c_hash value to commit\n \"Commit\": str(dataset_commit),\n \"url\": str(dvc_url_with_pipeline),\n },\n artifact_type_properties={\n \"git_repo\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n custom_props[\"git_repo\"] = git_repo\n custom_props[\"Commit\"] = dataset_commit\n self.execution_label_props[\"git_repo\"] = git_repo\n self.execution_label_props[\"Commit\"] = dataset_commit\n\n if self.graph:\n self.driver.create_dataset_node(\n name,\n url,\n uri,\n event,\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n if event.lower() == \"input\":\n self.input_artifacts.append(\n {\n \"Name\": name,\n \"Path\": url,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Dataset\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n )\n self.driver.create_execution_links(uri, name, \"Dataset\")\n else:\n child_artifact = {\n \"Name\": name,\n \"Path\": url,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Dataset\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n os.chdir(logging_dir)\n return artifact\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_dataset_with_version","title":"log_dataset_with_version(url, version, event, props=None, custom_properties=None)
","text":"Logs a dataset when the version (hash) is known. Example:
artifact: mlpb.Artifact = cmf.log_dataset_with_version( \n url=\"path/to/dataset\", \n version=\"abcdef\",\n event=\"output\",\n props={ \"git_repo\": \"https://github.com/example/repo\",\n \"url\": \"/path/in/repo\", },\n custom_properties={ \"custom_key\": \"custom_value\", }, \n ) \n
Args: url: Path to the dataset. version: Hash or version identifier for the dataset. event: Takes arguments INPUT
or OUTPUT
. props: Optional properties for the dataset (e.g., git_repo, url). custom_properties: Optional custom properties for the dataset. Returns: Artifact object from the ML Protocol Buffers library associated with the new dataset artifact. Source code in cmflib/cmf.py
def log_dataset_with_version(\n self,\n url: str,\n version: str,\n event: str,\n props: t.Optional[t.Dict] = None,\n custom_properties: t.Optional[t.Dict] = None,\n) -> mlpb.Artifact:\n \"\"\"Logs a dataset when the version (hash) is known.\n Example: \n ```python \n artifact: mlpb.Artifact = cmf.log_dataset_with_version( \n url=\"path/to/dataset\", \n version=\"abcdef\",\n event=\"output\",\n props={ \"git_repo\": \"https://github.com/example/repo\",\n \"url\": \"/path/in/repo\", },\n custom_properties={ \"custom_key\": \"custom_value\", }, \n ) \n ```\n Args: \n url: Path to the dataset. \n version: Hash or version identifier for the dataset. \n event: Takes arguments `INPUT` or `OUTPUT`. \n props: Optional properties for the dataset (e.g., git_repo, url). \n custom_properties: Optional custom properties for the dataset.\n Returns:\n Artifact object from the ML Protocol Buffers library associated with the new dataset artifact. \n \"\"\"\n\n props = {} if props is None else props\n custom_props = {} if custom_properties is None else custom_properties\n git_repo = props.get(\"git_repo\", \"\")\n name = url\n event_type = mlpb.Event.Type.OUTPUT\n existing_artifact = []\n c_hash = version\n if event.lower() == \"input\":\n event_type = mlpb.Event.Type.INPUT\n\n # dataset_commit = commit_output(url, self.execution.id)\n\n dataset_commit = version\n url = url + \":\" + c_hash\n if c_hash and c_hash.strip:\n existing_artifact.extend(self.store.get_artifacts_by_uri(c_hash))\n\n # To Do - What happens when uri is the same but names are different\n if existing_artifact and len(existing_artifact) != 0:\n existing_artifact = existing_artifact[0]\n\n # Quick fix- Updating only the name\n if custom_properties is not None:\n self.update_existing_artifact(\n existing_artifact, custom_properties)\n uri = c_hash\n # update url for existing artifact\n self.update_dataset_url(existing_artifact, props.get(\"url\", \"\"))\n artifact = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=uri,\n input_name=url,\n event_type=event_type,\n )\n else:\n # if((existing_artifact and len(existing_artifact )!= 0) and c_hash != \"\"):\n # url = url + \":\" + str(self.execution.id)\n uri = c_hash if c_hash and c_hash.strip() else str(uuid.uuid1())\n artifact = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=url,\n type_name=\"Dataset\",\n event_type=event_type,\n properties={\n \"git_repo\": str(git_repo),\n \"Commit\": str(dataset_commit),\n \"url\": props.get(\"url\", \" \"),\n },\n artifact_type_properties={\n \"git_repo\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n custom_props[\"git_repo\"] = git_repo\n custom_props[\"Commit\"] = dataset_commit\n self.execution_label_props[\"git_repo\"] = git_repo\n self.execution_label_props[\"Commit\"] = dataset_commit\n\n if self.graph:\n self.driver.create_dataset_node(\n name,\n url,\n uri,\n event,\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n if event.lower() == \"input\":\n self.input_artifacts.append(\n {\n \"Name\": name,\n \"Path\": url,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Dataset\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n )\n self.driver.create_execution_links(uri, name, \"Dataset\")\n else:\n child_artifact = {\n \"Name\": name,\n \"Path\": url,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Dataset\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n return artifact\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_model","title":"log_model(path, event, model_framework='Default', model_type='Default', model_name='Default', custom_properties=None)
","text":"Logs a model. The model is added to dvc and the metadata file (.dvc) gets committed to git. Example:
artifact: mlmd.proto.Artifact= cmf.log_model(\n path=\"path/to/model.pkl\",\n event=\"output\",\n model_framework=\"SKlearn\",\n model_type=\"RandomForestClassifier\",\n model_name=\"RandomForestClassifier:default\"\n)\n
Args: path: Path to the model file. event: Takes arguments INPUT
OR OUTPUT
. model_framework: Framework used to create the model. model_type: Type of model algorithm used. model_name: Name of the algorithm used. custom_properties: The model properties. Returns: Artifact object from ML Metadata library associated with the new model artifact. Source code in cmflib/cmf.py
def log_model(\n self,\n path: str,\n event: str,\n model_framework: str = \"Default\",\n model_type: str = \"Default\",\n model_name: str = \"Default\",\n custom_properties: t.Optional[t.Dict] = None,\n) -> mlpb.Artifact:\n \"\"\"Logs a model.\n The model is added to dvc and the metadata file (.dvc) gets committed to git.\n Example:\n ```python\n artifact: mlmd.proto.Artifact= cmf.log_model(\n path=\"path/to/model.pkl\",\n event=\"output\",\n model_framework=\"SKlearn\",\n model_type=\"RandomForestClassifier\",\n model_name=\"RandomForestClassifier:default\"\n )\n ```\n Args:\n path: Path to the model file.\n event: Takes arguments `INPUT` OR `OUTPUT`.\n model_framework: Framework used to create the model.\n model_type: Type of model algorithm used.\n model_name: Name of the algorithm used.\n custom_properties: The model properties.\n Returns:\n Artifact object from ML Metadata library associated with the new model artifact.\n \"\"\"\n\n logging_dir = change_dir(self.cmf_init_path)\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.child_context:\n self.create_context(pipeline_stage=name_without_extension)\n assert self.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # create execution if not already created\n if not self.execution:\n self.create_execution(execution_type=name_without_extension)\n assert self.execution is not None, f\"Failed to create execution for {self.pipeline_name}!!\"\n\n\n # To Do : Technical Debt. \n # If the model already exist , then we just link the existing model to the execution\n # We do not update the model properties . \n # We need to append the new properties to the existing model properties\n if custom_properties is None:\n custom_properties = {}\n custom_props = {} if custom_properties is None else custom_properties\n # name = re.split('/', path)[-1]\n event_type = mlpb.Event.Type.OUTPUT\n existing_artifact = []\n if event.lower() == \"input\":\n event_type = mlpb.Event.Type.INPUT\n\n commit_output(path, self.execution.id)\n c_hash = dvc_get_hash(path)\n\n if c_hash == \"\":\n print(\"Error in getting the dvc hash,return without logging\")\n return\n\n model_commit = c_hash\n\n # If connecting to an existing artifact - The name of the artifact is\n # used as path/steps/key\n model_uri = path + \":\" + c_hash\n dvc_url = dvc_get_url(path, False)\n url = dvc_url\n url_with_pipeline = f\"{self.parent_context.name}:{url}\"\n uri = \"\"\n if c_hash and c_hash.strip():\n uri = c_hash.strip()\n existing_artifact.extend(self.store.get_artifacts_by_uri(uri))\n else:\n raise RuntimeError(\"Model commit failed, Model uri empty\")\n\n if (\n existing_artifact\n and len(existing_artifact) != 0\n ):\n # update url for existing artifact\n existing_artifact = self.update_model_url(\n existing_artifact, url_with_pipeline\n )\n artifact = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=c_hash,\n input_name=model_uri,\n event_type=event_type,\n )\n model_uri = artifact.name\n else:\n uri = c_hash if c_hash and c_hash.strip() else str(uuid.uuid1())\n model_uri = model_uri + \":\" + str(self.execution.id)\n artifact = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=model_uri,\n type_name=\"Model\",\n event_type=event_type,\n properties={\n \"model_framework\": str(model_framework),\n \"model_type\": str(model_type),\n \"model_name\": str(model_name),\n # passing c_hash value to commit\n \"Commit\": str(model_commit),\n \"url\": str(url_with_pipeline),\n },\n artifact_type_properties={\n \"model_framework\": mlpb.STRING,\n \"model_type\": mlpb.STRING,\n \"model_name\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n # custom_properties[\"Commit\"] = model_commit\n self.execution_label_props[\"Commit\"] = model_commit\n #To DO model nodes should be similar to dataset nodes when we create neo4j\n if self.graph:\n self.driver.create_model_node(\n model_uri,\n uri,\n event,\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n if event.lower() == \"input\":\n self.input_artifacts.append(\n {\n \"Name\": model_uri,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Model\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n )\n self.driver.create_execution_links(uri, model_uri, \"Model\")\n else:\n child_artifact = {\n \"Name\": model_uri,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Model\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n os.chdir(logging_dir)\n return artifact\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_model_with_version","title":"log_model_with_version(path, event, props=None, custom_properties=None)
","text":"Logs a model when the version(hash) is known The model is added to dvc and the metadata file (.dvc) gets committed to git. Example:
artifact: mlmd.proto.Artifact= cmf.log_model_with_version(\n path=\"path/to/model.pkl\",\n event=\"output\",\n props={\n \"url\": \"/home/user/local-storage/bf/629ccd5cd008066b72c04f9a918737\",\n \"model_type\": \"RandomForestClassifier\",\n \"model_name\": \"RandomForestClassifier:default\",\n \"Commit\": \"commit 1146dad8b74cae205db6a3132ea403db1e4032e5\",\n \"model_framework\": \"SKlearn\",\n },\n custom_properties={\n \"uri\": \"bf629ccd5cd008066b72c04f9a918737\",\n },\n\n)\n
Args: path: Path to the model file. event: Takes arguments INPUT
OR OUTPUT
. props: Model artifact properties. custom_properties: The model properties. Returns: Artifact object from ML Metadata library associated with the new model artifact. Source code in cmflib/cmf.py
def log_model_with_version(\n self,\n path: str,\n event: str,\n props=None,\n custom_properties: t.Optional[t.Dict] = None,\n) -> object:\n \"\"\"Logs a model when the version(hash) is known\n The model is added to dvc and the metadata file (.dvc) gets committed to git.\n Example:\n ```python\n artifact: mlmd.proto.Artifact= cmf.log_model_with_version(\n path=\"path/to/model.pkl\",\n event=\"output\",\n props={\n \"url\": \"/home/user/local-storage/bf/629ccd5cd008066b72c04f9a918737\",\n \"model_type\": \"RandomForestClassifier\",\n \"model_name\": \"RandomForestClassifier:default\",\n \"Commit\": \"commit 1146dad8b74cae205db6a3132ea403db1e4032e5\",\n \"model_framework\": \"SKlearn\",\n },\n custom_properties={\n \"uri\": \"bf629ccd5cd008066b72c04f9a918737\",\n },\n\n )\n ```\n Args:\n path: Path to the model file.\n event: Takes arguments `INPUT` OR `OUTPUT`.\n props: Model artifact properties.\n custom_properties: The model properties.\n Returns:\n Artifact object from ML Metadata library associated with the new model artifact.\n \"\"\"\n\n if custom_properties is None:\n custom_properties = {}\n custom_props = {} if custom_properties is None else custom_properties\n name = re.split(\"/\", path)[-1]\n event_type = mlpb.Event.Type.OUTPUT\n existing_artifact = []\n if event.lower() == \"input\":\n event_type = mlpb.Event.Type.INPUT\n\n # props[\"commit\"] = \"\" # To do get from incoming data\n c_hash = props.get(\"uri\", \" \")\n # If connecting to an existing artifact - The name of the artifact is used as path/steps/key\n model_uri = path + \":\" + c_hash\n # dvc_url = dvc_get_url(path, False)\n url = props.get(\"url\", \"\")\n # uri = \"\"\n if c_hash and c_hash.strip():\n uri = c_hash.strip()\n existing_artifact.extend(self.store.get_artifacts_by_uri(uri))\n else:\n raise RuntimeError(\"Model commit failed, Model uri empty\")\n\n if (\n existing_artifact\n and len(existing_artifact) != 0\n ):\n # update url for existing artifact\n existing_artifact = self.update_model_url(existing_artifact, url)\n artifact = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=c_hash,\n input_name=model_uri,\n event_type=event_type,\n )\n model_uri = artifact.name\n else:\n uri = c_hash if c_hash and c_hash.strip() else str(uuid.uuid1())\n model_uri = model_uri + \":\" + str(self.execution.id)\n artifact = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=model_uri,\n type_name=\"Model\",\n event_type=event_type,\n properties={\n \"model_framework\": props.get(\"model_framework\", \"\"),\n \"model_type\": props.get(\"model_type\", \"\"),\n \"model_name\": props.get(\"model_name\", \"\"),\n \"Commit\": props.get(\"Commit\", \"\"),\n \"url\": str(url),\n },\n artifact_type_properties={\n \"model_framework\": mlpb.STRING,\n \"model_type\": mlpb.STRING,\n \"model_name\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n # custom_properties[\"Commit\"] = model_commit\n # custom_props[\"url\"] = url\n self.execution_label_props[\"Commit\"] = props.get(\"Commit\", \"\")\n if self.graph:\n self.driver.create_model_node(\n model_uri,\n uri,\n event,\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n if event.lower() == \"input\":\n self.input_artifacts.append(\n {\n \"Name\": model_uri,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Model\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n )\n self.driver.create_execution_links(uri, model_uri, \"Model\")\n else:\n child_artifact = {\n \"Name\": model_uri,\n \"URI\": uri,\n \"Event\": event.lower(),\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Model\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n\n return artifact\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_execution_metrics_from_client","title":"log_execution_metrics_from_client(metrics_name, custom_properties=None)
","text":"Logs execution metrics from a client. Data from pre-existing metrics from client side is used to create identical metrics on server side. Example:
artifact: mlpb.Artifact = cmf.log_execution_metrics_from_client( \n metrics_name=\"example_metrics:uri:123\", \n custom_properties={\"custom_key\": \"custom_value\"}, \n )\n
Args: metrics_name: Name of the metrics in the format \"name:uri:execution_id\". custom_properties: Optional custom properties for the metrics. Returns: Artifact object from the ML Protocol Buffers library associated with the metrics artifact. Source code in cmflib/cmf.py
def log_execution_metrics_from_client(self, metrics_name: str,\n custom_properties: t.Optional[t.Dict] = None) -> mlpb.Artifact:\n \"\"\" Logs execution metrics from a client.\n Data from pre-existing metrics from client side is used to create identical metrics on server side. \n Example: \n ```python \n artifact: mlpb.Artifact = cmf.log_execution_metrics_from_client( \n metrics_name=\"example_metrics:uri:123\", \n custom_properties={\"custom_key\": \"custom_value\"}, \n )\n ``` \n Args: \n metrics_name: Name of the metrics in the format \"name:uri:execution_id\". \n custom_properties: Optional custom properties for the metrics. \n Returns: \n Artifact object from the ML Protocol Buffers library associated with the metrics artifact.\n \"\"\"\n\n metrics = None\n custom_props = {} if custom_properties is None else custom_properties\n existing_artifact = []\n name_tokens = metrics_name.split(\":\")\n if name_tokens and len(name_tokens) > 2:\n name = name_tokens[0]\n uri = name_tokens[1]\n execution_id = name_tokens[2]\n else:\n print(f\"Error : metrics name {metrics_name} is not in the correct format\")\n return \n\n #we need to add the execution id to the metrics name\n new_metrics_name = f\"{name}:{uri}:{str(self.execution.id)}\"\n existing_artifacts = self.store.get_artifacts_by_uri(uri)\n\n existing_artifact = existing_artifacts[0] if existing_artifacts else None\n if not existing_artifact or \\\n ((existing_artifact) and not\n (existing_artifact.name == new_metrics_name)): #we need to add the artifact otherwise its already there \n metrics = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=new_metrics_name,\n type_name=\"Metrics\",\n event_type=mlpb.Event.Type.OUTPUT,\n properties={\"metrics_name\": metrics_name},\n artifact_type_properties={\"metrics_name\": mlpb.STRING},\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n if self.graph:\n # To do create execution_links\n self.driver.create_metrics_node(\n metrics_name,\n uri,\n \"output\",\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n child_artifact = {\n \"Name\": metrics_name,\n \"URI\": uri,\n \"Event\": \"output\",\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Metrics\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n return metrics\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_execution_metrics","title":"log_execution_metrics(metrics_name, custom_properties=None)
","text":"Log the metadata associated with the execution (coarse-grained tracking). It is stored as a metrics artifact. This does not have a backing physical file, unlike other artifacts that we have. Example:
exec_metrics: mlpb.Artifact = cmf.log_execution_metrics(\n metrics_name=\"Training_Metrics\",\n {\"auc\": auc, \"loss\": loss}\n)\n
Args: metrics_name: Name to identify the metrics. custom_properties: Dictionary with metric values. Returns: Artifact object from ML Metadata library associated with the new coarse-grained metrics artifact. Source code in cmflib/cmf.py
def log_execution_metrics(\n self, metrics_name: str, custom_properties: t.Optional[t.Dict] = None\n) -> mlpb.Artifact:\n \"\"\"Log the metadata associated with the execution (coarse-grained tracking).\n It is stored as a metrics artifact. This does not have a backing physical file, unlike other artifacts that we\n have.\n Example:\n ```python\n exec_metrics: mlpb.Artifact = cmf.log_execution_metrics(\n metrics_name=\"Training_Metrics\",\n {\"auc\": auc, \"loss\": loss}\n )\n ```\n Args:\n metrics_name: Name to identify the metrics.\n custom_properties: Dictionary with metric values.\n Returns:\n Artifact object from ML Metadata library associated with the new coarse-grained metrics artifact.\n \"\"\"\n logging_dir = change_dir(self.cmf_init_path)\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.child_context:\n self.create_context(pipeline_stage=name_without_extension)\n assert self.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # create execution if not already created\n if not self.execution:\n self.create_execution(execution_type=name_without_extension)\n assert self.execution is not None, f\"Failed to create execution for {self.pipeline_name}!!\"\n\n custom_props = {} if custom_properties is None else custom_properties\n uri = str(uuid.uuid1())\n metrics_name = metrics_name + \":\" + uri + \":\" + str(self.execution.id)\n metrics = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=metrics_name,\n type_name=\"Metrics\",\n event_type=mlpb.Event.Type.OUTPUT,\n properties={\"metrics_name\": metrics_name},\n artifact_type_properties={\"metrics_name\": mlpb.STRING},\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n if self.graph:\n # To do create execution_links\n self.driver.create_metrics_node(\n metrics_name,\n uri,\n \"output\",\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n child_artifact = {\n \"Name\": metrics_name,\n \"URI\": uri,\n \"Event\": \"output\",\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Metrics\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n \"Pipeline_Name\": self.parent_context.name,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n os.chdir(logging_dir)\n return metrics\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.log_metric","title":"log_metric(metrics_name, custom_properties=None)
","text":"Stores the fine-grained (per step or per epoch) metrics to memory. The metrics provided are stored in a parquet file. The commit_metrics
call add the parquet file in the version control framework. The metrics written in the parquet file can be retrieved using the read_metrics
call. Example:
# Can be called at every epoch or every step in the training. This is logged to a parquet file and committed\n# at the commit stage.\n# Inside training loop\nwhile True:\n cmf.log_metric(\"training_metrics\", {\"train_loss\": train_loss})\ncmf.commit_metrics(\"training_metrics\")\n
Args: metrics_name: Name to identify the metrics. custom_properties: Dictionary with metrics. Source code in cmflib/cmf.py
def log_metric(\n self, metrics_name: str, custom_properties: t.Optional[t.Dict] = None\n) -> None:\n \"\"\"Stores the fine-grained (per step or per epoch) metrics to memory.\n The metrics provided are stored in a parquet file. The `commit_metrics` call add the parquet file in the version\n control framework. The metrics written in the parquet file can be retrieved using the `read_metrics` call.\n Example:\n ```python\n # Can be called at every epoch or every step in the training. This is logged to a parquet file and committed\n # at the commit stage.\n # Inside training loop\n while True:\n cmf.log_metric(\"training_metrics\", {\"train_loss\": train_loss})\n cmf.commit_metrics(\"training_metrics\")\n ```\n Args:\n metrics_name: Name to identify the metrics.\n custom_properties: Dictionary with metrics.\n \"\"\"\n if metrics_name in self.metrics:\n key = max((self.metrics[metrics_name]).keys()) + 1\n self.metrics[metrics_name][key] = custom_properties\n else:\n self.metrics[metrics_name] = {}\n self.metrics[metrics_name][1] = custom_properties\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.commit_existing_metrics","title":"commit_existing_metrics(metrics_name, uri, props=None, custom_properties=None)
","text":"Commits existing metrics associated with the given URI to MLMD. Example:
artifact: mlpb.Artifact = cmf.commit_existing_metrics(\"existing_metrics\", \"abc123\",\n {\"custom_key\": \"custom_value\"})\n
Args: metrics_name: Name of the metrics. uri: Unique identifier associated with the metrics. custom_properties: Optional custom properties for the metrics. Returns: Artifact object from the ML Protocol Buffers library associated with the existing metrics artifact. Source code in cmflib/cmf.py
def commit_existing_metrics(self, metrics_name: str, uri: str, props: t.Optional[t.Dict] = None, custom_properties: t.Optional[t.Dict] = None):\n \"\"\"\n Commits existing metrics associated with the given URI to MLMD.\n Example:\n ```python\n artifact: mlpb.Artifact = cmf.commit_existing_metrics(\"existing_metrics\", \"abc123\",\n {\"custom_key\": \"custom_value\"})\n ```\n Args:\n metrics_name: Name of the metrics.\n uri: Unique identifier associated with the metrics.\n custom_properties: Optional custom properties for the metrics.\n Returns:\n Artifact object from the ML Protocol Buffers library associated with the existing metrics artifact.\n \"\"\"\n\n custom_props = {} if custom_properties is None else custom_properties\n c_hash = uri.strip()\n existing_artifact = []\n existing_artifact.extend(self.store.get_artifacts_by_uri(c_hash))\n if (existing_artifact\n and len(existing_artifact) != 0 ):\n metrics = link_execution_to_artifact(\n store=self.store,\n execution_id=self.execution.id,\n uri=c_hash,\n input_name=metrics_name,\n event_type=mlpb.Event.Type.OUTPUT,\n )\n else:\n metrics = create_new_artifact_event_and_attribution(\n store=self.store,\n execution_id=self.execution.id,\n context_id=self.child_context.id,\n uri=uri,\n name=metrics_name,\n type_name=\"Step_Metrics\",\n event_type=mlpb.Event.Type.OUTPUT,\n properties={\n # passing uri value to commit\n \"Commit\": props.get(\"Commit\", \"\"),\n \"url\": props.get(\"url\", \"\"),\n },\n artifact_type_properties={\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n if self.graph:\n self.driver.create_metrics_node(\n metrics_name,\n uri,\n \"output\",\n self.execution.id,\n self.parent_context,\n custom_props,\n )\n child_artifact = {\n \"Name\": metrics_name,\n \"URI\": uri,\n \"Event\": \"output\",\n \"Execution_Name\": self.execution_name,\n \"Type\": \"Metrics\",\n \"Execution_Command\": self.execution_command,\n \"Pipeline_Id\": self.parent_context.id,\n }\n self.driver.create_artifact_relationships(\n self.input_artifacts, child_artifact, self.execution_label_props\n )\n return metrics\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.create_dataslice","title":"create_dataslice(name)
","text":"Creates a dataslice object. Once created, users can add data instances to this data slice with add_data method. Users are also responsible for committing data slices by calling the commit method. Example:
dataslice = cmf.create_dataslice(\"slice-a\")\n
Args: name: Name to identify the dataslice. Returns:
Type DescriptionDataSlice
Instance of a newly created DataSlice.
Source code incmflib/cmf.py
def create_dataslice(self, name: str) -> \"Cmf.DataSlice\":\n \"\"\"Creates a dataslice object.\n Once created, users can add data instances to this data slice with [add_data][cmflib.cmf.Cmf.DataSlice.add_data]\n method. Users are also responsible for committing data slices by calling the\n [commit][cmflib.cmf.Cmf.DataSlice.commit] method.\n Example:\n ```python\n dataslice = cmf.create_dataslice(\"slice-a\")\n ```\n Args:\n name: Name to identify the dataslice.\n\n Returns:\n Instance of a newly created [DataSlice][cmflib.cmf.Cmf.DataSlice].\n \"\"\"\n return Cmf.DataSlice(name, self)\n
"},{"location":"api/public/cmf/#cmflib.cmf.Cmf.update_dataslice","title":"update_dataslice(name, record, custom_properties)
","text":"Updates a dataslice record in a Parquet file with the provided custom properties. Example:
dataslice=cmf.update_dataslice(\"dataslice_file.parquet\", \"record_id\", \n {\"key1\": \"updated_value\"})\n
Args: name: Name of the Parquet file. record: Identifier of the dataslice record to be updated. custom_properties: Dictionary containing custom properties to update. Returns:
Type DescriptionNone
Source code incmflib/cmf.py
def update_dataslice(self, name: str, record: str, custom_properties: t.Dict):\n \"\"\"Updates a dataslice record in a Parquet file with the provided custom properties.\n Example:\n ```python\n dataslice=cmf.update_dataslice(\"dataslice_file.parquet\", \"record_id\", \n {\"key1\": \"updated_value\"})\n ```\n Args:\n name: Name of the Parquet file.\n record: Identifier of the dataslice record to be updated.\n custom_properties: Dictionary containing custom properties to update.\n\n Returns:\n None\n \"\"\"\n directory_path = os.path.join(self.ARTIFACTS_PATH, self.execution.properties[\"Execution_uuid\"].string_value.split(',')[0], self.DATASLICE_PATH)\n name = os.path.join(directory_path, name)\n df = pd.read_parquet(name)\n temp_dict = df.to_dict(\"index\")\n temp_dict[record].update(custom_properties)\n dataslice_df = pd.DataFrame.from_dict(temp_dict, orient=\"index\")\n dataslice_df.index.names = [\"Path\"]\n dataslice_df.to_parquet(name)\n
"},{"location":"api/public/cmf/#cmflib.cmf.cmf_init_show","title":"cmf_init_show()
","text":"Initializes and shows details of the CMF command. Example:
result = cmf_init_show() \n
Returns: Output from the _cmf_cmd_init function. Source code in cmflib/cmf.py
def cmf_init_show():\n \"\"\" Initializes and shows details of the CMF command. \n Example: \n ```python \n result = cmf_init_show() \n ``` \n Returns: \n Output from the _cmf_cmd_init function. \n \"\"\"\n\n output=_cmf_cmd_init()\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.cmf_init","title":"cmf_init(type='', path='', git_remote_url='', cmf_server_url='', neo4j_user='', neo4j_password='', neo4j_uri='', url='', endpoint_url='', access_key_id='', secret_key='', session_token='', user='', password='', port=0, osdf_path='', key_id='', key_path='', key_issuer='')
","text":"Initializes the CMF configuration based on the provided parameters. Example:
cmf_init( type=\"local\", \n path=\"/path/to/re\",\n git_remote_url=\"git@github.com:user/repo.git\",\n cmf_server_url=\"http://cmf-server\"\n neo4j_user\", \n neo4j_password=\"password\",\n neo4j_uri=\"bolt://localhost:76\"\n )\n
Args: type: Type of repository (\"local\", \"minioS3\", \"amazonS3\", \"sshremote\") path: Path for the local repository. git_remote_url: Git remote URL for version control. cmf_server_url: CMF server URL. neo4j_user: Neo4j database username. neo4j_password: Neo4j database password. neo4j_uri: Neo4j database URI. url: URL for MinioS3 or AmazonS3. endpoint_url: Endpoint URL for MinioS3. access_key_id: Access key ID for MinioS3 or AmazonS3. secret_key: Secret key for MinioS3 or AmazonS3. session_token: Session token for AmazonS3. user: SSH remote username. password: SSH remote password. port: SSH remote port Returns: Output based on the initialized repository type. Source code in cmflib/cmf.py
def cmf_init(type: str = \"\",\n path: str = \"\",\n git_remote_url: str = \"\",\n cmf_server_url: str = \"\",\n neo4j_user: str = \"\",\n neo4j_password: str = \"\",\n neo4j_uri: str = \"\",\n url: str = \"\",\n endpoint_url: str = \"\",\n access_key_id: str = \"\",\n secret_key: str = \"\",\n session_token: str = \"\",\n user: str = \"\",\n password: str = \"\",\n port: int = 0,\n osdf_path: str = \"\",\n key_id: str = \"\",\n key_path: str = \"\",\n key_issuer: str = \"\",\n ):\n\n \"\"\" Initializes the CMF configuration based on the provided parameters. \n Example:\n ```python\n cmf_init( type=\"local\", \n path=\"/path/to/re\",\n git_remote_url=\"git@github.com:user/repo.git\",\n cmf_server_url=\"http://cmf-server\"\n neo4j_user\", \n neo4j_password=\"password\",\n neo4j_uri=\"bolt://localhost:76\"\n )\n ```\n Args: \n type: Type of repository (\"local\", \"minioS3\", \"amazonS3\", \"sshremote\")\n path: Path for the local repository. \n git_remote_url: Git remote URL for version control.\n cmf_server_url: CMF server URL.\n neo4j_user: Neo4j database username.\n neo4j_password: Neo4j database password.\n neo4j_uri: Neo4j database URI.\n url: URL for MinioS3 or AmazonS3.\n endpoint_url: Endpoint URL for MinioS3.\n access_key_id: Access key ID for MinioS3 or AmazonS3.\n secret_key: Secret key for MinioS3 or AmazonS3. \n session_token: Session token for AmazonS3.\n user: SSH remote username.\n password: SSH remote password. \n port: SSH remote port\n Returns:\n Output based on the initialized repository type.\n \"\"\"\n\n if type == \"\":\n return print(\"Error: Type is not provided\")\n if type not in [\"local\",\"minioS3\",\"amazonS3\",\"sshremote\",\"osdfremote\"]:\n return print(\"Error: Type value is undefined\"+ \" \"+type+\".Expected: \"+\",\".join([\"local\",\"minioS3\",\"amazonS3\",\"sshremote\",\"osdfremote\"]))\n\n if neo4j_user != \"\" and neo4j_password != \"\" and neo4j_uri != \"\":\n pass\n elif neo4j_user == \"\" and neo4j_password == \"\" and neo4j_uri == \"\":\n pass\n else:\n return print(\"Error: Enter all neo4j parameters.\") \n\n args={'path': path,\n 'git_remote_url': git_remote_url,\n 'url': url,\n 'endpoint_url': endpoint_url,\n 'access_key_id': access_key_id,\n 'secret_key': secret_key,\n 'session_token': session_token,\n 'user': user,\n 'password': password,\n 'osdf_path': osdf_path,\n 'key_id': key_id,\n 'key_path': key_path, \n 'key-issuer': key_issuer,\n }\n\n status_args=non_related_args(type, args)\n\n if type == \"local\" and path != \"\" and git_remote_url != \"\" :\n \"\"\"Initialize local repository\"\"\"\n output = _init_local(\n path, git_remote_url, cmf_server_url, neo4j_user, neo4j_password, neo4j_uri\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n return output\n\n elif type == \"minioS3\" and url != \"\" and endpoint_url != \"\" and access_key_id != \"\" and secret_key != \"\" and git_remote_url != \"\":\n \"\"\"Initialize minioS3 repository\"\"\"\n output = _init_minioS3(\n url,\n endpoint_url,\n access_key_id,\n secret_key,\n git_remote_url,\n cmf_server_url,\n neo4j_user,\n neo4j_password,\n neo4j_uri,\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n return output\n\n elif type == \"amazonS3\" and url != \"\" and access_key_id != \"\" and secret_key != \"\" and git_remote_url != \"\":\n \"\"\"Initialize amazonS3 repository\"\"\"\n output = _init_amazonS3(\n url,\n access_key_id,\n secret_key,\n session_token,\n git_remote_url,\n cmf_server_url,\n neo4j_user,\n neo4j_password,\n neo4j_uri,\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n\n return output\n\n elif type == \"sshremote\" and path != \"\" and user != \"\" and port != 0 and password != \"\" and git_remote_url != \"\":\n \"\"\"Initialize sshremote repository\"\"\"\n output = _init_sshremote(\n path,\n user,\n port,\n password,\n git_remote_url,\n cmf_server_url,\n neo4j_user,\n neo4j_password,\n neo4j_uri,\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n\n return output\n\n elif type == \"osdfremote\" and osdf_path != \"\" and key_id != \"\" and key_path != 0 and key_issuer != \"\" and git_remote_url != \"\":\n \"\"\"Initialize osdfremote repository\"\"\"\n output = _init_osdfremote(\n osdf_path,\n key_id,\n key_path,\n key_issuer,\n git_remote_url,\n cmf_server_url,\n neo4j_user,\n neo4j_password,\n neo4j_uri,\n )\n if status_args != []:\n print(\"There are non-related arguments: \"+\",\".join(status_args)+\".Please remove them.\")\n\n return output\n\n else:\n print(\"Error: Enter all arguments\")\n
"},{"location":"api/public/cmf/#cmflib.cmf.metadata_push","title":"metadata_push(pipeline_name, filepath='./mlmd', tensorboard_path='', execution_id='')
","text":"Pushes MLMD file to CMF-server. Example:
result = metadata_push(\"example_pipeline\", \"mlmd_file\", \"3\")\n
Args: pipeline_name: Name of the pipeline. filepath: Path to the MLMD file. execution_id: Optional execution ID. tensorboard_path: Path to tensorboard logs. Returns:
Type DescriptionResponse output from the _metadata_push function.
Source code incmflib/cmf.py
def metadata_push(pipeline_name: str, filepath = \"./mlmd\", tensorboard_path: str = \"\", execution_id: str = \"\"):\n \"\"\" Pushes MLMD file to CMF-server.\n Example:\n ```python\n result = metadata_push(\"example_pipeline\", \"mlmd_file\", \"3\")\n ```\n Args:\n pipeline_name: Name of the pipeline.\n filepath: Path to the MLMD file.\n execution_id: Optional execution ID.\n tensorboard_path: Path to tensorboard logs.\n\n Returns:\n Response output from the _metadata_push function.\n \"\"\"\n # Required arguments: pipeline_name\n # Optional arguments: Execution_ID, filepath (mlmd file path, tensorboard_path\n output = _metadata_push(pipeline_name, filepath, execution_id, tensorboard_path)\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.metadata_pull","title":"metadata_pull(pipeline_name, filepath='./mlmd', execution_id='')
","text":"Pulls MLMD file from CMF-server. Example:
result = metadata_pull(\"example_pipeline\", \"./mlmd_directory\", \"execution_123\") \n
Args: pipeline_name: Name of the pipeline. filepath: File path to store the MLMD file. execution_id: Optional execution ID. Returns: Message from the _metadata_pull function. Source code in cmflib/cmf.py
def metadata_pull(pipeline_name: str, filepath = \"./mlmd\", execution_id: str = \"\"):\n \"\"\" Pulls MLMD file from CMF-server. \n Example: \n ```python \n result = metadata_pull(\"example_pipeline\", \"./mlmd_directory\", \"execution_123\") \n ``` \n Args: \n pipeline_name: Name of the pipeline. \n filepath: File path to store the MLMD file. \n execution_id: Optional execution ID. \n Returns: \n Message from the _metadata_pull function. \n \"\"\"\n # Required arguments: pipeline_name \n #Optional arguments: Execution_ID, filepath(file path to store mlmd file) \n output = _metadata_pull(pipeline_name, filepath, execution_id)\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.artifact_pull","title":"artifact_pull(pipeline_name, filepath='./mlmd')
","text":"Pulls artifacts from the initialized repository.
Example:
result = artifact_pull(\"example_pipeline\", \"./mlmd_directory\")\n
Parameters:
Name Type Description Defaultpipeline_name
str
Name of the pipeline.
requiredfilepath
Path to store artifacts.
'./mlmd'
Returns: Output from the _artifact_pull function.
Source code incmflib/cmf.py
def artifact_pull(pipeline_name: str, filepath = \"./mlmd\"):\n \"\"\" Pulls artifacts from the initialized repository.\n\n Example:\n ```python\n result = artifact_pull(\"example_pipeline\", \"./mlmd_directory\")\n ```\n\n Args:\n pipeline_name: Name of the pipeline.\n filepath: Path to store artifacts.\n Returns:\n Output from the _artifact_pull function.\n \"\"\"\n\n # Required arguments: Pipeline_name\n # Optional arguments: filepath( path to store artifacts)\n output = _artifact_pull(pipeline_name, filepath)\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.artifact_pull_single","title":"artifact_pull_single(pipeline_name, filepath, artifact_name)
","text":"Pulls a single artifact from the initialized repository. Example:
result = artifact_pull_single(\"example_pipeline\", \"./mlmd_directory\", \"example_artifact\") \n
Args: pipeline_name: Name of the pipeline. filepath: Path to store the artifact. artifact_name: Name of the artifact. Returns: Output from the _artifact_pull_single function. Source code in cmflib/cmf.py
def artifact_pull_single(pipeline_name: str, filepath: str, artifact_name: str):\n \"\"\" Pulls a single artifact from the initialized repository. \n Example: \n ```python \n result = artifact_pull_single(\"example_pipeline\", \"./mlmd_directory\", \"example_artifact\") \n ```\n Args: \n pipeline_name: Name of the pipeline. \n filepath: Path to store the artifact. \n artifact_name: Name of the artifact. \n Returns:\n Output from the _artifact_pull_single function. \n \"\"\"\n\n # Required arguments: Pipeline_name\n # Optional arguments: filepath( path to store artifacts), artifact_name\n output = _artifact_pull_single(pipeline_name, filepath, artifact_name)\n return output\n
"},{"location":"api/public/cmf/#cmflib.cmf.artifact_push","title":"artifact_push(pipeline_name, filepath='./mlmd')
","text":"Pushes artifacts to the initialized repository.
Example:
result = artifact_push(\"example_pipeline\", \"./mlmd_directory\")\n
Args: pipeline_name: Name of the pipeline. filepath: Path to store the artifact. Returns: Output from the _artifact_push function. Source code in cmflib/cmf.py
def artifact_push(pipeline_name: str, filepath = \"./mlmd\"):\n \"\"\" Pushes artifacts to the initialized repository.\n\n Example:\n ```python\n result = artifact_push(\"example_pipeline\", \"./mlmd_directory\")\n ```\n Args: \n pipeline_name: Name of the pipeline. \n filepath: Path to store the artifact. \n Returns:\n Output from the _artifact_push function.\n \"\"\"\n\n output = _artifact_push(pipeline_name, filepath)\n return output\n
"},{"location":"api/public/cmfquery/","title":"cmflib.cmfquery.CmfQuery","text":" Bases: object
CMF Query communicates with the MLMD database and implements basic search and retrieval functionality.
This class has been designed to work with the CMF framework. CMF alters names of pipelines, stages and artifacts in various ways. This means that actual names in the MLMD database will be different from those originally provided by users via CMF API. When methods in this class accept name
parameters, it is expected that values of these parameters are fully-qualified names of respective entities.
Parameters:
Name Type Description Defaultfilepath
str
Path to the MLMD database file.
'mlmd'
Source code in cmflib/cmfquery.py
def __init__(self, filepath: str = \"mlmd\") -> None:\n config = mlpb.ConnectionConfig()\n config.sqlite.filename_uri = filepath\n self.store = metadata_store.MetadataStore(config)\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_pipeline_names","title":"get_pipeline_names()
","text":"Return names of all pipelines.
Returns:
Type DescriptionList[str]
List of all pipeline names.
Source code incmflib/cmfquery.py
def get_pipeline_names(self) -> t.List[str]:\n \"\"\"Return names of all pipelines.\n\n Returns:\n List of all pipeline names.\n \"\"\"\n return [ctx.name for ctx in self._get_pipelines()]\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_pipeline_id","title":"get_pipeline_id(pipeline_name)
","text":"Return pipeline identifier for the pipeline names pipeline_name
. Args: pipeline_name: Name of the pipeline. Returns: Pipeline identifier or -1 if one does not exist.
cmflib/cmfquery.py
def get_pipeline_id(self, pipeline_name: str) -> int:\n \"\"\"Return pipeline identifier for the pipeline names `pipeline_name`.\n Args:\n pipeline_name: Name of the pipeline.\n Returns:\n Pipeline identifier or -1 if one does not exist.\n \"\"\"\n pipeline: t.Optional[mlpb.Context] = self._get_pipeline(pipeline_name)\n return -1 if not pipeline else pipeline.id\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_pipeline_stages","title":"get_pipeline_stages(pipeline_name)
","text":"Return list of pipeline stages for the pipeline with the given name.
Parameters:
Name Type Description Defaultpipeline_name
str
Name of the pipeline for which stages need to be returned. In CMF, there are no different pipelines with the same name.
requiredReturns: List of stage names associated with the given pipeline.
Source code incmflib/cmfquery.py
def get_pipeline_stages(self, pipeline_name: str) -> t.List[str]:\n \"\"\"Return list of pipeline stages for the pipeline with the given name.\n\n Args:\n pipeline_name: Name of the pipeline for which stages need to be returned. In CMF, there are no different\n pipelines with the same name.\n Returns:\n List of stage names associated with the given pipeline.\n \"\"\"\n stages = []\n for pipeline in self._get_pipelines(pipeline_name):\n stages.extend(stage.name for stage in self._get_stages(pipeline.id))\n return stages\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_exe_in_stage","title":"get_all_exe_in_stage(stage_name)
","text":"Return list of all executions for the stage with the given name.
Parameters:
Name Type Description Defaultstage_name
str
Name of the stage. Before stages are recorded in MLMD, they are modified (e.g., pipeline name will become part of the stage name). So stage names from different pipelines will not collide.
requiredReturns: List of executions for the given stage.
Source code incmflib/cmfquery.py
def get_all_exe_in_stage(self, stage_name: str) -> t.List[mlpb.Execution]:\n \"\"\"Return list of all executions for the stage with the given name.\n\n Args:\n stage_name: Name of the stage. Before stages are recorded in MLMD, they are modified (e.g., pipeline name\n will become part of the stage name). So stage names from different pipelines will not collide.\n Returns:\n List of executions for the given stage.\n \"\"\"\n for pipeline in self._get_pipelines():\n for stage in self._get_stages(pipeline.id):\n if stage.name == stage_name:\n return self.store.get_executions_by_context(stage.id)\n return []\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_executions_by_ids_list","title":"get_all_executions_by_ids_list(exe_ids)
","text":"Return executions for given execution ids list as a pandas data frame.
Parameters:
Name Type Description Defaultexe_ids
List[int]
List of execution identifiers.
requiredReturns:
Type DescriptionDataFrame
Data frame with all executions for the list of given execution identifiers.
Source code incmflib/cmfquery.py
def get_all_executions_by_ids_list(self, exe_ids: t.List[int]) -> pd.DataFrame:\n \"\"\"Return executions for given execution ids list as a pandas data frame.\n\n Args:\n exe_ids: List of execution identifiers.\n\n Returns:\n Data frame with all executions for the list of given execution identifiers.\n \"\"\"\n\n df = pd.DataFrame()\n executions = self.store.get_executions_by_id(exe_ids)\n for exe in executions:\n d1 = self._transform_to_dataframe(exe)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifacts_by_context","title":"get_all_artifacts_by_context(pipeline_name)
","text":"Return artifacts for given pipeline name as a pandas data frame.
Parameters:
Name Type Description Defaultpipeline_name
str
Name of the pipeline.
requiredReturns:
Type DescriptionDataFrame
Data frame with all artifacts associated with given pipeline name.
Source code incmflib/cmfquery.py
def get_all_artifacts_by_context(self, pipeline_name: str) -> pd.DataFrame:\n \"\"\"Return artifacts for given pipeline name as a pandas data frame.\n\n Args:\n pipeline_name: Name of the pipeline.\n\n Returns:\n Data frame with all artifacts associated with given pipeline name.\n \"\"\"\n df = pd.DataFrame()\n contexts = self.store.get_contexts_by_type(\"Parent_Context\")\n context_id = self.get_pipeline_id(pipeline_name)\n for ctx in contexts:\n if ctx.id == context_id:\n child_contexts = self.store.get_children_contexts_by_context(ctx.id)\n for cc in child_contexts:\n artifacts = self.store.get_artifacts_by_context(cc.id)\n for art in artifacts:\n d1 = self.get_artifact_df(art)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifacts_by_ids_list","title":"get_all_artifacts_by_ids_list(artifact_ids)
","text":"Return all artifacts for the given artifact ids list.
Parameters:
Name Type Description Defaultartifact_ids
List[int]
List of artifact identifiers
requiredReturns:
Type DescriptionDataFrame
Data frame with all artifacts for the given artifact ids list.
Source code incmflib/cmfquery.py
def get_all_artifacts_by_ids_list(self, artifact_ids: t.List[int]) -> pd.DataFrame:\n \"\"\"Return all artifacts for the given artifact ids list.\n\n Args:\n artifact_ids: List of artifact identifiers\n\n Returns:\n Data frame with all artifacts for the given artifact ids list.\n \"\"\"\n df = pd.DataFrame()\n artifacts = self.store.get_artifacts_by_id(artifact_ids)\n for art in artifacts:\n d1 = self.get_artifact_df(art)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_executions_in_stage","title":"get_all_executions_in_stage(stage_name)
","text":"Return executions of the given stage as pandas data frame. Args: stage_name: Stage name. See doc strings for the prev method. Returns: Data frame with all executions associated with the given stage.
Source code incmflib/cmfquery.py
def get_all_executions_in_stage(self, stage_name: str) -> pd.DataFrame:\n \"\"\"Return executions of the given stage as pandas data frame.\n Args:\n stage_name: Stage name. See doc strings for the prev method.\n Returns:\n Data frame with all executions associated with the given stage.\n \"\"\"\n df = pd.DataFrame()\n for pipeline in self._get_pipelines():\n for stage in self._get_stages(pipeline.id):\n if stage.name == stage_name:\n for execution in self._get_executions(stage.id):\n ex_as_df: pd.DataFrame = self._transform_to_dataframe(\n execution, {\"id\": execution.id, \"name\": execution.name}\n )\n df = pd.concat([df, ex_as_df], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_artifact_df","title":"get_artifact_df(artifact, d=None)
","text":"Return artifact's data frame representation.
Parameters:
Name Type Description Defaultartifact
Artifact
MLMD entity representing artifact.
requiredd
Optional[Dict]
Optional initial content for data frame.
None
Returns: A data frame with the single row containing attributes of this artifact.
Source code incmflib/cmfquery.py
def get_artifact_df(self, artifact: mlpb.Artifact, d: t.Optional[t.Dict] = None) -> pd.DataFrame:\n \"\"\"Return artifact's data frame representation.\n\n Args:\n artifact: MLMD entity representing artifact.\n d: Optional initial content for data frame.\n Returns:\n A data frame with the single row containing attributes of this artifact.\n \"\"\"\n if d is None:\n d = {}\n d.update(\n {\n \"id\": artifact.id,\n \"type\": self.store.get_artifact_types_by_id([artifact.type_id])[0].name,\n \"uri\": artifact.uri,\n \"name\": artifact.name,\n \"create_time_since_epoch\": artifact.create_time_since_epoch,\n \"last_update_time_since_epoch\": artifact.last_update_time_since_epoch,\n }\n )\n return self._transform_to_dataframe(artifact, d)\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifacts","title":"get_all_artifacts()
","text":"Return names of all artifacts.
Returns:
Type DescriptionList[str]
List of all artifact names.
Source code incmflib/cmfquery.py
def get_all_artifacts(self) -> t.List[str]:\n \"\"\"Return names of all artifacts.\n\n Returns:\n List of all artifact names.\n \"\"\"\n return [artifact.name for artifact in self.store.get_artifacts()]\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_artifact","title":"get_artifact(name)
","text":"Return artifact's data frame representation using artifact name.
Parameters:
Name Type Description Defaultname
str
Artifact name.
requiredReturns: Pandas data frame with one row containing attributes of this artifact.
Source code incmflib/cmfquery.py
def get_artifact(self, name: str) -> t.Optional[pd.DataFrame]:\n \"\"\"Return artifact's data frame representation using artifact name.\n\n Args:\n name: Artifact name.\n Returns:\n Pandas data frame with one row containing attributes of this artifact.\n \"\"\"\n artifact: t.Optional[mlpb.Artifact] = self._get_artifact(name)\n if artifact:\n return self.get_artifact_df(artifact)\n return None\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifacts_for_execution","title":"get_all_artifacts_for_execution(execution_id)
","text":"Return input and output artifacts for the given execution.
Parameters:
Name Type Description Defaultexecution_id
int
Execution identifier.
requiredReturn: Data frame containing input and output artifacts for the given execution, one artifact per row.
Source code incmflib/cmfquery.py
def get_all_artifacts_for_execution(self, execution_id: int) -> pd.DataFrame:\n \"\"\"Return input and output artifacts for the given execution.\n\n Args:\n execution_id: Execution identifier.\n Return:\n Data frame containing input and output artifacts for the given execution, one artifact per row.\n \"\"\"\n df = pd.DataFrame()\n for event in self.store.get_events_by_execution_ids([execution_id]):\n event_type = \"INPUT\" if event.type == mlpb.Event.Type.INPUT else \"OUTPUT\"\n for artifact in self.store.get_artifacts_by_id([event.artifact_id]):\n df = pd.concat(\n [df, self.get_artifact_df(artifact, {\"event\": event_type})], sort=True, ignore_index=True\n )\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_artifact_types","title":"get_all_artifact_types()
","text":"Return names of all artifact types.
Returns:
Type DescriptionList[str]
List of all artifact types.
Source code incmflib/cmfquery.py
def get_all_artifact_types(self) -> t.List[str]:\n \"\"\"Return names of all artifact types.\n\n Returns:\n List of all artifact types.\n \"\"\"\n artifact_list = self.store.get_artifact_types()\n types=[i.name for i in artifact_list]\n return types\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_executions_for_artifact","title":"get_all_executions_for_artifact(artifact_name)
","text":"Return executions that consumed and produced given artifact.
Parameters:
Name Type Description Defaultartifact_name
str
Artifact name.
requiredReturns: Pandas data frame containing stage executions, one execution per row.
Source code incmflib/cmfquery.py
def get_all_executions_for_artifact(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return executions that consumed and produced given artifact.\n\n Args:\n artifact_name: Artifact name.\n Returns:\n Pandas data frame containing stage executions, one execution per row.\n \"\"\"\n df = pd.DataFrame()\n\n artifact: t.Optional = self._get_artifact(artifact_name)\n if not artifact:\n return df\n\n for event in self.store.get_events_by_artifact_ids([artifact.id]):\n stage_ctx = self.store.get_contexts_by_execution(event.execution_id)[0]\n linked_execution = {\n \"Type\": \"INPUT\" if event.type == mlpb.Event.Type.INPUT else \"OUTPUT\",\n \"execution_id\": event.execution_id,\n \"execution_name\": self.store.get_executions_by_id([event.execution_id])[0].name,\n \"execution_type_name\":self.store.get_executions_by_id([event.execution_id])[0].properties['Execution_type_name'],\n \"stage\": stage_ctx.name,\n \"pipeline\": self.store.get_parent_contexts_by_context(stage_ctx.id)[0].name,\n }\n d1 = pd.DataFrame(\n linked_execution,\n index=[\n 0,\n ],\n )\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_one_hop_child_artifacts","title":"get_one_hop_child_artifacts(artifact_name, pipeline_id=None)
","text":"Get artifacts produced by executions that consume given artifact.
Parameters:
Name Type Description Defaultartifact
name
Name of an artifact.
requiredReturn: Output artifacts of all executions that consumed given artifact.
Source code incmflib/cmfquery.py
def get_one_hop_child_artifacts(self, artifact_name: str, pipeline_id: str = None) -> pd.DataFrame:\n \"\"\"Get artifacts produced by executions that consume given artifact.\n\n Args:\n artifact name: Name of an artifact.\n Return:\n Output artifacts of all executions that consumed given artifact.\n \"\"\"\n artifact: t.Optional = self._get_artifact(artifact_name)\n if not artifact:\n return pd.DataFrame()\n\n # Get output artifacts of executions consumed the above artifact.\n artifacts_ids = self._get_output_artifacts(self._get_executions_by_input_artifact_id(artifact.id,pipeline_id))\n return self._as_pandas_df(\n self.store.get_artifacts_by_id(artifacts_ids), lambda _artifact: self.get_artifact_df(_artifact)\n )\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_child_artifacts","title":"get_all_child_artifacts(artifact_name)
","text":"Return all downstream artifacts starting from the given artifact.
Parameters:
Name Type Description Defaultartifact_name
str
Artifact name.
requiredReturns: Data frame containing all child artifacts.
Source code incmflib/cmfquery.py
def get_all_child_artifacts(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return all downstream artifacts starting from the given artifact.\n\n Args:\n artifact_name: Artifact name.\n Returns:\n Data frame containing all child artifacts.\n \"\"\"\n df = pd.DataFrame()\n d1 = self.get_one_hop_child_artifacts(artifact_name)\n # df = df.append(d1, sort=True, ignore_index=True)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n for row in d1.itertuples():\n d1 = self.get_all_child_artifacts(row.name)\n # df = df.append(d1, sort=True, ignore_index=True)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n df = df.drop_duplicates(subset=None, keep=\"first\", inplace=False)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_one_hop_parent_artifacts","title":"get_one_hop_parent_artifacts(artifact_name)
","text":"Return input artifacts for the execution that produced the given artifact. Args: artifact_name: Artifact name. Returns: Data frame containing immediate parent artifactog of given artifact.
Source code incmflib/cmfquery.py
def get_one_hop_parent_artifacts(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return input artifacts for the execution that produced the given artifact.\n Args:\n artifact_name: Artifact name.\n Returns:\n Data frame containing immediate parent artifactog of given artifact.\n \"\"\"\n artifact: t.Optional = self._get_artifact(artifact_name)\n if not artifact:\n return pd.DataFrame()\n\n artifact_ids: t.List[int] = self._get_input_artifacts(self._get_executions_by_output_artifact_id(artifact.id))\n\n return self._as_pandas_df(\n self.store.get_artifacts_by_id(artifact_ids), lambda _artifact: self.get_artifact_df(_artifact)\n )\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_parent_artifacts","title":"get_all_parent_artifacts(artifact_name)
","text":"Return all upstream artifacts. Args: artifact_name: Artifact name. Returns: Data frame containing all parent artifacts.
Source code incmflib/cmfquery.py
def get_all_parent_artifacts(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return all upstream artifacts.\n Args:\n artifact_name: Artifact name.\n Returns:\n Data frame containing all parent artifacts.\n \"\"\"\n df = pd.DataFrame()\n d1 = self.get_one_hop_parent_artifacts(artifact_name)\n # df = df.append(d1, sort=True, ignore_index=True)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n for row in d1.itertuples():\n d1 = self.get_all_parent_artifacts(row.name)\n # df = df.append(d1, sort=True, ignore_index=True)\n df = pd.concat([df, d1], sort=True, ignore_index=True)\n df = df.drop_duplicates(subset=None, keep=\"first\", inplace=False)\n return df\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_all_parent_executions","title":"get_all_parent_executions(artifact_name)
","text":"Return all executions that produced upstream artifacts for the given artifact. Args: artifact_name: Artifact name. Returns: Data frame containing all parent executions.
Source code incmflib/cmfquery.py
def get_all_parent_executions(self, artifact_name: str) -> pd.DataFrame:\n \"\"\"Return all executions that produced upstream artifacts for the given artifact.\n Args:\n artifact_name: Artifact name.\n Returns:\n Data frame containing all parent executions.\n \"\"\"\n parent_artifacts: pd.DataFrame = self.get_all_parent_artifacts(artifact_name)\n if parent_artifacts.shape[0] == 0:\n # If it's empty, there's no `id` column and the code below raises an exception.\n return pd.DataFrame()\n\n execution_ids = set(\n event.execution_id\n for event in self.store.get_events_by_artifact_ids(parent_artifacts.id.values.tolist())\n if event.type == mlpb.Event.OUTPUT\n )\n\n return self._as_pandas_df(\n self.store.get_executions_by_id(execution_ids),\n lambda _exec: self._transform_to_dataframe(_exec, {\"id\": _exec.id, \"name\": _exec.name}),\n )\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.get_metrics","title":"get_metrics(metrics_name)
","text":"Return metric data frame. Args: metrics_name: Metrics name. Returns: Data frame containing all metrics.
Source code incmflib/cmfquery.py
def get_metrics(self, metrics_name: str) -> t.Optional[pd.DataFrame]:\n \"\"\"Return metric data frame.\n Args:\n metrics_name: Metrics name.\n Returns:\n Data frame containing all metrics.\n \"\"\"\n for metric in self.store.get_artifacts_by_type(\"Step_Metrics\"):\n if metric.name == metrics_name:\n name: t.Optional[str] = metric.custom_properties.get(\"Name\", None)\n if name:\n return pd.read_parquet(name)\n break\n return None\n
"},{"location":"api/public/cmfquery/#cmflib.cmfquery.CmfQuery.dumptojson","title":"dumptojson(pipeline_name, exec_id=None)
","text":"Return JSON-parsable string containing details about the given pipeline. Args: pipeline_name: Name of an AI pipelines. exec_id: Optional stage execution ID - filter stages by this execution ID. Returns: Pipeline in JSON format.
Source code incmflib/cmfquery.py
def dumptojson(self, pipeline_name: str, exec_id: t.Optional[int] = None) -> t.Optional[str]:\n \"\"\"Return JSON-parsable string containing details about the given pipeline.\n Args:\n pipeline_name: Name of an AI pipelines.\n exec_id: Optional stage execution ID - filter stages by this execution ID.\n Returns:\n Pipeline in JSON format.\n \"\"\"\n if exec_id is not None:\n exec_id = int(exec_id)\n\n def _get_node_attributes(_node: t.Union[mlpb.Context, mlpb.Execution, mlpb.Event], _attrs: t.Dict) -> t.Dict:\n for attr in CONTEXT_LIST:\n #Artifacts getattr call on Type was giving empty string, which was overwriting \n # the defined types such as Dataset, Metrics, Models\n if getattr(_node, attr, None) is not None and not getattr(_node, attr, None) == \"\":\n _attrs[attr] = getattr(_node, attr)\n\n if \"properties\" in _attrs:\n _attrs[\"properties\"] = CmfQuery._copy(_attrs[\"properties\"])\n if \"custom_properties\" in _attrs:\n # TODO: (sergey) why do we need to rename \"type\" to \"user_type\" if we just copy into a new dictionary?\n _attrs[\"custom_properties\"] = CmfQuery._copy(\n _attrs[\"custom_properties\"], key_mapper={\"type\": \"user_type\"}\n )\n return _attrs\n\n pipelines: t.List[t.Dict] = []\n for pipeline in self._get_pipelines(pipeline_name):\n pipeline_attrs = _get_node_attributes(pipeline, {\"stages\": []})\n for stage in self._get_stages(pipeline.id):\n stage_attrs = _get_node_attributes(stage, {\"executions\": []})\n for execution in self._get_executions(stage.id, execution_id=exec_id):\n # name will be an empty string for executions that are created with\n # create new execution as true(default)\n # In other words name property will there only for execution\n # that are created with create new execution flag set to false(special case)\n exec_attrs = _get_node_attributes(\n execution,\n {\n \"type\": self.store.get_execution_types_by_id([execution.type_id])[0].name,\n \"name\": execution.name if execution.name != \"\" else \"\",\n \"events\": [],\n },\n )\n for event in self.store.get_events_by_execution_ids([execution.id]):\n event_attrs = _get_node_attributes(event, {})\n # An event has only a single Artifact associated with it. \n # For every artifact we create an event to link it to the execution.\n\n artifacts = self.store.get_artifacts_by_id([event.artifact_id])\n artifact_attrs = _get_node_attributes(\n artifacts[0], {\"type\": self.store.get_artifact_types_by_id([artifacts[0].type_id])[0].name}\n )\n event_attrs[\"artifact\"] = artifact_attrs\n exec_attrs[\"events\"].append(event_attrs)\n stage_attrs[\"executions\"].append(exec_attrs)\n pipeline_attrs[\"stages\"].append(stage_attrs)\n pipelines.append(pipeline_attrs)\n\n return json.dumps({\"Pipeline\": pipelines})\n
"},{"location":"api/public/dataslice/","title":"cmflib.cmf.Cmf.DataSlice","text":"A data slice represents a named subset of data. It can be used to track performance of an ML model on different slices of the training or testing dataset splits. This can be useful from different perspectives, for instance, to mitigate model bias.
Instances of data slices are not meant to be created manually by users. Instead, use Cmf.create_dataslice method.
Source code incmflib/cmf.py
def __init__(self, name: str, writer):\n self.props = {}\n self.name = name\n self.writer = writer\n
"},{"location":"api/public/dataslice/#cmflib.cmf.Cmf.DataSlice.add_data","title":"add_data(path, custom_properties=None)
","text":"Add data to create the dataslice. Currently supported only for file abstractions. Pre-condition - the parent folder, containing the file should already be versioned. Example:
dataslice.add_data(f\"data/raw_data/{j}.xml)\n
Args: path: Name to identify the file to be added to the dataslice. custom_properties: Properties associated with this datum. Source code in cmflib/cmf.py
def add_data(\n self, path: str, custom_properties: t.Optional[t.Dict] = None\n) -> None:\n \"\"\"Add data to create the dataslice.\n Currently supported only for file abstractions. Pre-condition - the parent folder, containing the file\n should already be versioned.\n Example:\n ```python\n dataslice.add_data(f\"data/raw_data/{j}.xml)\n ```\n Args:\n path: Name to identify the file to be added to the dataslice.\n custom_properties: Properties associated with this datum.\n \"\"\"\n\n self.props[path] = {}\n self.props[path]['hash'] = dvc_get_hash(path)\n parent_path = path.rsplit(\"/\", 1)[0]\n self.data_parent = parent_path.rsplit(\"/\", 1)[1]\n if custom_properties:\n for k, v in custom_properties.items():\n self.props[path][k] = v\n
"},{"location":"api/public/dataslice/#cmflib.cmf.Cmf.DataSlice.commit","title":"commit(custom_properties=None)
","text":"Commit the dataslice. The created dataslice is versioned and added to underneath data versioning software. Example:
dataslice.commit()\n```\n
Args: custom_properties: Dictionary to store key value pairs associated with Dataslice Example{\"mean\":2.5, \"median\":2.6}
Source code incmflib/cmf.py
def commit(self, custom_properties: t.Optional[t.Dict] = None) -> None:\n \"\"\"Commit the dataslice.\n The created dataslice is versioned and added to underneath data versioning software.\n Example:\n\n dataslice.commit()\n ```\n Args:\n custom_properties: Dictionary to store key value pairs associated with Dataslice\n Example{\"mean\":2.5, \"median\":2.6}\n \"\"\"\n\n logging_dir = change_dir(self.writer.cmf_init_path)\n # code for nano cmf\n # Assigning current file name as stage and execution name\n current_script = sys.argv[0]\n file_name = os.path.basename(current_script)\n name_without_extension = os.path.splitext(file_name)[0]\n # create context if not already created\n if not self.writer.child_context:\n self.writer.create_context(pipeline_stage=name_without_extension)\n assert self.writer.child_context is not None, f\"Failed to create context for {self.pipeline_name}!!\"\n\n # create execution if not already created\n if not self.writer.execution:\n self.writer.create_execution(execution_type=name_without_extension)\n assert self.writer.execution is not None, f\"Failed to create execution for {self.pipeline_name}!!\"\n\n directory_path = os.path.join(self.writer.ARTIFACTS_PATH, self.writer.execution.properties[\"Execution_uuid\"].string_value.split(',')[0], self.writer.DATASLICE_PATH)\n os.makedirs(directory_path, exist_ok=True)\n custom_props = {} if custom_properties is None else custom_properties\n git_repo = git_get_repo()\n dataslice_df = pd.DataFrame.from_dict(self.props, orient=\"index\")\n dataslice_df.index.names = [\"Path\"]\n dataslice_path = os.path.join(directory_path,self.name)\n dataslice_df.to_parquet(dataslice_path)\n existing_artifact = []\n\n commit_output(dataslice_path, self.writer.execution.id)\n c_hash = dvc_get_hash(dataslice_path)\n if c_hash == \"\":\n print(\"Error in getting the dvc hash,return without logging\")\n return\n\n dataslice_commit = c_hash\n url = dvc_get_url(dataslice_path)\n dvc_url_with_pipeline = f\"{self.writer.parent_context.name}:{url}\"\n if c_hash and c_hash.strip():\n existing_artifact.extend(\n self.writer.store.get_artifacts_by_uri(c_hash))\n if existing_artifact and len(existing_artifact) != 0:\n print(\"Adding to existing data slice\")\n # Haven't added event type in this if cond, is it not needed??\n slice = link_execution_to_input_artifact(\n store=self.writer.store,\n execution_id=self.writer.execution.id,\n uri=c_hash,\n input_name=dataslice_path + \":\" + c_hash,\n )\n else:\n props={\n \"git_repo\": str(git_repo),\n # passing c_hash value to commit\n \"Commit\": str(dataslice_commit),\n \"url\": str(dvc_url_with_pipeline),\n },\n slice = create_new_artifact_event_and_attribution(\n store=self.writer.store,\n execution_id=self.writer.execution.id,\n context_id=self.writer.child_context.id,\n uri=c_hash,\n name=dataslice_path + \":\" + c_hash,\n type_name=\"Dataslice\",\n event_type=mlpb.Event.Type.OUTPUT,\n properties={\n \"git_repo\": str(git_repo),\n # passing c_hash value to commit\n \"Commit\": str(dataslice_commit),\n \"url\": str(dvc_url_with_pipeline),\n },\n artifact_type_properties={\n \"git_repo\": mlpb.STRING,\n \"Commit\": mlpb.STRING,\n \"url\": mlpb.STRING,\n },\n custom_properties=custom_props,\n milliseconds_since_epoch=int(time.time() * 1000),\n )\n if self.writer.graph:\n self.writer.driver.create_dataslice_node(\n self.name, dataslice_path + \":\" + c_hash, c_hash, self.data_parent, props\n )\n os.chdir(logging_dir)\n return slice\n
"},{"location":"architecture/advantages/","title":"Advantages","text":"Common metadata framework has the following components:
The API\u2019s and the abstractions provided by the library enables tracking of pipeline metadata. It tracks the stages in the pipeline, the input and output artifacts at each stage and metrics. The framework allows metrics to be tracked both at coarse and fine-grained intervals. It could be a stage metrics, which could be captured at the end of a stage or fine-grained metrics which is tracked per step (epoch) or at regular intervals during the execution of the stage.
The metadata logged through the APIs are written to a backend relational database. The library also provides API\u2019s to query the metadata stored in the relational database for the users to inspect pipelines.
In addition to explicit tracking through the API\u2019s library also provides, implicit tracking. The implicit tracking automatically tracks the software version used in the pipelines. The function arguments and function return values can be automatically tracked by adding metadata tracker class decorators on the functions.
Before writing the metadata to relational database, the metadata operations are journaled in the metadata journal log. This enables the framework to transfer the local metadata to the central server.
All artifacts are versioned with a data versioning framework (for e.g., DVC). The content hash of the artifacts are generated and stored along with the user provided metadata. A special artifact metadata file called a \u201c.dvc\u201d file is created for every artifact (file / folder) which is added to data version management system. The .dvc file contains the content hash of the artifact.
For every new execution, the metadata tracker creates a new branch to track the code. The special metadata file created for artifacts, the \u201c.dvc\u201d file is also committed to GIT and its commit id is tracked as a metadata information. The artifacts are versioned through the versioning of its metadata file. Whenever there is a change in the artifact, the metadata file is modified to reflect its current content hash, and the file is tracked as a new version of the metadata file.
The metadata tracker automatically tracks the start commit when the library was initialized and creates separate commit for each change in the artifact along the experiment. This helps to track the transformations on the artifacts along the different stages in the pipeline.
"},{"location":"architecture/components/#local-client","title":"Local Client","text":"The metadata client interacts with the metadata server. It communicates with the server, for synchronization of metadata.
After the experiment is completed, the user invokes the \u201cCmf push\u201d command to push the collected metadata to the remote. This transfers the existing metadata journal to the server.
The metadata from the central repository can be pulled to the local repository, either using the artifacts or using the project as the identifier or both.
When artifact is used as the identifier, all metadata associated with the artifacts currently present in the branch of the cloned Git repository is pulled from the central repository to the local repository. The pulled metadata consist of not only the immediate metadata associated with the artifacts, it contains the metadata of all the artifacts in its chain of lineage.
When project is used as the identifier, all the metadata associated with the current branch of the pipeline code that is checked out is pulled to the local repository.
"},{"location":"architecture/components/#central-server","title":"Central Server","text":"The central server, exposes REST API\u2019s that can be called from the remote clients. This can help in situations where the connectivity between the core datacenter and the remote client is robust. The remote client calls the API\u2019s exposed by the central server to log the metadata directly to the central metadata repository.
Where the connectivity with the central server is intermittent, the remote clients log the metadata to the local repository. The journaled metadata is pushed by the remote client to the central server. The central server, will replay the journal and merge the incoming metadata with the metadata already existing in the central repository. The ability to accurately identify the artifacts anywhere using their content hash, makes this merge robust.
"},{"location":"architecture/components/#central-repositories","title":"Central Repositories","text":"The common metadata framework consist of three central repositories for the code, data and metadata.
"},{"location":"architecture/components/#central-metadata-repository","title":"Central Metadata repository","text":"Central metadata repository holds the metadata pushed from the distributed sites. It holds metadata about all the different pipelines that was tracked using the common metadata tracker. The consolidated view of the metadata stored in the central repository, helps the users to learn across various stages in the pipeline executed at different locations. Using the query layer that is pointed to the central repository, the users gets the global view of the metadata which provides them with a deeper understanding of the pipelines and its metadata. The metadata helps to understand nonobvious results like performance of a dataset with respect to other datasets, Performance of a particular pipeline with respect to other pipelines etc.
"},{"location":"architecture/components/#central-artifact-storage-repository","title":"Central Artifact storage repository","text":"Central Artifact storage repository stores all the artifacts related to experiment. The data versioning framework (DVC) stores the artifacts in a content addressable layout. The artifacts are stored inside the folder with name as the first two characters of the content hash and the name of the artifact as the remaining part of the content hash. This helps in efficient retrieval of the artifacts.
"},{"location":"architecture/components/#git-repository","title":"Git Repository","text":"Git repository is used to track the code. Along with the code, the metadata file of the artifacts which contain the content hash of the artifacts are also stored in GIT. The Data versioning framework (dvc) would use these files to retrieve the artifacts from the artifact storage repository.
"},{"location":"architecture/overview/","title":"Architecture Overview","text":"Interactions in data pipelines can be complex. The Different stages in the pipeline, (which may not be next to each other) may have to interact to produce or transform artifacts. As the artifacts navigates and undergo transformations through this pipeline, it can take a complicated path, which might also involve bidirectional movement across these stages. Also, there could be dependencies between the multiple stages, where the metrics produced by a stage could influence the metrics at a subsequent stage. It is important to track the metadata across a pipeline to provide features like, lineage tracking, provenance and reproducibility.
The tracking of metadata through these complex pipelines have multiple challenges, some of them being,
Common metadata framework (CMF) addresses the problems associated with tracking of pipeline metadata from distributed sites and tracks code, data and metadata together for end-to-end traceability.
The framework automatically tracks the code version as one of the metadata for an execution. Additionally, the data artifacts are also versioned automatically using a data versioning framework (like DVC) and the metadata regarding the data version is stored along with the code. The framework stores the Git commit id of the metadata file associated with the artifact and content hash of the artifact as metadata. The framework provides API\u2019s to track the hyperparameters and other metadata of pipelines. Therefore, from the metadata stored, users can zero in on the hyperparameters, code version and the artifact version used for the experiment.
Identifying the artifacts by content hash allows the framework, to uniquely identify an artifact anywhere in the distributed sites. This enables the metadata from the distributed sites to be precisely merged to a central repository, thereby providing a single global metadata from the distributed sites.
On this backbone, we build the Git like experience for metadata, enabling users to push their local metadata to the remote repository, where it is merged to create the global metadata and pull metadata from the global metadata to the local, to create a local view, which would contain only the metadata of interest.
The framework can be used to track various types of pipelines such as data pipelines or AI pipelines.
"},{"location":"cmf_client/Getting%20Started%20with%20cmf/","title":"Getting started with cmf","text":"Common metadata framework (cmf) has the following components:
cmf-client is a tool that facilitates metadata collaboration between different teams or two team members. It allows users to pull or push metadata from or to the cmf-server.
Follow the below-mentioned steps for the end-to-end setup of cmf-client:-
Pre-Requisites
Install cmf library i.e. cmflib
pip install https://github.com/HewlettPackard/cmf\n
OR pip install cmflib\n
Check here for more details."},{"location":"cmf_client/Getting%20Started%20with%20cmf/#install-cmf-server","title":"Install cmf-server","text":"cmf-server is a key interface for the user to explore and track their ML training runs. It allows users to store the metadata file on the cmf-server. The user can retrieve the saved metadata file and can view the content of the saved metadata file using the UI provided by the cmf-server.
Follow here for details on how to setup a cmf-server.
"},{"location":"cmf_client/Getting%20Started%20with%20cmf/#how-to-effectively-use-cmf-client","title":"How to effectively use cmf-client?","text":"Let's assume we are tracking the metadata for a pipeline named Test-env
with minio S3 bucket as the artifact repository and a cmf-server.
Create a folder
mkdir example-folder\n
Initialize cmf
CMF initialization is the first and foremost to use cmf-client commads. This command in one go complete initialization process making cmf-client user friendly. Execute cmf init
in the example-folder
directory created in the above step.
cmf init minioS3 --url s3://bucket-name --endpoint-url http://localhost:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://X.X.X.X:7687\n
Check here for more details. Check status of CMF initialization (Optional)
cmf init show\n
Check here for more details. Track metadata using cmflib
Use Sample projects as a reference to create a new project to track metadata for ML pipelines.
More info is available here.
Push artifacts
Push artifacts in the artifact repo initialised in the Initialize cmf step.
cmf artifact push \n
Check here for more details. Push metadata to cmf-server
cmf metadata push -p 'Test-env'\n
Check here for more details."},{"location":"cmf_client/Getting%20Started%20with%20cmf/#cmf-client-with-collaborative-development","title":"cmf-client with collaborative development","text":"In the case of collaborative development, in addition to the above commands, users can follow the commands below to pull metadata and artifacts from a common cmf server and a central artifact repository.
Pull metadata from the server
Execute cmf metadata
command in the example_folder
.
cmf metadata pull -p 'Test-env'\n
Check here for more details. Pull artifacts from the central artifact repo
Execute cmf artifact
command in the example_folder
.
cmf artifact pull -p \"Test-env\"\n
Check here for more details."},{"location":"cmf_client/Getting%20Started%20with%20cmf/#flow-chart-for-cmf","title":"Flow Chart for cmf","text":""},{"location":"cmf_client/cmf_client/","title":"Getting started with cmf-client commands","text":""},{"location":"cmf_client/cmf_client/#cmf-init","title":"cmf init","text":"Usage: cmf init [-h] {minioS3,amazonS3,local,sshremote,osdfremote,show}\n
cmf init
initializes an artifact repository for cmf. Local directory, Minio S3 bucket, Amazon S3 bucket, SSH Remote and Remote OSDF directory are the options available. Additionally, user can provide cmf-server url."},{"location":"cmf_client/cmf_client/#cmf-init-show","title":"cmf init show","text":"Usage: cmf init show\n
cmf init show
displays current cmf configuration."},{"location":"cmf_client/cmf_client/#cmf-init-minios3","title":"cmf init minioS3","text":"Usage: cmf init minioS3 [-h] --url [url] \n --endpoint-url [endpoint_url]\n --access-key-id [access_key_id] \n --secret-key [secret_key] \n --git-remote-url[git_remote_url] \n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password [neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init minioS3
configures Minio S3 bucket as a cmf artifact repository. Refer minio-server.md to set up a minio server. cmf init minioS3 --url s3://dvc-art --endpoint-url http://x.x.x.x:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Here, \"dvc-art\" is provided as an example bucket name. However, users can change it as needed, if the user chooses to change it, they will need to update the Dockerfile for MinIOS3 accordingly.
Required Arguments
--url [url] Specify bucket url.\n --endpoint-url [endpoint_url] Specify the endpoint url of minio UI.\n --access-key-id [access_key_id] Specify Access Key Id.\n --secret-key [secret_key] Specify Secret Key.\n --git-remote-url [git_remote_url] Specify git repo url.\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-init-local","title":"cmf init local","text":"Usage: cmf init local [-h] --path [path] -\n --git-remote-url [git_remote_url]\n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password [neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init local
initialises local directory as a cmf artifact repository. cmf init local --path /home/XXXX/local-storage --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Replace 'XXXX' with your system username in the following path: /home/XXXX/local-storage
Required Arguments
--path [path] Specify local directory path.\n --git-remote-url [git_remote_url] Specify git repo url.\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-init-amazons3","title":"cmf init amazonS3","text":"Before setting up, obtain AWS temporary security credentials using the AWS Security Token Service (STS). These credentials are short-term and can last from minutes to hours. They are dynamically generated and provided to trusted users upon request, and expire after use. Users with appropriate permissions can request new credentials before or upon expiration. For further information, refer to the Temporary security credentials in IAM page.
To retrieve temporary security credentials using multi-factor authentication (MFA) for an IAM user, you can use the below command.
aws sts get-session-token --duration-seconds <duration> --serial-number <MFA_device_serial_number> --token-code <MFA_token_code>\n
Required Arguments --serial-number Specifies the serial number of the MFA device associated with the IAM user.\n --token-code Specifies the one-time code generated by the MFA device.\n
Optional Arguments
--duration-seconds Specifies the duration for which the temporary credentials will be valid, in seconds.\n
Example
aws sts get-session-token --duration-seconds 3600 --serial-number arn:aws:iam::123456789012:mfa/user --token-code 123456\n
This will return output like
{\n \"Credentials\": {\n \"AccessKeyId\": \"ABCDEFGHIJKLMNO123456\",\n \"SecretAccessKey\": \"PQRSTUVWXYZ789101112131415\",\n \"SessionToken\": \"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijlmnopqrstuvwxyz12345678910ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijlmnopqrstuvwxyz12345678910ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijlmnopqrstuvwxyz12345678910\",\n \"Expiration\": \"2021-05-10T15:31:08+00:00\"\n }\n}\n
Initialization of amazonS3 Usage: cmf init amazonS3 [-h] --url [url] \n --access-key-id [access_key_id]\n --secret-key [secret_key]\n --session-token [session_token]\n --git-remote-url [git_remote_url]\n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init amazonS3
initialises Amazon S3 bucket as a CMF artifact repository. cmf init amazonS3 --url s3://bucket-name --access-key-id XXXXXXXXXXXXX --secret-key XXXXXXXXXXXXX --session-token XXXXXXXXXXXXX --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687 \n
Here, use the --access-key-id, --secret-key and --session-token generated from the aws sts
command which is mentioned above.
The bucket-name must exist within Amazon S3 before executing the cmf artifact push
command.
Required Arguments
--url [url] Specify bucket url.\n --access-key-id [access_key_id] Specify Access Key Id.\n --secret-key [secret_key] Specify Secret Key.\n --session-token Specify session token. (default: )\n --git-remote-url [git_remote_url] Specify git repo url.\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-init-sshremote","title":"cmf init sshremote","text":"Usage: cmf init sshremote [-h] --path [path] \n --user [user]\n --port [port]\n --password [password] \n --git-remote-url [git_remote_url] \n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init sshremote
command initialises remote ssh directory as a cmf artifact repository. cmf init sshremote --path ssh://127.0.0.1/home/user/ssh-storage --user XXXXX --port 22 --password example@123 --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Required Arguments --path [path] Specify ssh directory path.\n --user [user] Specify user.\n --port [port] Specify Port.\n --password [password] Specify password.\n --git-remote-url [git_remote_url] Specify git repo url.\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-init-osdfremote","title":"cmf init osdfremote","text":"Usage: cmf init osdfremote [-h] --path [path] \n --key-id [key_id]\n --key-path [key_path] \n --key-issuer [key_issuer] \n --git-remote-url[git_remote_url] \n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password [neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init osdfremote
configures a OSDF Origin as a cmf artifact repository. cmf init osdfremote --path https://[Some Origin]:8443/nrp/fdp/ --key-id c2a5 --key-path ~/.ssh/fdp.pem --key-issuer https://[Token Issuer]] --git-remote-url https://github.com/user/experiment-repo.git --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://127.0.0.1:80 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Required Arguments --path [path] Specify FQDN for OSDF origin including including port and directory path\n --key-id [key_id] Specify key_id for provided private key. eg. b2d3\n --key-path [key_path] Specify path for private key on local filesystem. eg. ~/.ssh/XXX.pem\n --key-issuer [key_issuer] Specify URL for Key Issuer. eg. https://t.nationalresearchplatform.org/XXX\n --git-remote-url [git_remote_url] Specify git repo url. eg: https://github.com/XXX/example.git\n
Optional Arguments -h, --help show this help message and exit\n --cmf-server-url [cmf_server_url] Specify cmf-server url. (default: http://127.0.0.1:80)\n --neo4j-user [neo4j_user] Specify neo4j user. (default: None)\n --neo4j-password [neo4j_password] Specify neo4j password. (default: None)\n --neo4j-uri [neo4j_uri] Specify neo4j uri. Eg bolt://localhost:7687 (default: None)\n
"},{"location":"cmf_client/cmf_client/#cmf-artifact","title":"cmf artifact","text":"Usage: cmf artifact [-h] {pull,push}\n
cmf artifact
pull or push artifacts from or to the user configured artifact repository, respectively."},{"location":"cmf_client/cmf_client/#cmf-artifact-pull","title":"cmf artifact pull","text":"Usage: cmf artifact pull [-h] -p [pipeline_name] -f [file_name] -a [artifact_name]\n
cmf artifact pull
command pull artifacts from the user configured repository to the user's local machine. cmf artifact pull -p 'pipeline-name' -f '/path/to/mlmd-file-name' -a 'artifact-name'\n
Required Arguments -p [pipeline_name], --pipeline-name [pipeline_name] Specify Pipeline name.\n
Optional Arguments -h, --help show this help message and exit\n -a [artifact_name], --artifact_name [artifact_name] Specify artifact name only; don't use folder name or absolute path.\n -f [file_name],--file-name [file_name] Specify mlmd file name.\n
"},{"location":"cmf_client/cmf_client/#cmf-artifact-push","title":"cmf artifact push","text":"Usage: cmf artifact push [-h] -p [pipeline_name] -f [file_name]\n
cmf artifact push
command push artifacts from the user's local machine to the user configured artifact repository. cmf artifact push -p 'pipeline_name' -f '/path/to/mlmd-file-name'\n
Required Arguments -p [pipeline_name], --pipeline-name [pipeline_name] Specify Pipeline name.\n
Optional Arguments -h, --help show this help message and exit.\n -f [file_name],--file-name [file_name] Specify mlmd file name.\n
"},{"location":"cmf_client/cmf_client/#cmf-metadata","title":"cmf metadata","text":"Usage: cmf metadata [-h] {pull,push,export}\n
cmf metadata
push, pull or export the metadata file to and from the cmf-server, respectively."},{"location":"cmf_client/cmf_client/#cmf-metadata-pull","title":"cmf metadata pull","text":"Usage: cmf metadata pull [-h] -p [pipeline_name] -f [file_name] -e [exec_id]\n
cmf metadata pull
command pulls the metadata file from the cmf-server to the user's local machine. cmf metadata pull -p 'pipeline-name' -f '/path/to/mlmd-file-name' -e 'execution_id'\n
Required Arguments -p [pipeline_name], --pipeline_name [pipeline_name] Specify Pipeline name.\n
Optional Arguments -h, --help show this help message and exit\n-e [exec_id], --execution [exec_id] Specify execution id\n-f [file_name], --file_name [file_name] Specify mlmd file name with full path(either relative or absolute).\n
"},{"location":"cmf_client/cmf_client/#cmf-metadata-push","title":"cmf metadata push","text":"Usage: cmf metadata push [-h] -p [pipeline_name] -f [file_name] -e [exec_id] -t [tensorboard]\n
cmf metadata push
command pushes the metadata file from the local machine to the cmf-server. cmf metadata push -p 'pipeline-name' -f '/path/to/mlmd-file-name' -e 'execution_id' -t '/path/to/tensorboard-log'\n
Required Arguments -p [pipeline_name], --pipeline_name [pipeline_name] Specify Pipeline name.\n
Optional Arguments
-h, --help show this help message and exit\n -f [file_name], --file_name [file_name] Specify mlmd file name.\n -e [exec_id], --execution [exec_id] Specify execution id.\n -t [tensorboard], --tensorboard [tensorboard] Specify path to tensorboard logs for the pipeline.\n
"},{"location":"cmf_client/cmf_client/#cmf-metadata-export","title":"cmf metadata export","text":"Usage: cmf metadata export [-h] -p [pipeline_name] -j [json_file_name] -f [file_name]\n
cmf metadata export
export local mlmd's metadata in json format to a json file. cmf metadata export -p 'pipeline-name' -j '/path/to/json-file-name' -f '/path/to/mlmd-file-name'\n
Required Arguments -p [pipeline_name], --pipeline_name [pipeline_name] Specify Pipeline name.\n
Optional Arguments
-h, --help show this help message and exit\n -f [file_name], --file_name [file_name] Specify mlmd file name.\n -j [json_file_name], --json_file_name [json_file_name] Specify json file name with full path.\n
"},{"location":"cmf_client/minio-server/","title":"MinIO S3 Artifact Repo Setup","text":""},{"location":"cmf_client/minio-server/#steps-to-set-up-a-minio-server","title":"Steps to set up a MinIO server","text":"Object storage is an abstraction layer above the file system and helps to work with data using API. MinIO is the fastest way to start working with object storage. It is compatible with S3, easy to deploy, manage locally, and upscale if needed.
Follow the below mentioned steps to set up a MinIO server:
Copy contents of the example-get-started
directory to a separate directory outside the cmf repository.
Check whether cmf is initialized.
cmf init show\n
If cmf is not initialized, the following message will appear on the screen. 'cmf' is not configured.\nExecute the 'cmf init' command.\n
Execute the following command to initialize the MinIO S3 bucket as a CMF artifact repository.
cmf init minioS3 --url s3://dvc-art --endpoint-url http://x.x.x.x:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Here, \"dvc-art\" is provided as an example bucket name. However, users can change it as needed, if the user chooses to change it, they will need to update the Dockerfile for minioS3 accordingly.
Execute cmf init show
to check the CMF configuration. The sample output looks as follows:
remote.minio.url=s3://bucket-name\nremote.minio.endpointurl=http://localhost:9000\nremote.minio.access_key_id=minioadmin\nremote.minio.secret_access_key=minioadmin\ncore.remote=minio\n
Build a MinIO server using a Docker container. docker-compose.yml
available in example-get-started
directory provides two services: minio
and aws-cli
. User will initialise the repository with bucket name, storage URL, and credentials to access MinIO.
MYIP= XX.XX.XXX.XXX docker-compose up\n
or MYIP= XX.XX.XXX.XXX docker compose up\n
After executing the above command, following messages confirm that MinIO is up and running. Also you can adjust $MYIP
in examples/example-get-started/docker-compose.yml
to reflect the server IP and run the docker compose
command without specifying
Login into remote.minio.endpointurl
(in the above example - http://localhost:9000) using access-key and secret-key mentioned in cmf configuration.
Following image is an example snapshot of the MinIO server with bucket named 'dvc-art'.
SSH (Secure Shell) remote storage refers to using the SSH protocol to securely access and manage files and data on a remote server or storage system over a network. SSH is a cryptographic network protocol that allows secure communication and data transfer between a local computer and a remote server.
Proceed with the following steps to set up a SSH Remote Repository:
project directory
with SSH repo. Check whether cmf is initialized in your project directory with following command.
cmf init show\n
If cmf is not initialized, the following message will appear on the screen. 'cmf' is not configured.\nExecute the 'cmf init' command.\n
Execute the following command to initialize the SSH remote storage as a CMF artifact repository.
cmf init sshremote --path ssh://127.0.0.1/home/user/ssh-storage --user XXXXX --port 22 --password example@123 --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://127.0.0.1:80 \n
> When running cmf init sshremote
, please ensure that the specified IP address has the necessary permissions to allow access using the specified user ('XXXX'). If the IP address or user lacks the required permissions, the command will fail. Execute cmf init show
to check the CMF configuration.
/etc/ssh/sshd_config file
. This configuration file serves as the primary starting point for diagnosing and resolving SSH permission-related challenges.Common metadata framework (cmf) has the following components:
Before proceeding, ensure that the CMF library is installed on your system. If not, follow the installation instructions provided inside the CMF in a nutshell page.
"},{"location":"cmf_client/step-by-step/#install-cmf-server","title":"Install cmf-server","text":"cmf-server is a key interface for the user to explore and track their ML training runs. It allows users to store the metadata file on the cmf-server. The user can retrieve the saved metadata file and can view the content of the saved metadata file using the UI provided by the cmf-server.
Follow the instructions on the Getting started with cmf-server page for details on how to setup a cmf-server.
"},{"location":"cmf_client/step-by-step/#setup-a-cmf-client","title":"Setup a cmf-client","text":"cmf-client is a tool that facilitates metadata collaboration between different teams or two team members. It allows users to pull or push metadata from or to the cmf-server.
Follow the below-mentioned steps for the end-to-end setup of cmf-client:-
Configuration
mkdir <workdir>
cmf init
to configure dvc remote directory, git remote url, cmf server and neo4j. Follow the Overview page for more details.Let's assume we are tracking the metadata for a pipeline named Test-env
with minio S3 bucket as the artifact repository and a cmf-server.
Create a folder
mkdir example-folder\n
Initialize cmf
CMF initialization is the first and foremost to use cmf-client commads. This command in one go complete initialization process making cmf-client user friendly. Execute cmf init
in the example-folder
directory created in the above step.
cmf init minioS3 --url s3://dvc-art --endpoint-url http://x.x.x.x:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://x.x.x.x:8080 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Here, \"dvc-art\" is provided as an example bucket name. However, users can change it as needed, if the user chooses to change it, they will need to update the Dockerfile for minioS3 accordingly.
Check Overview page for more details.
Check status of CMF initialization (Optional)
cmf init show\n
Check Overview page for more details. Track metadata using cmflib
Use Sample projects as a reference to create a new project to track metadata for ML pipelines.
More information is available inside Getting Started.
Before pushing artifacts or metadata, ensure that the cmf server and minioS3 are up and running.
Push artifacts
Push artifacts in the artifact repo initialised in the Initialize cmf step.
cmf artifact push -p 'Test-env'\n
Check Overview page for more details. Push metadata to cmf-server
cmf metadata push -p 'Test-env'\n
Check Overview page for more details."},{"location":"cmf_client/step-by-step/#cmf-client-with-collaborative-development","title":"cmf-client with collaborative development","text":"In the case of collaborative development, in addition to the above commands, users can follow the commands below to pull metadata and artifacts from a common cmf server and a central artifact repository.
Pull metadata from the server
Execute cmf metadata
command in the example_folder
.
cmf metadata pull -p 'Test-env'\n
Check Overview page for more details. Pull artifacts from the central artifact repo
Execute cmf artifact
command in the example_folder
.
cmf artifact pull -p 'Test-env'\n
Check Overview page for more details."},{"location":"cmf_client/step-by-step/#flow-chart-for-cmf","title":"Flow Chart for cmf","text":""},{"location":"cmf_client/tensorflow_guide/","title":"How to Use TensorBoard with CMF","text":"Copy the contents of the 'example-get-started' directory from cmf/examples/example-get-started
into a separate directory outside cmf repository.
Execute the following command to install the TensorFlow library in the current directory:
pip install tensorflow\n
Create a new Python file (e.g., tensorflow_log.py
) and copy the following code:
import datetime\n import tensorflow as tf\n\n mnist = tf.keras.datasets.mnist\n (x_train, y_train),(x_test, y_test) = mnist.load_data()\n x_train, x_test = x_train / 255.0, x_test / 255.0\n\n def create_model():\n return tf.keras.models.Sequential([\n tf.keras.layers.Flatten(input_shape=(28, 28), name='layers_flatten'),\n tf.keras.layers.Dense(512, activation='relu', name='layers_dense'), \n tf.keras.layers.Dropout(0.2, name='layers_dropout'),\n tf.keras.layers.Dense(10, activation='softmax', name='layers_dense_2')\n ])\n\n model = create_model()\n model.compile(optimizer='adam',\n loss='sparse_categorical_crossentropy',\n metrics=['accuracy'])\n\n log_dir = \"logs/fit/\" + datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)\n model.fit(x=x_train,y=y_train,epochs=5,validation_data=(x_test, y_test),callbacks=[tensorboard_callback])\n\n train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))\n test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))\n\n train_dataset = train_dataset.shuffle(60000).batch(64)\n test_dataset = test_dataset.batch(64)\n\n loss_object = tf.keras.losses.SparseCategoricalCrossentropy()\n optimizer = tf.keras.optimizers.Adam()\n\n # Define our metrics\n train_loss = tf.keras.metrics.Mean('train_loss', dtype=tf.float32)\n train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy('train_accuracy')\n test_loss = tf.keras.metrics.Mean('test_loss', dtype=tf.float32)\n test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy('test_accuracy')\n\n def train_step(model, optimizer, x_train, y_train):\n with tf.GradientTape() as tape:\n predictions = model(x_train, training=True)\n loss = loss_object(y_train, predictions)\n grads = tape.gradient(loss, model.trainable_variables)\n optimizer.apply_gradients(zip(grads, model.trainable_variables))\n train_loss(loss)\n train_accuracy(y_train, predictions)\n\n def test_step(model, x_test, y_test):\n predictions = model(x_test)\n loss = loss_object(y_test, predictions)\n test_loss(loss)\n test_accuracy(y_test, predictions)\n\n current_time = datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n train_log_dir = 'logs/gradient_tape/' + current_time + '/train'\n test_log_dir = 'logs/gradient_tape/' + current_time + '/test'\n train_summary_writer = tf.summary.create_file_writer(train_log_dir)\n test_summary_writer = tf.summary.create_file_writer(test_log_dir)\n\n model = create_model() # reset our model\n EPOCHS = 5\n for epoch in range(EPOCHS):\n for (x_train, y_train) in train_dataset:\n train_step(model, optimizer, x_train, y_train)\n with train_summary_writer.as_default():\n tf.summary.scalar('loss', train_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch)\n\n for (x_test, y_test) in test_dataset:\n test_step(model, x_test, y_test)\n with test_summary_writer.as_default():\n tf.summary.scalar('loss', test_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', test_accuracy.result(), step=epoch)\n template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'\n print (template.format(epoch+1,\n train_loss.result(),\n train_accuracy.result()*100,\n test_loss.result(),\n test_accuracy.result()*100))\n
For more detailed information, check out the TensorBoard documentation. Execute the TensorFlow log script using the following command:
python3 tensorflow_log.py\n
The above script will automatically create a logs
directory inside your current directory.
Start the CMF server and configure the CMF client.
Use the following command to run the test script, which will generate the MLMD file:
sh test_script.sh\n
Use the following command to push the generated MLMD and TensorFlow log files to the CMF server:
cmf metadata push -p 'pipeline-name' -t 'tensorboard-log-file-name'\n
Go to the CMF server and navigate to the TensorBoard tab. You will see an interface similar to the following image.
cmf-server is a key interface for the user to explore and track their ML training runs. It allows users to store the metadata file on the cmf-server. The user can retrieve the saved metadata file and can view the content of the saved metadata file using the UI provided by the cmf-server.
"},{"location":"cmf_server/cmf-server/#setup-a-cmf-server","title":"Setup a cmf-server","text":"There are two ways to start cmf server -
Clone the Github repository.
git clone https://github.com/HewlettPackard/cmf\n
Install Docker Engine with non root user privileges.
In earlier versions of docker compose, docker compose
was independent of docker. Hence, docker-compose
was command. However, after introduction of Docker Compose Desktop V2, compose command become part of docker engine. The recommended way to install docker compose is installing a docker compose plugin on docker engine. For more information - Docker Compose Reference.
docker compose
file","text":"This is the recommended way as docker compose starts both ui-server and cmf-server in one go.
cmf
directory.Replace xxxx
with user-name in docker-compose-server.yml available in the root cmf directory.
......\nservices:\nserver:\n image: server:latest\n volumes:\n - /home/xxxx/cmf-server/data:/cmf-server/data # for example /home/hpe-user/cmf-server/data:/cmf-server/data \n - /home/xxxx/cmf-server/data/static:/cmf-server/data/static # for example /home/hpe-user/cmf-server/data/static:/cmf-server/data/static\n container_name: cmf-server\n build:\n....\n
Execute following command to start both the containers. IP
variable is the IP address and hostname
is host name of the machine on which you are executing the following command. You can use either way.
IP=200.200.200.200 docker compose -f docker-compose-server.yml up\n OR\nhostname=host_name docker compose -f docker-compose-server.yml up\n
Replace docker compose
with docker-compose
for older versions. Also you can adjust $IP
in docker-compose-server.yml
to reflect the server IP and run the docker compose
command without specifying IP=200.200.200.200.
.......\nenvironment:\nREACT_APP_MY_IP: ${IP}\n......\n
Stop the containers.
docker compose -f docker-compose-server.yml stop\n
It is neccessary to rebuild images for cmf-server and ui-server after cmf version update
or after pulling latest cmf code from git.
OR
"},{"location":"cmf_server/cmf-server/#using-docker-run-command","title":"Usingdocker run
command","text":"Install cmflib on your system.
Go to cmf/server
directory.
cd server\n
List all docker images.
docker images\n
Execute the below-mentioned command to create a cmf-server
docker image.
Usage: docker build -t [image_name] -f ./Dockerfile ../\n
Example: docker build -t server_image -f ./Dockerfile ../\n
Note
- '../'
represents the Build context for the docker image. Launch a new docker container using the image with directory /home/user/cmf-server/data mounted. Pre-requisite: mkdir /home/<user>/cmf-server/data/static
Usage: docker run --name [container_name] -p 0.0.0.0:8080:80 -v /home/<user>/cmf-server/data:/cmf-server/data -e MYIP=XX.XX.XX.XX [image_name]\n
Example: docker run --name cmf-server -p 0.0.0.0:8080:80 -v /home/user/cmf-server/data:/cmf-server/data -e MYIP=0.0.0.0 server_image\n
After cmf-server container is up, start ui-server
, Go to cmf/ui
folder.
cd cmf/ui\n
Execute the below-mentioned command to create a ui-server
docker image.
Usage: docker build -t [image_name] -f ./Dockerfile ./\n
Example: docker build -t ui_image -f ./Dockerfile ./\n
Launch a new docker container for UI.
Usage: docker run --name [container_name] -p 0.0.0.0:3000:3000 -e REACT_APP_MY_IP=XX.XX.XX.XX [image_name]\n
Example: docker run --name ui-server -p 0.0.0.0:3000:3000 -e REACT_APP_MY_IP=0.0.0.0 ui_image\n
Note: If you face issue regarding Libzbar-dev
similar to the snapshot, add proxies to '/.docker/config.json' {\n proxies: {\n \"default\": {\n \"httpProxy\": \"http://web-proxy.labs.xxxx.net:8080\",\n \"httpsProxy\": \"http://web-proxy.labs.xxxx.net:8080\",\n \"noProxy\": \".labs.xxxx.net,127.0.0.0/8\"\n }\n }\n }\n
To stop the docker container.
docker stop [container_name]\n
To delete the docker container.
docker rm [container_name] \n
To remove the docker image.
docker image rm [image_name] \n
cmf-server APIs are organized around FastAPI. They accept and return JSON-encoded request bodies and responses and return standard HTTP response codes.
"},{"location":"cmf_server/cmf-server/#list-of-apis","title":"List of APIs","text":"Method URL DescriptionPost
/mlmd_push
Used to push Json Encoded data to cmf-server Get
/mlmd_pull/{pipeline_name}
Retrieves a mlmd file from cmf-server Get
/display_executions
Retrieves all executions from cmf-server Get
/display_artifacts/{pipeline_name}/{data_type}
Retrieves all artifacts from cmf-server for resp datat type Get
/display_lineage/{lineage_type}/{pipeline_name}
Creates lineage data from cmf-server Get
/display_pipelines
Retrieves all pipelines present in mlmd file"},{"location":"cmf_server/cmf-server/#http-response-status-codes","title":"HTTP Response Status codes","text":"Code Title Description 200
OK
mlmd is successfully pushed (e.g. when using GET
, POST
). 400
Bad request
When the cmf-server is not available. 500
Internal server error
When an internal error has happened"},{"location":"common-metadata-ontology/readme/","title":"Ontology","text":""},{"location":"common-metadata-ontology/readme/#common-metadata-ontology","title":"Common Metadata Ontology","text":"Common Metadata Ontology (CMO) is proposed to integrate and aggregate the pipeline metadata from various sources such as Papers-with-code, OpenML and Huggingface. CMF's data model is a manifestation of CMO which is specifically designed to capture the pipeline-centric metadata of AI pipelines. It consists of nodes to represent a pipeline, components of a pipeline (stages), relationships to capture interaction among pipeline entities and properties. CMO offers interoperability of diverse metadata, search and recommendation with reasoning capabilities. CMO offers flexibility to incorporate various executions implemented for each stage such as dataset preprocessing, feature engineering, training (including HPO), testing and evaluation. This enables robust search capabilities to identify the best execution path for a given pipeline. Additionally, CMO also facilitates the inclusion of additional semantic and statistical properties to enhance the richness and comprehensiveness of the metadata associated with them. The overview of CMO can be found below.
The external link to arrows.app can be found here
"},{"location":"common-metadata-ontology/readme/#sample-pipeline-represented-using-cmo","title":"Sample pipeline represented using CMO","text":"The sample figure shows a pipeline titled \"Robust outlier detection by de-biasing VAE likelihoods\" executed for \"Outlier Detection\" task for the stage train/test. The model used in the pipeline was \"Variational Autoencoder\". Several datasets were used in the pipeline implementation which are as follows (i) German Traffic Sign, (ii) Street View House Numbers and (iii) CelebFaces Arrtibutes dataset. The corresponding hyperparameters used and the metrics generated as a result of execution are included in the figure. The external link to source figure created using arrows.app can be found here
"},{"location":"common-metadata-ontology/readme/#turtle-syntax","title":"Turtle Syntax","text":"The Turtle format of formal ontology can be found here
"},{"location":"common-metadata-ontology/readme/#properties-of-each-nodes","title":"Properties of each nodes","text":"The properties of each node can be found below.
"},{"location":"common-metadata-ontology/readme/#pipeline","title":"Pipeline","text":"AI pipeline executed to solve a machine or deep learning Task
"},{"location":"common-metadata-ontology/readme/#properties","title":"Properties","text":"Any published text document regarding the pipeline implementation
"},{"location":"common-metadata-ontology/readme/#properties_1","title":"Properties","text":"The AI Task for which the pipeline is implemented. Example: image classification
"},{"location":"common-metadata-ontology/readme/#properties_2","title":"Properties","text":"The framework used to implement the pipeline and their code repository
"},{"location":"common-metadata-ontology/readme/#properties_3","title":"Properties","text":"Various stages of the pipeline such as data preprocessing, training, testing or evaluation
"},{"location":"common-metadata-ontology/readme/#properties_4","title":"Properties","text":"Multiple executions of a given stage in a pipeline
"},{"location":"common-metadata-ontology/readme/#properties_5","title":"Properties","text":"Artifacts such as model, dataset and metric generated at the end of each execution
"},{"location":"common-metadata-ontology/readme/#properties_6","title":"Properties","text":"Subclass of artifact. The dataset used in each Execution of a Pipeline
"},{"location":"common-metadata-ontology/readme/#properties_7","title":"Properties","text":"Subclass of artifact. The model used in each execution or produced as a result of an execution
"},{"location":"common-metadata-ontology/readme/#properties_8","title":"Properties","text":"Subclass of artifact. The evaluation result of each execution
"},{"location":"common-metadata-ontology/readme/#properties_9","title":"Properties","text":"Parameter setting using for each Execution of a Stage
"},{"location":"common-metadata-ontology/readme/#properties_10","title":"Properties","text":"NOTE: * are optional properties * There additional information on each node, different for each source. As of now, there are included in the KG for efficient search. But they are available to be used in the future to extract the data and populate as node properties. * *For metric, there are umpteen possible metric names and values. Therefore, we capture all of them as a key value pair under evaluations * custom_properties are where user can enter custom properties for each node while executing a pipeline * source is the source from which the node is obtained - papers-with-code, openml, huggingface
"},{"location":"common-metadata-ontology/readme/#published-works","title":"Published works","text":"This example depends on the following packages: git
. We also recommend installing anaconda to manage python virtual environments. This example was tested in the following environments:
Ubuntu-22.04 with python-3.10
This example demonstrates how CMF tracks a metadata associated with executions of various machine learning (ML) pipelines. ML pipelines differ from other pipelines (e.g., data Extract-Transform-Load pipelines) by the presence of ML steps, such as training and testing ML models. More comprehensive ML pipelines may include steps such as deploying a trained model and tracking its inference parameters (such as response latency, memory consumption etc.). This example, located here implements a simple pipeline consisting of four steps:
train
and test
raw datasets for training and testing a machine learning model. This step registers one input artifact (raw dataset
) and two output artifacts (train and test datasets
). train
step. This step registers two input artifacts (ML model and test dataset) and one output artifact (performance metrics).We start by creating (1) a workspace directory that will contain all files for this example and (2) a python virtual environment. Then we will clone the CMF project that contains this example project.
# Create workspace directory\nmkdir cmf_getting_started_example\ncd cmf_getting_started_example\n\n# Create and activate Python virtual environment (the Python version may need to be adjusted depending on your system)\nconda create -n cmf_getting_started_example python=3.10 \nconda activate cmf_getting_started_example\n\n# Clone the CMF project from GitHub and install CMF\ngit clone https://github.com/HewlettPackard/cmf\npip install ./cmf\n
"},{"location":"examples/getting_started/#setup-a-cmf-server","title":"Setup a cmf-server","text":"cmf-server is a key interface for the user to explore and track their ML training runs. It allows users to store the metadata file on the cmf-server. The user can retrieve the saved metadata file and can view the content of the saved metadata file using the UI provided by the cmf-server.
Follow here to setup a common cmf-server.
"},{"location":"examples/getting_started/#project-initialization","title":"Project initialization","text":"We need to copy the source tree of the example in its own directory (that must be outside the CMF source tree), and using cmf init
command initialize dvc remote directory, git remote url, cmf server and neo4j with appropriate dvc backend for this project .
# Create a separate copy of the example project\ncp -r ./cmf/examples/example-get-started/ ./example-get-started\ncd ./example-get-started\n
"},{"location":"examples/getting_started/#cmf-init","title":"cmf init","text":"\nUsage: cmf init minioS3 [-h] --url [url] \n --endpoint-url [endpoint_url]\n --access-key-id [access_key_id] \n --secret-key [secret_key] \n --git-remote-url[git_remote_url] \n --cmf-server-url [cmf_server_url]\n --neo4j-user [neo4j_user]\n --neo4j-password [neo4j_password]\n --neo4j-uri [neo4j_uri]\n
cmf init minioS3 --url s3://bucket-name --endpoint-url http://localhost:9000 --access-key-id minioadmin --secret-key minioadmin --git-remote-url https://github.com/user/experiment-repo.git --cmf-server-url http://127.0.0.1:80 --neo4j-user neo4j --neo4j-password password --neo4j-uri bolt://localhost:7687\n
Follow here for more details."},{"location":"examples/getting_started/#project-execution","title":"Project execution","text":"To execute the example pipeline, run the test_script.sh file (before that, study the contents of that file). Basically, that script will run a sequence of steps common for a typical machine learning project - getting raw data, converting it into machine learning train/test splits, training and testing a model. The execution of these steps (and parent pipeline) will be recorded by the CMF.
# Run the example pipeline\nsh ./test_script.sh\n
This script will run the pipeline and will store its metadata in a sqlite file named mlmd. Verify that all stages are done using git log
command. You should see commits corresponding to the artifacts that were created.
Under normal conditions, the next steps would be to: (1) execute the cmf artifact push
command to push the artifacts to the central artifact repository and (2) execute the cmf metadata push
command to track the metadata of the generated artifacts on a common cmf server.
Follow here for more details on cmf artifact
and cmf metadata
commands.
The stored metadata can be explored using the query layer. Example Jupyter notebook Query_Tester-base_mlmd.ipynb can be found in this directory.
"},{"location":"examples/getting_started/#clean-up","title":"Clean Up","text":"Metadata is stored in sqlite file named \"mlmd\". To clean up, delete the \"mlmd\" file.
"},{"location":"examples/getting_started/#steps-to-test-dataslice","title":"Steps to test dataslice","text":"Run the following command: python test-data-slice.py
.