Skip to content

Data Model

André Jesus edited this page Jun 29, 2023 · 15 revisions

Data Model

Welcome to the Data Model page of the PHYLOViZ Web Platform! This page provides an overview of the data model used by the platform to represent genetic or epidemiological data. Understanding the data model is essential for effectively using the platform's features and analyzing your datasets.

The PHYLOViZ Web Platform follows a flexible and extensible data model that can accommodate various types of data. The data model consists of the following key entities:

  1. Project: A project serves as a container for organizing and managing datasets, analyses, and visualizations. Each project has a unique name and description.

  2. Dataset: A dataset represents a collection of genetic or epidemiological data. It consists of samples or isolates and their associated attributes, such as genetic markers, patient information, and geographic location. Datasets provide the basis for performing phylogenetic analysis and visualization.

  3. File: Files are resources stored within a project and can be categorized as either Typing Data or Isolate Data. Typing Data files contain information about genetic markers or allelic profiles, while Isolate Data files contain information about individual isolates or samples. Each dataset in a project can reference a single Typing Data file and optionally reference a single Isolate Data file. Files are uploaded directly into the project and can be shared among datasets.

  4. Distance Matrix: A Distance Matrix represents the calculated distances between each pair of allelic profiles in a dataset. It is derived from the Typing Data and serves as the basis for assessing the relatedness between profiles. Multiple Distance Matrices can be generated using different distance functions or models.

  5. Tree: A Tree is the result of applying an inference algorithm to a Distance Matrix, generating a hierarchical representation of the allelic profiles. It shows the evolutionary relationships between the profiles and can be visualized as a phylogenetic tree. Multiple Trees can be created using different algorithms or distance matrices.

  6. Tree Visualization: A Tree Visualization or Tree View is a visual representation of a Tree, where a layout is applied to arrange the nodes and edges in a meaningful way. It allows for the exploration and analysis of the phylogenetic relationships in a visually informative manner. Multiple Tree Views can be created from a single Tree, each with different layouts or transformations applied.

The entities described above form the basis of the data model used by the PHYLOViZ Web Platform. They provide a structured representation of genetic or epidemiological data and facilitate the organization, analysis, and visualization of datasets. The relationships between these entities allow users to navigate and explore the data efficiently.

By understanding the data model and its components, users can effectively utilize the platform's features and leverage its capabilities for their data analysis and research purposes. For more detailed information on the data model and its usage, refer to the platform's documentation or consult the support team for assistance.

Workflow documents

Workflows can be defined as procedures composed of groups of tasks or steps that are typically dependent on each other. Each task's output may serve as input for future tasks, creating a pipeline-like structure. The PHYLOViZ Web Platform utilizes workflows to execute computational procedures efficiently.

To manage workflows effectively, the platform employs a Workflow Management System (WMS) consisting of a workflow engine and a workflow scheduler. The workflow engine handles the execution of workflows, while the scheduler manages their scheduling.

Workflow Templates

Workflow templates serve as blueprints for creating workflows, representing the existence of a workflow type itself: adding a workflow template means adding a new workflow type to be used by the application. They define the structure and details of a workflow, including the following information:

  • Workflow type: An identifier that categorizes the workflows built from the template.
  • Name and description: Clear and descriptive information about the purpose and functionality of the workflow.
  • Schema of input arguments: Specifies the expected input arguments for the workflow and defines their validation rules.
  • Tasks: Each workflow consists of tasks associated with specific tools or commands. The template includes the commands to be executed in each task, with placeholders that are later replaced with the appropriate input arguments specified in the schema during workflow creation.

Each of the arguments on the schema have a type, which can be:

  • string: argument is a string. Needs another field "allowed-values" that limits the values. This exists to prevent code injections, exactly because these arguments are replaced directly into the placeholders of the command arguments of the tools;
  • uuid: argument is a UUID;
  • objectid: argument is an ObjectId;
  • number: argument is a number;
  • datasetId: argument is the id of a dataset;
  • distanceMatrixId: argument is the id of a distance matrix;
  • treeId: argument is the id of a tree;
  • treeViewId: argument is the id of a tree view;

"projectId" and "workflowId" are reserved arguments, which always replace the placeholders in the task, if they are present.

By utilizing workflow templates, users can easily create workflows with consistent structures and predefined behaviors.

Tool Templates

Similar to workflow templates, tool templates define individual tools used within workflows. They provide the following information:

  • Name and description: Identifiers and descriptions of the tools, which are referenced by tasks in the workflow.
  • Access instructions: Instructions for accessing the tools, including information about APIs or libraries. Docker image details and Docker volume configuration for storing output data are also specified. The Docker volume path contains placeholders that are substituted with the project ID and workflow ID, ensuring each workflow has a dedicated directory for task outputs.

Tool templates enable the integration of various tools into workflows, allowing for the customization of specific actions and functionalities.

Workflow Instances

Workflow instances represent the actual execution or non-execution state of workflows. They are generated based on workflow templates and provide valuable information about the workflow's status and progress. Each workflow instance document includes the following details:

  • Workflow ID: A unique identifier that distinguishes each specific workflow instance.
  • Workflow type: Corresponding to the type of the workflow template it was built upon.
  • Workflow name: The name of the workflow.
  • Project ID: Every workflow belongs to a project.
  • Workflow object: Contains the tasks to be executed, along with their associated tools and actions (commands).
  • Workflow status: Represents the current state of the workflow (e.g., RUNNING, SUCCESS, or FAILED).
  • Failure reason and log: If the workflow failed, the reason why it failed and an accompanying log of the task.
  • Workflow progress: A percentage value indicating the progress of the workflow, managed directly by the tasks.
  • Additional data from tasks: Any additional information or results generated by the tasks during the computation.

Workflow instances provide insights into the execution and results of workflows, facilitating better integration with the application.

image

We hope this overview of the data model provides you with a solid foundation for working with the PHYLOViZ Web Platform and maximizing its capabilities in analyzing and visualizing your genetic or epidemiological data.