Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Extensible Artifact Model #318

Open
okennedy opened this issue Aug 8, 2024 · 0 comments
Open

Proposal: Extensible Artifact Model #318

okennedy opened this issue Aug 8, 2024 · 0 comments
Labels
layer-api An issue involving the vizier API layer layer-python An issue involving the Python compatibility code layer-scala An issue involving Scala compatibility code
Milestone

Comments

@okennedy
Copy link
Contributor

okennedy commented Aug 8, 2024

Challenge

Vizier's current data model is:

  1. Tightly coupled to Apache Spark: This brings in a 600MB dependency (technically 1.2GB, since pip ends up importing it a second time for Python compatibility).
  2. Very ad-hoc: Type translations are developed organically, on an as-needed basis.
  3. Reliant on 'canonical' types: Every data value has a canonical type. This often necessitates redundant, or unnecessarily proactive translations, most commonly with the Dataset type. For example, instead of easily allowing Pandas to interpret a LoadDataset('csv') with pd.load_csv; we have to go through spark.
  4. No notion of multiple-role objects. For example, a CSV file is a file, but could also represent a dataframe defined over the file. Presently, it's possible to have both, but you need separate artifacts for each.
  5. No support for transient artifacts --- artifacts created temporarily as cache.

Proposal Summary

  1. Provide Interfaces, Implementations, and Rust-Style Into[]/From[] adaptors; mainly with an eye towards decoupling how Vizier and language servers interact with artifacts (Interfaces/Mixins), from the underlying representation of the artifact.
  2. Introduce the notion of 'cache' artifacts

Concrete Proposal

The core idea is to decouple the physical representation of an artifact from the ways in which user code interacts with it. This breaks down into four concepts:

  • Encoding: The physical representation of the artifact
  • Interface: A conceptual 'role' that an artifact may play (e.g., Dataset, Image, or Integer), defined as a set of methods.
  • Implementation: Implementations of the methods of an Interface for a specific Encoding (or for an Interface).
  • Conversion: Code that translates one Encoding into another Encoding (or an Interface into an Encoding)

Encoding

At present, Vizier's representation of artifacts consists of a small, opaque blob of text data (typically json). These are interpreted based on the specific type of artifact, but the interpretation is entirely unstructured and performed on read. There is no common structure to the artifacts. This, in particular, makes things like reachability checks hard, since inter-artifact dependencies (e.g., a SQL query over existing tables) always need to be implemented ad-hoc.

The first major goal is to define a schema definition language for Artifacts. The schema definition needs to capture:

  • Serialization Standards (e.g., how the structure maps to JSON)
  • Type constraints (e.g., signed-ness and bit length of integers)
  • Nested dependencies (e.g., references to artifacts, and large-content/blob data on which the artifact depends)

Then, we define encodings for all of the existing artifact types, perhaps strengthening them somewhat (e.g., explicitly typed primitives, instead of generic parameters).

To emphasize the point, an encoding simply gives a name to the physical manifestation of the artifact, and dictates how it is stored in the database. This should be the minimum required to reproduce the artifact (see Artifact Caching below); and can should disregard any data that is only needed for efficiency (e.g., the URL of a file, but not the contents).

Some TODOs:

  • Design the schema language; implement it as Scala Case Classes, or similar.
  • Map all existing Artifact Types into the Encoding framework
  • Replace ArtifactType and its kin in the columns of the Artifact table with a reference to the Encoding used for the artifact.
  • Replate the hodgepodgy Artifact.describe / summarize with something more sensible based on the encoding.
  • Elide all references to SparkSchema/SparkPrimitive, replacing them with references to Encodings. In particular Dataset schemas should be based on Encodings, rather than Spark DataTypes

Interface

At present, Vizier uses ArtifactType and MIME types to differentiate different roles that an artifact can play. The Interface plays a similar role, by dictating a specific API to which an artifact can conform (i.e., governing how Vizier, its subsystems, and the user interacts with it). Some examples include:

  • Dataset (Consider the many types of dataset we have right now)
  • Image (png, jpg, etc...)
  • File (independent of format)

Some TODOs:

  • Design the interface language; implement it as Scala Case Classes, or similar
  • Design Interfaces for all existing ArtifactTypes
  • Add a 'Summary' interface.
  • Allow Interfaces to provide descriptions (e.g., to replace Artifact.describe)

Implementation

(An Encoding -> Interface, or Interface -> Interface edge)

In order to decouple Encoding and Interface, we need a binding between the two. Somewhere in the code, we need to be able to define code that implements a specific interface for a specific encoding. (e.g., how do I get the spark dataframe for a CSV file; How do I get the arrow dataframe, etc...).

Some TODOs:

  • We'll need a router; something to figure out which Implementation to invoke for a particular Encoding/Interface pair. This becomes harder if we want to allow Implementations from Interface to Interface

Conversion

(An Encoding -> Encoding edge)

This is more/less the same as an implementation, save that it generates a new encoding (and consequent additional data)

Platform Interactions

Generic artifacts necessitate decoupling Vizier from its target platforms, including Spark (but also Scala and Python). This means that we need a code component to translate an Encoding of an artifact into the platform-native equivalent. The natural approach here is to define a set of tiered fallbacks:

  1. Platform-provided logic for directly translating an encoding into a platform-native representation (e.g., CSV File -> Spark Dataframe)
  2. Fall back to platform-provided logic for translating an encoding that implements a specific interface into a platform-native representation (e.g., Function)
  3. Fallback through conversions to an encoding that is supported by case 1 or 2. (e.g., convert dataframe to arrow -> spark)
  4. Fallback to just providing the encoding directly (e.g., as the JSON-serialized artifact)

Artifact Caching

[more to come]

@okennedy okennedy added layer-api An issue involving the vizier API layer layer-python An issue involving the Python compatibility code layer-scala An issue involving Scala compatibility code labels Aug 8, 2024
@okennedy okennedy added this to the Eventually milestone Aug 8, 2024
@okennedy okennedy changed the title Epic: Extensible Artifact Model Proposal: Extensible Artifact Model Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
layer-api An issue involving the vizier API layer layer-python An issue involving the Python compatibility code layer-scala An issue involving Scala compatibility code
Projects
None yet
Development

No branches or pull requests

1 participant