Skip to content

Latest commit

 

History

History
157 lines (127 loc) · 8.95 KB

DATA_MODEL.md

File metadata and controls

157 lines (127 loc) · 8.95 KB

ShardsDB Data Model

All relevant information is stored in a PostgreSQL database.

The database is the single source of truth and contains a pronounced data model which ensures consistency of values and references inside the database. Thus the database is independent of validations provided by ORM. Constraints and triggers are in place to make sure the data is inherently consistent, regardless of how it is accessed.

Shards

                                      Table "public.shards"
   Column    |           Type           | Collation | Nullable |             Default
-------------+--------------------------+-----------+----------+----------------------------------
 id          | bigint                   |           | not null | generated by default as identity
 name        | citext                   |           | not null |
 qualifier   | citext                   |           | not null | ''::citext
 description | text                     |           |          |
 created_at  | timestamp with time zone |           | not null | now()
 updated_at  | timestamp with time zone |           | not null | now()
 categories  | bigint[]                 |           | not null | '{}'::bigint[]
 archived_at | timestamp with time zone |           |          |
Check constraints:
    "shards_name_check" CHECK (name ~ '^[A-Za-z0-9_\-.]{1,100}$'::text)
    "shards_qualifier_check" CHECK (qualifier ~ '^[A-Za-z0-9_\-.]{0,100}$'::citext)

Simple, fault-tolerant, global naming schema for shards

Shards are generally called by their name. The name is defined in shard.yml and also applied when used as a dependency both in dependencies: mapping as well as the folder name in ./lib.

Because of shard's decentralized design, name clashes can't be ruled out.

Mirrors

Name clashes are often caused by mirrors of a shard's repository. This is usually not that problematic because they can be viewed as alternate repositories of the same shard.

Example: kemal

  • Main repo location: github:kemalcr/kemal. This is considered the canonical repo.
  • Old repo location: github:sdogruyol/kemal
  • There are also development forks like github:straight-shoota/kemal

They all reference the same shard. Mirrors could be considered as individual instances but unless they have separate releases, they are considered the same shard, just provided at a different location. When a mirror releases independently, it must be considered a fork and a separate shard.

Shardbox doesn't take care of this automatically. Mirrors need to be listed as mirrors belonging to the canonical repository in the catalog. Shardbox only every synchronizes releases from the canonical repository. To separate a mirror to form its own shard, it simply needs to be removed from the mirrors list.

Homonymous Shards

There might also be unrelated shards which share the same name. This is obviously not ideal, but can't really be avoided when there is no centralized registry that assigns names.

This problem is approached as follows:

  • Shards are generally identified by their name as specified in shard.yml (e.g. kemal).
  • When there are multiple shards of the same name (related or unrelated) in the database, an additional qualifier is used to tell them apart (e.g. kemalcr or straight-shoota). This qualifier is interpreted as #{qualifier}'s version of #{name}
  • Qualifiers can be omitted when there is no ambiguity.
  • Name and qualifier combined form a slug which could look like kemal (no qualifier) or kemal~straight-shoota.
  • Avoids / as delimiter for easier use in HTTP routes and to distinguish from <org>/<project> scheme typically used for source code hosting.

Shardbox automatically assigns qualifiers derived from the URL. Qualifiers are only assigned when a shard name has already been taken (first come – first serve). There is currently no mechanism for modifying qualifiers (but might be implemented in the future).

Repos

                                        Table "public.repos"
     Column     |           Type           | Collation | Nullable |             Default
----------------+--------------------------+-----------+----------+----------------------------------
 id             | bigint                   |           | not null | generated by default as identity
 shard_id       | bigint                   |           |          |
 resolver       | repo_resolver            |           | not null |
 url            | citext                   |           | not null |
 role           | repo_role                |           | not null | 'canonical'::repo_role
 metadata       | jsonb                    |           | not null | '{}'::jsonb
 sync_failed_at | timestamp with time zone |           |          |
 synced_at      | timestamp with time zone |           |          |
 created_at     | timestamp with time zone |           | not null | now()
 updated_at     | timestamp with time zone |           | not null | now()
Check constraints:
    "repos_obsolete_role_shard_id_null" CHECK (role <> 'obsolete'::repo_role OR shard_id IS NULL)
    "repos_resolvers_service_url" CHECK (NOT (resolver = ANY (ARRAY['github'::repo_resolver, 'gitlab'::repo_resolver, 'bitbucket'::repo_resolver])) OR url ~ '^[A-Za-z0-9_\-.]{1,100}/[A-Za-z0-9_\-.]{1,100}$'::citext AND url !~~ '%.git'::citext)
    "repos_shard_id_null_role" CHECK (shard_id IS NOT NULL OR role = 'canonical'::repo_role OR role = 'obsolete'::repo_role)

NOTES:

  • role specifies the role of this repo for the shard (defaults to canonical). Other values are mirror, legacy and obsolete. Thus, multiple repositories can be linked to the same shard. This is important for example to preserve continuity when a repository is transferred to a different location (for example github:sdogruyol/kemal to github:kemalcr/kemal) and to map mirrors of the same shard.

Releases

                                      Table "public.releases"
    Column     |           Type           | Collation | Nullable |             Default
---------------+--------------------------+-----------+----------+----------------------------------
 id            | bigint                   |           | not null | generated by default as identity
 shard_id      | bigint                   |           | not null |
 version       | character varying        |           | not null |
 revision_info | jsonb                    |           | not null |
 spec          | jsonb                    |           | not null |
 position      | integer                  |           | not null |
 latest        | boolean                  |           |          |
 released_at   | timestamp with time zone |           | not null |
 yanked_at     | timestamp with time zone |           |          |
 created_at    | timestamp with time zone |           | not null | now()
 updated_at    | timestamp with time zone |           | not null | now()
Check constraints:
    "releases_latest_check" CHECK (latest <> false)
    "releases_version_check" CHECK (version::text ~ '^[0-9]+(\.[0-9a-zA-Z]+)*(-[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)*)?$'::text OR version::text = 'HEAD'::text)

NOTES:

  • Releases are bound to a shard (shard_id), not an individual repo because repo locations may change. We consider each shard to have a unique release history determined by the releases provided by the canonical repo.
  • position is a utility column used to sort versions because PostgreSQL doesn't provide proper comparison operator for version strings. There is a semver extension, but it requires versions to follow SEMVER, which is not enforced by shards. So we need to implement sort order externally using Service::OrderReleases. As a benefit, result sorting is simple integer comparison.
  • If a release has been deleted from the repo (i.e. the tag was removed) it is marked as yanked. This procedure needs refinement. Yanked releases should still be addressable.
  • When a tag is changed to point to a different commit, it is simply updated. This also needs refinement.

Dependencies

                      Table "public.dependencies"
   Column   |           Type           | Collation | Nullable | Default
------------+--------------------------+-----------+----------+---------
 release_id | bigint                   |           | not null |
 name       | citext                   |           | not null |
 repo_id    | bigint                   |           |          |
 spec       | jsonb                    |           | not null |
 scope      | dependency_scope         |           | not null |
 created_at | timestamp with time zone |           | not null | now()
 updated_at | timestamp with time zone |           | not null | now()

NOTES:

  • repo_id points to the repo referenced as dependency. This is only NULL if it cannot be resolved every (path scheme). In all other cases, repo_id points to a repository record, but might not be actually resolvable (when the repository is not available).
  • The dependent shard is available through joining releases on release_id.
  • Scope is either runtime or dependency.