Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: promoting to production #162

Merged
merged 10 commits into from
Aug 20, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .markdownlint-cli2.jsonc
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
{
"config": {
"MD013": false // disable line length checks
"MD013": false, // disable line length checks
"MD033": {
"allowed_elements": [ "a", "b", "details", "summary" ]
}
}
}
98 changes: 70 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,42 @@

[![GitHub Workflow Status (with event)](https://img.shields.io/github/actions/workflow/status/nasa-impact/veda-data/ci.yaml?style=for-the-badge&label=CI)](https://github.com/NASA-IMPACT/veda-data/actions/workflows/ci.yaml)

This repository houses data used to define a VEDA dataset to load into the [VEDA catalog](https://nasa-impact.github.io/veda-docs/services/apis.html). Inclusion in the VEDA catalog is a prerequisite for displaying the dataset in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/).
This repository houses config data used to load datasets into the [VEDA catalog](https://nasa-impact.github.io/veda-docs/services/apis.html). Inclusion in the VEDA catalog is a prerequisite for displaying datasets in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/).

The data provided here gets processed in the ingestion system [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow), to which this repository is directly linked (as a Git submodule).
The config data provided here gets processed in the [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow) ingestion system. See [Dataset Submission Process](#dataset-submission-process) for details about submitting work to the ingestion system.

## Dataset Submission Process
## Dataset submission process

The VEDA user docs explain the full [dataset submission process](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/).
![veda-data-publication][veda-data-publication]

Ultimately, submission to the VEDA catalog requires that you [open an issue with the "new dataset" template](https://github.com/NASA-IMPACT/veda-data/issues/new?assignees=&labels=dataset&projects=&template=new-dataset.yaml&title=New+Dataset%3A+%3Cdataset+title%3E). This template will require, at minimum:
To add data to VEDA you will:

1. a description of the dataset
2. the location of the data (in S3, CMR, etc.), and
3. a point of contact for the VEDA team to collaborate with.
1. **Stage your files:** Upload files to the staging bucket `s3://veda-data-store-staging` (which you can do with a VEDA JupyterHub account--request access [here](https://nasa-impact.github.io/veda-docs/services/jupyterhub.html)) or a self-hosted bucket in s3 has shared read access to VEDA service.

One or more notebooks showing how the data should be processed would be appreciated.
2. **Generate STAC metadata in the staging catalog:** Metadata must first be added to the Staging Catalog [staging.openveda.cloud/api/stac](https://staging.openveda.cloud/api/stac). You will need to create a dataset config file and submit it to the `/workflows/dataset/publish` endpoint to generate STAC Collection metadata and generate Item records for the files you have uploaded in Step 1. See detailed steps for the [dataset submission process](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/) in the contribuing section of [veda-docs](https://nasa-impact.github.io/veda-docs).

## Ingestion Data Structure
3. **Acceptance testing\*:** Perform acceptance testing appropriate for your data. \*In most cases this will be opening a dataset PR in [veda-config](https://github.com/NASA-IMPACT/veda-config) to generate a dashboard preview of the data. See [veda-docs/contributing/dashboard-configuration](https://nasa-impact.github.io/veda-docs/contributing/dashboard-configuration/dataset-configuration.html) for instructions on generating a dashboard preview).

When submitting STAC records to ingest, a pull request can be made with the data structured as described below.
4. **Promote to production!** Open a PR in the [veda-data](https://github.com/NASA-IMPACT/veda-data) repo with the dataset config metadata you used to add your data to the Staging catalog in Step 2. Add your config to `ingestion-data/production/dataset-config`. When your PR is approved, this configuration will be used to generate records in the production VEDA catalog!

### `collections/`
5. **[Optional] Share your data :** Share your data in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/) by submitting a PR to [veda-config](https://github.com/NASA-IMPACT/veda-config) ([see veda-docs/contributing/dashboard-configuration](https://nasa-impact.github.io/veda-docs/contributing/dashboard-configuration/dataset-configuration.html)) and add jupyterhub hosted usage examples to [veda-docs/contributing/docs-and-notebooks](https://nasa-impact.github.io/veda-docs/contributing/docs-and-notebooks.html)

The `ingestion-data/collections/` directory holds json files representing the data for VEDA collection metadata (STAC).
## Project ingestion data structure

When submitting STAC records for ingestion, a pull request can be made with the data structured as described below. The `ingestion-data/` directory contains artifacts of the ingestion configuration used to publish to the staging and production catalogs.

> **Note**
Various ingestion workflows are supported and documented below but only the configuration metadata used to publish to the VEDA catalog are stored in this repo. It is not expected that every ingestion will follow exactly the same pattern nor will each ingested collection have have all types of configuration metadata here. The primary method used to ingest is [**`dataset-config`**](#stagedataset-config).

### `<stage>/collections/`

The `ingestion-data/collections/` directory holds json files representing the data for VEDA collection metadata (STAC). STAC Collection metadata can be generated from an id, title, description using Pystac. See this [veda-docs/contributing notebook example](https://nasa-impact.github.io/veda-docs/notebooks/veda-operations/stac-collection-creation.html) to get started.

Should follow the following format:

<details>
<summary><b>/collections/collection_id.json</b></summary>

```json
{
"id": "<collection-id>",
Expand Down Expand Up @@ -105,40 +115,41 @@ Should follow the following format:

```

### `discovery-items/`
</details>

### `<stage>/discovery-items/`

The `ingestion-data/discovery-items/` directory holds json files representing the step function inputs for initiating the discovery, ingest and publication workflows.
The `ingestion-data/discovery-items/` directory holds json files representing the inputs for initiating the discovery, ingest and publication workflows.
Can either be a single input event or a list of input events.

Should follow the following format:

<details>
<summary><b>/discovery-items/collection_id.json</b></summary>

```json
{
"collection": "<collection-id>",
"discovery": "<s3/cmr>",

## for s3 discovery
"prefix": "<s3-key-prefix>",
"bucket": "<s3-bucket>",
"filename_regex": "<filename-regex>",
"datetime_range": "<month/day/year>",

## for cmr discovery
"version": "<collection-version>",
"temporal": ["<start-date>", "<end-date>"],
"bounding_box": ["<bounding-box-as-comma-separated-LBRT>"],
"include": "<filename-pattern>",

### misc
"cogify": "<true/false>",
"upload": "<true/false>",
"dry_run": "<true/false>",
"dry_run": "<true/false>"
}
```

### `dataset-config/`
</details>

The `ingestion-data/dataset-config/` directory holds json files that can be used with the `dataset/publish` stac ingestor endpoint, combining both collection metadata and discovery items. For an example of this ingestion workflow, see this [jupyter notebook](./transformation-scripts/example-template/example-geoglam-ingest.ipynb).
### `<stage>/dataset-config/`

The `ingestion-data/dataset-config/` directory holds json files that can be used with the `dataset/publish` workflows endpoint, combining both collection metadata and discovery items. For an example of this ingestion workflow, see this [geoglam ingest notebook in nasa-impact.github.io/veda-docs/contributing/dataset-ingeston](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/transformation-scripts/example-template/example-geoglam-ingest.ipynb).

<details>
<summary><b>/dataset-config/collection_id.json</b></summary>

```json
{
Expand Down Expand Up @@ -170,11 +181,40 @@ The `ingestion-data/dataset-config/` directory holds json files that can be used
}
]
}

```

</details>

### `production/transfer-config`

This directory contains the configuration needed to execute a stand-alone airflow DAG that copies data from a specified staging bucket and prefix to a permanent location in `s3://veda-data-store` using the collection_id as a prefix.

Should follow the following format:

<details>
<summary><b>/production/transfer-config/collection_id.json</b></summary>

```json
{
"collection": "<collection-id>",

## the location of the staged files
"origin_bucket": "<s3-bucket>",
"origin_prefix": "<s3-key-prefix>",
"bucket": "<s3-bucket>",
"filename_regex": "<filename-regex>",

### misc
"dry_run": "<true/false>"
}
```

</details>

## Validation
Copy link
Contributor

@smohiudd smohiudd Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anayeaye as part of the validation should we also create json schema that includes our minimum requirements for collection metadata to maintain consistency across the production catalog? For example, all collections in VEDA must include:

  • providers
  • thumbnails assets
  • renders

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooof I just included those there in an optional section in the updated geoglam notebook that is now runnable in the hub https://github.com/NASA-IMPACT/veda-docs/blob/c9dc33a8f2960fa29a59db2974f979b12055f8ca/contributing/dataset-ingestion/example-template/example-geoglam-ingest.ipynb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree on a firm nudge to create renders and thumbnails but I don't think we should require them until we have great docs on how to set them up + a way for users to upload thumbs (which are both doable)


This repository provides a script for validating all collections.
This repository provides a script for validating all collections in the ingestion-data directory.
First, install the requirements (preferably in a virtual environment):

```shell
Expand Down Expand Up @@ -212,3 +252,5 @@ pip-compile
```

This will update `requirements.txt` with a complete, realized set of Python dependencies.

[veda-data-publication]: ./docs/publishing-data.excalidraw.png
Binary file added docs/publishing-data.excalidraw.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Loading