Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: promoting to production #162

Merged
merged 10 commits into from
Aug 20, 2024
Merged
89 changes: 60 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,33 @@

This repository houses data used to define a VEDA dataset to load into the [VEDA catalog](https://nasa-impact.github.io/veda-docs/services/apis.html). Inclusion in the VEDA catalog is a prerequisite for displaying the dataset in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/).

The data provided here gets processed in the ingestion system [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow), to which this repository is directly linked (as a Git submodule).
The data provided here gets processed in the ingestion system [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow), see [Dataset Submission Process](#dataset-submission-process) for details about submitting work to the ingestion system.

## Dataset Submission Process
### Dataset submission process
![][veda-data-publication]
To add data to VEDA you will
1. **Stage your files:** Upload files to the staging bucket (which you can do with a VEDA JupyterHub account--request access [here](https://nasa-impact.github.io/veda-docs/services/jupyterhub.html)) or a self-hosted bucket in s3.
2. **Generate STAC metadata in the staging catalog:** Use the workflows/data/set/publish endpoint to generate STAC Collection metadata and generate Item records for the files you have uploaded. See detailed steps for the [dataset submission process](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/) in the contribuing section of [veda-docs](https://nasa-impact.github.io/veda-docs).
3. **Acceptance testing\*:** Perform acceptance testing appropriate for your data. \*In most cases this will be opening a dataset PR in [veda-config](https://github.com/NASA-IMPACT/veda-config) to generate a dashboard preview of the data ([see veda-docs/contributing/dashboard-configuration](https://nasa-impact.github.io/veda-docs/contributing/dashboard-configuration/dataset-configuration.html) for instructions on generating a dashboard preview).
4. **Promote to production!** Open a PR <ins>in this project</ins> with the dataset-config metadata you used to add your data to the staging catalog via the workflows dataset publish api (add your config to `ingestion-data/production/dataset-config`). When your PR is approved, this configuration will be used to generate records in the production VEDA catalog!
5. **[Optional] Share your data :** Share your data in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/) by submitting a PR to [veda-config](https://github.com/NASA-IMPACT/veda-config) ([see veda-docs/contributing/dashboard-configuration](https://nasa-impact.github.io/veda-docs/contributing/dashboard-configuration/dataset-configuration.html)) and add jupyterhub hosted usage examples to [veda-docs/contributing/docs-and-notebooks](https://nasa-impact.github.io/veda-docs/contributing/docs-and-notebooks.html)

The VEDA user docs explain the full [dataset submission process](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/).
## Project ingestion data structure

Ultimately, submission to the VEDA catalog requires that you [open an issue with the "new dataset" template](https://github.com/NASA-IMPACT/veda-data/issues/new?assignees=&labels=dataset&projects=&template=new-dataset.yaml&title=New+Dataset%3A+%3Cdataset+title%3E). This template will require, at minimum:
When submitting STAC records to ingest, a pull request can be made with the data structured as described below. The ingestion-data directory contains artifacts of the ingestion configuration used to publish to the staging and production catalogs.

1. a description of the dataset
2. the location of the data (in S3, CMR, etc.), and
3. a point of contact for the VEDA team to collaborate with.
> **Note**
Various ingestion workflows are supported and documented below but only the configuration metadata used to publish to the VEDA catalog are stored in this repo. It is not expected that every ingestion will follow exactly the same pattern nor will each ingested collection have have all types of configuration metadata here. <ins>The primary method used to ingest is [**`dataset-config`**](#stagedataset-config)</ins>.

One or more notebooks showing how the data should be processed would be appreciated.
### `<stage>/collections/`

## Ingestion Data Structure

When submitting STAC records to ingest, a pull request can be made with the data structured as described below.

### `collections/`

The `ingestion-data/collections/` directory holds json files representing the data for VEDA collection metadata (STAC).
The `ingestion-data/collections/` directory holds json files representing the data for VEDA collection metadata (STAC). STAC Collection metadata can be generated from an id, title, description using Pystac. See this [veda-docs/contributing notebook example](https://nasa-impact.github.io/veda-docs/notebooks/veda-operations/stac-collection-creation.html) to get started.

Should follow the following format:

<details>
<summary><b>/collections/collection_id.json</b></summary>

```json
{
"id": "<collection-id>",
Expand Down Expand Up @@ -104,41 +107,40 @@ Should follow the following format:
}

```
</details>

### `discovery-items/`
### `<stage>/discovery-items/`

The `ingestion-data/discovery-items/` directory holds json files representing the step function inputs for initiating the discovery, ingest and publication workflows.
Can either be a single input event or a list of input events.

Should follow the following format:

<details>
<summary><b>/discovery-items/collection_id.json</b></summary>

```json
{
"collection": "<collection-id>",
"discovery": "<s3/cmr>",

## for s3 discovery
"prefix": "<s3-key-prefix>",
"bucket": "<s3-bucket>",
"filename_regex": "<filename-regex>",
"datetime_range": "<month/day/year>",

## for cmr discovery
"version": "<collection-version>",
"temporal": ["<start-date>", "<end-date>"],
"bounding_box": ["<bounding-box-as-comma-separated-LBRT>"],
"include": "<filename-pattern>",

### misc
"cogify": "<true/false>",
"upload": "<true/false>",
"dry_run": "<true/false>",
"dry_run": "<true/false>"
}
```
</details>

### `<stage>/dataset-config/`

### `dataset-config/`
The `ingestion-data/dataset-config/` directory holds json files that can be used with the `dataset/publish` workflows endpoint, combining both collection metadata and discovery items. For an example of this ingestion workflow, see this [jupyter notebook](./transformation-scripts/example-template/example-geoglam-ingest.ipynb).

The `ingestion-data/dataset-config/` directory holds json files that can be used with the `dataset/publish` stac ingestor endpoint, combining both collection metadata and discovery items. For an example of this ingestion workflow, see this [jupyter notebook](./transformation-scripts/example-template/example-geoglam-ingest.ipynb).
<details>
<summary><b>/dataset-config/collection_id.json</b></summary>

```json
{
Expand Down Expand Up @@ -171,10 +173,36 @@ The `ingestion-data/dataset-config/` directory holds json files that can be used
]
}
```
</details>

### `production/transfer-config`

This directory contains the configuration needed to execute a stand-alone airflow DAG that copies data from a specified staging bucket and prefix to a permanent location in `s3://veda-data-store` using the collection_id as a prefix.

Should follow the following format:

<details>
<summary><b>/production/transfer-config/collection_id.json</b></summary>

```json
{
"collection": "<collection-id>",

## the location of the staged files
"origin_bucket": "<s3-bucket>",
"origin_prefix": "<s3-key-prefix>",
"bucket": "<s3-bucket>",
"filename_regex": "<filename-regex>",

### misc
"dry_run": "<true/false>"
}
```
</details>

## Validation
Copy link
Contributor

@smohiudd smohiudd Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anayeaye as part of the validation should we also create json schema that includes our minimum requirements for collection metadata to maintain consistency across the production catalog? For example, all collections in VEDA must include:

  • providers
  • thumbnails assets
  • renders

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooof I just included those there in an optional section in the updated geoglam notebook that is now runnable in the hub https://github.com/NASA-IMPACT/veda-docs/blob/c9dc33a8f2960fa29a59db2974f979b12055f8ca/contributing/dataset-ingestion/example-template/example-geoglam-ingest.ipynb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree on a firm nudge to create renders and thumbnails but I don't think we should require them until we have great docs on how to set them up + a way for users to upload thumbs (which are both doable)


This repository provides a script for validating all collections.
This repository provides a script for validating all collections in the ingestion-data directory.
First, install the requirements (preferably in a virtual environment):

```shell
Expand Down Expand Up @@ -212,3 +240,6 @@ pip-compile
```

This will update `requirements.txt` with a complete, realized set of Python dependencies.


[veda-data-publication]: ./docs/publishing-data.excalidraw.png
Binary file added docs/publishing-data.excalidraw.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,15 @@
"source": [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this notebook be moved to veda-docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this should move to veda-docs

"from cognito_client import CognitoClient\n",
"\n",
"STAGING_CLIENT_ID = \"4rhmpnmnk3rgd9qtiuarllppau\"\n",
"STAGING_USERPOOL_ID = \"us-west-2_0G3VRilt1\"\n",
"STAGING_IDENTITY_POOL_ID = \"us-west-2:ad6647b6-b410-4e73-8205-28a066c290fb\"\n",
"\n",
"# Obtain a token from the cognito client \n",
"client = CognitoClient(\n",
" client_id=\"o8c93cebc17upumgstlbqm44f\",\n",
" user_pool_id=\"us-west-2_9mMSsMcxw\",\n",
" identity_pool_id=\"us-west-2:40f39c19-ab88-4d0b-85a3-3bad4eacbfc0\",\n",
" client_id = STAGING_CLIENT_ID,\n",
" user_pool_id = STAGING_USERPOOL_ID,\n",
" identity_pool_id = STAGING_IDENTITY_POOL_ID\n",
")\n",
"_ = client.login()\n",
"\n",
Expand Down Expand Up @@ -135,7 +140,7 @@
"metadata": {},
"outputs": [],
"source": [
"API = \"https://ig9v64uky8.execute-api.us-west-2.amazonaws.com/staging/\"\n",
"STAGING_WORKFLOWS_API = \"https://staging.openveda.cloud/api/workflows\"\n",
"\n",
"LOCAL_FILE_PATH = \"CropMonitor_2023_06_28.tif\"\n",
"YEAR, MONTH = 2023, 6\n",
Expand Down Expand Up @@ -340,7 +345,7 @@
" \"content-type\": \"application/json\",\n",
" \"accept\": \"application/json\",\n",
"}\n",
"response = requests.post((API + \"dataset/validate\"), json=dataset, headers=headers)\n",
"response = requests.post((STAGING_WORKFLOWS_API + \"dataset/validate\"), json=dataset, headers=headers)\n",
"response.raise_for_status()\n",
"print(response.text)"
]
Expand Down Expand Up @@ -377,7 +382,7 @@
}
],
"source": [
"response = requests.post((API + \"dataset/publish\"), json=dataset, headers=headers)\n",
"response = requests.post((STAGING_WORKFLOWS_API + \"dataset/publish\"), json=dataset, headers=headers)\n",
"response.raise_for_status()\n",
"print(response.text)"
]
Expand Down
Loading