diff --git a/apps/docs/docs/contribute/connect-data/bigquery/_category_.json b/apps/docs/docs/contribute/connect-data/bigquery/_category_.json new file mode 100644 index 000000000..db77c4917 --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/bigquery/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "Connect via BigQuery", + "position": 1, + "link": { + "type": "doc", + "id": "index" + } +} diff --git a/apps/docs/docs/contribute/connect-data/bigquery-open-perms.png b/apps/docs/docs/contribute/connect-data/bigquery/bigquery-open-perms.png similarity index 100% rename from apps/docs/docs/contribute/connect-data/bigquery-open-perms.png rename to apps/docs/docs/contribute/connect-data/bigquery/bigquery-open-perms.png diff --git a/apps/docs/docs/contribute/connect-data/bigquery-set-perms.png b/apps/docs/docs/contribute/connect-data/bigquery/bigquery-set-perms.png similarity index 100% rename from apps/docs/docs/contribute/connect-data/bigquery-set-perms.png rename to apps/docs/docs/contribute/connect-data/bigquery/bigquery-set-perms.png diff --git a/apps/docs/docs/contribute/connect-data/bigquery.md b/apps/docs/docs/contribute/connect-data/bigquery/index.md similarity index 85% rename from apps/docs/docs/contribute/connect-data/bigquery.md rename to apps/docs/docs/contribute/connect-data/bigquery/index.md index 8aa081178..73ae685d2 100644 --- a/apps/docs/docs/contribute/connect-data/bigquery.md +++ b/apps/docs/docs/contribute/connect-data/bigquery/index.md @@ -5,7 +5,11 @@ sidebar_position: 1 BigQuery's built-in data-sharing capabilities make it trivially easy to integrate any public dataset into -the OSO data pipeline. +the OSO data pipeline, provided the dataset exists in +the US multi-region. + +If the dataset needs to be replicated, see our guide on +[BigQuery Data Transfer Service](./replication.md). ## Make the data available in the US region @@ -22,9 +26,9 @@ you can do this directly from the [BigQuery Studio](https://console.cloud.google.com/bigquery/transfers?project=opensource-observer). OSO will also copy certain valuable datasets into the -`opensource-observer` project via Dagster assets. -See the [Dataset replication](#oso-dataset-replication) -section below to add a Dagster asset to OSO. +`opensource-observer` project via the BigQuery Data Transfer Service +See the guide on [BigQuery Data Transfer Service](./replication.md) +add dataset replication as a Dagster asset to OSO. ## Make the data accessible to our Google service account @@ -123,17 +127,3 @@ Coming soon... This section is a work in progress. :::warning Coming soon... This section is a work in progress. ::: - -## OSO Dataset Replication - -In order to make the OSO data pipeline more robust, -we can copy datasets into the `opensource-observer` Google Cloud project. - -:::warning -Coming soon... This section is a work in progress. -To track progress, see this -[GitHub issue](https://github.com/opensource-observer/oso/issues/1311). -::: - -Dagster also has an excellent tutorial on integrating -[BigQuery with Dagster](https://docs.dagster.io/integrations/bigquery/using-bigquery-with-dagster). diff --git a/apps/docs/docs/contribute/connect-data/bigquery/replication.md b/apps/docs/docs/contribute/connect-data/bigquery/replication.md new file mode 100644 index 000000000..845501e3b --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/bigquery/replication.md @@ -0,0 +1,26 @@ +--- +title: 🏗️ Using a BigQuery Data Transfer Service +sidebar_position: 2 +--- + +import NextSteps from "../dagster-config.mdx" + +BigQuery comes with a built-in data transfer service +for replicating datasets between BigQuery projects/regions, +from Amazon S3, and from various Google services. +In this guide, we'll copy an existing BigQuery dataset into the +`opensource-observer` Google Cloud project at a regular schedule. + +If you already maintain a public dataset in +the US multi-region, you should simply make a dbt source +as shown in [this guide](./index.md). + +## OSO Dataset Replication + +:::warning +Coming soon... This section is a work in progress. +To track progress, see this +[GitHub issue](https://github.com/opensource-observer/oso/issues/1311). +::: + + diff --git a/apps/docs/docs/contribute/connect-data/dagster.md b/apps/docs/docs/contribute/connect-data/dagster.md index 722fbda4b..1f21c9ce3 100644 --- a/apps/docs/docs/contribute/connect-data/dagster.md +++ b/apps/docs/docs/contribute/connect-data/dagster.md @@ -7,7 +7,7 @@ import NextSteps from "./dagster-config.mdx" Before writing a fully custom Dagster asset, we recommend you first see if the previous guides on -[BigQuery datasets](./bigquery.md), +[BigQuery datasets](./bigquery/index.md), [database replication](./database.md), [API crawling](./api.md) may be a better fit. diff --git a/apps/docs/docs/contribute/connect-data/index.md b/apps/docs/docs/contribute/connect-data/index.md index f7aca8874..3c807d173 100644 --- a/apps/docs/docs/contribute/connect-data/index.md +++ b/apps/docs/docs/contribute/connect-data/index.md @@ -10,7 +10,7 @@ We're always looking for new data sources to integrate with OSO and deepen our c There are currently the following patterns for integrating new data sources into OSO, in order of preference: -1. [**BigQuery public datasets**](./bigquery.md): If you can maintain a BigQuery public dataset, this is the preferred and easiest route. +1. [**BigQuery public datasets**](./bigquery/index.md): If you can maintain a BigQuery public dataset, this is the preferred and easiest route. 2. [**Database replication**](./database.md): Replicate your database into an OSO dataset (e.g. from Postgres). 3. [**API crawling**](./api.md): Crawl an API by writing a plugin. 4. [**Files into Google Cloud Storage (GCS)**](./gcs.md): You can drop Parquet/CSV files in our GCS bucket for loading into BigQuery.