diff --git a/apps/docs/docs/contribute/connect-data/_category_.json b/apps/docs/docs/contribute/connect-data/_category_.json new file mode 100644 index 000000000..d6a76a91e --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "Connect Your Data", + "position": 4, + "link": { + "type": "doc", + "id": "index" + } +} diff --git a/apps/docs/docs/contribute/connect-data/airbyte.md b/apps/docs/docs/contribute/connect-data/airbyte.md new file mode 100644 index 000000000..3fe0d97d5 --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/airbyte.md @@ -0,0 +1,173 @@ +--- +title: Connect via Airbyte +sidebar_position: 2 +--- + +## Replicating external databases + +If your data exists in an off-the-shelf database, +you can replicate data to OSO via an AirByte Connector or +Singer.io Tap integration through Meltano. This section provides the details +necessary to add a connector or a tap from an existing Postgres database into +our system. Other databases or datasources should be similar. + +### Settings up your postgres database for connection + +We will setup the postgre connection to use Change Data Capture which is +suggested for very large databases. You will need to have the following in order +to connect your postgres database to OSO for replication. + +- `wal_level` must be set to `logical` +- You need to create a username of your choosing and share the associated + credentials with a maintainer at OSO +- You need to grant `REPLICATION` privileges to a username of your choosing +- You need to create a replication slot +- You need to create a publication for OSO for the tables you wish to have replicated. + +#### Setting your `wal_level` + +:::warning +Please ensure that you understand what changing the `wal_level` will do for your +database system requirements and/or performance. +::: + +Before you begin, it's possible your settings are already correct. To check your +`wal_level` settings, run the following query: + +```SQL +SHOW wal_level; +``` + +The output would look something like this from `psql`: + +``` + wal_level +----------- + logical +``` + +If doesn't have the word `logical` but instead some other value, you will need +to change this. Please ensure that this `wal_level` change is actually what you +want for your database. Setting this value to `logical` will likely affect +performance as it increases the disk writes by the database process. If you are +comfortable with this, then you can change the `wal_level` by executing the +following: + +```SQL +ALTER SYSTEM SET wal_level = logical; +``` + +#### Creating a user for OSO + +To create a user, choose a username and password, here we've chosen `oso_user` +and have a placeholder password `somepassword`: + +```SQL +CREATE USER oso_user WITH PASSWORD 'somepassword'; +``` + +#### Granting replication privileges + +The user we just created will need replication privileges + +```SQL +ALTER USER oso_user WITH REPLICATION; +``` + +#### Create a replication slot + +Create a replication slot for the `oso_user`. Here we named it `oso_slot`, but +it can have any name. + +```SQL +SELECT * FROM pg_create_logical_replication_slot('oso_slot', 'pgoutput'); +``` + +#### Create a publication + +For the final step, we will be creating the publication which will subscribe to +a specific table or tables. That table should already exist. If it does not, you +will need to create it _before_ creating the publication. Once you've ensured +that the table or tables in question have been created, run the following to +create the publication: + +_This assumes that you're creating the publication for table1 and table2._ + +```SQL +CREATE PUBLICATION oso_publication FOR TABLE table1, table2; +``` + +You can also create a publication for _all_ tables. To do this run the following +query: + +```SQL +CREATE PUBLICATION oso_publication FOR ALL TABLES; +``` + +For more details about this command see: https://www.postgresql.org/docs/current/sql-createpublication.html + +### Adding your postgres replication data to the OSO meltano configuration + +Assuming that you've created the publication you're now ready to connect your +postgres data source to OSO. + +#### Add the extractor to `meltano.yml` + +The `meltano.yml` YAML file details all of the required configuration for the +meltano "extractors" which are either airbyte connectors or singer.io taps. + +For postgres data sources we use the postgres airbyte connector. Underneath the +`extractors:` section. Add the following as a new list item (you should choose a +name other than `tap-my-postgres-datasource`): + +```yaml +extractors: + # ... other items my be above + # Choose any arbitrary name tap-# that is related to your datasource + - name: tap-my-postgres-datasource + inherit_from: tap-postgres + variant: airbyte + pip_url: git+https://github.com/MeltanoLabs/tap-airbyte-wrapper.git + config: + airbyte_config: + jdbc_url_params: "replication=postgres" + ssl_mode: # Update with your SSL configuration + mode: enable + schemas: # Update with your schemas + - public + replication_method: + plugin: pgoutput + method: CDC + publication: publication_name + replication_slot: oso_slot + initial_waiting_seconds: 5 +``` + +#### Send the read only credentials to OSO maintainers + +For now, once this is all completed it is best to open a pull request and an OSO +maintainer will reach out with a method to accept the read only credentials. + +### Adding to Dagster + +:::warning +Coming soon... This section is a work in progress. +To track progress, see this +[GitHub issue](https://github.com/opensource-observer/oso/issues/1318) +::: + +## Writing a new Airbyte connector + +Airbyte provides one of the best ways to write data connectors +that ingest data from HTTP APIs and other Python sources via the +[Airbyte Python CDK](https://docs.airbyte.com/connector-development/cdk-python/). + +:::warning +Coming soon... This section is a work in progress. +::: + +## Airbyte examples in OSO + +:::warning +Coming soon... This section is a work in progress. +::: diff --git a/apps/docs/docs/contribute/connect-data/bigquery-open-perms.png b/apps/docs/docs/contribute/connect-data/bigquery-open-perms.png new file mode 100644 index 000000000..ada7e680d Binary files /dev/null and b/apps/docs/docs/contribute/connect-data/bigquery-open-perms.png differ diff --git a/apps/docs/docs/contribute/connect-data/bigquery-set-perms.png b/apps/docs/docs/contribute/connect-data/bigquery-set-perms.png new file mode 100644 index 000000000..fa7e0b046 Binary files /dev/null and b/apps/docs/docs/contribute/connect-data/bigquery-set-perms.png differ diff --git a/apps/docs/docs/contribute/connect-data/bigquery.md b/apps/docs/docs/contribute/connect-data/bigquery.md new file mode 100644 index 000000000..8aa081178 --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/bigquery.md @@ -0,0 +1,139 @@ +--- +title: Connect via BigQuery +sidebar_position: 1 +--- + +BigQuery's built-in data-sharing capabilities make it +trivially easy to integrate any public dataset into +the OSO data pipeline. + +## Make the data available in the US region + +In order for our data pipeline to operate on the data, +it must be in the +[US multi-region](https://cloud.google.com/bigquery/docs/locations#multi-regions). + +If you have reason to keep the dataset in a different region, +you can use the +[BigQuery Data Transfer Service](https://cloud.google.com/bigquery/docs/dts-introduction) +to easily copy the dataset to the US region. +To manually define this as a transfer job in your own Google project, +you can do this directly from the +[BigQuery Studio](https://console.cloud.google.com/bigquery/transfers?project=opensource-observer). + +OSO will also copy certain valuable datasets into the +`opensource-observer` project via Dagster assets. +See the [Dataset replication](#oso-dataset-replication) +section below to add a Dagster asset to OSO. + +## Make the data accessible to our Google service account + +The easiest way to do this is to make the BigQuery dataset publicly accessible. + +![Open BigQuery permissions](./bigquery-open-perms.png) + +Add the `allAuthenticatedUsers` as the "BigQuery Data Viewer" + +![Set BigQuery permissions](./bigquery-set-perms.png) + +If you have reasons to keep your dataset private, +you can reach out to us directly on our +[Discord](https://www.opensource.observer/discord). + +## Defining a dbt source + +For example, Google maintains a +[public dataset](https://cloud.google.com/blog/products/data-analytics/ethereum-bigquery-public-dataset-smart-contract-analytics) +for Ethereum mainnet. + +As long as the dataset is publicly available in the US region, +we can create a dbt source in `oso/warehouse/dbt/models/` +(see [source](https://github.com/opensource-observer/oso/blob/main/warehouse/dbt/models/ethereum_sources.yml)): + +```yaml +sources: + - name: ethereum + database: bigquery-public-data + schema: crypto_ethereum + tables: + - name: transactions + identifier: transactions + - name: traces + identifier: traces +``` + +We can then reference these tables in a downstream model with +the `source` macro: + +```sql +select + block_timestamp, + `hash` as transaction_hash, + from_address, + receipt_contract_address +from {{ source("ethereum", "transactions") }} +``` + +## Creating a playground dataset (optional) + +If the source table is large, we will want to +extract a subset of the data into a playground dataset +for testing and development. + +For example for GitHub event data, +we copy just the last 14 days of data +into a playground dataset, which is used +when the dbt target is set to `playground` +(see [source](https://github.com/opensource-observer/oso/blob/main/warehouse/dbt/models/github_sources.yml)): + +```yaml +sources: + - name: github_archive + database: | + {%- if target.name in ['playground', 'dev'] -%} opensource-observer + {%- elif target.name == 'production' -%} githubarchive + {%- else -%} invalid_database + {%- endif -%} + schema: | + {%- if target.name in ['playground', 'dev'] -%} oso + {%- elif target.name == 'production' -%} day + {%- else -%} invalid_schema + {%- endif -%} + tables: + - name: events + identifier: | + {%- if target.name in ['playground', 'dev'] -%} stg_github__events + {%- elif target.name == 'production' -%} 20* + {%- else -%} invalid_table + {%- endif -%} +``` + +### Choosing a playground window size + +There is a fine balance between choosing a playground data set window +that is sufficiently small for affordable testing and development, +yet produces meaningful results to detect issues in your queries. + +:::warning +Coming soon... This section is a work in progress. +::: + +### Copying the playground dataset + +:::warning +Coming soon... This section is a work in progress. +::: + +## OSO Dataset Replication + +In order to make the OSO data pipeline more robust, +we can copy datasets into the `opensource-observer` Google Cloud project. + +:::warning +Coming soon... This section is a work in progress. +To track progress, see this +[GitHub issue](https://github.com/opensource-observer/oso/issues/1311). +::: + +Dagster also has an excellent tutorial on integrating +[BigQuery with Dagster](https://docs.dagster.io/integrations/bigquery/using-bigquery-with-dagster). diff --git a/apps/docs/docs/contribute/connect-data.md b/apps/docs/docs/contribute/connect-data/cloudquery.md similarity index 62% rename from apps/docs/docs/contribute/connect-data.md rename to apps/docs/docs/contribute/connect-data/cloudquery.md index bb800d59e..2485e6a14 100644 --- a/apps/docs/docs/contribute/connect-data.md +++ b/apps/docs/docs/contribute/connect-data/cloudquery.md @@ -1,18 +1,10 @@ --- -title: Connect Your Data -sidebar_position: 4 +title: Connect via CloudQuery +sidebar_position: 3 --- -:::info -We're always looking for new data sources to integrate with OSO and deepen our community's understanding of open source impact. If you're a developer or data engineer, we'd love to partner with you to connect your database (or other external data sources) to the OSO data warehouse. -::: - -## CloudQuery Plugins - ---- - -[CloudQuery](https://cloudquery.io) is used to integrate external data sources -into the Open Source Observer platform. At this time we are limiting the +[CloudQuery](https://cloudquery.io) can be used to integrate external data sources +into the OSO platform. At this time we are limiting the CloudQuery plugins in the OSO repository to Python or Typescript. This page will go over writing a plugin with Python, which is our suggested plugin language. @@ -266,146 +258,17 @@ In the future we intend to improve the experience of adding a plugin to the pipeline, but for now these docs are consistent with the current state of the pipeline. -## Connecting external databases - -The easiest way to connect data to OSO is to use our AirByte Connector or -Singer.io Tap integration through meltano. This section provides the details -necessary to add a connector or a tap from an existing postgres database into -our system. Other databases or datasources should be similar. - -### Settings up your postgres database for connection - -We will setup the postgre connection to use Change Data Capture which is -suggested for very large databases. You will need to have the following in order -to connect your postgres database to OSO for replication. - -- `wal_level` must be set to `logical` -- You need to create a username of your choosing and share the associated - credentials with a maintainer at OSO -- You need to grant `REPLICATION` privileges to a username of your choosing -- You need to create a replication slot -- You need to create a publication for OSO for the tables you wish to have replicated. - -#### Setting your `wal_level` +### Adding to Dagster :::warning -Please ensure that you understand what changing the `wal_level` will do for your -database system requirements and/or performance. +Coming soon... This section is a work in progress. +To track progress, see this +[GitHub issue](https://github.com/opensource-observer/oso/issues/1325) ::: -Before you begin, it's possible your settings are already correct. To check your -`wal_level` settings, run the following query: - -```SQL -SHOW wal_level; -``` - -The output would look something like this from `psql`: - -``` - wal_level ------------ - logical -``` - -If doesn't have the word `logical` but instead some other value, you will need -to change this. Please ensure that this `wal_level` change is actually what you -want for your database. Setting this value to `logical` will likely affect -performance as it increases the disk writes by the database process. If you are -comfortable with this, then you can change the `wal_level` by executing the -following: - -```SQL -ALTER SYSTEM SET wal_level = logical; -``` - -#### Creating a user for OSO - -To create a user, choose a username and password, here we've chosen `oso_user` -and have a placeholder password `somepassword`: - -```SQL -CREATE USER oso_user WITH PASSWORD 'somepassword'; -``` - -#### Granting replication privileges - -The user we just created will need replication privileges - -```SQL -ALTER USER oso_user WITH REPLICATION; -``` - -#### Create a replication slot - -Create a replication slot for the `oso_user`. Here we named it `oso_slot`, but -it can have any name. - -```SQL -SELECT * FROM pg_create_logical_replication_slot('oso_slot', 'pgoutput'); -``` - -#### Create a publication - -For the final step, we will be creating the publication which will subscribe to -a specific table or tables. That table should already exist. If it does not, you -will need to create it _before_ creating the publication. Once you've ensured -that the table or tables in question have been created, run the following to -create the publication: - -_This assumes that you're creating the publication for table1 and table2._ - -```SQL -CREATE PUBLICATION oso_publication FOR TABLE table1, table2; -``` - -You can also create a publication for _all_ tables. To do this run the following -query: - -```SQL -CREATE PUBLICATION oso_publication FOR ALL TABLES; -``` - -For more details about this command see: https://www.postgresql.org/docs/current/sql-createpublication.html - -### Adding your postgres replication data to the OSO meltano configuration - -Assuming that you've created the publication you're now ready to connect your -postgres data source to OSO. - -#### Add the extractor to `meltano.yml` - -The `meltano.yml` YAML file details all of the required configuration for the -meltano "extractors" which are either airbyte connectors or singer.io taps. - -For postgres data sources we use the postgres airbyte connector. Underneath the -`extractors:` section. Add the following as a new list item (you should choose a -name other than `tap-my-postgres-datasource`): - -```yaml -extractors: - # ... other items my be above - # Choose any arbitrary name tap-# that is related to your datasource - - name: tap-my-postgres-datasource - inherit_from: tap-postgres - variant: airbyte - pip_url: git+https://github.com/MeltanoLabs/tap-airbyte-wrapper.git - config: - airbyte_config: - jdbc_url_params: "replication=postgres" - ssl_mode: # Update with your SSL configuration - mode: enable - schemas: # Update with your schemas - - public - replication_method: - plugin: pgoutput - method: CDC - publication: publication_name - replication_slot: oso_slot - initial_waiting_seconds: 5 -``` +## CloudQuery examples in OSO -#### Send the read only credentials to OSO maintainers +Here are a few examples of CloudQuery plugins currently in use: -For now, once this is all completed it is best to open a pull request and an OSO -maintainer will reach out with a method to accept the read only credentials. +- [Importing oss-directory](https://github.com/opensource-observer/oso/tree/main/warehouse/cloudquery-oss-directory) +- [Fetch GitHub data missing from GHArchive](https://github.com/opensource-observer/oso/tree/main/warehouse/cloudquery-github-resolve-repos) diff --git a/apps/docs/docs/contribute/funding-data.md b/apps/docs/docs/contribute/connect-data/funding-data.md similarity index 86% rename from apps/docs/docs/contribute/funding-data.md rename to apps/docs/docs/contribute/connect-data/funding-data.md index 15e250251..991f95b74 100644 --- a/apps/docs/docs/contribute/funding-data.md +++ b/apps/docs/docs/contribute/connect-data/funding-data.md @@ -1,6 +1,6 @@ --- title: Add Funding Data -sidebar_position: 3 +sidebar_position: 10 --- :::info @@ -11,12 +11,12 @@ We are coordinating with several efforts to collect, clean, and visualize OSS fu --- -Add or update OSS funding data by making a pull request to [OSS Funding](https://github.com/opensource-observer/oss-funding). +Add or update funding data by making a pull request to [oss-funding](https://github.com/opensource-observer/oss-funding). -1. Fork [OSS Funding](https://github.com/opensource-observer/oss-funding/fork). +1. Fork [oss-funding](https://github.com/opensource-observer/oss-funding/fork). 2. Add static data in CSV (or JSON) format to `./uploads/`. 3. Ensure the data contains links to one or more project artifacts such as GitHub repos or wallet addresses. This is necessary in order for one of the repo maintainers to link funding events to OSS projects. -4. Submit a pull request from your fork back to [OSS Funding](https://github.com/opensource-observer/oss-funding). +4. Submit a pull request from your fork back to [oss-funding](https://github.com/opensource-observer/oss-funding). ## Contributing Clean Data @@ -28,7 +28,7 @@ Submissions will be validated to ensure they conform to the schema and don't con Additions to the `./clean/` directory should include as many of the following columns as possible: -- `oso_slug`: The OSO project slug (leave blank or null if the project doesn't exist yet). +- `oso_slug`: The OSO project name (leave blank or null if the project doesn't exist yet). - `project_name`: The name of the project (according to the funder's data). - `project_id`: The unique identifier for the project (according to the funder's data). - `project_url`: The URL of the project's grant application or profile. @@ -46,6 +46,6 @@ Additions to the `./clean/` directory should include as many of the following co --- -You can read or copy the latest version of the funding data directly from the [OSS Funding](https://github.com/opensource-observer/oss-funding) repo. +You can read or copy the latest version of the funding data directly from the [oss-funding](https://github.com/opensource-observer/oss-funding) repo. If you do something cool with the data (eg, a visualization or analysis), please share it with us! diff --git a/apps/docs/docs/contribute/connect-data/gcs.md b/apps/docs/docs/contribute/connect-data/gcs.md new file mode 100644 index 000000000..c2832049b --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/gcs.md @@ -0,0 +1,37 @@ +--- +title: Connect via Google Cloud Storage (GCS) +sidebar_position: 4 +--- + +Depending on the data, we may accept data dumps +into our Google Cloud Storage (GCS). +If you believe your data storage qualifies to be sponsored +by OSO, please reach out to us on +[Discord](https://www.opensource.observer/discord). + +## Get write access + +Coordinate with the OSO engineering team directly on +[Discord](https://www.opensource.observer/discord) +to give your Google service account write permissions to +our GCS bucket. + +## Defining a Dagster Asset + +:::warning +Coming soon... This section is a work in progress +and will be likely refactored soon. +::: + +To see an example of this in action, +you can look into our Dagster asset for +[Gitcoin passport scores](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/assets.py). + +For more details on defining Dagster assets, +see the [Dagster tutorial](https://docs.dagster.io/tutorial). + +## GCS import examples in OSO + +- [Superchain data](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/assets.py) +- [Gitcoin Passport scores](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/assets.py) +- [OpenRank reputations on Farcaster](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/assets.py) diff --git a/apps/docs/docs/contribute/connect-data/index.md b/apps/docs/docs/contribute/connect-data/index.md new file mode 100644 index 000000000..457e1fcd7 --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/index.md @@ -0,0 +1,27 @@ +--- +title: Connect Your Data +sidebar_position: 0 +--- + +:::info +We're always looking for new data sources to integrate with OSO and deepen our community's understanding of open source impact. If you're a developer or data engineer, please reach out to us on [Discord](https://www.opensource.observer/discord). We'd love to partner with you to connect your database (or other external data sources) to the OSO data warehouse. +::: + +There are currently the following patterns for integrating new data sources into OSO, +in order of preference: + +1. [BigQuery public datasets](./bigquery.md): If you can maintain a BigQuery public dataset, this is the preferred and easiest route. +2. [Airbyte plugins](./airbyte.md): Airbyte plugins are the preferred method for crawling APIs. +3. [Database replication via Airbyte](./airbyte.md): Airbyte maintains off-the-shelf plugins for database replication (e.g. from Postgres). +4. [CloudQuery plugins](./cloudquery.md): CloudQuery offers another, more flexible avenue for writing data import plugins. +5. [Files into Google Cloud Storage (GCS)](./gcs.md): You can drop Parquet/CSV files in our GCS bucket for loading into BigQuery. +6. Static files: If the data is high quality and can only be imported via static files, please reach out to us on [Discord](https://www.opensource.observer/discord) to coordinate hand-off. This path is predominantly used for [grant funding data](./funding-data.md). + +We generally prefer to work with data partners that can help us regularly +index live data that can feed our daily data pipeline. +All data sources should be defined as +[software-defined assets](https://docs.dagster.io/concepts/assets/software-defined-assets) in our Dagster configuration. + +ETL is the messiest, most high-touch part of the OSO data pipeline. +Please reach out to us for help on [Discord](https://www.opensource.observer/discord). +We will happily work with you to get it working. diff --git a/apps/docs/docs/contribute/impact-models.md b/apps/docs/docs/contribute/impact-models.md index 47ea8ebae..79cc10743 100644 --- a/apps/docs/docs/contribute/impact-models.md +++ b/apps/docs/docs/contribute/impact-models.md @@ -1,5 +1,5 @@ --- -title: Propose an Impact Model +title: Write a Data Model sidebar_position: 5 --- @@ -85,7 +85,8 @@ poetry install && poetry run oso_lets_go :::tip Under the hood, `oso_lets_go` will create a GCP project and BigQuery dataset if they don't already exist, -and copy a small subset of the OSO data for you to develop against. +and copy a small subset of the OSO data for you to develop against, +called `playground`. It will also create a dbt profile to connect to this dataset (stored in `~/.dbt/profiles.yml`). The script is idempotent, so you can safely run it again @@ -107,7 +108,8 @@ Finally, you can test that everything is working by running the following comman dbt run ``` ---- +This will run the full dbt pipeline against your own +copy of the OSO playground dataset. ## Working with OSO dbt Models @@ -129,8 +131,8 @@ here for a fuller explanation](https://docs.getdbt.com/best-practices/how-we-str - `marts` - This directory contains transformations that should be fairly minimal and mostly be aggregations. In general, `marts` shouldn't depend on other marts unless they're just coarser grained aggregations of an upstream - mart. Marts are also automatically copied to the postgresql database that runs - the OSO website. + mart. Marts are also automatically copied to the frontend database that + powers the OSO API and website. ### OSO data sources @@ -159,11 +161,11 @@ oso_source('ossd', '{TABLE_NAME}') }}` where `{TABLE_NAME}` could be one of the following tables: - `collections` - This data is pulled directly from the [oss-directory - Repository][oss-directory] and is + repository][oss-directory] and is groups of projects. You can view this table [here][collections_table] - `projects` - This data is also pulled directly from the oss-directory - Repository. It describes a project's repositories, blockchain addresses, and + repository. It describes a project's repositories, blockchain addresses, and public packages. You can view this table [here][projects_table] - `repositories` - This data is derived by gathering repository data of all the @@ -217,8 +219,6 @@ namespace for a collection named `foo` would be as follows: {{ oso_id('collection', 'foo')}} ``` ---- - ## Adding Your dbt Model Now you're armed with enough information to add your model! Add your model to @@ -238,13 +238,16 @@ dbt run _Note: If you configured the dbt profile as shown in this document, this `dbt run` will write to the `opensource-observer.oso_playground` dataset._ -It is likely best to target a specific model so things don't take so long on -some of our materializations: +It is likely best to target a specific model when developing +so things don't take so long on some of our materializations: ```bash dbt run --select {name_of_your_model} ``` +If `dbt run` runs without issue and you feel that you've completed something you'd +like to contribute. It's time to open a PR! + ### Using the BigQuery UI to check your queries During your development process, it may be useful to use the BigQuery UI to @@ -272,26 +275,12 @@ The presence of the compiled model does not necessarily mean your SQL will work simply that it was rendered by `dbt` correctly. To test your model it's likely cheapest to copy the query into the [BigQuery Console](https://console.cloud.google.com/bigquery) and run that query there. -However, if you need more validation you'll need to [Setup GCP with your own -playground](https://docs.opensource.observer/docs/contribute/transform/setting-up-gcp.md#setting-up-your-own-playground-copy-of-the-dataset) - -### Testing your dbt models - -When you're ready, you can test your dbt models against your playground by -simply running dbt like so: - -```bash -dbt run --select {name_of_your_model} -``` - -If this runs without issue and you feel that you've completed something you'd -like to contribute. It's time to open a PR! ### Submit a PR Once you've developed your model and you feel comfortable that it will properly -run. you can submit it a PR to the [oso Repository][oso] to be tested by the OSO -github CI workflows (_still under development_). +run, you can submit it a PR to the [oso repository][oso] to be tested by the OSO +GitHub CI workflows. ### DBT model execution schedule @@ -301,228 +290,13 @@ pipelines are executed once a day by the OSO CI at 02:00 UTC. The pipeline currently takes a number of hours and any materializations or views would likely be ready for use by 4-6 hours after that time. -## Model Examples - -Here are a few examples of dbt models currently in production: - -### Developers - -This is an intermediate model available in the data warehouse as `int_devs`. - -```sql -SELECT - e.project_id, - e.to_namespace AS repository_source, - e.from_id, - 1 AS amount, - TIMESTAMP_TRUNC(e.time, MONTH) AS bucket_month, - CASE - WHEN - COUNT(DISTINCT CASE WHEN e.event_type = 'COMMIT_CODE' THEN e.time END) - >= 10 - THEN 'FULL_TIME_DEV' - WHEN - COUNT(DISTINCT CASE WHEN e.event_type = 'COMMIT_CODE' THEN e.time END) - >= 1 - THEN 'PART_TIME_DEV' - ELSE 'OTHER_CONTRIBUTOR' - END AS user_segment_type -FROM {{ ref('int_events_to_project') }} AS e -WHERE - e.event_type IN ( - 'PULL_REQUEST_CREATED', - 'PULL_REQUEST_MERGED', - 'COMMIT_CODE', - 'ISSUE_CLOSED', - 'ISSUE_CREATED' - ) -GROUP BY e.project_id, bucket_month, e.from_id, repository_source -``` - -### Events to a Project +You can monitor all pipeline runs in +[GitHub actions](https://github.com/opensource-observer/oso/actions/workflows/warehouse-run-data-pipeline.yml). -This is an intermediate model available in the data warehouse as `int_events_to_project`. +## Model References -```sql -SELECT - e.*, - a.project_id -FROM {{ ref('int_events_with_artifact_id') }} AS e -INNER JOIN {{ ref('stg_ossd__artifacts_by_project') }} AS a - ON - e.to_source_id = a.artifact_source_id - AND e.to_namespace = a.artifact_namespace - AND e.to_type = a.artifact_type -``` +All OSO models can be found in +[`warehouse/dbt/models`](https://github.com/opensource-observer/oso/tree/main/warehouse/dbt/models). -### Summary Onchain Metrics by Project - -This is a mart model available in the data warehouse as `onchain_metrics_by_project_v1`. - -```sql -WITH txns AS ( - SELECT - a.project_id, - c.to_namespace AS onchain_network, - c.from_source_id AS from_id, - c.l2_gas, - c.tx_count, - DATE(TIMESTAMP_TRUNC(c.time, MONTH)) AS bucket_month - FROM {{ ref('stg_dune__contract_invocation') }} AS c - INNER JOIN {{ ref('stg_ossd__artifacts_by_project') }} AS a - ON c.to_source_id = a.artifact_source_id -), -metrics_all_time AS ( - SELECT - project_id, - onchain_network, - MIN(bucket_month) AS first_txn_date, - COUNT(DISTINCT from_id) AS total_users, - SUM(l2_gas) AS total_l2_gas, - SUM(tx_count) AS total_txns - FROM txns - GROUP BY project_id, onchain_network -), -metrics_6_months AS ( - SELECT - project_id, - onchain_network, - COUNT(DISTINCT from_id) AS users_6_months, - SUM(l2_gas) AS l2_gas_6_months, - SUM(tx_count) AS txns_6_months - FROM txns - WHERE bucket_month >= DATE_ADD(CURRENT_DATE(), INTERVAL -6 MONTH) - GROUP BY project_id, onchain_network -), -new_users AS ( - SELECT - project_id, - onchain_network, - SUM(is_new_user) AS new_user_count - FROM ( - SELECT - project_id, - onchain_network, - from_id, - CASE - WHEN - MIN(bucket_month) >= DATE_ADD(CURRENT_DATE(), INTERVAL -3 MONTH) - THEN - 1 - END AS is_new_user - FROM txns - GROUP BY project_id, onchain_network, from_id - ) - GROUP BY project_id, onchain_network -), -user_txns_aggregated AS ( - SELECT - project_id, - onchain_network, - from_id, - SUM(tx_count) AS total_tx_count - FROM txns - WHERE bucket_month >= DATE_ADD(CURRENT_DATE(), INTERVAL -3 MONTH) - GROUP BY project_id, onchain_network, from_id -), -multi_project_users AS ( - SELECT - onchain_network, - from_id, - COUNT(DISTINCT project_id) AS projects_transacted_on - FROM user_txns_aggregated - GROUP BY onchain_network, from_id -), -user_segments AS ( - SELECT - project_id, - onchain_network, - COUNT(DISTINCT CASE - WHEN user_segment = 'HIGH_FREQUENCY_USER' THEN from_id - END) AS high_frequency_users, - COUNT(DISTINCT CASE - WHEN user_segment = 'MORE_ACTIVE_USER' THEN from_id - END) AS more_active_users, - COUNT(DISTINCT CASE - WHEN user_segment = 'LESS_ACTIVE_USER' THEN from_id - END) AS less_active_users, - COUNT(DISTINCT CASE - WHEN projects_transacted_on >= 3 THEN from_id - END) AS multi_project_users - FROM ( - SELECT - uta.project_id, - uta.onchain_network, - uta.from_id, - mpu.projects_transacted_on, - CASE - WHEN uta.total_tx_count >= 1000 THEN 'HIGH_FREQUENCY_USER' - WHEN uta.total_tx_count >= 10 THEN 'MORE_ACTIVE_USER' - ELSE 'LESS_ACTIVE_USER' - END AS user_segment - FROM user_txns_aggregated AS uta - INNER JOIN multi_project_users AS mpu - ON uta.from_id = mpu.from_id - ) - GROUP BY project_id, onchain_network -), -contracts AS ( - SELECT - project_id, - artifact_namespace AS onchain_network, - COUNT(artifact_source_id) AS num_contracts - FROM {{ ref('stg_ossd__artifacts_by_project') }} - GROUP BY project_id, onchain_network -), -project_by_network AS ( - SELECT - p.project_id, - ctx.onchain_network, - p.project_name - FROM {{ ref('projects_v1') }} AS p - INNER JOIN contracts AS ctx - ON p.project_id = ctx.project_id -) - -SELECT - p.project_id, - p.onchain_network AS network, - p.project_name, - c.num_contracts, - ma.first_txn_date, - ma.total_txns, - ma.total_l2_gas, - ma.total_users, - m6.txns_6_months, - m6.l2_gas_6_months, - m6.users_6_months, - nu.new_user_count, - us.high_frequency_users, - us.more_active_users, - us.less_active_users, - us.multi_project_users, - ( - us.high_frequency_users + us.more_active_users + us.less_active_users - ) AS active_users -FROM project_by_network AS p -LEFT JOIN metrics_all_time AS ma - ON - p.project_id = ma.project_id - AND p.onchain_network = ma.onchain_network -LEFT JOIN metrics_6_months AS m6 - ON - p.project_id = m6.project_id - AND p.onchain_network = m6.onchain_network -LEFT JOIN new_users AS nu - ON - p.project_id = nu.project_id - AND p.onchain_network = nu.onchain_network -LEFT JOIN user_segments AS us - ON - p.project_id = us.project_id - AND p.onchain_network = us.onchain_network -LEFT JOIN contracts AS c - ON - p.project_id = c.project_id - AND p.onchain_network = c.onchain_network -``` +We also continuously deploy model reference documentation at +[https://models.opensource.observer/](https://models.opensource.observer/) diff --git a/apps/docs/docs/contribute/index.mdx b/apps/docs/docs/contribute/index.mdx index 0da77a078..0028d0b75 100644 --- a/apps/docs/docs/contribute/index.mdx +++ b/apps/docs/docs/contribute/index.mdx @@ -19,37 +19,37 @@ There are a variety of ways you can contribute to OSO. This doc features some of