diff --git a/apps/docs/docs/contribute/connect-data/_category_.json b/apps/docs/docs/contribute/connect-data/_category_.json new file mode 100644 index 000000000..d6a76a91e --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "Connect Your Data", + "position": 4, + "link": { + "type": "doc", + "id": "index" + } +} diff --git a/apps/docs/docs/contribute/connect-data/airbyte.md b/apps/docs/docs/contribute/connect-data/airbyte.md new file mode 100644 index 000000000..3fe0d97d5 --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/airbyte.md @@ -0,0 +1,173 @@ +--- +title: Connect via Airbyte +sidebar_position: 2 +--- + +## Replicating external databases + +If your data exists in an off-the-shelf database, +you can replicate data to OSO via an AirByte Connector or +Singer.io Tap integration through Meltano. This section provides the details +necessary to add a connector or a tap from an existing Postgres database into +our system. Other databases or datasources should be similar. + +### Settings up your postgres database for connection + +We will setup the postgre connection to use Change Data Capture which is +suggested for very large databases. You will need to have the following in order +to connect your postgres database to OSO for replication. + +- `wal_level` must be set to `logical` +- You need to create a username of your choosing and share the associated + credentials with a maintainer at OSO +- You need to grant `REPLICATION` privileges to a username of your choosing +- You need to create a replication slot +- You need to create a publication for OSO for the tables you wish to have replicated. + +#### Setting your `wal_level` + +:::warning +Please ensure that you understand what changing the `wal_level` will do for your +database system requirements and/or performance. +::: + +Before you begin, it's possible your settings are already correct. To check your +`wal_level` settings, run the following query: + +```SQL +SHOW wal_level; +``` + +The output would look something like this from `psql`: + +``` + wal_level +----------- + logical +``` + +If doesn't have the word `logical` but instead some other value, you will need +to change this. Please ensure that this `wal_level` change is actually what you +want for your database. Setting this value to `logical` will likely affect +performance as it increases the disk writes by the database process. If you are +comfortable with this, then you can change the `wal_level` by executing the +following: + +```SQL +ALTER SYSTEM SET wal_level = logical; +``` + +#### Creating a user for OSO + +To create a user, choose a username and password, here we've chosen `oso_user` +and have a placeholder password `somepassword`: + +```SQL +CREATE USER oso_user WITH PASSWORD 'somepassword'; +``` + +#### Granting replication privileges + +The user we just created will need replication privileges + +```SQL +ALTER USER oso_user WITH REPLICATION; +``` + +#### Create a replication slot + +Create a replication slot for the `oso_user`. Here we named it `oso_slot`, but +it can have any name. + +```SQL +SELECT * FROM pg_create_logical_replication_slot('oso_slot', 'pgoutput'); +``` + +#### Create a publication + +For the final step, we will be creating the publication which will subscribe to +a specific table or tables. That table should already exist. If it does not, you +will need to create it _before_ creating the publication. Once you've ensured +that the table or tables in question have been created, run the following to +create the publication: + +_This assumes that you're creating the publication for table1 and table2._ + +```SQL +CREATE PUBLICATION oso_publication FOR TABLE table1, table2; +``` + +You can also create a publication for _all_ tables. To do this run the following +query: + +```SQL +CREATE PUBLICATION oso_publication FOR ALL TABLES; +``` + +For more details about this command see: https://www.postgresql.org/docs/current/sql-createpublication.html + +### Adding your postgres replication data to the OSO meltano configuration + +Assuming that you've created the publication you're now ready to connect your +postgres data source to OSO. + +#### Add the extractor to `meltano.yml` + +The `meltano.yml` YAML file details all of the required configuration for the +meltano "extractors" which are either airbyte connectors or singer.io taps. + +For postgres data sources we use the postgres airbyte connector. Underneath the +`extractors:` section. Add the following as a new list item (you should choose a +name other than `tap-my-postgres-datasource`): + +```yaml +extractors: + # ... other items my be above + # Choose any arbitrary name tap-# that is related to your datasource + - name: tap-my-postgres-datasource + inherit_from: tap-postgres + variant: airbyte + pip_url: git+https://github.com/MeltanoLabs/tap-airbyte-wrapper.git + config: + airbyte_config: + jdbc_url_params: "replication=postgres" + ssl_mode: # Update with your SSL configuration + mode: enable + schemas: # Update with your schemas + - public + replication_method: + plugin: pgoutput + method: CDC + publication: publication_name + replication_slot: oso_slot + initial_waiting_seconds: 5 +``` + +#### Send the read only credentials to OSO maintainers + +For now, once this is all completed it is best to open a pull request and an OSO +maintainer will reach out with a method to accept the read only credentials. + +### Adding to Dagster + +:::warning +Coming soon... This section is a work in progress. +To track progress, see this +[GitHub issue](https://github.com/opensource-observer/oso/issues/1318) +::: + +## Writing a new Airbyte connector + +Airbyte provides one of the best ways to write data connectors +that ingest data from HTTP APIs and other Python sources via the +[Airbyte Python CDK](https://docs.airbyte.com/connector-development/cdk-python/). + +:::warning +Coming soon... This section is a work in progress. +::: + +## Airbyte examples in OSO + +:::warning +Coming soon... This section is a work in progress. +::: diff --git a/apps/docs/docs/contribute/connect-data/bigquery-open-perms.png b/apps/docs/docs/contribute/connect-data/bigquery-open-perms.png new file mode 100644 index 000000000..ada7e680d Binary files /dev/null and b/apps/docs/docs/contribute/connect-data/bigquery-open-perms.png differ diff --git a/apps/docs/docs/contribute/connect-data/bigquery-set-perms.png b/apps/docs/docs/contribute/connect-data/bigquery-set-perms.png new file mode 100644 index 000000000..fa7e0b046 Binary files /dev/null and b/apps/docs/docs/contribute/connect-data/bigquery-set-perms.png differ diff --git a/apps/docs/docs/contribute/connect-data/bigquery.md b/apps/docs/docs/contribute/connect-data/bigquery.md new file mode 100644 index 000000000..8aa081178 --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/bigquery.md @@ -0,0 +1,139 @@ +--- +title: Connect via BigQuery +sidebar_position: 1 +--- + +BigQuery's built-in data-sharing capabilities make it +trivially easy to integrate any public dataset into +the OSO data pipeline. + +## Make the data available in the US region + +In order for our data pipeline to operate on the data, +it must be in the +[US multi-region](https://cloud.google.com/bigquery/docs/locations#multi-regions). + +If you have reason to keep the dataset in a different region, +you can use the +[BigQuery Data Transfer Service](https://cloud.google.com/bigquery/docs/dts-introduction) +to easily copy the dataset to the US region. +To manually define this as a transfer job in your own Google project, +you can do this directly from the +[BigQuery Studio](https://console.cloud.google.com/bigquery/transfers?project=opensource-observer). + +OSO will also copy certain valuable datasets into the +`opensource-observer` project via Dagster assets. +See the [Dataset replication](#oso-dataset-replication) +section below to add a Dagster asset to OSO. + +## Make the data accessible to our Google service account + +The easiest way to do this is to make the BigQuery dataset publicly accessible. + +![Open BigQuery permissions](./bigquery-open-perms.png) + +Add the `allAuthenticatedUsers` as the "BigQuery Data Viewer" + +![Set BigQuery permissions](./bigquery-set-perms.png) + +If you have reasons to keep your dataset private, +you can reach out to us directly on our +[Discord](https://www.opensource.observer/discord). + +## Defining a dbt source + +For example, Google maintains a +[public dataset](https://cloud.google.com/blog/products/data-analytics/ethereum-bigquery-public-dataset-smart-contract-analytics) +for Ethereum mainnet. + +As long as the dataset is publicly available in the US region, +we can create a dbt source in `oso/warehouse/dbt/models/` +(see [source](https://github.com/opensource-observer/oso/blob/main/warehouse/dbt/models/ethereum_sources.yml)): + +```yaml +sources: + - name: ethereum + database: bigquery-public-data + schema: crypto_ethereum + tables: + - name: transactions + identifier: transactions + - name: traces + identifier: traces +``` + +We can then reference these tables in a downstream model with +the `source` macro: + +```sql +select + block_timestamp, + `hash` as transaction_hash, + from_address, + receipt_contract_address +from {{ source("ethereum", "transactions") }} +``` + +## Creating a playground dataset (optional) + +If the source table is large, we will want to +extract a subset of the data into a playground dataset +for testing and development. + +For example for GitHub event data, +we copy just the last 14 days of data +into a playground dataset, which is used +when the dbt target is set to `playground` +(see [source](https://github.com/opensource-observer/oso/blob/main/warehouse/dbt/models/github_sources.yml)): + +```yaml +sources: + - name: github_archive + database: | + {%- if target.name in ['playground', 'dev'] -%} opensource-observer + {%- elif target.name == 'production' -%} githubarchive + {%- else -%} invalid_database + {%- endif -%} + schema: | + {%- if target.name in ['playground', 'dev'] -%} oso + {%- elif target.name == 'production' -%} day + {%- else -%} invalid_schema + {%- endif -%} + tables: + - name: events + identifier: | + {%- if target.name in ['playground', 'dev'] -%} stg_github__events + {%- elif target.name == 'production' -%} 20* + {%- else -%} invalid_table + {%- endif -%} +``` + +### Choosing a playground window size + +There is a fine balance between choosing a playground data set window +that is sufficiently small for affordable testing and development, +yet produces meaningful results to detect issues in your queries. + +:::warning +Coming soon... This section is a work in progress. +::: + +### Copying the playground dataset + +:::warning +Coming soon... This section is a work in progress. +::: + +## OSO Dataset Replication + +In order to make the OSO data pipeline more robust, +we can copy datasets into the `opensource-observer` Google Cloud project. + +:::warning +Coming soon... This section is a work in progress. +To track progress, see this +[GitHub issue](https://github.com/opensource-observer/oso/issues/1311). +::: + +Dagster also has an excellent tutorial on integrating +[BigQuery with Dagster](https://docs.dagster.io/integrations/bigquery/using-bigquery-with-dagster). diff --git a/apps/docs/docs/contribute/connect-data.md b/apps/docs/docs/contribute/connect-data/cloudquery.md similarity index 62% rename from apps/docs/docs/contribute/connect-data.md rename to apps/docs/docs/contribute/connect-data/cloudquery.md index bb800d59e..2485e6a14 100644 --- a/apps/docs/docs/contribute/connect-data.md +++ b/apps/docs/docs/contribute/connect-data/cloudquery.md @@ -1,18 +1,10 @@ --- -title: Connect Your Data -sidebar_position: 4 +title: Connect via CloudQuery +sidebar_position: 3 --- -:::info -We're always looking for new data sources to integrate with OSO and deepen our community's understanding of open source impact. If you're a developer or data engineer, we'd love to partner with you to connect your database (or other external data sources) to the OSO data warehouse. -::: - -## CloudQuery Plugins - ---- - -[CloudQuery](https://cloudquery.io) is used to integrate external data sources -into the Open Source Observer platform. At this time we are limiting the +[CloudQuery](https://cloudquery.io) can be used to integrate external data sources +into the OSO platform. At this time we are limiting the CloudQuery plugins in the OSO repository to Python or Typescript. This page will go over writing a plugin with Python, which is our suggested plugin language. @@ -266,146 +258,17 @@ In the future we intend to improve the experience of adding a plugin to the pipeline, but for now these docs are consistent with the current state of the pipeline. -## Connecting external databases - -The easiest way to connect data to OSO is to use our AirByte Connector or -Singer.io Tap integration through meltano. This section provides the details -necessary to add a connector or a tap from an existing postgres database into -our system. Other databases or datasources should be similar. - -### Settings up your postgres database for connection - -We will setup the postgre connection to use Change Data Capture which is -suggested for very large databases. You will need to have the following in order -to connect your postgres database to OSO for replication. - -- `wal_level` must be set to `logical` -- You need to create a username of your choosing and share the associated - credentials with a maintainer at OSO -- You need to grant `REPLICATION` privileges to a username of your choosing -- You need to create a replication slot -- You need to create a publication for OSO for the tables you wish to have replicated. - -#### Setting your `wal_level` +### Adding to Dagster :::warning -Please ensure that you understand what changing the `wal_level` will do for your -database system requirements and/or performance. +Coming soon... This section is a work in progress. +To track progress, see this +[GitHub issue](https://github.com/opensource-observer/oso/issues/1325) ::: -Before you begin, it's possible your settings are already correct. To check your -`wal_level` settings, run the following query: - -```SQL -SHOW wal_level; -``` - -The output would look something like this from `psql`: - -``` - wal_level ------------ - logical -``` - -If doesn't have the word `logical` but instead some other value, you will need -to change this. Please ensure that this `wal_level` change is actually what you -want for your database. Setting this value to `logical` will likely affect -performance as it increases the disk writes by the database process. If you are -comfortable with this, then you can change the `wal_level` by executing the -following: - -```SQL -ALTER SYSTEM SET wal_level = logical; -``` - -#### Creating a user for OSO - -To create a user, choose a username and password, here we've chosen `oso_user` -and have a placeholder password `somepassword`: - -```SQL -CREATE USER oso_user WITH PASSWORD 'somepassword'; -``` - -#### Granting replication privileges - -The user we just created will need replication privileges - -```SQL -ALTER USER oso_user WITH REPLICATION; -``` - -#### Create a replication slot - -Create a replication slot for the `oso_user`. Here we named it `oso_slot`, but -it can have any name. - -```SQL -SELECT * FROM pg_create_logical_replication_slot('oso_slot', 'pgoutput'); -``` - -#### Create a publication - -For the final step, we will be creating the publication which will subscribe to -a specific table or tables. That table should already exist. If it does not, you -will need to create it _before_ creating the publication. Once you've ensured -that the table or tables in question have been created, run the following to -create the publication: - -_This assumes that you're creating the publication for table1 and table2._ - -```SQL -CREATE PUBLICATION oso_publication FOR TABLE table1, table2; -``` - -You can also create a publication for _all_ tables. To do this run the following -query: - -```SQL -CREATE PUBLICATION oso_publication FOR ALL TABLES; -``` - -For more details about this command see: https://www.postgresql.org/docs/current/sql-createpublication.html - -### Adding your postgres replication data to the OSO meltano configuration - -Assuming that you've created the publication you're now ready to connect your -postgres data source to OSO. - -#### Add the extractor to `meltano.yml` - -The `meltano.yml` YAML file details all of the required configuration for the -meltano "extractors" which are either airbyte connectors or singer.io taps. - -For postgres data sources we use the postgres airbyte connector. Underneath the -`extractors:` section. Add the following as a new list item (you should choose a -name other than `tap-my-postgres-datasource`): - -```yaml -extractors: - # ... other items my be above - # Choose any arbitrary name tap-# that is related to your datasource - - name: tap-my-postgres-datasource - inherit_from: tap-postgres - variant: airbyte - pip_url: git+https://github.com/MeltanoLabs/tap-airbyte-wrapper.git - config: - airbyte_config: - jdbc_url_params: "replication=postgres" - ssl_mode: # Update with your SSL configuration - mode: enable - schemas: # Update with your schemas - - public - replication_method: - plugin: pgoutput - method: CDC - publication: publication_name - replication_slot: oso_slot - initial_waiting_seconds: 5 -``` +## CloudQuery examples in OSO -#### Send the read only credentials to OSO maintainers +Here are a few examples of CloudQuery plugins currently in use: -For now, once this is all completed it is best to open a pull request and an OSO -maintainer will reach out with a method to accept the read only credentials. +- [Importing oss-directory](https://github.com/opensource-observer/oso/tree/main/warehouse/cloudquery-oss-directory) +- [Fetch GitHub data missing from GHArchive](https://github.com/opensource-observer/oso/tree/main/warehouse/cloudquery-github-resolve-repos) diff --git a/apps/docs/docs/contribute/funding-data.md b/apps/docs/docs/contribute/connect-data/funding-data.md similarity index 86% rename from apps/docs/docs/contribute/funding-data.md rename to apps/docs/docs/contribute/connect-data/funding-data.md index 15e250251..991f95b74 100644 --- a/apps/docs/docs/contribute/funding-data.md +++ b/apps/docs/docs/contribute/connect-data/funding-data.md @@ -1,6 +1,6 @@ --- title: Add Funding Data -sidebar_position: 3 +sidebar_position: 10 --- :::info @@ -11,12 +11,12 @@ We are coordinating with several efforts to collect, clean, and visualize OSS fu --- -Add or update OSS funding data by making a pull request to [OSS Funding](https://github.com/opensource-observer/oss-funding). +Add or update funding data by making a pull request to [oss-funding](https://github.com/opensource-observer/oss-funding). -1. Fork [OSS Funding](https://github.com/opensource-observer/oss-funding/fork). +1. Fork [oss-funding](https://github.com/opensource-observer/oss-funding/fork). 2. Add static data in CSV (or JSON) format to `./uploads/`. 3. Ensure the data contains links to one or more project artifacts such as GitHub repos or wallet addresses. This is necessary in order for one of the repo maintainers to link funding events to OSS projects. -4. Submit a pull request from your fork back to [OSS Funding](https://github.com/opensource-observer/oss-funding). +4. Submit a pull request from your fork back to [oss-funding](https://github.com/opensource-observer/oss-funding). ## Contributing Clean Data @@ -28,7 +28,7 @@ Submissions will be validated to ensure they conform to the schema and don't con Additions to the `./clean/` directory should include as many of the following columns as possible: -- `oso_slug`: The OSO project slug (leave blank or null if the project doesn't exist yet). +- `oso_slug`: The OSO project name (leave blank or null if the project doesn't exist yet). - `project_name`: The name of the project (according to the funder's data). - `project_id`: The unique identifier for the project (according to the funder's data). - `project_url`: The URL of the project's grant application or profile. @@ -46,6 +46,6 @@ Additions to the `./clean/` directory should include as many of the following co --- -You can read or copy the latest version of the funding data directly from the [OSS Funding](https://github.com/opensource-observer/oss-funding) repo. +You can read or copy the latest version of the funding data directly from the [oss-funding](https://github.com/opensource-observer/oss-funding) repo. If you do something cool with the data (eg, a visualization or analysis), please share it with us! diff --git a/apps/docs/docs/contribute/connect-data/gcs.md b/apps/docs/docs/contribute/connect-data/gcs.md new file mode 100644 index 000000000..c2832049b --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/gcs.md @@ -0,0 +1,37 @@ +--- +title: Connect via Google Cloud Storage (GCS) +sidebar_position: 4 +--- + +Depending on the data, we may accept data dumps +into our Google Cloud Storage (GCS). +If you believe your data storage qualifies to be sponsored +by OSO, please reach out to us on +[Discord](https://www.opensource.observer/discord). + +## Get write access + +Coordinate with the OSO engineering team directly on +[Discord](https://www.opensource.observer/discord) +to give your Google service account write permissions to +our GCS bucket. + +## Defining a Dagster Asset + +:::warning +Coming soon... This section is a work in progress +and will be likely refactored soon. +::: + +To see an example of this in action, +you can look into our Dagster asset for +[Gitcoin passport scores](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/assets.py). + +For more details on defining Dagster assets, +see the [Dagster tutorial](https://docs.dagster.io/tutorial). + +## GCS import examples in OSO + +- [Superchain data](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/assets.py) +- [Gitcoin Passport scores](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/assets.py) +- [OpenRank reputations on Farcaster](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/assets.py) diff --git a/apps/docs/docs/contribute/connect-data/index.md b/apps/docs/docs/contribute/connect-data/index.md new file mode 100644 index 000000000..457e1fcd7 --- /dev/null +++ b/apps/docs/docs/contribute/connect-data/index.md @@ -0,0 +1,27 @@ +--- +title: Connect Your Data +sidebar_position: 0 +--- + +:::info +We're always looking for new data sources to integrate with OSO and deepen our community's understanding of open source impact. If you're a developer or data engineer, please reach out to us on [Discord](https://www.opensource.observer/discord). We'd love to partner with you to connect your database (or other external data sources) to the OSO data warehouse. +::: + +There are currently the following patterns for integrating new data sources into OSO, +in order of preference: + +1. [BigQuery public datasets](./bigquery.md): If you can maintain a BigQuery public dataset, this is the preferred and easiest route. +2. [Airbyte plugins](./airbyte.md): Airbyte plugins are the preferred method for crawling APIs. +3. [Database replication via Airbyte](./airbyte.md): Airbyte maintains off-the-shelf plugins for database replication (e.g. from Postgres). +4. [CloudQuery plugins](./cloudquery.md): CloudQuery offers another, more flexible avenue for writing data import plugins. +5. [Files into Google Cloud Storage (GCS)](./gcs.md): You can drop Parquet/CSV files in our GCS bucket for loading into BigQuery. +6. Static files: If the data is high quality and can only be imported via static files, please reach out to us on [Discord](https://www.opensource.observer/discord) to coordinate hand-off. This path is predominantly used for [grant funding data](./funding-data.md). + +We generally prefer to work with data partners that can help us regularly +index live data that can feed our daily data pipeline. +All data sources should be defined as +[software-defined assets](https://docs.dagster.io/concepts/assets/software-defined-assets) in our Dagster configuration. + +ETL is the messiest, most high-touch part of the OSO data pipeline. +Please reach out to us for help on [Discord](https://www.opensource.observer/discord). +We will happily work with you to get it working. diff --git a/apps/docs/docs/contribute/impact-models.md b/apps/docs/docs/contribute/impact-models.md index 47ea8ebae..79cc10743 100644 --- a/apps/docs/docs/contribute/impact-models.md +++ b/apps/docs/docs/contribute/impact-models.md @@ -1,5 +1,5 @@ --- -title: Propose an Impact Model +title: Write a Data Model sidebar_position: 5 --- @@ -85,7 +85,8 @@ poetry install && poetry run oso_lets_go :::tip Under the hood, `oso_lets_go` will create a GCP project and BigQuery dataset if they don't already exist, -and copy a small subset of the OSO data for you to develop against. +and copy a small subset of the OSO data for you to develop against, +called `playground`. It will also create a dbt profile to connect to this dataset (stored in `~/.dbt/profiles.yml`). The script is idempotent, so you can safely run it again @@ -107,7 +108,8 @@ Finally, you can test that everything is working by running the following comman dbt run ``` ---- +This will run the full dbt pipeline against your own +copy of the OSO playground dataset. ## Working with OSO dbt Models @@ -129,8 +131,8 @@ here for a fuller explanation](https://docs.getdbt.com/best-practices/how-we-str - `marts` - This directory contains transformations that should be fairly minimal and mostly be aggregations. In general, `marts` shouldn't depend on other marts unless they're just coarser grained aggregations of an upstream - mart. Marts are also automatically copied to the postgresql database that runs - the OSO website. + mart. Marts are also automatically copied to the frontend database that + powers the OSO API and website. ### OSO data sources @@ -159,11 +161,11 @@ oso_source('ossd', '{TABLE_NAME}') }}` where `{TABLE_NAME}` could be one of the following tables: - `collections` - This data is pulled directly from the [oss-directory - Repository][oss-directory] and is + repository][oss-directory] and is groups of projects. You can view this table [here][collections_table] - `projects` - This data is also pulled directly from the oss-directory - Repository. It describes a project's repositories, blockchain addresses, and + repository. It describes a project's repositories, blockchain addresses, and public packages. You can view this table [here][projects_table] - `repositories` - This data is derived by gathering repository data of all the @@ -217,8 +219,6 @@ namespace for a collection named `foo` would be as follows: {{ oso_id('collection', 'foo')}} ``` ---- - ## Adding Your dbt Model Now you're armed with enough information to add your model! Add your model to @@ -238,13 +238,16 @@ dbt run _Note: If you configured the dbt profile as shown in this document, this `dbt run` will write to the `opensource-observer.oso_playground` dataset._ -It is likely best to target a specific model so things don't take so long on -some of our materializations: +It is likely best to target a specific model when developing +so things don't take so long on some of our materializations: ```bash dbt run --select {name_of_your_model} ``` +If `dbt run` runs without issue and you feel that you've completed something you'd +like to contribute. It's time to open a PR! + ### Using the BigQuery UI to check your queries During your development process, it may be useful to use the BigQuery UI to @@ -272,26 +275,12 @@ The presence of the compiled model does not necessarily mean your SQL will work simply that it was rendered by `dbt` correctly. To test your model it's likely cheapest to copy the query into the [BigQuery Console](https://console.cloud.google.com/bigquery) and run that query there. -However, if you need more validation you'll need to [Setup GCP with your own -playground](https://docs.opensource.observer/docs/contribute/transform/setting-up-gcp.md#setting-up-your-own-playground-copy-of-the-dataset) - -### Testing your dbt models - -When you're ready, you can test your dbt models against your playground by -simply running dbt like so: - -```bash -dbt run --select {name_of_your_model} -``` - -If this runs without issue and you feel that you've completed something you'd -like to contribute. It's time to open a PR! ### Submit a PR Once you've developed your model and you feel comfortable that it will properly -run. you can submit it a PR to the [oso Repository][oso] to be tested by the OSO -github CI workflows (_still under development_). +run, you can submit it a PR to the [oso repository][oso] to be tested by the OSO +GitHub CI workflows. ### DBT model execution schedule @@ -301,228 +290,13 @@ pipelines are executed once a day by the OSO CI at 02:00 UTC. The pipeline currently takes a number of hours and any materializations or views would likely be ready for use by 4-6 hours after that time. -## Model Examples - -Here are a few examples of dbt models currently in production: - -### Developers - -This is an intermediate model available in the data warehouse as `int_devs`. - -```sql -SELECT - e.project_id, - e.to_namespace AS repository_source, - e.from_id, - 1 AS amount, - TIMESTAMP_TRUNC(e.time, MONTH) AS bucket_month, - CASE - WHEN - COUNT(DISTINCT CASE WHEN e.event_type = 'COMMIT_CODE' THEN e.time END) - >= 10 - THEN 'FULL_TIME_DEV' - WHEN - COUNT(DISTINCT CASE WHEN e.event_type = 'COMMIT_CODE' THEN e.time END) - >= 1 - THEN 'PART_TIME_DEV' - ELSE 'OTHER_CONTRIBUTOR' - END AS user_segment_type -FROM {{ ref('int_events_to_project') }} AS e -WHERE - e.event_type IN ( - 'PULL_REQUEST_CREATED', - 'PULL_REQUEST_MERGED', - 'COMMIT_CODE', - 'ISSUE_CLOSED', - 'ISSUE_CREATED' - ) -GROUP BY e.project_id, bucket_month, e.from_id, repository_source -``` - -### Events to a Project +You can monitor all pipeline runs in +[GitHub actions](https://github.com/opensource-observer/oso/actions/workflows/warehouse-run-data-pipeline.yml). -This is an intermediate model available in the data warehouse as `int_events_to_project`. +## Model References -```sql -SELECT - e.*, - a.project_id -FROM {{ ref('int_events_with_artifact_id') }} AS e -INNER JOIN {{ ref('stg_ossd__artifacts_by_project') }} AS a - ON - e.to_source_id = a.artifact_source_id - AND e.to_namespace = a.artifact_namespace - AND e.to_type = a.artifact_type -``` +All OSO models can be found in +[`warehouse/dbt/models`](https://github.com/opensource-observer/oso/tree/main/warehouse/dbt/models). -### Summary Onchain Metrics by Project - -This is a mart model available in the data warehouse as `onchain_metrics_by_project_v1`. - -```sql -WITH txns AS ( - SELECT - a.project_id, - c.to_namespace AS onchain_network, - c.from_source_id AS from_id, - c.l2_gas, - c.tx_count, - DATE(TIMESTAMP_TRUNC(c.time, MONTH)) AS bucket_month - FROM {{ ref('stg_dune__contract_invocation') }} AS c - INNER JOIN {{ ref('stg_ossd__artifacts_by_project') }} AS a - ON c.to_source_id = a.artifact_source_id -), -metrics_all_time AS ( - SELECT - project_id, - onchain_network, - MIN(bucket_month) AS first_txn_date, - COUNT(DISTINCT from_id) AS total_users, - SUM(l2_gas) AS total_l2_gas, - SUM(tx_count) AS total_txns - FROM txns - GROUP BY project_id, onchain_network -), -metrics_6_months AS ( - SELECT - project_id, - onchain_network, - COUNT(DISTINCT from_id) AS users_6_months, - SUM(l2_gas) AS l2_gas_6_months, - SUM(tx_count) AS txns_6_months - FROM txns - WHERE bucket_month >= DATE_ADD(CURRENT_DATE(), INTERVAL -6 MONTH) - GROUP BY project_id, onchain_network -), -new_users AS ( - SELECT - project_id, - onchain_network, - SUM(is_new_user) AS new_user_count - FROM ( - SELECT - project_id, - onchain_network, - from_id, - CASE - WHEN - MIN(bucket_month) >= DATE_ADD(CURRENT_DATE(), INTERVAL -3 MONTH) - THEN - 1 - END AS is_new_user - FROM txns - GROUP BY project_id, onchain_network, from_id - ) - GROUP BY project_id, onchain_network -), -user_txns_aggregated AS ( - SELECT - project_id, - onchain_network, - from_id, - SUM(tx_count) AS total_tx_count - FROM txns - WHERE bucket_month >= DATE_ADD(CURRENT_DATE(), INTERVAL -3 MONTH) - GROUP BY project_id, onchain_network, from_id -), -multi_project_users AS ( - SELECT - onchain_network, - from_id, - COUNT(DISTINCT project_id) AS projects_transacted_on - FROM user_txns_aggregated - GROUP BY onchain_network, from_id -), -user_segments AS ( - SELECT - project_id, - onchain_network, - COUNT(DISTINCT CASE - WHEN user_segment = 'HIGH_FREQUENCY_USER' THEN from_id - END) AS high_frequency_users, - COUNT(DISTINCT CASE - WHEN user_segment = 'MORE_ACTIVE_USER' THEN from_id - END) AS more_active_users, - COUNT(DISTINCT CASE - WHEN user_segment = 'LESS_ACTIVE_USER' THEN from_id - END) AS less_active_users, - COUNT(DISTINCT CASE - WHEN projects_transacted_on >= 3 THEN from_id - END) AS multi_project_users - FROM ( - SELECT - uta.project_id, - uta.onchain_network, - uta.from_id, - mpu.projects_transacted_on, - CASE - WHEN uta.total_tx_count >= 1000 THEN 'HIGH_FREQUENCY_USER' - WHEN uta.total_tx_count >= 10 THEN 'MORE_ACTIVE_USER' - ELSE 'LESS_ACTIVE_USER' - END AS user_segment - FROM user_txns_aggregated AS uta - INNER JOIN multi_project_users AS mpu - ON uta.from_id = mpu.from_id - ) - GROUP BY project_id, onchain_network -), -contracts AS ( - SELECT - project_id, - artifact_namespace AS onchain_network, - COUNT(artifact_source_id) AS num_contracts - FROM {{ ref('stg_ossd__artifacts_by_project') }} - GROUP BY project_id, onchain_network -), -project_by_network AS ( - SELECT - p.project_id, - ctx.onchain_network, - p.project_name - FROM {{ ref('projects_v1') }} AS p - INNER JOIN contracts AS ctx - ON p.project_id = ctx.project_id -) - -SELECT - p.project_id, - p.onchain_network AS network, - p.project_name, - c.num_contracts, - ma.first_txn_date, - ma.total_txns, - ma.total_l2_gas, - ma.total_users, - m6.txns_6_months, - m6.l2_gas_6_months, - m6.users_6_months, - nu.new_user_count, - us.high_frequency_users, - us.more_active_users, - us.less_active_users, - us.multi_project_users, - ( - us.high_frequency_users + us.more_active_users + us.less_active_users - ) AS active_users -FROM project_by_network AS p -LEFT JOIN metrics_all_time AS ma - ON - p.project_id = ma.project_id - AND p.onchain_network = ma.onchain_network -LEFT JOIN metrics_6_months AS m6 - ON - p.project_id = m6.project_id - AND p.onchain_network = m6.onchain_network -LEFT JOIN new_users AS nu - ON - p.project_id = nu.project_id - AND p.onchain_network = nu.onchain_network -LEFT JOIN user_segments AS us - ON - p.project_id = us.project_id - AND p.onchain_network = us.onchain_network -LEFT JOIN contracts AS c - ON - p.project_id = c.project_id - AND p.onchain_network = c.onchain_network -``` +We also continuously deploy model reference documentation at +[https://models.opensource.observer/](https://models.opensource.observer/) diff --git a/apps/docs/docs/contribute/index.mdx b/apps/docs/docs/contribute/index.mdx index 0da77a078..0028d0b75 100644 --- a/apps/docs/docs/contribute/index.mdx +++ b/apps/docs/docs/contribute/index.mdx @@ -19,37 +19,37 @@ There are a variety of ways you can contribute to OSO. This doc features some of Update Project Data -OSS Directory +oss-directory Add a new project or update info for an existing project. OSS Projects, Analysts, General Public Add Funding Data -OSS Funding +oss-funding Add to our database of OSS funding via CSV upload. OSS Funders, Analysts Connect Your Data -OSO Monorepo +oso Write a plugin or help us replicate your data in the OSO data warehouse. Data Engineers, Developers -Propose an Impact Model -OSO Monorepo -Submit a dbt model for tracking open source impact metrics. +Propose an Impact Data Model +oso +Submit a dbt data model for tracking open source impact metrics. Data Scientists, Analysts Share Insights -Insights +insights Contribute to our library of data visualizations and Jupyter notebooks. Data Scientists, Analysts Join a Data Challenge -Insights +insights Work on a specific data challenge and get paid for your contributions. Data Scientists, Analysts diff --git a/apps/docs/docs/contribute/project-data.md b/apps/docs/docs/contribute/project-data.md index ecd932657..a27e9e255 100644 --- a/apps/docs/docs/contribute/project-data.md +++ b/apps/docs/docs/contribute/project-data.md @@ -3,33 +3,34 @@ title: Update Project Data sidebar_position: 2 --- -:::info -Add or update data about a project by making a pull request to the OSS Directory. -When a new project is added to OSS directory, we automatically index relevant -data about its history and ongoing activity so it can be queried via our API, included -in metrics dashboards, and analyzed by data scientists. -::: +We maintain a repository of open source projects called +[oss-directory](https://github.com/opensource-observer/oss-directory). +Think of it as an awesome list of reputable open source +projects that lists all known 'artifacts' related to a project, +from GitHub repos, to social media profiles, to software deployments. +We use this as the starting point of the OSO data pipeline, +which is run automatically daily to produce metrics +for our API and dashboards. ## Quick Steps ---- - -Add or update project data by making a pull request to [OSS Directory](https://github.com/opensource-observer/oss-directory). +Add or update project data by making a pull request to +[oss-directory](https://github.com/opensource-observer/oss-directory). -1. Fork [OSS Directory](https://github.com/opensource-observer/oss-directory/fork). +1. Fork [oss-directory](https://github.com/opensource-observer/oss-directory/fork). 2. Locate or create a new project `.yaml` file under `./data/projects/`. -3. Link artifacts (ie, GitHubs, npm packages, blockchain addresses) in the project `.yaml` file. -4. Submit a pull request from your fork back to [OSS Directory](https://github.com/opensource-observer/oss-directory). -5. Once your pull request is approved, your project will automatically be added to our daily indexers. It may take longer for some historical data (eg, GitHub events) to show up as we run backfill jobs less frequently. +3. Link artifacts (ie, GitHub repositories, npm packages, blockchain addresses) in the project `.yaml` file. +4. Submit a pull request from your fork back to [oss-directory](https://github.com/opensource-observer/oss-directory). +5. Once your pull request is approved, your project will automatically be added to our daily indexers. It may take longer for some historical data (e.g. GitHub events) to show up as we run backfill jobs less frequently. ## Schema Overview ---- - -Make sure to use the latest version of the OSS Directory schema. You can see the latest version by opening any project YAML file and getting the version from the top of file. Note, since Version 3, we have replace the field `slug` with `name` and the previous `name` field with `display_name`. +Make sure to use the latest version of the oss-directory schema. You can see the latest version by opening any project YAML file and getting the version from the top of file. :::important The `name` field is the unique identifier for the project and **must** match the name of the project file. For example, if the project file is `./data/projects/m/my-project.yaml`, then the `name` field should be `my-project`. As a convention, we usually take the GitHub organization name as the project `name`. If the project is a standalone repo within a larger GitHub organization or personal account, you can use the project name followed by the repo owner as the name, separated by hyphens. + +Note: since version 3, we have replaced the `slug` field with `name` and the previous `name` field with `display_name`. ::: ### Fields @@ -38,11 +39,16 @@ The schema currently contains the following fields: - `version`: The latest version of the OSS Directory schema. This is a required field. To find the latest version, open any project YAML file and get the version from the top of the file. As of writing (2024-06-05), the latest version is Version 7. - `name`: The unique identifier for the project. This is usually the GitHub organization name or the project name followed by the repo owner, separated by hyphens. This is a required field. -- `display_name`: The name of the project. This is a required field. +- `display_name`: The display name of the project. This is a required field. - `description`: A brief description of the project. +- `websites`: A list of associated websites +- `social`: A list of social channels (e.g. Twitter, Telegram, Discord) - `github`: The GitHub URL of the project. This is a list of URLs, as a project can have multiple GitHub URLs. In most cases, the first and only URL will be the main GitHub organization URL. You don't need to include all the repositories that belong to the organization, as we will automatically index all of them. -- `npm`: The npm URL of a package owned the project. This is a list of URLs, as a project can have multiple npm URLs. +- `npm`: The npm URL of a package owned the project. This is a list of URLs, as a project can have multiple npm packages. - `blockchain`: A list of blockchain addresses associated with the project. Each address should include the address itself, the networks it is associated with, and any tags that describe the address. The most important addresses to include are deployers and wallets. We use deployers to trace all contracts deployed by a project, and wallets to trace all transactions made by a project. +- `comments`: Feel free to store any useful comments for maintainers here. + +For the latest fields, see the [project schema](https://github.com/opensource-observer/oss-directory/blob/main/src/resources/schema/project.json) ### Supported Blockchain Networks and Tags @@ -50,19 +56,19 @@ The schema currently contains the following fields: The simplest way to add all contracts and factories associated with your project is to just add the deployer address in the project file. We will then automatically index all contracts and factories associated with the deployer address. If the deployer is on multiple EVM networks, you can use the `any_evm` tag instead of listing each network individually. ::: -The OSS Directory currently supports the following blockchain networks, which can be enumerated in the `networks` field of a blockchain address: +OSO currently supports the following blockchain networks, which can be enumerated in the `networks` field of a blockchain address: - `any_evm`: Any Ethereum Virtual Machine (EVM) network. This is the recommended tag for EOAs that deploy contracts on multiple EVM networks. - `mainnet`: The Ethereum mainnet. - `arbitrum_one`: The Arbitrum L2 network. -- `optimism`: The Optimism L2 network. - `base`: The Base L2 network. +- `frax`: The Frax L2 network. - `metal`: The Metal L2 network. - `mode`: The Mode L2 network. -- `frax`: The Frax L2 network. +- `optimism`: The Optimism L2 network. - `zora`: The Zora L2 network. -We do not support testnets for any of these networks and do not intend to. +Note: We do not support testnets for any of these networks and do not intend to. The following tags can be used to describe blockchain addresses: @@ -80,13 +86,11 @@ Read below for more detailed steps on how to add or update project data or consu ## Detailed Steps ---- - Here's a more detailed set of instructions for first-time contributors. -### 1. Fork OSS Directory +### 1. Fork oss-directory -- Navigate to the [OSS Directory](https://github.com/opensource-observer/oss-directory) repo. +- Navigate to the [oss-directory](https://github.com/opensource-observer/oss-directory) repo. - Click the "Fork" button in the upper right corner of this page. This will create a copy of the repository in your GitHub account. It's best practice to keep the same repository name, but you can change it if you want. @@ -175,10 +179,11 @@ Some projects may own a lot of blockchain addresses. The most important addresse ### 4. Submit a pull request from your fork to our repository -- Save your changes and open a pull request from your fork to the [OSS Directory](https://github.com/opensource-observer/oss-directory). +- Save your changes and open a pull request from your fork to [oss-directory](https://github.com/opensource-observer/oss-directory). - If you are adding multiple new projects, you can include them all in the same pull request, but please provide some comments to help us understand your changes. +- Opening the pull request will trigger automated validation of the artifacts you added. If there are any issues or duplicates found, the pull request will be rejected and you will be notified in the pull request thread. - Your submission will be reviewed by a maintainer before approving the pull request. If there are any issues, you will be notified in the pull request thread. -- Your submission will be merged once it is approved by a maintainer. Merging the pull request will trigger automated validation of the artifacts you added. If there are any issues or duplicates found, the pull request will be rejected and you will be notified in the pull request thread. +- Your submission will be merged once it is approved by a maintainer. - Once the pull request is merged successfully, your project will be added to the indexing queue for inclusion in all subsequent data updates. :::tip @@ -193,9 +198,7 @@ Note that our indexer currently runs every 24 hours at 02:00 UTC. Therefore, it ## Bulk Updates ---- - -To make bulk updates, we recommend cloning the [OSS Directory](https://github.com/opensource-observer/oss-directory) repo and making changes locally. Then, submit a complete set of project updates via one pull request. +To make bulk updates, we recommend cloning the [oss-directory](https://github.com/opensource-observer/oss-directory) repo and making changes locally. Then, submit a complete set of project updates via one pull request. Given that the project data may come in all sorts of formats, we have not included a script that will work for all cases. We have included a [few scripts](https://github.com/opensource-observer/oss-directory/tree/main/src/scripts) as examples. These take CSV, TOML, or JSON files that contain a list of projects and associated artifacts. diff --git a/apps/docs/docs/references/_category_.json b/apps/docs/docs/references/_category_.json new file mode 100644 index 000000000..a77336457 --- /dev/null +++ b/apps/docs/docs/references/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "References", + "position": 5, + "link": { + "type": "doc", + "id": "index" + } +} diff --git a/apps/docs/docs/references/index.md b/apps/docs/docs/references/index.md new file mode 100644 index 000000000..aeb663e68 --- /dev/null +++ b/apps/docs/docs/references/index.md @@ -0,0 +1,21 @@ +--- +title: References +sidebar_position: 1 +--- + +## Data Model Reference Documentation + +[https://models.opensource.observer](https://models.opensource.observer/) + +These are the auto-generated dbt docs for each model +in `warehouse/dbt/models/`. + +## Dagster dashboard + +[https://dagster.opensource.observer](https://dagster.opensource.observer) + +Use this to view the entire data infrastructure, +as well as the current status of every stage of the pipeline. + +Admins can trigger runs +[here](https://admin-dagster.opensource.observer/)