Skip to content

Commit

Permalink
docs: update data science section
Browse files Browse the repository at this point in the history
  • Loading branch information
ccerv1 committed Mar 26, 2024
1 parent e7d74c0 commit 8465044
Show file tree
Hide file tree
Showing 2 changed files with 84 additions and 59 deletions.
143 changes: 84 additions & 59 deletions apps/docs/docs/integrate/data-science.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,17 @@ Notebooks are a great way for data scientists to explore data, organize ad-hoc a
You will need access to the OSO data warehouse to do data science. See our getting started guide [here](../get-started/#login-to-bigquery).
:::

## Using Google Colab
## Fetching Data

There are three common ways to fetch data from the OSO data warehouse so you can run analysis on it:

1. **Google Colab**: Run your analysis in the cloud using Google Colab.
2. **Jupyter on Your Machine**: Run your analysis locally using Jupyter.
3. **Export to CSV / JSON**: Export your data from BigQuery to a CSV or JSON file and then import it into your preferred tool.

The next section will walk you through each of these methods.

### With Google Colab

---

Expand All @@ -21,7 +31,7 @@ You can also create a new notebook from scratch and run it in the cloud. Here's

1. Create a new Collab notebook [here](https://colab.research.google.com/#create=true).

2. Run the following code at the top of the Colab notebook to authenticate with BigQuery.
2. Authenticate with BigQuery. In the first block at the top of the Colab notebook, copy and execute the following code:

```python
# @title Setup
Expand All @@ -33,7 +43,9 @@ You can also create a new notebook from scratch and run it in the cloud. Here's

You will be prompted to give this notebook access to your Google account. Once you have authenticated, you can start querying the OSO data warehouse.

3. Write your query. Here's an example of how to fetch the latest code metrics for all projects in the OSO data warehouse. Remember to replace `my-oso-playground` with your project id.
3. Create a new code block for your query. In this block, we will use the `%%bigquery` magic command to run a SQL query and store the results in a Pandas dataframe (named `df` in my example).

Here's an example of how to fetch the latest code metrics for all projects in the OSO data warehouse. **Remember to replace `my-oso-playground` with your project id.**

```python
# replace 'my-oso-playground' with your project id
Expand All @@ -44,9 +56,17 @@ You can also create a new notebook from scratch and run it in the cloud. Here's
ORDER BY last_commit_date DESC
```

This query takes advantage of a magic command `%%bigquery` that allows you to run a SQL query and store the results in a dataframe (`df` in my example). You can then use the dataframe to perform analysis and generate visualizations.
Execute the code block. The query will run in a few seconds and the results will be stored in the `df` dataframe.

4. Create a new code block to preview the first few rows of your dataframe:

```python
df.head()
```

This will show you the first few rows of your dataframe so you can get a sense of the data you're working with.

4. Move from the "playground" to the "production" dataset. Once you have a working query, you can replace `oso_playground` with `oso` to fetch data from the production dataset.
5. Move from the "playground" to the "production" dataset. Once you have a working query, you can replace `oso_playground` with `oso` to fetch data from the production dataset.

```python
# replace 'my-oso-playground' with your project id
Expand All @@ -57,23 +77,34 @@ You can also create a new notebook from scratch and run it in the cloud. Here's
ORDER BY last_commit_date DESC
```

That's it! You're ready to start analyzing the OSO dataset in a Google Colab notebook.
6. Import other common data science libraries like `pandas`, `numpy`, `matplotlib`, and `seaborn` to help you analyze and visualize your data.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
```

You can execute these imports in a new code block after you've grabbed your data or back at the top of your notebook with the other imports.

That's it! You're ready to start analyzing the OSO dataset in a Google Colab notebook. You can [skip ahead to the tutorial](./data-science#tutorial-github-stars--forks-analysis) to see an example of how to analyze the data.

:::tip
You can also download your Colab notebooks to your local machine and run them in Jupyter.
:::

## Using Jupyter on Your Machine
### Using Jupyter on your machine

---

### Install Anaconda
This section will walk you through setting up a local Jupyter notebook environment, storing your GCP service account key on your machine, and connecting to the OSO data warehouse.

For new users, we highly recommend [installing Anaconda](https://www.anaconda.com/download). Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for working with data.
#### Install Anaconda

If you already have Jupyter installed, you can skip this step.
For new users, we highly recommend [installing Anaconda](https://www.anaconda.com/download). Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for working with data.

Use the following installation steps:
If you already have Jupyter installed, you can skip steps 1 and 2 below:

1. Download [Anaconda](https://www.anaconda.com/download). We recommend downloading Anaconda’s latest Python 3 version.

Expand All @@ -95,7 +126,7 @@ Congratulations! You're in. You should have an empty Jupyter notebook on your co
If you run into issues getting set up with Jupyter, check out the [Jupyter docs](https://docs.jupyter.org/en/latest/install.html).
:::

### Install standard dependencies
#### Install standard dependencies

If you just installed Anaconda, you should have all the standard data science packages installed. Skip ahead to [the next section](#install-the-bigquery-python-client-library).

Expand All @@ -105,7 +136,7 @@ If you're here, we will assume you have some familiarity with setting up a local

Remember, it is a best practice to use a Python virtual environment tool such as [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage dependencies.

#### Install pip and jupyter
##### Install pip and jupyter

First, ensure that you have the latest pip; older versions may have trouble with some dependencies:

Expand All @@ -119,7 +150,7 @@ Then install the Jupyter Notebook using:
pip install jupyter
```

#### For working with dataframes and vector operations
##### For working with dataframes and vector operations

The following packages are used in almost every Python data science application:

Expand All @@ -132,7 +163,7 @@ Install them:
pip install pandas numpy
```

#### For statistical analysis and vector operations
##### For statistical analysis and vector operations

The following packages are used for statistical analysis and vector operations:

Expand All @@ -145,15 +176,15 @@ Install them:
pip install scikit-learn scipy
```

#### For working with graph data
##### For working with graph data

If you plan on doing graph-based analysis, you may want to install [networkx](https://networkx.org/).

```
pip install networkx
```

#### For charting and data visualization
##### For charting and data visualization

These are the most popular packages for static data visualization:

Expand All @@ -170,7 +201,7 @@ For interactive data visualization, you may want to install [plotly](https://plo
pip install plotly
```

### Install the BigQuery Python Client Library
#### Install the BigQuery Python Client Library

We recommend using the [Google Cloud BigQuery Python Client Library](https://cloud.google.com/python/docs/reference/bigquery/latest/index.html) to connect to the OSO data warehouse. This library provides a convenient way to interact with BigQuery from your Jupyter notebook.

Expand All @@ -182,7 +213,7 @@ pip install google-cloud-bigquery

Alternatively, you can stick to static analysis and export your data from BigQuery to a CSV or JSON file and then import it into your notebook.

### Obtain a GCP Service Account Key
#### Obtain a GCP Service Account Key

This section will walk you through the process of obtaining a GCP service account key and connecting to BigQuery from a Jupyter notebook. If you don't have a GCP account, you will need to create one (see [here](../get-started) for instructions).

Expand Down Expand Up @@ -251,7 +282,7 @@ It will download the JSON file with your private key info. You should be able to

Now you're ready to authenticate with BigQuery using your service account key.

### Connect to BigQuery from Jupyter
#### Connect to BigQuery from Jupyter

From the command line, open a Jupyter notebook:

Expand Down Expand Up @@ -283,37 +314,45 @@ Try a sample query to test your connection:
```python
query = """
SELECT *
FROM `opensource-observer.oso_playground.collections`
FROM `opensource-observer.oso_playground.code_metrics_by_project`
ORDER BY last_commit_date DESC
"""
results = client.query(query)
results.to_dataframe()
```

If everything is working, you should see a dataframe with the results of your query.

### Keep your service account key safe
#### Keep your service account key safe

You should never commit your service account key to a public repository. Instead, you can store it in a secure location on your local machine and reference it in your code using an environment variable.

If you plan on sharing your notebook with others, you can use a package like [python-dotenv](https://pypi.org/project/python-dotenv/) to load your environment variables from a `.env` file.

Always remember to add your `.env` or `credentials.json` file to your `.gitignore` file to prevent it from being committed to your repository.

## Downloading CSV / JSON Files from BigQuery
### Exporting CSV / JSON files from BigQuery

If you prefer to work with static data, you can export your data from BigQuery to a CSV or JSON file and then import it into your notebook, spreadsheet, or other tool.

From the BigQuery console, run your query and then click the **Save Results** button to export your data in your preferred format. Note that there are limits to the amount of data you can download locally vs the amount you can save on Google Drive. If you have a large dataset (above 10MB), you may need to save it to Google Drive and then download it from there.
1. Navigate to the BigQuery console [here](https://console.cloud.google.com/bigquery).
2. Try a sample query and click **Run** to execute it. For example, you can fetch the latest code metrics for all projects in the OSO data warehouse:
```sql
SELECT *
FROM `opensource-observer.oso_playground.code_metrics_by_project`
ORDER BY last_commit_date DESC
```
3. Click the **Save Results** button to export your data in your preferred format. Note that there are limits to the amount of data you can download locally vs the amount you can save on Google Drive. If you have a large dataset (above 10MB), you may need to save it to Google Drive and then download it from there.

![GCP Save Results](./gcp_save_results.png)

Then you can import your data right into your notebook:
4. Finally, you can import your data into your analysis tool of choice. For example, you can import a CSV file into a Pandas dataframe in a Jupyter notebook:

```python
import pandas as pd
```python
import pandas as pd

df = pd.read_csv('path/to/your/file.csv')
```
df = pd.read_csv('path/to/your/file.csv')
```

If this is your preferred workflow, you can [skip the first part](./data-science#transform) of the next section.

Expand All @@ -335,20 +374,12 @@ These notebooks typically have the following structure:

This next section will help you create a notebook from scratch, performing each of these steps using the OSO playground dataset.

### Follow the tutorial on Google Colab

If you want to get going as quickly as possible, just copy and execute our [tutorial notebook](https://colab.research.google.com/drive/1v318jtHyuU55JMx2vR9QEXENwFCITBrh?usp=drive_link) on Google Colab.

Remember to replace `opensource-observer` with the name of your project in the `%%bigquery` magic command.

It's always a good idea to test your queries in the [BigQuery console](https://console.cloud.google.com/bigquery) and with the `oso_playground` dataset before running them in your notebook with the `oso` production dataset.

### Prepare your notebook

From the command line, create a new Jupyter notebook:

```bash
$ jupyter notebook
jupyter notebook
```

A Jupyter directory will open in your browser. Navigate to the directory where you want to store your notebook. Create a new notebook.
Expand All @@ -370,17 +401,9 @@ os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '' # path to your service account
client = bigquery.Client()
```

### Tutorial: GitHub Stars & Forks Analysis

The example below fetches the latest code metrics for all projects in the OSO data warehouse and generates a scatter plot of the number of forks vs the number of stars for each project.

Remember to follow the steps from the setup section above and authenticate with your BigQuery service account key.

You can find the full notebook [here](https://github.com/opensource-observer/insights/blob/main/community/notebooks/oso_starter_tutorial.ipynb).

#### Query
Now you can fetch the data you want.

In this example, we will fetch the latest code metrics for all projects in the OSO data warehouse.
In this example, we will fetch the latest code metrics for all projects on OSO.

```python
query = """
Expand All @@ -391,24 +414,26 @@ query = """
results = client.query(query)
```

We recommend exploring the data in [the BigQuery console](https://console.cloud.google.com/bigquery) before running your query in your notebook. This will help you understand the structure of the data and the fields you want to fetch.

[![GCP Playground](./gcp_playground.png)](https://console.cloud.google.com/bigquery)

---

#### Transform

Once you have fetched your data, you can transform it into a format that is ready for analysis.

Store the results of your query in a dataframe and preview the first few rows:

```python
df = results.to_dataframe()
df.head()
```

Next, we will apply some basic cleaning and transformation to the data:
### Tutorial: GitHub Stars & Forks Analysis

The example below analyzes the latest code metrics for all projects in the OSO data warehouse and generates a scatter plot of the number of forks vs the number of stars for each project.

If you're running locally, follow the earlier steps to authenticate with BigQuery and fetch your data. You can find the full notebook [here](https://github.com/opensource-observer/insights/blob/main/community/notebooks/oso_starter_tutorial.ipynb).

If you're using Colab, you can copy and execute our [tutorial notebook](https://colab.research.google.com/drive/1v318jtHyuU55JMx2vR9QEXENwFCITBrh?usp=drive_link). Just remember to replace `opensource-observer` with the name of your project in the `%%bigquery` magic command.

#### Transform

Once you have fetched your data, you can transform it into a format that is ready for analysis.

Let's apply some basic cleaning and transformation to the data. This block will:

- Remove any rows where the number of forks or stars is 0; copy to a new dataframe
- Create a new column to indicate whether the project has had recent activity (commits in the last 6 months)
Expand Down
Binary file modified apps/docs/docs/integrate/gcp_save_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8465044

Please sign in to comment.