ETL Group Project 2

(Using Jupyter Notebook and PostgreSQL)

DU - DA Module 13
Crowdfunding ETL Mini Project

by M. Aparisio and L. Vara

INDEX

Content of the repository
Instructions for the NoSQL Challenge
References

Content of the repository

Resources directory:
- campaign.csv <-- Database created via ETL_Mini_Project_MAparisio_LVara.ipynb
- category.csv <-- Database created via ETL_Mini_Project_MAparisio_LVara.ipynb
- contacts.csv <-- Database created via ETL_Mini_Project_MAparisio_LVara.ipynb
- contacts.xlsx <-- Provided Excel Database
- crowdfunding.xlsx <-- Provided Excel Database
- subcategory.csv <-- Database created via ETL_Mini_Project_MAparisio_LVara.ipynb
Database_Crowdfunding_db.png <-- Image displaying that a Database named Crowdfunding_db was created in PostgreSQL
ERD_Crowdfunding.png <-- Image of the ERD created to initiate the schema
ETL_Mini_Project_MAparisio_LVara.ipynb <-- ETL initial process via Jupyter Notebook file
crowdfunding_db_schema.sql <-- Schema used to create the Crowdfunding_db Database to finalize the ETL process

Instructions

The instructions for this mini project are divided into the following subsections:

Create the Category and Subcategory DataFrames
Create the Campaign DataFrame
Create the Contacts DataFrame
Create the Crowdfunding Database

CREATE THE CATEGORY AND SUBCATEGORY DATAFRAMES

Extract and transform the crowdfunding.xlsx Excel data to create a category DataFrame that has the following columns:

A "category_id" column that has entries going sequentially from "cat1" to "catn", where n is the number of unique categories
A "category" column that contains only the category titles
The following image shows this category DataFrame:

Export the category DataFrame as category.csv and save it to your GitHub repository.
Extract and transform the crowdfunding.xlsx Excel data to create a subcategory DataFrame that has the following columns:

A "subcategory_id" column that has entries going sequentially from "subcat1" to "subcatn", where n is the number of unique subcategories
A "subcategory" column that contains only the subcategory titles
The following image shows this subcategory DataFrame:

Export the subcategory DataFrame as subcategory.csv and save it to your GitHub repository.

CREATE THE CAMPAIGN DATAFRAME

Extract and transform the crowdfunding.xlsx Excel data to create a campaign DataFrame has the following columns:

The "cf_id" column
The "contact_id" column
The "company_name" column
The "blurb" column, renamed to "description"
The "goal" column, converted to the float data type
The "pledged" column, converted to the float data type
The "outcome" column
The "backers_count" column
The "country" column
The "currency" column
The "launched_at" column, renamed to "launch_date" and with the UTC times converted to the datetime format
The "deadline" column, renamed to "end_date" and with the UTC times converted to the datetime format
The "category_id" column, with unique identification numbers matching those in the "category_id" column of the category DataFrame
The "subcategory_id" column, with the unique identification numbers matching those in the "subcategory_id" column of the subcategory DataFrame
The following image shows this campaign DataFrame:

Export the campaign DataFrame as campaign.csv and save it to your GitHub repository.

CREATE THE CONTACTS DATAFRAME

1. Choose one of the following two options for extracting and transforming the data from the **contacts.xlsx** Excel data: - **Option 1:** Use Python dictionary methods. - **Option 2:** Use regular expressions.

If you chose Option 1, complete the following steps:

Import the contacts.xlsx file into a DataFrame.
Iterate through the DataFrame, converting each row to a dictionary.
Iterate through each dictionary, doing the following:
Extract the dictionary values from the keys by using a Python list comprehension.
Add the values for each row to a new list.
Create a new DataFrame that contains the extracted data.
Split each "name" column value into a first and last name, and place each in a new column.
Clean and export the DataFrame as contacts.csv and save it to your GitHub repository.

If you chose Option 2, complete the following steps:

Import the contacts.xlsx file into a DataFrame.
Extract the "contact_id", "name", and "email" columns by using regular expressions.
Create a new DataFrame with the extracted data.
Convert the "contact_id" column to the integer type.
Split each "name" column value into a first and a last name, and place each in a new column.
Clean and then export the DataFrame as contacts.csv and save it to your GitHub repository.

Check that your final DataFrame resembles the one in the following image:

CREATE THE CROWDFUNDING DATABASE

Inspect the four CSV files, and then sketch an ERD of the tables by using QuickDBDLinks to an external site..
- https://app.quickdatabasediagrams.com/
Use the information from the ERD to create a table schema for each CSV file. Note: Remember to specify the data types, primary keys, foreign keys, and other constraints.
Save the database schema as a Postgres file named crowdfunding_db_schema.sql, and save it to your GitHub repository.
Create a new Postgres database, named crowdfunding_db.
Using the database schema, create the tables in the correct order to handle the foreign keys.
Verify the table creation by running a SELECT statement for each table.
Import each CSV file into its corresponding SQL table.
Verify that each table has the correct data by running a SELECT statement for each.

Hints

To split each "category & sub-category" column value into "category" and "subcategory" column values, use df[["new_column1","new_column2"]] = df["column"].str.split(). Make sure to pass the correct parameters to the split() function.
To get the unique category and subcategory values from the "category" and "subcategory" columns, create a NumPy array where the array length equals the number of unique categories and unique subcategories from each column. For information about how to do so, see numpy.arangeLinks to an external site. in the NumPy documentation.
To create the category and subcategory identification numbers, use a list comprehension to add the "cat" string or the "subcat" string to each number in the category or the subcategory array, respectively.
For more information about creating a new Pandas DataFrame, see the pandas.DataFrameLinks to an external site. in the Pandas documentation.
To convert the "goal" and "pledged" columns to the float data type, use the astype() method.
To convert the "launch_date" and "end_date" UTC times to the datetime format, see the Transform_Grocery_Orders_Solved.ipynb activity solution.
For more information about how to add the "category_id" and "subcategory_id" unique identification numbers to the campaign DataFrame, see the pandas.DataFrame.mergeLinks to an external site. in the Pandas documentation.

References for NoSQL UK Food Ratings

Most of what's in this challenge, was covered in class.
The few things that either weren't or we had to reference to, are described with it's source right below.

To create a dictionary and use json.loads() method:
- Reference: https://www.geeksforgeeks.org/json-loads-in-python/
ERD references using these two videos:
- Part 1: https://youtu.be/xsg9BDiwiJE?si=u5p3yiQeC447DUhU
- Part 2: https://youtu.be/hktyW5Lp0Vo?si=u-t-066w0SxvuMvQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Group Project 2

DU - DA Module 13
Crowdfunding ETL Mini Project

by M. Aparisio and L. Vara

INDEX

Content of the repository

Instructions

CREATE THE CATEGORY AND SUBCATEGORY DATAFRAMES

CREATE THE CAMPAIGN DATAFRAME

CREATE THE CONTACTS DATAFRAME

CREATE THE CROWDFUNDING DATABASE

References for NoSQL UK Food Ratings

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Resources		Resources
Database_Crowdfunding_db.png		Database_Crowdfunding_db.png
ERD_Crowdfunding.png		ERD_Crowdfunding.png
ETL_Mini_Project_MAparisio_LVara.ipynb		ETL_Mini_Project_MAparisio_LVara.ipynb
README.md		README.md
crowdfunding_db_schema.sql		crowdfunding_db_schema.sql

vara-co/Crowdfunding_ETL

Folders and files

Latest commit

History

Repository files navigation

ETL Group Project 2

DU - DA Module 13 Crowdfunding ETL Mini Project by M. Aparisio and L. Vara

INDEX

Content of the repository

Instructions

CREATE THE CATEGORY AND SUBCATEGORY DATAFRAMES

CREATE THE CAMPAIGN DATAFRAME

CREATE THE CONTACTS DATAFRAME

CREATE THE CROWDFUNDING DATABASE

References for NoSQL UK Food Ratings

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

DU - DA Module 13
Crowdfunding ETL Mini Project

by M. Aparisio and L. Vara

Packages