This project demonstrates data engineering tasks using basketball data within a Databricks environment. The main goals are to process raw data, cleanse it, and analyze it using SQL to gain insights into player demographics across different teams.
- Databricks Workspace
- PySpark
- SQL
Purpose: Handle initial data processing tasks including reading raw data from a CSV file, performing data cleansing, and writing the cleaned data to a Parquet file.
Import basketball data from a CSV file.
- Rename columns for better readability and consistency.
- Replace blank or null values with appropriate placeholders or default values.
- Add new columns that may be required for further analysis.
- Write the processed data to a Parquet file format for efficient querying and storage.
Purpose: Create a table from the Parquet file generated in the first notebook and perform SQL queries to extract specific insights.
- Create a table in Databricks using the Parquet file as the data source.
- Query 1: View the most aged players from each team.
- Query 2: Display players from each team with a height greater than 6 feet.
- Databricks environment setup
- Required libraries installed (e.g., pandas, pyspark, etc.)
- Access to the basketball CSV data file
- Upload the basketball CSV data file to Databricks.
- Execute the notebook to perform data cleansing and generate the Parquet file.
- Verify the output Parquet file in the Databricks file system.
- Execute the notebook to create a table from the Parquet file.
- Run the provided SQL queries to extract insights and verify the results.