This is an optional honors project for the IBM Exploratory Data Analysis for Machine Learning Course on Coursera. The aim is to demonstrate the applications of skills and knowledge gained from this course such as Data Cleaning, Feature Engineering, Exploratory Data Visualization, and Hypothesis Testing.
- Select a dataset that you are curious about.
- Provide a brief description of the data set and a summary of its attributes.
- Provide an initial plan for data exploration.
- Describe actions taken for data cleaning and feature engineering.
- Provide key findings and insights, which synthesizes the results of Exploratory Data Analysis in an insightful and actionable manner.
- Formulate at least 3 hypothesis about this data.
- Conduct a formal significance test for one of the hypotheses and discuss the results.
- Provide suggestions for next steps in analyzing this data.
- Include a paragraph that summarizes the quality of this data set and a request for additional data if needed.
Using Kaggle Data set, High School Alcoholism and Academic Performance
To explore what causes teenage alcoholism and its impact on academic performance, as well as factors that could reduce it.
- Download Kaggle dataset and extract contents into
./data
. - Create and activate virtual environment following this tutorial. https://docs.python.org/3/tutorial/venv.html
- Install requirements
install -r .\requirements.txt
install -r ./requirements.txt
- Run File
python .\src\exploratory_data_analysis.py
python src/exploratory_data_analysis.py