Welcome to MATE-T580 course portal! This repository will serve as the home page for the course. Lectures, assignments, datasets for analysis, and solutions to assignments & quizzes will be posted here. Your first order of business should be to carefully read this page to learn about the course's content, structure, and week-by-week plan of study.
MATE-T580 is designed as an introduction to data science for students with no prior knowledge of the subject. MATE-T580 covers all steps of the data science pipeline: from getting and cleaning data, to performing exploratory analysis, producing meaningful visualizations, all the way to building predictive models. Students will be introduced to some of the most powerful machine learning algorithms used in the industry, including artificial neural network (ANN) and gradient boosting machine (GBM). Throughout the course, emphasis will be on the practical aspects of data science (as opposed to the theoretical/statistical aspect). Students will work with real datasets and will do a lot of coding using R.
Some knowledge of R is required before taking the course. To start with, I suggest completing the following courses on DataCamp:
- Introduction to R (4 hours, Free)
- Intermediate R (6 hours, Premium content)
And this interactive course within R:
Note that Drexel students who are registered for the course will receive a special invitation for accessing premium content on DataCamp to complete the prerequisites. When signing up for DataCamp, use your Drexel shortname email so that your account can be linked to your Drexel identity.
At its core, Data Science is a combination of coding and performing statistical analysis. As the focus of the course is on the R implementation of various Data Science techniques, you learn best through hands on coding projects. As such, the course is structured to ensure that students complete ~100 hours of coding over the 10-week duration of the course. This is achieved through the following components:
- Classroom-lab hybrid: Classroom time will be run as a lab rather than a traditional lecture. Students are required to always have a laptop with R and Rstudio installed on, ready to use in class. The instructor will present certain concepts and the students will follow by working on related coding tasks. Within this environment, I expect that we'll have lots of stimulating discussions and opportunities to collaborate.
- Weekly DataCamp assignments (~4 hours each): A blend of mini video lectures and guided hands on exercises. The topics closely follow the topics we learn in class and thus reinforce in-class learning and provide additional avenue for getting hands on practice in R.
- Bi-weekly assignments: More challenging assignments using real datasets. These assignments test the students ability to apply their subject knowledge to a problem with little guidance (relative to the DataCamp exercises and the in-class activities). The assignments present an opportunity to do work at the level expected in an actual professional setting.
- Capstone project: The capstone project provides a chance for students to perform analysis on a dataset of their own choosing. The analysis should include an advanced topic that was not covered in class. A list of potential topics is included here, but students are welcome to come up with a topic not included in the list (subject to approval of the course instructor). The deliverable is a 10-minute presentation followed by a Q&A session during the last week of the course.
Week, Lesson |
Overview |
DataCamp Assignment |
---|---|---|
Prereq | Prior to our very first meeting, you are expected to: install R and Rstudio on a laptop that you will always have with you in class and complete some interactive learning exercises on DataCamp. |
|
(4/2) 1 |
We'll learn how to manipulate data frames using the dplyr package. dplyr offers many advantages over equivalent functions in base R including performance, compactness of code, and readability of code. |
|
(4/9) 2 |
We'll continue building on the work done in the previous week. However, we'll start to pay more attention to the structure of the data and what it means to have 'tidy data'. We'll also discuss how to deal with missing values. |
|
(4/16) 3 |
We'll get introduced to the different R packages used to extract data from csv files, webpages, and text files. We'll also learn how to use regular expressions to parse text. |
|
(4/23) 4 |
We'll learn about producing high quality and meaningful visualizations in R using the ggplot package. The focus will be on scatter plots, line plots, and bar plots. |
|
(4/30) 5 |
We'll get introduced to important concepts in statistical learning including the bias-variance trade off, feature selection, regularization, and cross validation. |
|
(5/7) 6, 7 |
We'll focus on the linear and logistic regression as the two baseline methods used respectively in regression and classification problems. |
|
8 (5/14) |
We'll focus on tree-based machine learning methods to solve classification and regression type problems. We'll learn how to train models using random forest and gbm (gradient boosting machine). |
|
9 (5/21) |
We'll get introduced to another very popular machine learning algorithm: Artificial neural network (ann). We'll learn how to perform ann learning in R using the nnet package. |
|
10 (5/28) |
We'll get introduced to text mining and we’ll learn how to perform machine learning on unstructured data (text). |
|
11 (6/4) |
We'll conclude with a survey of advanced topics not discussed in this course, and with advice on how to continue your journey to become a data scientist. | No DataCamp assignment final week |
Students will be assessed according to the following scheme:
Component | Share |
---|---|
DataCamp assignments | 20% |
In-class short quizzes | 20% |
Assignments | 40% |
Capstone project | 20% |
Even before our first meeting, you have an opportunity to prepare for the course. Here's what you can do to get started:
- Install R on your computer
- Install RStudio on your computer
- Open RStudio and install the swirl package by typing through the console:
install.packages("swirl")
- Load swirl by typing through the console:
library(swirl)
- Begin your learning journey by typing through the console:
swirl()
-
R for Data Science by Grolemund and Wickham
-
An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie, and Tibshirani
-
Text Mining with R by Julia Silge and David Robinson
-
Tensorflow playground: use this simulator to experiment with building and tuning a neural network to solve some prototype problems
I'm an Assistant Professor of Physics at Drexel University. I have interest in Data Science, Machine Learning, and Text Mining, and prior consulting experience in these fields. I am also an avid competitor on Kaggle with the rank of Competitions Master.
The lessons of this course are for the most part original material that I developed specifically for the first offering of MATE-T580 at Drexel University. There is one notable exception: lesson 5 on Introduction to Statistical Learning borrows heavily from the book with same name by James, Witten, Hastie, and Tibshirani.
The datasets used in the various demos and assignments were collected from different sources on the Internet and placed in a single folder for convenience to students. The datasets and scripts are used for sole educational purposes.
Educators interested in obtaining the .Rmd files used to create the lessons should contact me with their request.
- To avoid any confusion over the term "University", here's what I believe a University is (the words of Robert M. Pirsig).