MATE-T580: Practical Data Science using R
(A Drexel University¹ course)

Welcome to MATE-T580 course portal! This repository will serve as the home page for the course. Lectures, assignments, datasets for analysis, and solutions to assignments & quizzes will be posted here. Your first order of business should be to carefully read this page to learn about the course's content, structure, and week-by-week plan of study.

Overview

MATE-T580 is designed as an introduction to data science for students with no prior knowledge of the subject. MATE-T580 covers all steps of the data science pipeline: from getting and cleaning data, to performing exploratory analysis, producing meaningful visualizations, all the way to building predictive models. Students will be introduced to some of the most powerful machine learning algorithms used in the industry, including artificial neural network (ANN) and gradient boosting machine (GBM). Throughout the course, emphasis will be on the practical aspects of data science (as opposed to the theoretical/statistical aspect). Students will work with real datasets and will do a lot of coding using R.

Prerequisites

Some knowledge of R is required before taking the course. To start with, I suggest completing the following courses on DataCamp:

Introduction to R (4 hours, Free)
Intermediate R (6 hours, Premium content)

And this interactive course within R:

swirl

Note that Drexel students who are registered for the course will receive a special invitation for accessing premium content on DataCamp to complete the prerequisites. When signing up for DataCamp, use your Drexel shortname email so that your account can be linked to your Drexel identity.

Course structure

At its core, Data Science is a combination of coding and performing statistical analysis. As the focus of the course is on the R implementation of various Data Science techniques, you learn best through hands on coding projects. As such, the course is structured to ensure that students complete ~100 hours of coding over the 10-week duration of the course. This is achieved through the following components:

Classroom-lab hybrid: Classroom time will be run as a lab rather than a traditional lecture. Students are required to always have a laptop with R and Rstudio installed on, ready to use in class. The instructor will present certain concepts and the students will follow by working on related coding tasks. Within this environment, I expect that we'll have lots of stimulating discussions and opportunities to collaborate.
Weekly DataCamp assignments (~4 hours each): A blend of mini video lectures and guided hands on exercises. The topics closely follow the topics we learn in class and thus reinforce in-class learning and provide additional avenue for getting hands on practice in R.
Bi-weekly assignments: More challenging assignments using real datasets. These assignments test the students ability to apply their subject knowledge to a problem with little guidance (relative to the DataCamp exercises and the in-class activities). The assignments present an opportunity to do work at the level expected in an actual professional setting.
Capstone project: The capstone project provides a chance for students to perform analysis on a dataset of their own choosing. The analysis should include an advanced topic that was not covered in class. A list of potential topics is included here, but students are welcome to come up with a topic not included in the list (subject to approval of the course instructor). The deliverable is a 10-minute presentation followed by a Q&A session during the last week of the course.

Course Plan

Week, Lesson	Overview	DataCamp Assignment
Prereq	Prior to our very first meeting, you are expected to: install R and Rstudio on a laptop that you will always have with you in class and complete some interactive learning exercises on DataCamp.	Introduction to R Intermediate R Swirl
(4/2) 1	We'll learn how to manipulate data frames using the dplyr package. dplyr offers many advantages over equivalent functions in base R including performance, compactness of code, and readability of code.	Data Manipulation with dplyr
(4/9) 2	We'll continue building on the work done in the previous week. However, we'll start to pay more attention to the structure of the data and what it means to have 'tidy data'. We'll also discuss how to deal with missing values.	Cleaning Data in R
(4/16) 3	We'll get introduced to the different R packages used to extract data from csv files, webpages, and text files. We'll also learn how to use regular expressions to parse text.	Importing data from flat files with utils readr & data.table Importing data from the web (Part 1) Importing data from the web (Part 2)
(4/23) 4	We'll learn about producing high quality and meaningful visualizations in R using the ggplot package. The focus will be on scatter plots, line plots, and bar plots.	Data Visualization with ggplot2 (Part 1) Coordinates and Facets
(4/30) 5	We'll get introduced to important concepts in statistical learning including the bias-variance trade off, feature selection, regularization, and cross validation.	Correlation and Regression
(5/7) 6, 7	We'll focus on the linear and logistic regression as the two baseline methods used respectively in regression and classification problems.	Multiple Regression Logistic Regression Case study: Italian restaurants in NYC
8 (5/14)	We'll focus on tree-based machine learning methods to solve classification and regression type problems. We'll learn how to train models using random forest and gbm (gradient boosting machine).	Machine Learning Toolbox
9 (5/21)	We'll get introduced to another very popular machine learning algorithm: Artificial neural network (ann). We'll learn how to perform ann learning in R using the nnet package.	Machine Learning with Tree-Based Models in R
10 (5/28)	We'll get introduced to text mining and we’ll learn how to perform machine learning on unstructured data (text).	Text Mining: Bag of Words
11 (6/4)	We'll conclude with a survey of advanced topics not discussed in this course, and with advice on how to continue your journey to become a data scientist.	No DataCamp assignment final week

Grading criteria

Students will be assessed according to the following scheme:

Component	Share
DataCamp assignments	20%
In-class short quizzes	20%
Assignments	40%
Capstone project	20%

Getting started

Even before our first meeting, you have an opportunity to prepare for the course. Here's what you can do to get started:

Install R on your computer
Install RStudio on your computer
Open RStudio and install the swirl package by typing through the console:

install.packages("swirl")

Load swirl by typing through the console:

library(swirl)

Begin your learning journey by typing through the console:

swirl()

Resources

R for Data Science by Grolemund and Wickham
An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie, and Tibshirani
Text Mining with R by Julia Silge and David Robinson
Tensorflow playground: use this simulator to experiment with building and tuning a neural network to solve some prototype problems

About the Instructor

I'm an Assistant Professor of Physics at Drexel University. I have interest in Data Science, Machine Learning, and Text Mining, and prior consulting experience in these fields. I am also an avid competitor on Kaggle with the rank of Competitions Master.

Credit

The lessons of this course are for the most part original material that I developed specifically for the first offering of MATE-T580 at Drexel University. There is one notable exception: lesson 5 on Introduction to Statistical Learning borrows heavily from the book with same name by James, Witten, Hastie, and Tibshirani.

The datasets used in the various demos and assignments were collected from different sources on the Internet and placed in a single folder for convenience to students. The datasets and scripts are used for sole educational purposes.

Educators interested in obtaining the .Rmd files used to create the lessons should contact me with their request.

Footnotes

To avoid any confusion over the term "University", here's what I believe a University is (the words of Robert M. Pirsig).

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
Assignments		Assignments
Datasets		Datasets
Demos		Demos
Lectures		Lectures
Quizzes		Quizzes
LICENSE		LICENSE
README.md		README.md
assignments_info.md		assignments_info.md
project_info.md		project_info.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MATE-T580: Practical Data Science using R
(A Drexel University¹ course)

Overview

Prerequisites

Course structure

Course Plan

Grading criteria

Getting started

Resources

About the Instructor

Credit

Footnotes

About

Releases

Packages

License

maherharb/MATE-T580

Folders and files

Latest commit

History

Repository files navigation

MATE-T580: Practical Data Science using R (A Drexel University1 course)

Overview

Prerequisites

Course structure

Course Plan

Grading criteria

Getting started

Resources

About the Instructor

Credit

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

MATE-T580: Practical Data Science using R
(A Drexel University¹ course)

Packages