CS7641 Machine Learning Assignment 1

This base code comes from the template provided by Jonathan Tay from his repository at https://github.com/JonathanTay/CS-7641-assignment-1 which has been modified to use a different dataset.

Initial testing of the base code used his datasets as shown below then modified to use a different dataset.

This file describes the structure of this assignment submission.

Spring 2018

This is the code for Assignment 1 for the OMSCS CS7641 Machine Learning course taught in the Spring of 2018.

Requirements

The assignment code was originally written in Python 3.5.1 but for this assignment it is being run in Python 3.6.0 on Microsoft Windows using the PyCharm IDE.

Hardware and Platforms

Both are running on Windows 7 Professional 64-bit Editions. Hardware for processing the code.

Lenovo Thinkpad T460P with 32Gb RAM, SSD and i7-6820HQ 8-core
Dell Precision T1650 with 16Gb RAM, SSD and i7-3770 8-core

The Dell desktop is slightly faster at computational processing by about 11% overall. The Lenovo laptop is my primary development environment and follows me everywhere.

Processor and memory access speeds seems to be the gating factor for the processing time.

http://cpu.userbenchmark.com/Compare/Intel-Core-i7-6820HQ-vs-Intel-Core-i7-3770K/m43500vs1317

Python Environment

Lenovo Thinkpad

Python 3.6.0 for Windows x86-64 retrieved from https://www.python.org/downloads/windows/ Dec 2016.
PyCharm 2017.3.3 Professional Edition IDE retrieved January 2018.
VirtualEnv used for isolating environment from other projects
Github account used for hosting this code and data

Dell Precision

Python 3.5.0 for Windows x86-64 retrieved from https://www.python.org/downloads/windows/.
PyCharm 2017.3.3 Professional Edition IDE retrieved January 2018.

Note: I did not use the Anaconda build of Python but the python.org version.

Libraries

Library dependencies are:

scikit-learn 0.19.1
numpy 0.14.0
pandas 0.22.0
matplotlib 2.1.2
tables 3.4.2
scipy 1.0.0

Other libraries used are part of the Python standard library.

datasets.hdf -> A pre-processed/cleaned up copy of the datasets. This file is created by the parse-xxx-data.py code. Note: Migrate the dataset.hdf manually to the root for processing.
"parse-xxx-data.py" -> This python script pre-processes the original UCI ML repo files into a cleaner form for the experiments
"xxx-analysis.pdf" -> The analysis for this assignment.
helpers.py -> A collection of helper functions used for this assignment
ANN.py -> Code for the Neural Network Experiments
Boosting.py -> Code for the Boosted Tree experiments
"Decision Tree.py" -> Code for the Decision Tree experiments
KNN.py -> Code for the K-nearest Neighbours experiments
SVM.py -> Code for the Support Vector Machine (SVM) experiments
plotter.py -> Code to plot the learning and validation curves in the report
README.txt -> This file

Supplemental Content

Weka 3.8.2 for Windows x86-64 retrieved from https://www.cs.waikato.ac.nz/ml/weka/downloading.html Feb 2018.

Outputs

There is also a subfolder called "output". This folder contains the experimental results.

Here, I use DT/ANN/BT/KNN/SVM_Lin/SVM_RBF to refer to decision trees, artificial neural networks, boosted trees, K-nearest neighbours, linear and RBF kernel SVMs respectively. A suffix of _OF indicates a deliberately "overfitted" version of the model where regularisation is turned off.

The datasets are adult/madelon referring to the two datasets used (the UCI Adult dataset and the UCI Madelon dataset)

There are 83 files in this folder. They come the following types:

__reg.csv -> The validation curve tests for on
__LC_train.scv -> Table of # of examples vs. CV training accuracy (for 5 folds) for on . For learning curves
__LC_test.csv -> Table of # of examples vs. CV testing accuracy (for 5 folds) for on . For learning curves
__timing.csv -> Table of fraction of training set vs. training and evaluation times. If the fulll training set is of size T and a fraction f are used for training, then the evaluation set is of size (T-fT)= (1-f)T
ITER_base__.csv -> Table of results for learning curves based on number of iterations/epochs.
ITERtestSET__.csv -> Table showing training and test set accuracy as number of iterations/epochs is varied. NOT USED in report.
"test results.csv" -> Table showing the optimal hyper-parameters chosen, as well as the final accuracy on the held out test set.
"test results Madelon No feature selection.csv" -> Table showing the optimal hyper-parameters chosen, as well as the final accuracy on the held out test set on Madelon with feature selection turned off. (Feature selection can be turned off my removing the "Cull" stages in the experiment pipelines (pipeM objects). Note that these results were done before random seeds were fixed throughout the code, so any attempt to regenerate them will be slightly different due to different random seeds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS7641 Machine Learning Assignment 1

Spring 2018

Requirements

Hardware and Platforms

Python Environment

Libraries

Contents

Supplemental Content

Outputs

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
jmm-data		jmm-data
jtay-data		jtay-data
output		output
.gitignore		.gitignore
ANN.py		ANN.py
Boosting.py		Boosting.py
Decision Tree.py		Decision Tree.py
KNN.py		KNN.py
README.md		README.md
SVM.py		SVM.py
helpers.py		helpers.py

mcgarrah/CS-7641-assignment-1

Folders and files

Latest commit

History

Repository files navigation

CS7641 Machine Learning Assignment 1

Spring 2018

Requirements

Hardware and Platforms

Python Environment

Libraries

Contents

Supplemental Content

Outputs

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages