This is the Practical Machine Learning Project Repository. If you want a web-based version of the project go to this RPubs page. The basic outline of my solution is:
- Clean the training and testing datasets
- Remove non-relevant variables
- Remove variables with more than 95% NAs
- Do a Principal Component Analysis (PCA) on the clean training dataset in order to reduce the number of variables
- Select the PCs whose cumulative proportion of explained variance equals 95%
- Predcit the new PCs on the training dataset
- Use the predictions to build a Random Forest model (parallelization is required)
- Do a confusion matrix to assess the performance on the training dataset
- Predict the class outcome on the testing set, first calculating the PCs and then using the Random Forest model