Skip to content

5 analytical tasks have been completed using VAT validated gower-PAM clustering, Correspondence Analysis (CA), Asym-Biplot, Multiple Correspondence Analysis (MCA), Chi-Squared test, Regression, and predictive classification models with KNN, SVM, and Random Forest.

Notifications You must be signed in to change notification settings

KAR-NG/Human-Resource-Data-Mining

Repository files navigation

pic3_thumbnail

Summary

This project applies a series of data mining techniques including clustering, principal component methods, regression, and classification algorithms to study inner trends hiden the dataset. Numerous visualisation were also applied to aid each of these data mining methods. In this project, 5 analytical tasks have been completed using VAT validated gower-PAM clustering, correspondence analysis (CA), asymmetric-biplot, multiple correspondence analysis (MCA), Chi-Squared test, Regression, and predictive classification models with KNN, SVM, and Random Forest.

Outputs show that there is no statistical evidence (p-value = 0.249) to support the argument that several managers among all are good at training their employees, or vice versa. Instead, most managers are good at training their subordinates reaching the “fully meet” standard. The company do actively hire employees from diverse backgrounds. The company has a good level of overall diversity level at 76%. 40% of the employees in the company are female and 40% are employees from diverse backgrounds. The company recruits employees from 8 sources and diversity job fair is the best choice if the company is keen to hire an employee from a diverse background, and employee-referral being the worst source at hiring an employee with diverse background (Chi-squared test for independence: x-squared = 21.989, df = 5, p-value = 0.0005).

Inferential regression was applied to study the relationships between salary and numerous factors (variables) that would potentially relates to unequal pay, such as age, years of working, race, gender and etc, and the result shows that the company is paying employees equally, supported by extensive visualisation and P-values of higher than 0.05. Finally, this dataset provides sufficient data to train a model with great predictive power. K-Nearest Neighbor (KNN), Polynomial-kernel Support Vector Machine (SVM), and Random Forest were selected as the modeling candidates. Output shows that Random Forest models with 0.405 probability cut-off point is the best algorithm to make prediction for who is leaving the company. It has a reliable overall accuracy rate at 95.7%, sensitivity rate of 93.5% (the metric that we are most interested in) and specificity rate of 96.7%.

Highlight

pic2_highlights

References

Clustering and dimensionality reduction techniques on the Berlin Airbnb data and the problem of mixed data (n.d.),viewed 15 May 2022 https://rstudio-pubs-static.s3.amazonaws.com/579984_6b9efbf84ee24f00985c29e24265d2ba.html

Forest picture in section 8.5.2, credit: Michael Thirnbeck 2010, https://www.flickr.com/photos/thirnbeck/4547405603

KASSAMBARA A 2017, Practical Guide To Principal Component Methods in R, Edition 1, sthda.com

Lovelytics 2020, HR Diversity Scorecard, viewed 15 May 2022, https://www.youtube.com/watch?v=oaLp5eBi6E8

Nancy Chelaru 2019, Factor analysis of mixed data, viewed 4 June 2022, https://rpubs.com/nchelaru/famd

Rich Huebner 2020, Human Resources Data Set, viewed 2 May 2022, https://www.kaggle.com/datasets/rhuebner/human-resources-data-set?resource=download

Rich Huebner 2021, Codebook - HR Dataset v14, viewed 3 May 2022, https://rpubs.com/rhuebner/hrd_cb_v14

Wicked Good Data - r 2016, https://www.r-bloggers.com/2016/06/clustering-mixed-data-types-in-r/, viewed 8 May 2022

Will Tracz 2021, HR Tech Is the Key: Here’s How to Get It Right, viewed 14 May 2022, https://hrdailyadvisor.blr.com/2021/08/03/hr-tech-is-the-key-heres-how-to-get-it-right/

About

5 analytical tasks have been completed using VAT validated gower-PAM clustering, Correspondence Analysis (CA), Asym-Biplot, Multiple Correspondence Analysis (MCA), Chi-Squared test, Regression, and predictive classification models with KNN, SVM, and Random Forest.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages