-
Notifications
You must be signed in to change notification settings - Fork 3
/
Objective_1_MLR_Final.Rmd
282 lines (203 loc) · 10.6 KB
/
Objective_1_MLR_Final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: "Objective 1 MLR Final Version"
author: "Greg Madden"
date: '2022-05-05'
output: pdf_document
---
***Group 2 Members:***
Gregory Madden
Christina Kuang
Chi Do
Trey Hamilton
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE, tidy = TRUE, cache =TRUE)
library(multcomp)
library(lawstat)
library(leaps)
library(tidyverse)
library(faraway)
library(Stat2Data)
#turning off scientific notation
options(scipen = 99999999)
```
***Our dataset contains characteristics of nursing homes in New Mexico, with 52 observations and 7 variables. Each row represents an individual nursing home.***
***Description of Dataset (provided by Stat2Data Package)***
***Variables/Columns: ***
Beds Number of beds in the nursing home
InPatientDays Annual medical in-patient days (in hundreds)
AllPatientDays Annual total patient days (in hundreds)
PatientRevenue Annual patient care revenue (in hundreds of dollars)
NurseSalaries Annual nursing salaries (in hundreds of dollars)
FacilitiesExpend Annual facilities expenditure (in hundreds of dollars)
Rural 1=rural or 0=non-rural
```{r}
#reading in data
data(Nursing)
Data<-Nursing
Data$Rural <- factor(Data$Rural)
levels(Data$Rural) <- c("Non-Rural", "Rural")
#checking categorical variable classifiers
contrasts(Data$Rural)
```
Glimpse of data overview:
```{r}
glimpse(Data)
```
## ***Question 1.*** What characteristics of nursing homes in New Mexico dictate annual nurse salaries at those institutions?
***Practical implications of a linear model for predicting cumulative annual nurse salaries for a given nursing home could be used by policymakers to rationally distribute subsidy funds to institutions that are expected to contribute the lowest salaries to a particular area.***
***Objective 1: Fit a multiple linear regression model with cumulative annual nurse salaries for individual nursing homes using the available financial characteristics for each institution. The goal is to develop a model using these available data to reliably predict institutions with low annual nursing salaries among the larger group of nursing homes across the state. ***
## Exploratory Data Analysis:
Based on the scatter plots and correlation table, it appears that Nurse Salaries has a moderate correlation with Beds, All Patient Days, and Patient Revenue. There also appears to be a strong linear relationship between Beds and AllPatientDays, Beds and Patient Revenue, In Patient Days and All Patient Days, In Patient Days and Patient Revenue, and All Patient Days and Patient Revenue.
```{r}
cor(Data[1:6])
```
```{r}
pairs(~ NurseSalaries + PatientRevenue + FacilitiesExpend + Beds + InPatientDays + AllPatientDays, data = Data, lower.panel = NULL)
```
Nursing home salaries by census in Annual total patient days (in hundreds) and Nursing home size (by number of beds). There appears to be a moderately strong correlation between annual total patient days and average nurse salary, with at least one apparent outlier in terms of high patient days and low salary, observation #26.
```{r}
Data %>%
ggplot(aes(x=AllPatientDays, y = NurseSalaries, size = PatientRevenue)) +
geom_point(alpha = 0.4) +
labs(x="Annual total patient days (hundreds)", y="Average Nurse Salary (hundreds $)", title = "Nursing Home Nurse Salaries") +
guides(size = guide_legend(title = "Annual patient \ncare revenue\n (hundreds $)"))
```
Boxplot demonstrating differences in Institutional nurse's salaries in New Mexico for Rural Areas compared with Non-Rural:
Based on the box plot, it appears there is a greater variability for non-rural nurse salary. The nurses in non-rural regions also have a higher median salary.
```{r}
Data$Rural <- factor(Data$Rural)
levels(Data$Rural) <- c("Non-Rural", "Rural")
Data %>%
ggplot(aes(x=Rural, y=NurseSalaries))+
geom_boxplot()+
labs(x="", y="Nurse Salary", title = "Nursing Home Nurse Salaries (hundreds $)")
```
Scatter plot of Patient Revenue versus Nurse Salaries:
The slopes are not parallel, which indicates there is an interaction effect between Patient Revenue and Nurse Salaries. Again seen is outlying observation #26.
```{r}
ggplot(Data,aes(x=PatientRevenue,y=NurseSalaries,color=Rural))+
geom_point()+
geom_smooth(method ="lm", se=FALSE)+
labs(x="PatientRevenue",y="NurseSalaries",title="PatientRevenue versus NurseSalaries")
```
## Carrying out initial automated search procedures:
Using backward selection to find the best model according to AIC. Start with the first-order model with all the predictors.
***The model selected is: NurseSalaries ~ PatientRevenue + FacilitiesExpend +
Rural***
```{r, echo=FALSE}
#intercept only
regnull <- lm(NurseSalaries~1,data=Data)
#full
regfull <- lm(NurseSalaries~.,data=Data)
#backward elimination
step(regfull, scope=list(lower=regnull,upper=regfull),direction="backward")
```
Reduced model summary
```{r}
reduced <- lm(NurseSalaries~PatientRevenue + FacilitiesExpend +
Rural,data=Data)
summary(reduced)
```
## Identifying outlying observations.
***Observation #26 is outlying.***
```{r}
## externally studentized residuals, t_i
ext.student.res<-rstudent(reduced)
## identify outliers with t_i##
## critical value using Bonferroni procedure
n<-dim(Data)[1]
#p is the number of predictors + 1 for the intercept
p<-4
crit<-qt(1-0.05/(2*n), n-1-p)
## identify
ext.student.res[abs(ext.student.res)>crit]
```
## Identifying observations that have high leverage.
Calculating the leverage values $h_{ii}$ below and identifying ones that are >2*p/n.
***Observations 26 and 31 have high leverage. ***
```{r}
lev <-lm.influence(reduced)$hat
lev[lev>2*p/n]
```
***Observation # 26 is both outlying in the predictors and appears to have high leverage.*** Unfortunately, this dataset nor the primary reference [Smith et al. "A Comparison of Financial Performance, Organizational Characteristics, and Management Strategy Among Rural and Urban Nursing Facilities," Journal of Rural Health, 1992, pp 27-40.] do not provide identifying information for individual Nursing Home facilities, so we cannot make specific conclusions about the facility for observation #26. What we can say about observation 26 is that it is a relatively large facility with 221 beds, especially among other rural facilities. For example, the number of beds for this facility is 79% higher than the next largest rural nursing home (123 beds). Also, despite very high patient revenue and Patient census, the nurses salary is the lowest of the entire dataset. Therefore, we suspect this facility may not be comparable with other institutions, perhaps owing to it's unique combination of large facility and rural classification. Since our primary objective is to identify low nursing salaries particularly in rural areas, we proposed re-running the regression while excluding observation #26 on the basis that is is an extraordinarily large (>200 beds) rural facility. Future predictions for such institutions will not be made using this model unless rural facilities are under 200 beds. Separate policy considerations will then be made for rare large/rural institutions.
```{r}
Data <- Data[-26,]
```
Interestingly, after re-running the EDA plots, we noticed that the slope and the intercept of the relationship between PatientRevenue and NurseSalaries no longer appear to change depending on rural vs. non-rural status. This interaction effect appears gone after excluding observation #26.
```{r}
Data %>%
ggplot(aes(x=PatientRevenue,y=NurseSalaries,color=Rural))+
geom_point()+
geom_smooth(method ="lm", se=FALSE)+
labs(x="PatientRevenue",y="NurseSalaries",title="PatientRevenue versus NurseSalaries")
```
Next, we repeated the automated search procedures with backward selection again after outlier observation #26 was removed.
***The model selected is: NurseSalaries ~ InPatientDays + AllPatientDays + FacilitiesExpend ***
```{r}
#intercept only
regnull <- lm(NurseSalaries~1,data=Data)
#full
regfull <- lm(NurseSalaries~.,data=Data)
#backward elimination
step(regfull, scope=list(lower=regnull,upper=regfull),direction="backward")
```
### Final Reduced model summary
```{r}
reduced <- lm(NurseSalaries~InPatientDays + AllPatientDays + FacilitiesExpend,data=Data)
summary(reduced)
```
***All VIFs below 4 so multicollinearity does not appear to be an issue. ***
```{r}
#requires package faraway
vif(reduced)
```
## Checking linear regression assumptions:
***Assumptions 1/2 appear to be generally met: Mean of Errors = 0, constant variance. ***
```{r}
#storing fitted y residuals
yhat <- reduced$fitted.values
res <- reduced$residuals
#add to data frame
Data <-data.frame(Data,yhat,res)
#residual plot
Data %>%
ggplot(aes(x=yhat,y=res)) +
geom_point() +
geom_hline(yintercept=0,color="red")+
labs(x="fitted y", y = "residuals", title="residual plot")
```
***ACF Plot: no autocorrelation seen.***
```{r}
acf(res,main="ACF Plot of Residuals")
```
***Errors are fairly normally distributed.***
```{r}
qqnorm(res)
qqline(res,col="red")
```
Using boxcox method, we see that 1 lies just within the 95% CI for lambda so we did not transform the y variable.
```{r}
boxcox(reduced)
```
Since we will be using this model to predict nursing salaries for new data, we are interested in the $R^2_{prediction}$.
Based on this value, the final model might be able to explain 66.15% of the variability in the new observations (as long as they are not rural facilities >200 beds). The R2 is 0.7176. Both values are fairly close to each other, so overfitting is not a major concern.
```{r}
#creating function to calculate PRESS statistic
PRESS <- function(model) {
i <- residuals(model)/(1 - lm.influence(model)$hat)
sum(i^2)
}
#PRESS(reduced)
## Find SST
anova_result<-anova(reduced)
SST<-sum(anova_result$"Sum Sq")
## R2 pred
Rsq_pred <-1-PRESS(reduced)/SST
Rsq_pred
```
## So, our final regression equation is:
$NursingSalary=366.08897-6.77821*InpatientDays+15.73249*AllPatientDays+0.15552*FacilitiesExpenditure$
***Where bed size for rural institutions is <200, NursingSalary = Estimated Annual nursing salaries (in hundreds of dollars), InpatientDays represents annual medical in-patient days (in hundreds), AllPatientDays represents annual total in-patient days (in hundreds), and FacilitiesExpenditure = hundreds of $.***
## \underline{Objective 1 Conclusions:}
1) Patient census parameters (both total inpatient days as well as medical inpatient days) and total facilities expenditures appear to be the most important factors in determining nursing salaries among the predictors we looked at (others included PatientRevenue, Rural vs. Non-Rural).
2) Based on the $R^2_{prediction}$, our final model summarized above might be able to explain 66% of the variability in nursing salaries of future nursing homes, as long as they are not rural facilities >200 beds.