### Introduction

Predicting in hospital mortality or length of hospital stay can help clinicians make informed decisions and empower patients and their families by better weighing their risks and preferences in clinical decision making. In this study, we used the 2012 National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train multiple machine learning algorithms to predict both in hospital mortality and length of stay. The NIS data contains discharge level information and represents approximately 90% of hospitals across the U.S. ^{1}. Initially, this dataset has 200,000 patients and 175 variables, with missing data in 199,998 observations and 120 features. This data set uses the International Classification of Diseases-Ninth revision (ICD-9) and the Clinical Classifications Software (CCS) developed by Agency for Healthcare Research and Quality (AHRQ).

### Methods

*Data pre-processing:*

We selected those variables of critical clinical importance and can be utilized by clinicians to predict in hospital mortality and length of stay in a normal clinical setting. Outcome variables are “DIED” (in hospital mortality) and “LOS” (length of stay). Since we typically only know values of these two variables around the same time point, we did not use “LOS” to predict “DIED” or vise versa. Among those selected variables, patient level characteristics include combability variables (“CM_AIDS”, “CM_ALCOHOL”, “CM_ANEMDEF”, “CM_ARTH”, “CM_BLDLOSS”, “CM_CHF”, “CM_CHRNLUNG”, “CM_COAG”, “CM_DEPRESS”, “CM_DM”, “CM_DMCX”, “CM_DRUG”, “CM_HTN_C”, “CM_HYPOTHY”, “CM_LIVER”, “CM_LYMPH”, “CM_LYTES”, “CM_METS”, “CM_NEURO”, “CM_OBESE”, “CM_PARA”, “CM_PERIVASC”, “CM_PSYCH”, “CM_PULMCIRC”, “CM_RENLFAIL”, “CM_TUMOR”, “CM_ULCER”, “CM_VALVE”, “CM_WGHTLOSS”), demographics (“AGE”, “FEMALE”, “RACE”), insurance type “PAY1”, income “ZIPINC_QRTL”, patient location “PL_NCHS2006”, elective admission “ELECTIVE” and transfer into hospital “TRAN_IN”. One hospital relevant variable was included – hospital location “HOSP_DIVISION”. Variables related to diagnoses or in hospital events are primary diagnosis with CCS classification “DXCCS1”, number of chronic conditions “NCHRONIC”, number of diagnosis on record “NDX”, number of procedures on record “NPR”, and major operation “ORPROC”. We selected CCS classification for the primary diagnosis instead of the ICD-9 because CCS collapsed ICD categories into fewer categories, which is beneficial for interpretation and statistical power saving. Based on the CCS guide, we further collapsed CCS classifications into 17 types: infectious and parasite disease, neoplasms, metabolic, mental illness, nervous system, circulatory, respiratory, digestive, genitourinary, pregnancy, skin, musculoskeletal, congenital, perinatal, injury and poisoning and other. As as we aimed to generate our own prediction models in this study, we did not select those existing risk prediction scores i.e. All Patient Refined-DRG (APRDRG).

Patients younger than 18 years old or with missing age were excluded. Any patients with missing values (NA, A, B, C) for those selected variables were excluded. Therefore, our analytic sample has 149627 observations and 45 variables. When predicting LOS, since the outcome variable is highly biased ranging from 0 to 300+, we excluded that small proportion of patients with LOS above 30 (1080 patients, <1% of total), which leaves 148,547 patients in the analytic sample for LOS.

We coded all continuous variables into numeric and categorical variables into factor. All categorical variables were further coded into dummy variables and each dummy variable was in numeric. We removed one dummy for each variable with multiple categories to avoid multicollinearity.

We tested the bivariate association between each potential predictor with our outcomes, and excluded those variables with p-value over 0.2. Since our data is highly biased, nonparametric tests were used for testing continuous variables with DIED (Mann-Whitney test and Wilcoxon test). The Chi-Square test was used to test for the association between categorical variables with DIED. Similarly, nonparametric tests were used to access the association between categorical variables with LOS and Pearson Correlation test was used for the association between continuous variables and LOS.

*Machine learning methods:*

To predict in hospital mortality, we applied four supervised machine-learning approaches: (1) classification with forward selection logistic regression, (2) balanced random forest to identify the variable importance to mortality, (3) bagging and (4) gradient boosting. To predict length of stay, we applied three machine-learning methods: (1) k nearest neighbor classification (knn), (2) random forest and (3) decision tree.

*In-hospital mortality:* I split a training and a test set in a 7:3 fashion using Monte-Carlo cross-validation (CV) subsampling method in mlR package. When applying forward logistic regression, we computed CV test area under curve (auc) using three-fold CV. We included those features from forward selection to predict mortality in the test set. In addition, we trained our models in the training set using random forest method (mtry as number of features, 500 trees, and 40 variables at each split) and bagging method (500 trees and six variables at each split). Moreover, we set up a gradient boosted model using Bernoulli loss function with three-fold cross validation and a maximum number of iterations of 5000. The best cross validation iteration from training set was used to predict mortality in test set. We also generated 1D and 2D marginal plots to assess the effect of the top three variables and their 2-way interactions. Test auc, misclassification error, sensitivity and specificity were computed for all four methods. Out of bag error rate was estimated for random forest and bagging. Variable importance was summarized for random forest, bagging and boosting. We selected the best model from these four models by comparing their test auc, misclassification error, sensitivity and specificity.

*Length of stay:* To select the best K for knn classification, we did iterations for each k between 1 and 15 and compared their root mean square error (rmse). We used the k with lowest rmse to predict LOS in test set and we also computed test R-square (r2). Secondly, we trained random forest model with default parameters in training set (500 trees and 26 variables at each split), and predicted LOS in test set. Test r2 and rmse were computed. Variable importance was summarized. Thirdly, we tuned decision tree using the optimal cp value when x-val relative error was smallest. We then used the turned tree to predict LOS in test set and computed r2 and rmse. We selected the best from these three models by comparing their test r2 and test rmse.

### Results

Summary statistics of our variables were presented in table 1 and table 2. Table 1 shows summary statistics stratified by mortality and bivariate test results. Table 2 shows the summary statistics of LOS among each variable category and univariate test results. We excluded “CM_AIDS”, “CM_ARTH”, and “CM_ULCER” in our following four machine-learning models for mortality prediction because they show p-values above 0.2 in table 1.

Figure 1 shows 1D and 2D marginal plots of variables of top three importance (“NDX”, “NPR” and “AGE”) from gradient boosting. The mortality risk increased steadily with increased age. It also increased with increased number of diagnosis, but it suddenly jumped to a very high stage when number of diagnosis reaches around 35. Mortality risk increased with increased number of procedures but suddenly decreased to a stage when the number of procedures reaches 20. The 2D plot shows that mortality risk is higher among those with higher age, larger number of procedures (not above 20) and larger number of diagnosis, compared to other groups. Figure 2 shows that the iteration of 4987 has the lowest CV error among these 5000 iterations, so we used this number among the test data. Figure 3 shows top 10 important variables discovered by bagging, boosting and random forest method. The top 10 most important variables from bagging are age, hospital location, number of diagnosis, diagnosis type, number of procedures, number of chronic conditions, patient location, patient income, insurance type and race. The top 10 most important variables from boosting are number of diagnosis, number of procedures, age, CM_LYTES, diagnosis type, number of chronic conditions, major operation, CM_COAG, CM_METS, and transferring from other hospitals. The top 10 most important variables from random forest are age, number of diagnosis, hospital location, number of chronic conditions, number of procedures, patient location, diagnosis type, income, insurance type and race. In addition, features selected by forward regression are age, number of chronic conditions, number of diagnosis, number of procedures and major operation. These features were used among test set. Figure 4 and 5 present roc curves and test parameters. Boosting has the highest auc (0.883) and the lowest misclassification error (0.02). All four methods have very low sensitivity but very high specificity (figure 5). Random forest and bagging have out of bag error 2.12% and 2.11% respectively. In summary, gradient boosting outperforms all other three models.

Figure 6 shows k number selection process for the knn method for LOS prediction. The rmse is decreasing with the increased k between 1 and 15. The curve was flattened at around k = 12. We selected k of 15 as our optimal number here. Figure 7 listed 10 most important variables discovered by random forest method to predict LOS. They are number of procedures, elective admission, age, diagnosis of infectious disease, number of diagnosis on record, diagnosis of genitourinary, number of chronic conditions, diagnosis of injury, main operation, and CM_WGHTLOSS. Figure 8 shows that the optimal CP should be used to tune decision tree is 0.01 with split number of 4. Figure 9 presents the tuned decision tree. Variables actually used in tree construction were diagnosis if injury, number of diagnoses and number of procedures. Figure 10 and figure 11 are showing how we compared these three methods. As we can see from figure 10, predicted LOS was classified into five categories by decision tree method, and it was correlated with observed LOS. KNN method shows completed random pattern or no correlation. Random forest method outperformed other two methods, showing strong correlation between predicted LOS and observed LOS. Additionally, random forest method has largest test r2 (0.3624) and lowest test rmse (3.279) among these three methods.

### Conclusions/discussion

The final models we selected to predict mortality and LOS is gradient boost and random forest respectively. Gradient boosting method outperforms other models showing high auc, low misclassification error and high specificity. However, all four models in this study for mortality prediction show very low sensitivity. If we are more interested in identifying those patients who would not die during hospital stay, our model might be good enough to be applied to new observations. However, if identifying those patients with high mortality risk is a main concern here, then our model might not be sufficiently accurate. The random forest model selected for LOS prediction shows very high correlation between predicted value and observed value. However, since we only included those patients with LOS no more than 30, it is questionable if we could apply this model for patients who may stay longer.

code chunk

load packages

```
library(tidyverse)
library(knitr)
library(tableone)
library(dplyr)
library(mlr)
library(forcats)
library(rpart)
library(rpart.plot)
library(randomForest)
library(Cairo)
library(parallel)
library(doSNOW)
library(ggplot2)
library(wesanderson)
library(FNN)
# library(pROC)
theme_set(theme_minimal())
```

tidy data

```
mydata <- read.csv("NIS2012-200K.csv", na.strings=c("", "NA"))
dim(mydata) # a total of 200,000 patients and 175 variables
dput(sort(names(mydata)))
df <- mydata %>% select(c("AGE", "CM_AIDS", "CM_ALCOHOL", "CM_ANEMDEF",
"CM_ARTH", "CM_BLDLOSS", "CM_CHF", "CM_CHRNLUNG", "CM_COAG",
"CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", "CM_HTN_C", "CM_HYPOTHY",
"CM_LIVER", "CM_LYMPH", "CM_LYTES", "CM_METS", "CM_NEURO", "CM_OBESE",
"CM_PARA", "CM_PERIVASC", "CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL",
"CM_TUMOR", "CM_ULCER", "CM_VALVE", "CM_WGHTLOSS", "DIED",
"DXCCS1", "ELECTIVE", "FEMALE", "HOSP_DIVISION", "LOS", "NCHRONIC", "NDX", "NPR", "ORPROC", "PAY1", "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL")) %>%
filter(DIED !="A" & ELECTIVE != "A" & FEMALE != "C" & LOS != "C" & PAY1 != "A" & RACE != "A" & ZIPINC_QRTL != "A", AGE != "C") %>%
mutate(AGE = as.numeric(AGE)) %>%
mutate_at(c("AGE", "LOS", "NCHRONIC", "NDX", "NPR"), as.numeric) %>%
mutate_at(c("CM_AIDS", "CM_ALCOHOL", "CM_ANEMDEF",
"CM_ARTH", "CM_BLDLOSS", "CM_CHF", "CM_CHRNLUNG", "CM_COAG",
"CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", "CM_HTN_C", "CM_HYPOTHY",
"CM_LIVER", "CM_LYMPH", "CM_LYTES", "CM_METS", "CM_NEURO", "CM_OBESE",
"CM_PARA", "CM_PERIVASC", "CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL",
"CM_TUMOR", "CM_ULCER", "CM_VALVE", "CM_WGHTLOSS", "DIED", "DXCCS1", "ELECTIVE", "FEMALE", "HOSP_DIVISION", "ORPROC", "PAY1", "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL"), as.factor) %>%
filter(AGE >= 18) %>%
mutate(
DXCCS1 = as.numeric(DXCCS1), diagnosis = case_when(
DXCCS1 %in% c(1:10) ~ "Infectious and parasitic disease",
DXCCS1 %in% c(11:47) ~ "Neoplasms",
DXCCS1 %in% c(48:58) ~ "Metabolic",
DXCCS1 %in% c(59:64) ~ "Hematologic",
DXCCS1 %in% c(650:663, 670) ~ "Mental illness",
DXCCS1 %in% c(76:95) ~ "Nervous system",
DXCCS1 %in% c(96:121) ~ "Circulatory",
DXCCS1 %in% c(122:134) ~ "Respiratory",
DXCCS1 %in% c(135:155) ~ "Digestive",
DXCCS1 %in% c(156:175) ~ "Genitourinary",
DXCCS1 %in% c(176:196) ~ "Pregnancy",
DXCCS1 %in% c(197:200) ~ "Skin",
DXCCS1 %in% c(201:212) ~ "Musculoskeletal",
DXCCS1 %in% c(213:217) ~ "Congenital",
DXCCS1 %in% c(218:224) ~ "Perinatal",
DXCCS1 %in% c(225:244) ~ "Injury and poisoning",
DXCCS1 %in% c(245:260) ~ "Other"
)) %>% select(-"DXCCS1") %>% mutate(diagnosis = as.factor(diagnosis))
dim(df)
# [1] 155147 45
# remove na
df <- df %>% filter(complete.cases(.))
dim(df)
# [1] 149627 45
# write out the df
write.csv(df, "mydata.csv", row.names = FALSE)
```

make table one

```
# read in dataset
mydata <- read.csv("mydata.csv")
# get variable names
dput(names(mydata))
# variables that I would like to summarize
myVars <- c("AGE", "CM_AIDS",
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF",
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG",
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES",
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC",
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER",
"CM_VALVE", "CM_WGHTLOSS", "ELECTIVE", "FEMALE",
"HOSP_DIVISION", "LOS", "NCHRONIC", "NDX", "NPR", "ORPROC", "PAY1",
"PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")
## Vector of categorical variables that need transformation
catVars <- c("CM_AIDS",
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF",
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG",
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES",
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC",
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER",
"CM_VALVE", "CM_WGHTLOSS", "ELECTIVE", "FEMALE",
"HOSP_DIVISION", "ORPROC", "PAY1", "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")
## stratified by death
tab1 <- CreateTableOne(vars = myVars, strata = "DIED" , data = mydata, factorVars = catVars)
kable(print(tab1, showAllLevels = TRUE, test = FALSE), caption = "Descriptive statistics stratified by mortality")
tab1mat <- print(tab1, showAllLevels = TRUE, test = T, quote = FALSE, noSpaces = TRUE, printToggle = FALSE)
write.csv(tab1mat, file = "Tableone.csv")
```

Bivariate associations for LOS

```
mydata <- read.csv("mydata.csv")
mydata$diagnosis <- as.numeric(as.factor(mydata$diagnosis))
function_compute_meansd <- function(dataset, group_vars){
dataset %>%
filter(!is.na(!!enquo(group_vars))) %>%
mutate(rank_LOS = rank(LOS)) %>%
group_by(!!enquo(group_vars)) %>%
summarise(mean_LOS = mean(LOS), sd_LOS = sd(LOS), mean_rank_LOS = mean(rank_LOS))
}
function_bivariate_test <- function(dataset, i){
if(class(dataset[, i]) == "numeric"){
cor.test(dataset$LOS, dataset[, i], method = "pearson") %>% .[c("method", "estimate", "p.value")]
} else {
if (nlevels(dataset[, i]) > 2){
kruskal.test(dataset$LOS ~ dataset[, i]) %>% .[c("method", "p.value")]
} else{
wilcox.test(dataset$LOS ~ dataset[, i], exact = FALSE, correct = FALSE) %>% .[c("method", "p.value")]
}
}
}
tab_meansd <- as.list(c( "CM_AIDS",
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF",
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG",
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES",
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC",
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER",
"CM_VALVE", "CM_WGHTLOSS", "DIED" , "ELECTIVE", "FEMALE",
"HOSP_DIVISION", "ORPROC", "PAY1", "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")) %>%
map(as.name) %>%
map(function_compute_meansd, dataset = mydata) %>%
map(~ mutate(., Variable = colnames(.)[1])) %>%
map_dfr(~ rename(., level = 1)) %>%
select(Variable, everything())
mydata <- mydata %>% mutate_at(c("CM_AIDS", "CM_ALCOHOL", "CM_ANEMDEF",
"CM_ARTH", "CM_BLDLOSS", "CM_CHF", "CM_CHRNLUNG", "CM_COAG",
"CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", "CM_HTN_C", "CM_HYPOTHY",
"CM_LIVER", "CM_LYMPH", "CM_LYTES", "CM_METS", "CM_NEURO", "CM_OBESE",
"CM_PARA", "CM_PERIVASC", "CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL",
"CM_TUMOR", "CM_ULCER", "CM_VALVE", "CM_WGHTLOSS", "DIED", "ELECTIVE", "FEMALE", "HOSP_DIVISION", "ORPROC", "PAY1", "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis"), as.factor) %>%
mutate_at(c("AGE", "LOS", "NCHRONIC", "NDX", "NPR"), as.numeric)
str(mydata)
tab_bivariate_test <- as.list(c("AGE", "CM_AIDS",
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF",
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "^CM_DM$", "CM_DMCX", "CM_DRUG",
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES",
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC",
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER",
"CM_VALVE", "CM_WGHTLOSS", "ELECTIVE", "FEMALE",
"HOSP_DIVISION", "NCHRONIC", "NDX", "NPR", "ORPROC", "PAY1",
"PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")) %>%
map(grep, colnames(mydata)) %>% map_dfr(function_bivariate_test, dataset = mydata)
%>%
mutate(Variable = c("AGE", "CM_AIDS",
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF",
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG",
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES",
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC",
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER",
"CM_VALVE", "CM_WGHTLOSS", "ELECTIVE", "FEMALE",
"HOSP_DIVISION", "NCHRONIC", "NDX", "NPR", "ORPROC", "PAY1",
"PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")) %>%
select(Variable, everything())
write.csv(tab_meansd, file = "tab_meansd.csv")
write.csv(tab_bivariate_test, file = "tab_bivariate_test.csv")
```

classification with forward selection logistic regression

```
cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
ctrl_forward <- makeFeatSelControlSequential(method = "sfs")
set.seed(2020)
DIED_forwd_select <- selectFeatures(learner = logreg_lnr, task = DIED_tsk_train, resampling = rdesc, measure = auc, control = ctrl_forward, show.info = F)
DIED_forwd_select
# cross-validation AUC
cv_auc_test = DIED_forwd_select$y
cv_auc_test
DIED_forwd_select$x
#Features (5): AGE, NCHRONIC, NDX, NPR, ORPROC
#auc.test.mean=0.8478
# "AGE" "NCHRONIC" "NDX" "NPR" "ORPROC"
DIED_forwd_tsk = subsetTask(DIED_tsk, features=DIED_forwd_select$x) # this redifines the task to only use the features selected above
DIED_forwd = train(learner=logreg_lnr, task=DIED_forwd_tsk , subset=train) #this trains the best CV model on the entire training set
DIED_forwd_predicttest = predict(DIED_forwd, task=DIED_forwd_tsk, subset=test) #this predicts on the test set
DIED_forwd_predicttest
# Prediction: 49876 observations
# predict.type: prob
# threshold: 1=0.50,0=0.50
# time: 0.00
confMatrix_forwd= table(true = mydata[test, ]$DIED, predicted = DIED_forwd_predicttest %>% as.data.frame() %>% .$response)
confMatrix_forwd
classificationError <- function(confMatrix, dataset){
confMatrix
error= (confMatrix[1,2] + confMatrix[2,1])/nrow(dataset)
error <- round(error, 2)
sens <- confMatrix[2, 2]/(confMatrix[2, 2] + confMatrix[2, 1])
spec <- confMatrix[1, 1]/(confMatrix[1, 1] + confMatrix[1, 2])
c(error=error, sensitivity=sens, specificity=spec)
}
classificationError(confMatrix_forwd, mydata[test, ])
# test error
#error sensitivity specificity
# 0.02000 0.99922 0.01613
pred <- DIED_forwd_predicttest %>% as.data.frame() %>% .$prob.1 %>% as.numeric()
roc <- pROC::roc(mydata[test, ]$DIED , pred)
#Data: pred in 1054 controls (mydata[test, ]$DIED 1) > 48822 cases (mydata[test, ]$DIED 0).
#Area under the curve: 0.852
roc_tab_forwd <- data.frame(sensitivity = roc$sensitivities, specificity = roc$specificities, test = rep("logistic_forward", length(roc$specificities)))
write.csv(roc_tab_forwd, "roc_tab_forwd.csv", row.names = FALSE)
stopCluster(cl) # stop the cluster in the end
```

random forest, bagging and boost

```
cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
set.seed(2020)
df <- mydata %>% select(-c("LOS", "CM_AIDS", "CM_ARTH", "CM_ULCER"))
DIED_rf = randomForest(DIED ~ ., data = df[train, ], mtry = dim(df[train, ])[2] - 1, strata = df$DIED[train], sampsize = as.vector(table(df$DIED[train]))) # oob error rate
DIED_rf
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 40
# OOB estimate of error rate: 2.12%
#Confusion matrix:
# 1 0 class.error
#1 23 2084 0.9890840
#0 30 97614 0.0003072
sum(DIED_rf$err.rate[, 1])
# [1] 10.87
# For random forests use randomForest with the default parameters. Generate the variable importance plot using varImpPlot and extract variable importance from the randomForest fitted object using the importance function.
varImpPlot(DIED_rf, cex.lab = 0.5, cex.axis = 0.5, cex = 0.75, n.var = 20, main = "", pch = 16, col = "red4")
importance = randomForest::importance
important.par = importance(DIED_rf)[order(importance(DIED_rf)[, 1], decreasing = TRUE), ]
important.par.df<- important.par %>% data.frame()
write.csv(important.par.df, "importance_rf.csv", row.names = T)
# bagging
set.seed(2020)
DIED_bg = randomForest(DIED ~ ., data = df[train, ], strata = df$DIED[train], sampsize = as.vector(table(df$DIED[train]))) # oob error rate
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 6
# OOB estimate of error rate: 2.11%
#Confusion matrix:
# 1 0 class.error
#1 0 2107 1.000e+00
#0 1 97643 1.024e-05
sum(DIED_bg$err.rate[, 1])
## oob error
# [1] 10.75
varImpPlot(DIED_bg, cex.lab = 0.5, cex.axis = 0.5, cex = 0.75, n.var = 20, main = "", pch = 16, col = "red4")
importance = randomForest::importance
important.par = importance(DIED_bg)[order(importance(DIED_bg)[, 1], decreasing = TRUE), ]
important.par.df<- important.par %>% data.frame()
write.csv(important.par.df, "importance_bg.csv", row.names = T)
# For boosting use gbm with cv.folds=3 to perform 3-fold cross-validation, and set class.stratify.cv to DIED so that cross-validation is performed stratifying by DIED .
library(gbm, warn.conflicts = FALSE)
df$DIED = -(as.numeric(df$DIED)-2)
set.seed(2020)
DIED_boost = gbm(DIED ~ ., data = df[train, ], distribution = "bernoulli", n.trees = 5000, interaction.depth = 1, shrinkage = 0.01, cv.folds = 3, class.stratify.cv = TRUE)
str(DIED_boost)
# A gradient boosted model with bernoulli loss function.
# 5000 iterations were performed.
# The best cross-validation iteration was 4987.
# There were 40 predictors of which 39 had non-zero influence.
# Plot the cross-validation error as a function of the boosting iteration/trees (the $cv.error component of the object returned by gbm ) and determine whether additional boosting iterations are warranted. If so, run additional iterations with gbm.more (use the R help to check its sintax). Choose the optimal number of iterations.
plot(DIED_boost$train.error, cex.lab = 2, cex.axis = 2, col = "red4", type = "l", lwd = 3, ylim = c(0.1, 0.2), ylab = "error")
lines(DIED_boost$cv.error, col = "steelblue", lwd = 3)
boost_itr_error <- data.frame(train.error = DIED_boost$train.error, cv.error = DIED_boost$cv.error)
write.csv(boost_itr_error, "boost_itr_error.csv", row.names = T)
# Use the summary.gbm funtion to generate the variable importance plot and extract variable importance/influence ( summary.gbm does both). Generate 1D and 2D marginal plots with gbm.plot to assess the effect of the top three variables and their 2-way interactions.
importance_boost<- summary(DIED_boost) %>% as.data.frame()
write.csv(importance_boost, "importance_boost.csv", row.names = T)
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[1, 1]))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[2, 1]))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[3, 1]))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[1:2, 1]), main = paste0("2D Plot: interaction between NDX and NPR"))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[2:3, 1]), main = paste0("2D Plot: interaction between NPR and AGE"))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[c(1, 3), 1]), main = paste0("2D Plot: interaction between NDX and AGE"))
# Compute the test misclassification error for the 4 methods and comment on their relative performance.
# predict rf
DIED_rf_predict = predict(DIED_rf, newdata = df[test, ], type = "prob")
table(DIED_rf_predict)
confMatrix_rf = table(true = df[test, ]$DIED, predicted = DIED_rf_predict %>% as.data.frame() %>% .$`.`)
classificationError(confMatrix_rf, df[test, ])
# test error
# error sensitivity specificity
# 0.02000 0.99973 0.00759
pred <- DIED_rf_predict %>% as.data.frame() %>% .$`1` %>% as.numeric()
roc <- pROC::roc(df[test, ]$DIED , pred)
# Data: pred in 1054 controls (df[test, ]$DIED 1) > 48822 cases (df[test, ]$DIED 0).
# Area under the curve: 0.864
roc_tab_rf <- data.frame(sensitivity = roc$sensitivities, specificity = roc$specificities, test = rep("random_forest", length(roc$specificities)))
write.csv(roc_tab_rf, "roc_tab_rf.csv", row.names = F)
# predict baggin
DIED_bg_predict = predict(DIED_bg, newdata = df[test, ], type = "response")
table(DIED_bg_predict)
confMatrix_bg = table(true = df[test, ]$DIED, predicted = DIED_bg_predict %>% as.data.frame() %>% .$`.`)
classificationError(confMatrix_bg, df[test, ])
# test error
# error sensitivity specificity
# 0.02 1.00 0.00
pred <- DIED_bg_predict %>% as.data.frame() %>% .$`1` %>% as.numeric()
roc <- pROC::roc(df[test, ]$DIED , pred)
# Data: pred in 1054 controls (df[test, ]$DIED 1) > 48822 cases (df[test, ]$DIED 0).
# Area under the curve: 0.881
roc_tab_bag <- data.frame(sensitivity = roc$sensitivities, specificity = roc$specificities, test = rep("bagging", length(roc$specificities)))
write.csv(roc_tab_bag, "roc_tab_bag.csv", row.names = F)
# boosting
DIED_boost_predict = predict(DIED_boost, newdata = df[test, ], type = "response", n.trees = which.min(DIED_boost$cv.error))
DIED_boost_predict_df<- DIED_boost_predict %>% as.data.frame() %>% mutate(response = ifelse(`.`>0.5, 1, 0))
DIED_boost_predict_df %>% count(response)
confMatrix_boost = table(true = df[test, ]$DIED, predicted = DIED_boost_predict_df$response)
classificationError(confMatrix_bg, df[test, ])
48790 /(48790 + 32)
# error sensitivity specificity
# 9.800e-01 1.000e+00 2.048e-05
pred <- DIED_boost_predict %>% as.data.frame() %>% .$`.` %>% as.numeric()
roc <- pROC::roc(df[test, ]$DIED , pred)
# Data: pred in 48822 controls (df[test, ]$DIED 0) < 1054 cases (df[test, ]$DIED 1).
# Area under the curve: 0.883
roc_tab_boost <- data.frame(sensitivity = roc$sensitivities, specificity = roc$specificities, test = rep("boost", length(roc$specificities)))
write.csv(roc_tab_boost, "roc_tab_boost.csv", row.names = F)
```

draw roc graphs

```
df1 <- read.csv("roc_tab_forwd.csv")
df2 <- read.csv("roc_tab_rf.csv")
df3 <- read.csv("roc_tab_bag.csv")
df4 <- read.csv("roc_tab_boost.csv")
roc_df <- rbind(df1, df2, df3, df4)
Cairo::Cairo(file="roc.png",
type="png",
units="px",
width=800,
height=500,
pointsize=12,
dpi="auto")
roc_df %>% ggplot(aes(x = 1-specificity, y = sensitivity, group = test, color = test)) + geom_line()
dev.off()
png("roc.png", units="in", width=7, height=5, res=300, type="cairo")
roc_df %>% ggplot(aes(x = 1-specificity, y = sensitivity, group = test, color = test)) + geom_line(size=1, alpha=0.9)
dev.off()
```

plot variable importance

```
df1 <- read.csv("importance_rf.csv") %>% head(10)
df2 <- read.csv("importance_bg.csv") %>% head(10)
df3 <- read.csv("importance_boost.csv") %>% head(10)
variable_df <- rbind(df1, df2, df3)
mydata %>% summarize(mean = mean(LOS),sd = sd(LOS) )
variable_df %>% ggplot(aes(x = fct_reorder2(var, rel.inf, test), y = rel.inf))+
geom_col(aes(fill = var)) + coord_flip() + facet_wrap(~ test, ncol = 1, scales = "free")
new_order <- variable_df %>%
group_by(test) %>%
do(data_frame(al=levels(reorder(interaction(.$test, .$var, drop=TRUE), .$rel.inf)))) %>%
pull(al)
png("importance.png", units="in", width=7, height=7, res=300, type="cairo")
variable_df %>%
mutate(al=factor(interaction(test, var), levels=new_order)) %>%
ggplot(aes(x = al, y = rel.inf)) +
geom_col(aes(fill = var)) + coord_flip() + facet_wrap(~test, ncol = 1, scales = "free") +
scale_x_discrete(breaks= new_order, labels=gsub("^.*\\.", "", new_order)) + ylab("variable importance") + xlab("variables")
dev.off()
```

parameter plot

```
parameter <- read.csv("parameters.csv")
head(parameter)
parameter %>% ggplot(aes(x = test, y = value)) + geom_point(aes(color = test), size=5, alpha=0.7, shape=21, stroke=2) + coord_flip() + facet_wrap(~parameter, ncol = 2, scales = "free")
png("parameter.png", units="in", width=8, height=5, res=300, type="cairo")
parameter %>% ggplot(aes(x = test, y = value)) + geom_point(aes(color = test), size=3, alpha=0.7, shape=21, stroke=2) + coord_flip() + facet_wrap(~parameter, ncol = 2, scales = "free")
dev.off()
```

boosting iteration error

```
boost_itr_error <- read.csv("boost_itr_error.csv")
head(boost_itr_error)
png("itr_boost.png", units="in", width=8, height=5, res=300, type="cairo")
boost_itr_error %>%
pivot_longer(-itr, names_to = "error_type", values_to = "value") %>% ggplot(aes(x=itr, y=value, group = error_type)) +
geom_line(aes(color = error_type), size=2, alpha=0.9, linetype=2) + geom_vline(xintercept = 4987, size = 2, alpha = 0.5, col = "steelblue") + annotate("text", x = 4200, y = 0.2, label = "the best cross validation \n iteration = 4987") + ylab("error") + xlab("iteration number")
dev.off()
```

knn

```
cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
set.seed(2020)
df2 <- mydata %>% select(-"DIED") %>% filter(LOS <= 30)
# [1] 148547
n = nrow(df2)
n
train = sample(1:n, floor(0.7*n))
df2 <- fastDummies::dummy_cols(df2, remove_first_dummy = TRUE)
df2 <- df2[, !sapply(df2, is.factor)]
nrow(df2) # [1] 148450
rmse = function(observed, predicted) sqrt(mean((observed - predicted)^2))
rmse_validation <- vector(length=15)
mylist <- c(100:150) %>% map(function(x){
fit <- knn.reg(train = df2[train, ], test = df2[-train, ], y = df2$LOS, k = x)
rmse_validation[x] <- rmse(df2[-train, ]$LOS, fit$pred)
})
rmse_df <- data.frame(k = c(100:150), rmse = mylist %>% unlist())
write.csv(rmse_df, "rmse_df2.csv")
rmse_df0 <- read.csv("rmse_df0.csv")
rmse_df <- read.csv("rmse_df.csv")
rmse_df2 <- read.csv("rmse_df2.csv")
png("rmse_k.png", units="in", width=8, height=5, res=300, type="cairo")
rbind(rmse_df0) %>% ggplot(aes(x = k, y = rmse)) + geom_line(size = 2, color = "steelblue")
dev.off()
# k = 15 # rmse = 4.230
fit <- knn.reg(train = df2[train, ], test = df2[-train, ], y = df2$LOS, k =15)
R2_test_knn = 1 - sum((df2[-train, ]$LOS - fit$pred)^2)/sum((df2[-train, ]$LOS - mean(df2[-train, ]$LOS))^2)
# [1] -0.06108
knn_obs_pred <- data.frame(obs = df2[-train, ]$LOS, pred = fit$pred, test = rep("knn", length(fit$pred)))
write.csv(knn_obs_pred, "knn_obs_pred.csv")
# histogram
ggplot(aes()) + geom_histogram(binwidth = 3, fill="#69b3a2", color="#e9ecef", alpha=0.9)
```

random forest

```
cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
set.seed(2020)
df2 <- read.csv("df2.csv")
LOS_rf = randomForest(LOS ~ ., data = df2[train, ], importance = TRUE)
# oob error rate
LOS_rf
# Number of trees: 500
#No. of variables tried at each split: 26
# Mean of squared residuals: 10.74
# % Var explained: 36.19
sum(LOS_rf$err.rate[, 1])
# [1] 3.279
# For random forests use randomForest with the default parameters. Generate the variable importance plot using varImpPlot and extract variable importance from the randomForest fitted object using the importance function.
varImpPlot(LOS_rf, cex.lab = 0.5, cex.axis = 0.5, cex = 0.75, n.var = 20, main = "", pch = 16, col = "red4")
importance = randomForest::importance
important.par = importance(LOS_rf)[order(importance(LOS_rf)[, 1], decreasing = TRUE), ]
important.par.df<- important.par %>% data.frame()
write.csv(important.par.df, "linear_importance_rf.csv", row.names = T)
test.pred.forest <- predict(LOS_rf, newdata = df2[-train, ])
rf_obs_pred <- data.frame(obs = df2[-train, ]$LOS, pred = test.pred.forest, test = rep("rf", length(test.pred.forest)))
write.csv(rf_obs_pred, "rf_obs_pred.csv")
RMSE.forest <- sqrt(mean((test.pred.forest-df2[-train, ]$LOS)^2))
RMSE.forest
# 3.279
R2_test_rf = 1 - sum((df2[-train, ]$LOS - test.pred.forest)^2)/sum((df2[-train, ]$LOS - mean(df2[-train, ]$LOS))^2)
# [1] 0.3624
# %IncMSE
#%IncMSE is the most robust and informative measure. It is the increase in mse of predictions(estimated with out-of-bag-CV) as a result of variable j being permuted(values randomly shuffled).
#grow regression forest. Compute OOB-mse, name this mse0.
#for 1 to j var: permute values of column j, then predict and compute OOB-mse(j)
#%IncMSE of j'th is (mse(j)-mse0)/mse0 * 100%
#the higher number, the more important
png("importance_linear.png", units="in", width=6, height=4, res=300, type="cairo")
important.par.df %>% rownames_to_column("variables") %>% head(10) %>%
ggplot(aes(x = reorder(variables, X.IncMSE), y = X.IncMSE)) +
geom_col(aes(fill = variables)) + coord_flip() + ylab("%IncMSE(variable importance)") + xlab("variables") + theme(legend.position = "none")
dev.off()
```

decision tree

```
library(rpart)
library(rattle)
cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
set.seed(2020)
rt <- rpart(LOS ~ ., data = df2[train,])
# n= 103982
# node), split, n, deviance, yval
# * denotes terminal node
#1) root 103982 1751000 4.348
# 2) NDX< 14.5 81031 848300 3.612
# 4) NDX< 8.5 48208 348100 3.054
# 8) diagnosis_Injury< 0.5 43773 229400 2.839 *
# 9) diagnosis_Injury>=0.5 4435 96710 5.177 *
# 5) NDX>=8.5 32823 463200 4.430 *
# 3) NDX>=14.5 22951 703400 6.948
# 6) NPR< 3.5 17571 334900 5.765 *
# 7) NPR>=3.5 5380 263600 10.810 *
test.pred.rtree <- predict(rt,df2[-train,])
RMSE.rtree <- sqrt(mean((test.pred.rtree-df2[-train, ]$LOS)^2))
RMSE.rtree
# [1] 3.666
#Variables actually used in tree construction:
#[1] diagnosis_Injury NDX NPR
#Root node error: 1750880/103982 = 17
#n= 103982
printcp(rt)
png("rt_tree.png", units="in", width=8, height=5, res=300, type="cairo")
fancyRpartPlot(rt)
dev.off()
# Get the optimal CP programmatically...
min.xerror <- rt$cptable[which.min(rt$cptable[,"xerror"]),"CP"]
min.xerror
# ...and use it to prune the tree
rt.pruned <- prune(rt, cp = min.xerror)
fancyRpartPlot(rt.pruned)
# Evaluate the new pruned tree on the test set
test.pred.rtree.p <- predict(rt.pruned, df2[-train, ])
tree_obs_pred <- data.frame(obs = df2[-train, ]$LOS, pred = test.pred.rtree.p, test = rep("decision_tree_tuned", length(test.pred.rtree.p)))
write.csv(tree_obs_pred, "tree_obs_pred.csv")
RMSE.rtree.pruned <- sqrt(mean((test.pred.rtree.p-df2[-train,]$LOS)^2))
RMSE.rtree.pruned
# [1] 3.666
R2_test_tree = 1 - sum((df2[-train, ]$LOS - test.pred.rtree.p)^2)/sum((df2[-train, ]$LOS - mean(df2[-train, ]$LOS))^2)
# [1] 0.2028
```

cp plot

```
my_xlab <- paste(levels(rt$cptable %>% data.frame() %>% mutate(CP = factor(round(CP,3), levels = c("0.114", "0.06", "0.021", "0.013", "0.01")))%>% .$CP),"\n(split=",rt$cptable %>% data.frame() %>% .$nsplit,")",sep="")
png("cp_table.png", units="in", width=8, height=5, res=300, type="cairo")
rt$cptable %>% data.frame() %>% mutate(CP = factor(round(CP,3), levels = c("0.114", "0.06", "0.021", "0.013", "0.01"))) %>% ggplot(aes(x = CP, y = xerror, group = 1)) + geom_line( color="grey") +
geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
scale_x_discrete(labels=my_xlab) + ylab("X-val Relative Error")
dev.off()
```

plot for regression

```
df3 <- read.csv("tree_obs_pred.csv")
df1 <- read.csv("knn_obs_pred.csv")
df2 <- read.csv("rf_obs_pred.csv")
df <- rbind(df1, df2, df3)
png("correlation.png", units="in", width=10, height=3, res=300, type="cairo")
df %>% ggplot(aes(x = obs, y = pred)) + geom_point(alpha = 0.1) + geom_smooth(method = "lm") + labs( x = "observed values in testset", y = "predicted values in testset")+
facet_wrap(~test, ncol = 3, scales = "free")
dev.off()
```

parameters

```
parameters_linear <- read.csv("parameters_linear.csv")
png("parameter_linear.png", units="in", width=8, height=3, res=300, type="cairo")
parameters_linear %>% ggplot(aes(x = test, y = value)) + geom_point(aes(color = test), size=3, alpha=0.7, shape=21, stroke=2) + coord_flip() + facet_wrap(~parameter, ncol = 3, scales = "free") + theme(legend.position = "none")
dev.off()
```

- HCUP-US NIS Overview [Internet]. [cited 2020 Apr 30]. Available from: https://www.hcup-us.ahrq.gov/nisoverview.jsp.

↩