Predicting In-hospital Mortality and Length of Stay through Machine Learning Methods

Keren Xu

2020/05/06

Introduction

Predicting in hospital mortality or length of hospital stay can help clinicians make informed decisions and empower patients and their families by better weighing their risks and preferences in clinical decision making. In this study, we used the 2012 National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train multiple machine learning algorithms to predict both in hospital mortality and length of stay. The NIS data contains discharge level information and represents approximately 90% of hospitals across the U.S. 1. Initially, this dataset has 200,000 patients and 175 variables, with missing data in 199,998 observations and 120 features. This data set uses the International Classification of Diseases-Ninth revision (ICD-9) and the Clinical Classifications Software (CCS) developed by Agency for Healthcare Research and Quality (AHRQ).

Methods

Data pre-processing:
We selected those variables of critical clinical importance and can be utilized by clinicians to predict in hospital mortality and length of stay in a normal clinical setting. Outcome variables are “DIED” (in hospital mortality) and “LOS” (length of stay). Since we typically only know values of these two variables around the same time point, we did not use “LOS” to predict “DIED” or vise versa. Among those selected variables, patient level characteristics include combability variables (“CM_AIDS”, “CM_ALCOHOL”, “CM_ANEMDEF”, “CM_ARTH”, “CM_BLDLOSS”, “CM_CHF”, “CM_CHRNLUNG”, “CM_COAG”, “CM_DEPRESS”, “CM_DM”, “CM_DMCX”, “CM_DRUG”, “CM_HTN_C”, “CM_HYPOTHY”, “CM_LIVER”, “CM_LYMPH”, “CM_LYTES”, “CM_METS”, “CM_NEURO”, “CM_OBESE”, “CM_PARA”, “CM_PERIVASC”, “CM_PSYCH”, “CM_PULMCIRC”, “CM_RENLFAIL”, “CM_TUMOR”, “CM_ULCER”, “CM_VALVE”, “CM_WGHTLOSS”), demographics (“AGE”, “FEMALE”, “RACE”), insurance type “PAY1”, income “ZIPINC_QRTL”, patient location “PL_NCHS2006”, elective admission “ELECTIVE” and transfer into hospital “TRAN_IN”. One hospital relevant variable was included – hospital location “HOSP_DIVISION”. Variables related to diagnoses or in hospital events are primary diagnosis with CCS classification “DXCCS1”, number of chronic conditions “NCHRONIC”, number of diagnosis on record “NDX”, number of procedures on record “NPR”, and major operation “ORPROC”. We selected CCS classification for the primary diagnosis instead of the ICD-9 because CCS collapsed ICD categories into fewer categories, which is beneficial for interpretation and statistical power saving. Based on the CCS guide, we further collapsed CCS classifications into 17 types: infectious and parasite disease, neoplasms, metabolic, mental illness, nervous system, circulatory, respiratory, digestive, genitourinary, pregnancy, skin, musculoskeletal, congenital, perinatal, injury and poisoning and other. As as we aimed to generate our own prediction models in this study, we did not select those existing risk prediction scores i.e. All Patient Refined-DRG (APRDRG).
Patients younger than 18 years old or with missing age were excluded. Any patients with missing values (NA, A, B, C) for those selected variables were excluded. Therefore, our analytic sample has 149627 observations and 45 variables. When predicting LOS, since the outcome variable is highly biased ranging from 0 to 300+, we excluded that small proportion of patients with LOS above 30 (1080 patients, <1% of total), which leaves 148,547 patients in the analytic sample for LOS.
We coded all continuous variables into numeric and categorical variables into factor. All categorical variables were further coded into dummy variables and each dummy variable was in numeric. We removed one dummy for each variable with multiple categories to avoid multicollinearity.
We tested the bivariate association between each potential predictor with our outcomes, and excluded those variables with p-value over 0.2. Since our data is highly biased, nonparametric tests were used for testing continuous variables with DIED (Mann-Whitney test and Wilcoxon test). The Chi-Square test was used to test for the association between categorical variables with DIED. Similarly, nonparametric tests were used to access the association between categorical variables with LOS and Pearson Correlation test was used for the association between continuous variables and LOS.
Machine learning methods:
To predict in hospital mortality, we applied four supervised machine-learning approaches: (1) classification with forward selection logistic regression, (2) balanced random forest to identify the variable importance to mortality, (3) bagging and (4) gradient boosting. To predict length of stay, we applied three machine-learning methods: (1) k nearest neighbor classification (knn), (2) random forest and (3) decision tree.
In-hospital mortality: I split a training and a test set in a 7:3 fashion using Monte-Carlo cross-validation (CV) subsampling method in mlR package. When applying forward logistic regression, we computed CV test area under curve (auc) using three-fold CV. We included those features from forward selection to predict mortality in the test set. In addition, we trained our models in the training set using random forest method (mtry as number of features, 500 trees, and 40 variables at each split) and bagging method (500 trees and six variables at each split). Moreover, we set up a gradient boosted model using Bernoulli loss function with three-fold cross validation and a maximum number of iterations of 5000. The best cross validation iteration from training set was used to predict mortality in test set. We also generated 1D and 2D marginal plots to assess the effect of the top three variables and their 2-way interactions. Test auc, misclassification error, sensitivity and specificity were computed for all four methods. Out of bag error rate was estimated for random forest and bagging. Variable importance was summarized for random forest, bagging and boosting. We selected the best model from these four models by comparing their test auc, misclassification error, sensitivity and specificity.
Length of stay: To select the best K for knn classification, we did iterations for each k between 1 and 15 and compared their root mean square error (rmse). We used the k with lowest rmse to predict LOS in test set and we also computed test R-square (r2). Secondly, we trained random forest model with default parameters in training set (500 trees and 26 variables at each split), and predicted LOS in test set. Test r2 and rmse were computed. Variable importance was summarized. Thirdly, we tuned decision tree using the optimal cp value when x-val relative error was smallest. We then used the turned tree to predict LOS in test set and computed r2 and rmse. We selected the best from these three models by comparing their test r2 and test rmse.

Results

Summary statistics of our variables were presented in table 1 and table 2. Table 1 shows summary statistics stratified by mortality and bivariate test results. Table 2 shows the summary statistics of LOS among each variable category and univariate test results. We excluded “CM_AIDS”, “CM_ARTH”, and “CM_ULCER” in our following four machine-learning models for mortality prediction because they show p-values above 0.2 in table 1.
Figure 1 shows 1D and 2D marginal plots of variables of top three importance (“NDX”, “NPR” and “AGE”) from gradient boosting. The mortality risk increased steadily with increased age. It also increased with increased number of diagnosis, but it suddenly jumped to a very high stage when number of diagnosis reaches around 35. Mortality risk increased with increased number of procedures but suddenly decreased to a stage when the number of procedures reaches 20. The 2D plot shows that mortality risk is higher among those with higher age, larger number of procedures (not above 20) and larger number of diagnosis, compared to other groups. Figure 2 shows that the iteration of 4987 has the lowest CV error among these 5000 iterations, so we used this number among the test data. Figure 3 shows top 10 important variables discovered by bagging, boosting and random forest method. The top 10 most important variables from bagging are age, hospital location, number of diagnosis, diagnosis type, number of procedures, number of chronic conditions, patient location, patient income, insurance type and race. The top 10 most important variables from boosting are number of diagnosis, number of procedures, age, CM_LYTES, diagnosis type, number of chronic conditions, major operation, CM_COAG, CM_METS, and transferring from other hospitals. The top 10 most important variables from random forest are age, number of diagnosis, hospital location, number of chronic conditions, number of procedures, patient location, diagnosis type, income, insurance type and race. In addition, features selected by forward regression are age, number of chronic conditions, number of diagnosis, number of procedures and major operation. These features were used among test set. Figure 4 and 5 present roc curves and test parameters. Boosting has the highest auc (0.883) and the lowest misclassification error (0.02). All four methods have very low sensitivity but very high specificity (figure 5). Random forest and bagging have out of bag error 2.12% and 2.11% respectively. In summary, gradient boosting outperforms all other three models.
Figure 6 shows k number selection process for the knn method for LOS prediction. The rmse is decreasing with the increased k between 1 and 15. The curve was flattened at around k = 12. We selected k of 15 as our optimal number here. Figure 7 listed 10 most important variables discovered by random forest method to predict LOS. They are number of procedures, elective admission, age, diagnosis of infectious disease, number of diagnosis on record, diagnosis of genitourinary, number of chronic conditions, diagnosis of injury, main operation, and CM_WGHTLOSS. Figure 8 shows that the optimal CP should be used to tune decision tree is 0.01 with split number of 4. Figure 9 presents the tuned decision tree. Variables actually used in tree construction were diagnosis if injury, number of diagnoses and number of procedures. Figure 10 and figure 11 are showing how we compared these three methods. As we can see from figure 10, predicted LOS was classified into five categories by decision tree method, and it was correlated with observed LOS. KNN method shows completed random pattern or no correlation. Random forest method outperformed other two methods, showing strong correlation between predicted LOS and observed LOS. Additionally, random forest method has largest test r2 (0.3624) and lowest test rmse (3.279) among these three methods.

Conclusions/discussion

The final models we selected to predict mortality and LOS is gradient boost and random forest respectively. Gradient boosting method outperforms other models showing high auc, low misclassification error and high specificity. However, all four models in this study for mortality prediction show very low sensitivity. If we are more interested in identifying those patients who would not die during hospital stay, our model might be good enough to be applied to new observations. However, if identifying those patients with high mortality risk is a main concern here, then our model might not be sufficiently accurate. The random forest model selected for LOS prediction shows very high correlation between predicted value and observed value. However, since we only included those patients with LOS no more than 30, it is questionable if we could apply this model for patients who may stay longer.


Figure 2. cv error and train error for each iteration from gradient boosting Figure 3. variable importance from three methods (bagging, boosting and random forest). Figure 4. ROC curves for four methods (bagging, boosting, forward classification, and random forest). Figure 5. Test parameters for four methods (bagging, boosting, forward classification, and random forest). Figure 6. rmse for each k from 1 to 15 for knn method. Figure 7. variable importance of random forest method. Figure 8. X-val relative error for each cp value from decision tree method. Figure 10. correlation plots for observed LOS and predicted LOS from each method among test set. Figure 11. test parameters (r2 and rmse) from three methods.


code chunk


load packages

library(tidyverse)
library(knitr)
library(tableone)
library(dplyr)
library(mlr)
library(forcats)
library(rpart)
library(rpart.plot)
library(randomForest)
library(Cairo)
library(parallel)
library(doSNOW)
library(ggplot2)
library(wesanderson)
library(FNN)
# library(pROC)
theme_set(theme_minimal())

tidy data

mydata <- read.csv("NIS2012-200K.csv", na.strings=c("", "NA"))    

dim(mydata)  # a total of 200,000 patients and 175 variables 

dput(sort(names(mydata)))

df <- mydata %>% select(c("AGE", "CM_AIDS", "CM_ALCOHOL", "CM_ANEMDEF", 
"CM_ARTH", "CM_BLDLOSS", "CM_CHF", "CM_CHRNLUNG", "CM_COAG", 
"CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", "CM_HTN_C", "CM_HYPOTHY", 
"CM_LIVER", "CM_LYMPH", "CM_LYTES", "CM_METS", "CM_NEURO", "CM_OBESE", 
"CM_PARA", "CM_PERIVASC", "CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", 
"CM_TUMOR", "CM_ULCER", "CM_VALVE", "CM_WGHTLOSS", "DIED",  
"DXCCS1", "ELECTIVE", "FEMALE", "HOSP_DIVISION", "LOS", "NCHRONIC", "NDX", "NPR", "ORPROC", "PAY1", "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL")) %>% 
  filter(DIED !="A" & ELECTIVE != "A" & FEMALE != "C" & LOS != "C" & PAY1 != "A" & RACE != "A" & ZIPINC_QRTL != "A", AGE != "C") %>% 
  mutate(AGE = as.numeric(AGE)) %>% 
  mutate_at(c("AGE", "LOS", "NCHRONIC", "NDX", "NPR"), as.numeric) %>% 
  mutate_at(c("CM_AIDS", "CM_ALCOHOL", "CM_ANEMDEF", 
"CM_ARTH", "CM_BLDLOSS", "CM_CHF", "CM_CHRNLUNG", "CM_COAG", 
"CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", "CM_HTN_C", "CM_HYPOTHY", 
"CM_LIVER", "CM_LYMPH", "CM_LYTES", "CM_METS", "CM_NEURO", "CM_OBESE", 
"CM_PARA", "CM_PERIVASC", "CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", 
"CM_TUMOR", "CM_ULCER", "CM_VALVE", "CM_WGHTLOSS", "DIED",  "DXCCS1", "ELECTIVE", "FEMALE", "HOSP_DIVISION",  "ORPROC", "PAY1", "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL"), as.factor) %>% 
  filter(AGE >= 18) %>% 
  mutate(
  DXCCS1 = as.numeric(DXCCS1), diagnosis = case_when(
  DXCCS1 %in% c(1:10) ~ "Infectious and parasitic disease",
  DXCCS1 %in% c(11:47) ~ "Neoplasms",
  DXCCS1 %in% c(48:58) ~ "Metabolic", 
  DXCCS1 %in% c(59:64) ~ "Hematologic", 
  DXCCS1 %in% c(650:663, 670) ~ "Mental illness",   
  DXCCS1 %in% c(76:95) ~ "Nervous system",
  DXCCS1 %in% c(96:121) ~ "Circulatory",
  DXCCS1 %in% c(122:134) ~ "Respiratory",
  DXCCS1 %in% c(135:155) ~ "Digestive",   
  DXCCS1 %in% c(156:175) ~ "Genitourinary",  
  DXCCS1 %in% c(176:196) ~ "Pregnancy",   
  DXCCS1 %in% c(197:200) ~ "Skin", 
  DXCCS1 %in% c(201:212) ~ "Musculoskeletal", 
  DXCCS1 %in% c(213:217) ~ "Congenital",
  DXCCS1 %in% c(218:224) ~ "Perinatal",
  DXCCS1 %in% c(225:244) ~ "Injury and poisoning",
  DXCCS1 %in% c(245:260) ~ "Other"
)) %>% select(-"DXCCS1") %>% mutate(diagnosis = as.factor(diagnosis))

dim(df)
# [1] 155147    45  

# remove na   

df <- df %>% filter(complete.cases(.))

dim(df)      

# [1] 149627  45

# write out the df   
write.csv(df, "mydata.csv", row.names = FALSE)

make table one

# read in dataset   
mydata <- read.csv("mydata.csv")   

# get variable names  
dput(names(mydata))

# variables that I would like to summarize  
myVars <- c("AGE", "CM_AIDS", 
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF", 
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", 
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES", 
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC", 
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER", 
"CM_VALVE", "CM_WGHTLOSS",  "ELECTIVE", "FEMALE", 
"HOSP_DIVISION", "LOS", "NCHRONIC", "NDX", "NPR", "ORPROC", "PAY1", 
"PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")   

## Vector of categorical variables that need transformation
catVars <- c("CM_AIDS", 
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF", 
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", 
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES", 
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC", 
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER", 
"CM_VALVE", "CM_WGHTLOSS",  "ELECTIVE", "FEMALE", 
"HOSP_DIVISION", "ORPROC", "PAY1",  "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")  


## stratified by death 

tab1 <- CreateTableOne(vars = myVars, strata = "DIED" , data = mydata, factorVars = catVars)

kable(print(tab1, showAllLevels = TRUE, test = FALSE), caption = "Descriptive statistics stratified by mortality")  

tab1mat <- print(tab1, showAllLevels = TRUE, test = T, quote = FALSE, noSpaces = TRUE, printToggle = FALSE)

write.csv(tab1mat, file = "Tableone.csv")  

Bivariate associations for LOS


mydata <- read.csv("mydata.csv")     

mydata$diagnosis <- as.numeric(as.factor(mydata$diagnosis))   

function_compute_meansd <- function(dataset, group_vars){
dataset %>%
  filter(!is.na(!!enquo(group_vars))) %>% 
  mutate(rank_LOS = rank(LOS)) %>% 
  group_by(!!enquo(group_vars)) %>% 
  summarise(mean_LOS = mean(LOS), sd_LOS = sd(LOS), mean_rank_LOS = mean(rank_LOS))
} 


function_bivariate_test <- function(dataset, i){ 
  if(class(dataset[, i]) == "numeric"){
    cor.test(dataset$LOS, dataset[, i], method = "pearson") %>% .[c("method", "estimate", "p.value")]
  } else {
      if (nlevels(dataset[, i]) > 2){
   kruskal.test(dataset$LOS ~ dataset[, i]) %>% .[c("method", "p.value")]
 } else{
   wilcox.test(dataset$LOS ~ dataset[, i], exact = FALSE, correct = FALSE) %>% .[c("method", "p.value")]
 } 
  }
}


tab_meansd <- as.list(c( "CM_AIDS", 
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF", 
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", 
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES", 
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC", 
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER", 
"CM_VALVE", "CM_WGHTLOSS", "DIED" , "ELECTIVE", "FEMALE", 
"HOSP_DIVISION", "ORPROC", "PAY1",  "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")) %>% 
  map(as.name) %>% 
  map(function_compute_meansd, dataset = mydata) %>% 
  map(~ mutate(., Variable = colnames(.)[1])) %>% 
  map_dfr(~ rename(., level = 1)) %>% 
  select(Variable, everything()) 

mydata <- mydata %>%  mutate_at(c("CM_AIDS", "CM_ALCOHOL", "CM_ANEMDEF", 
"CM_ARTH", "CM_BLDLOSS", "CM_CHF", "CM_CHRNLUNG", "CM_COAG", 
"CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", "CM_HTN_C", "CM_HYPOTHY", 
"CM_LIVER", "CM_LYMPH", "CM_LYTES", "CM_METS", "CM_NEURO", "CM_OBESE", 
"CM_PARA", "CM_PERIVASC", "CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", 
"CM_TUMOR", "CM_ULCER", "CM_VALVE", "CM_WGHTLOSS", "DIED",  "ELECTIVE", "FEMALE", "HOSP_DIVISION",  "ORPROC", "PAY1", "PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis"), as.factor) %>% 
  mutate_at(c("AGE", "LOS", "NCHRONIC", "NDX", "NPR"), as.numeric)  

str(mydata)

tab_bivariate_test <- as.list(c("AGE", "CM_AIDS", 
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF", 
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "^CM_DM$", "CM_DMCX", "CM_DRUG", 
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES", 
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC", 
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER", 
"CM_VALVE", "CM_WGHTLOSS", "ELECTIVE", "FEMALE", 
"HOSP_DIVISION", "NCHRONIC", "NDX", "NPR", "ORPROC", "PAY1", 
"PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")) %>% 
  map(grep, colnames(mydata)) %>% map_dfr(function_bivariate_test, dataset = mydata) 
  %>% 
  mutate(Variable = c("AGE",  "CM_AIDS", 
"CM_ALCOHOL", "CM_ANEMDEF", "CM_ARTH", "CM_BLDLOSS", "CM_CHF", 
"CM_CHRNLUNG", "CM_COAG", "CM_DEPRESS", "CM_DM", "CM_DMCX", "CM_DRUG", 
"CM_HTN_C", "CM_HYPOTHY", "CM_LIVER", "CM_LYMPH", "CM_LYTES", 
"CM_METS", "CM_NEURO", "CM_OBESE", "CM_PARA", "CM_PERIVASC", 
"CM_PSYCH", "CM_PULMCIRC", "CM_RENLFAIL", "CM_TUMOR", "CM_ULCER", 
"CM_VALVE", "CM_WGHTLOSS", "ELECTIVE", "FEMALE", 
"HOSP_DIVISION", "NCHRONIC", "NDX", "NPR", "ORPROC", "PAY1", 
"PL_NCHS2006", "RACE", "TRAN_IN", "ZIPINC_QRTL", "diagnosis")) %>% 
  select(Variable, everything())    

write.csv(tab_meansd, file = "tab_meansd.csv")  

write.csv(tab_bivariate_test, file = "tab_bivariate_test.csv")  

classification with forward selection logistic regression

cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
ctrl_forward <- makeFeatSelControlSequential(method = "sfs") 
set.seed(2020) 
DIED_forwd_select <- selectFeatures(learner = logreg_lnr, task = DIED_tsk_train, resampling = rdesc, measure = auc, control = ctrl_forward, show.info = F)
DIED_forwd_select

# cross-validation AUC   
cv_auc_test = DIED_forwd_select$y
cv_auc_test    

DIED_forwd_select$x  

#Features (5): AGE, NCHRONIC, NDX, NPR, ORPROC
#auc.test.mean=0.8478
# "AGE"      "NCHRONIC" "NDX"      "NPR"      "ORPROC"


DIED_forwd_tsk = subsetTask(DIED_tsk, features=DIED_forwd_select$x) # this redifines the task to only use the features selected above   

DIED_forwd = train(learner=logreg_lnr, task=DIED_forwd_tsk , subset=train) #this trains the best CV model on the entire training set  

DIED_forwd_predicttest = predict(DIED_forwd, task=DIED_forwd_tsk, subset=test) #this predicts on the test set       

DIED_forwd_predicttest  

# Prediction: 49876 observations
# predict.type: prob
# threshold: 1=0.50,0=0.50
# time: 0.00  

confMatrix_forwd= table(true = mydata[test, ]$DIED, predicted = DIED_forwd_predicttest %>% as.data.frame() %>% .$response)    

confMatrix_forwd  

classificationError <- function(confMatrix, dataset){
  confMatrix
  error= (confMatrix[1,2] + confMatrix[2,1])/nrow(dataset) 
  error <- round(error, 2)
  sens <- confMatrix[2, 2]/(confMatrix[2, 2] + confMatrix[2, 1])
  spec <- confMatrix[1, 1]/(confMatrix[1, 1] + confMatrix[1, 2])
  c(error=error, sensitivity=sens, specificity=spec)  
}


classificationError(confMatrix_forwd, mydata[test, ])      

# test error  
#error sensitivity specificity 
#    0.02000     0.99922     0.01613 

pred <- DIED_forwd_predicttest %>% as.data.frame() %>% .$prob.1 %>% as.numeric()


roc <- pROC::roc(mydata[test, ]$DIED  , pred)   

#Data: pred in 1054 controls (mydata[test, ]$DIED 1) > 48822 cases (mydata[test, ]$DIED 0).
#Area under the curve: 0.852

roc_tab_forwd <- data.frame(sensitivity = roc$sensitivities, specificity = roc$specificities, test = rep("logistic_forward", length(roc$specificities)))    

write.csv(roc_tab_forwd, "roc_tab_forwd.csv", row.names = FALSE)
stopCluster(cl) # stop the cluster in the end

random forest, bagging and boost

cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
set.seed(2020)  
df <- mydata %>% select(-c("LOS", "CM_AIDS", "CM_ARTH", "CM_ULCER"))  


DIED_rf = randomForest(DIED ~ ., data = df[train, ], mtry = dim(df[train, ])[2] - 1, strata = df$DIED[train], sampsize = as.vector(table(df$DIED[train]))) # oob error rate

DIED_rf

# Type of random forest: classification
  #                   Number of trees: 500
# No. of variables tried at each split: 40
 
   #     OOB estimate of  error rate: 2.12%    


#Confusion matrix:
#   1     0 class.error
#1 23  2084   0.9890840
#0 30 97614   0.0003072

sum(DIED_rf$err.rate[, 1])

# [1] 10.87

# For random forests use randomForest with the default parameters. Generate the variable importance plot using varImpPlot and extract variable importance from the randomForest fitted object using the importance function.

varImpPlot(DIED_rf, cex.lab = 0.5, cex.axis = 0.5, cex = 0.75, n.var = 20, main = "", pch = 16, col = "red4")

importance = randomForest::importance
important.par = importance(DIED_rf)[order(importance(DIED_rf)[, 1], decreasing = TRUE), ]
important.par.df<- important.par %>% data.frame()  

write.csv(important.par.df, "importance_rf.csv", row.names = T)   


# bagging
set.seed(2020)
DIED_bg = randomForest(DIED ~ ., data = df[train, ], strata = df$DIED[train], sampsize = as.vector(table(df$DIED[train]))) # oob error rate

#               Type of random forest: classification
#                     Number of trees: 500
#No. of variables tried at each split: 6

#        OOB estimate of  error rate: 2.11%
#Confusion matrix:
#  1     0 class.error
#1 0  2107   1.000e+00
#0 1 97643   1.024e-05

sum(DIED_bg$err.rate[, 1])

## oob error 
# [1] 10.75


varImpPlot(DIED_bg, cex.lab = 0.5, cex.axis = 0.5, cex = 0.75, n.var = 20, main = "", pch = 16, col = "red4")

importance = randomForest::importance
important.par = importance(DIED_bg)[order(importance(DIED_bg)[, 1], decreasing = TRUE), ]
important.par.df<- important.par %>% data.frame()  
write.csv(important.par.df, "importance_bg.csv", row.names = T)     


# For boosting use gbm with cv.folds=3 to perform 3-fold cross-validation, and set class.stratify.cv to DIED  so that cross-validation is performed stratifying by DIED .

library(gbm, warn.conflicts = FALSE) 

df$DIED = -(as.numeric(df$DIED)-2)


set.seed(2020)

DIED_boost = gbm(DIED ~ ., data = df[train, ], distribution = "bernoulli", n.trees = 5000, interaction.depth = 1, shrinkage = 0.01, cv.folds = 3, class.stratify.cv = TRUE)

str(DIED_boost)

# A gradient boosted model with bernoulli loss function.
# 5000 iterations were performed.
# The best cross-validation iteration was 4987.
# There were 40 predictors of which 39 had non-zero influence.

# Plot the cross-validation error as a function of the boosting iteration/trees (the $cv.error component of the object returned by gbm ) and determine whether additional boosting iterations are warranted. If so, run additional iterations with gbm.more (use the R help to check its sintax). Choose the optimal number of iterations.   

plot(DIED_boost$train.error, cex.lab = 2, cex.axis = 2, col = "red4", type = "l", lwd = 3, ylim = c(0.1, 0.2), ylab = "error") 
lines(DIED_boost$cv.error, col = "steelblue", lwd = 3)    

boost_itr_error <- data.frame(train.error = DIED_boost$train.error, cv.error = DIED_boost$cv.error)  

write.csv(boost_itr_error, "boost_itr_error.csv", row.names = T)     

# Use the summary.gbm funtion to generate the variable importance plot and extract variable importance/influence ( summary.gbm does both). Generate 1D and 2D marginal plots with gbm.plot to assess the effect of the top three variables and their 2-way interactions.   

importance_boost<- summary(DIED_boost) %>% as.data.frame()  
write.csv(importance_boost, "importance_boost.csv", row.names = T)     


plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[1, 1])) 
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[2, 1]))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[3, 1]))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[1:2, 1]), main = paste0("2D Plot: interaction between NDX and NPR"))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[2:3, 1]), main = paste0("2D Plot: interaction between NPR and AGE"))
plot(DIED_boost, as.character(summary(DIED_boost, plotit = FALSE)[c(1, 3), 1]), main = paste0("2D Plot: interaction between NDX and AGE"))

# Compute the test misclassification error for the 4 methods and comment on their relative performance.   

# predict rf  

DIED_rf_predict = predict(DIED_rf, newdata = df[test, ], type = "prob") 
table(DIED_rf_predict)

confMatrix_rf = table(true = df[test, ]$DIED, predicted = DIED_rf_predict %>% as.data.frame() %>% .$`.`)    

classificationError(confMatrix_rf, df[test, ])      

# test error  
#  error sensitivity specificity 
#    0.02000     0.99973     0.00759   

pred <- DIED_rf_predict %>% as.data.frame() %>% .$`1` %>% as.numeric()


roc <- pROC::roc(df[test, ]$DIED  , pred)   

# Data: pred in 1054 controls (df[test, ]$DIED 1) > 48822 cases (df[test, ]$DIED 0).
# Area under the curve: 0.864

roc_tab_rf <- data.frame(sensitivity = roc$sensitivities, specificity = roc$specificities, test = rep("random_forest", length(roc$specificities)))    

write.csv(roc_tab_rf, "roc_tab_rf.csv", row.names = F)     


# predict baggin

DIED_bg_predict = predict(DIED_bg, newdata = df[test, ], type = "response") 
table(DIED_bg_predict)

confMatrix_bg = table(true = df[test, ]$DIED, predicted = DIED_bg_predict %>% as.data.frame() %>% .$`.`)    

classificationError(confMatrix_bg, df[test, ])      

# test error  
#      error sensitivity specificity 
#       0.02        1.00        0.00   

pred <- DIED_bg_predict %>% as.data.frame() %>% .$`1` %>% as.numeric()


roc <- pROC::roc(df[test, ]$DIED  , pred)   

# Data: pred in 1054 controls (df[test, ]$DIED 1) > 48822 cases (df[test, ]$DIED 0).
# Area under the curve: 0.881

roc_tab_bag <- data.frame(sensitivity = roc$sensitivities, specificity = roc$specificities, test = rep("bagging", length(roc$specificities)))    

write.csv(roc_tab_bag, "roc_tab_bag.csv", row.names = F)  


# boosting  

DIED_boost_predict = predict(DIED_boost, newdata = df[test, ], type = "response", n.trees = which.min(DIED_boost$cv.error))

DIED_boost_predict_df<- DIED_boost_predict %>% as.data.frame() %>% mutate(response = ifelse(`.`>0.5, 1, 0))
DIED_boost_predict_df %>% count(response)

confMatrix_boost = table(true = df[test, ]$DIED, predicted = DIED_boost_predict_df$response)    

classificationError(confMatrix_bg, df[test, ])      

48790 /(48790 + 32)
#       error sensitivity specificity 
#   9.800e-01   1.000e+00   2.048e-05 

pred <- DIED_boost_predict %>% as.data.frame() %>% .$`.` %>% as.numeric()

roc <- pROC::roc(df[test, ]$DIED  , pred)   

# Data: pred in 48822 controls (df[test, ]$DIED 0) < 1054 cases (df[test, ]$DIED 1).
# Area under the curve: 0.883

roc_tab_boost <- data.frame(sensitivity = roc$sensitivities, specificity = roc$specificities, test = rep("boost", length(roc$specificities)))    

write.csv(roc_tab_boost, "roc_tab_boost.csv", row.names = F)  

draw roc graphs


df1 <- read.csv("roc_tab_forwd.csv")   
df2 <- read.csv("roc_tab_rf.csv")   
df3 <- read.csv("roc_tab_bag.csv")     
df4 <- read.csv("roc_tab_boost.csv")   

roc_df <- rbind(df1, df2, df3, df4)    

Cairo::Cairo(file="roc.png",
      type="png",
      units="px", 
      width=800, 
      height=500, 
      pointsize=12, 
      dpi="auto")

roc_df %>% ggplot(aes(x = 1-specificity, y = sensitivity, group = test, color = test)) + geom_line()
dev.off()

png("roc.png", units="in", width=7, height=5, res=300, type="cairo")
roc_df %>% ggplot(aes(x = 1-specificity, y = sensitivity, group = test, color = test)) + geom_line(size=1, alpha=0.9)
dev.off()


plot variable importance

df1 <- read.csv("importance_rf.csv") %>% head(10)
df2 <- read.csv("importance_bg.csv")  %>% head(10)
df3 <- read.csv("importance_boost.csv")  %>% head(10)

variable_df <- rbind(df1, df2, df3) 

mydata %>% summarize(mean = mean(LOS),sd = sd(LOS) )

variable_df %>% ggplot(aes(x = fct_reorder2(var, rel.inf, test), y = rel.inf))+
  geom_col(aes(fill = var)) + coord_flip() + facet_wrap(~ test, ncol = 1, scales = "free")


new_order <- variable_df %>% 
  group_by(test) %>% 
  do(data_frame(al=levels(reorder(interaction(.$test, .$var, drop=TRUE), .$rel.inf)))) %>% 
  pull(al)  

png("importance.png", units="in", width=7, height=7, res=300, type="cairo")
variable_df %>% 
  mutate(al=factor(interaction(test, var), levels=new_order)) %>%
  ggplot(aes(x = al, y = rel.inf)) +
    geom_col(aes(fill = var)) + coord_flip() + facet_wrap(~test, ncol = 1,  scales = "free") +
    scale_x_discrete(breaks= new_order, labels=gsub("^.*\\.", "", new_order)) + ylab("variable importance") + xlab("variables")
dev.off()

parameter plot

parameter <- read.csv("parameters.csv")  

head(parameter)

parameter %>% ggplot(aes(x = test, y = value)) + geom_point(aes(color = test), size=5,  alpha=0.7, shape=21, stroke=2) + coord_flip() + facet_wrap(~parameter, ncol = 2,  scales = "free")   

png("parameter.png", units="in", width=8, height=5, res=300, type="cairo")
parameter %>% ggplot(aes(x = test, y = value)) + geom_point(aes(color = test), size=3,  alpha=0.7, shape=21, stroke=2) + coord_flip() + facet_wrap(~parameter, ncol = 2,  scales = "free")  
dev.off()

boosting iteration error

boost_itr_error <- read.csv("boost_itr_error.csv")  

head(boost_itr_error)


png("itr_boost.png", units="in", width=8, height=5, res=300, type="cairo")
boost_itr_error %>%
 pivot_longer(-itr, names_to = "error_type", values_to = "value") %>% ggplot(aes(x=itr, y=value, group = error_type)) +
  geom_line(aes(color = error_type), size=2, alpha=0.9, linetype=2) + geom_vline(xintercept = 4987, size = 2, alpha = 0.5, col = "steelblue") + annotate("text", x = 4200, y = 0.2, label = "the best cross validation \n iteration = 4987") + ylab("error") + xlab("iteration number")
dev.off()

knn

cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
set.seed(2020)  

df2 <- mydata %>% select(-"DIED") %>% filter(LOS <= 30)
# [1] 148547  

n = nrow(df2) 
n
train = sample(1:n, floor(0.7*n)) 


df2 <- fastDummies::dummy_cols(df2, remove_first_dummy = TRUE)
df2 <- df2[, !sapply(df2, is.factor)]    
nrow(df2) # [1] 148450

rmse = function(observed, predicted) sqrt(mean((observed - predicted)^2))
rmse_validation <- vector(length=15)

mylist <- c(100:150) %>% map(function(x){ 
  
fit <- knn.reg(train = df2[train, ], test = df2[-train, ], y = df2$LOS, k = x) 

rmse_validation[x] <- rmse(df2[-train, ]$LOS, fit$pred)    

})



rmse_df <- data.frame(k = c(100:150), rmse = mylist %>% unlist()) 

write.csv(rmse_df, "rmse_df2.csv")  

rmse_df0 <- read.csv("rmse_df0.csv")
rmse_df <- read.csv("rmse_df.csv")
rmse_df2 <- read.csv("rmse_df2.csv")


png("rmse_k.png", units="in", width=8, height=5, res=300, type="cairo")
rbind(rmse_df0) %>% ggplot(aes(x = k, y = rmse)) + geom_line(size = 2, color = "steelblue")
dev.off()


# k = 15	# rmse = 4.230	
fit <- knn.reg(train = df2[train, ], test = df2[-train, ], y = df2$LOS, k =15) 

R2_test_knn = 1 - sum((df2[-train, ]$LOS - fit$pred)^2)/sum((df2[-train, ]$LOS - mean(df2[-train, ]$LOS))^2)

# [1] -0.06108 

knn_obs_pred <- data.frame(obs = df2[-train, ]$LOS, pred = fit$pred, test = rep("knn", length(fit$pred)))   

write.csv(knn_obs_pred, "knn_obs_pred.csv") 



# histogram
ggplot(aes()) + geom_histogram(binwidth = 3, fill="#69b3a2", color="#e9ecef", alpha=0.9)  




random forest

cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
set.seed(2020)    
df2 <- read.csv("df2.csv")

LOS_rf = randomForest(LOS ~ ., data = df2[train, ], importance = TRUE)
# oob error rate
LOS_rf

#                     Number of trees: 500
#No. of variables tried at each split: 26

 #         Mean of squared residuals: 10.74
  #                  % Var explained: 36.19

sum(LOS_rf$err.rate[, 1])

# [1] 3.279

# For random forests use randomForest with the default parameters. Generate the variable importance plot using varImpPlot and extract variable importance from the randomForest fitted object using the importance function.

varImpPlot(LOS_rf, cex.lab = 0.5, cex.axis = 0.5, cex = 0.75, n.var = 20, main = "", pch = 16, col = "red4")

importance = randomForest::importance
important.par = importance(LOS_rf)[order(importance(LOS_rf)[, 1], decreasing = TRUE), ]
important.par.df<- important.par %>% data.frame()  


write.csv(important.par.df, "linear_importance_rf.csv", row.names = T)   


test.pred.forest <- predict(LOS_rf, newdata = df2[-train, ])
rf_obs_pred <- data.frame(obs = df2[-train, ]$LOS, pred = test.pred.forest, test = rep("rf", length(test.pred.forest)))   

write.csv(rf_obs_pred, "rf_obs_pred.csv") 

RMSE.forest <- sqrt(mean((test.pred.forest-df2[-train, ]$LOS)^2)) 
RMSE.forest  
# 3.279

R2_test_rf = 1 - sum((df2[-train, ]$LOS - test.pred.forest)^2)/sum((df2[-train, ]$LOS - mean(df2[-train, ]$LOS))^2)

# [1] 0.3624

# %IncMSE   

#%IncMSE is the most robust and informative measure. It is the increase in mse of predictions(estimated with out-of-bag-CV) as a result of variable j being permuted(values randomly shuffled).

#grow regression forest. Compute OOB-mse, name this mse0.
#for 1 to j var: permute values of column j, then predict and compute OOB-mse(j)
#%IncMSE of j'th is (mse(j)-mse0)/mse0 * 100%
#the higher number, the more important


png("importance_linear.png", units="in", width=6, height=4, res=300, type="cairo")
important.par.df %>% rownames_to_column("variables") %>% head(10) %>% 
  ggplot(aes(x = reorder(variables, X.IncMSE), y = X.IncMSE)) +
    geom_col(aes(fill = variables)) + coord_flip() + ylab("%IncMSE(variable importance)") + xlab("variables") + theme(legend.position = "none")
dev.off()


decision tree

library(rpart)
library(rattle)

cl <- makeCluster(6, type="SOCK") # parallel processing for 6 cores machine
registerDoSNOW(cl)
set.seed(2020) 

rt <- rpart(LOS ~ ., data = df2[train,])

# n= 103982 

# node), split, n, deviance, yval
#      * denotes terminal node

#1) root 103982 1751000  4.348  
#  2) NDX< 14.5 81031  848300  3.612  
#    4) NDX< 8.5 48208  348100  3.054  
#      8) diagnosis_Injury< 0.5 43773  229400  2.839 *
 #     9) diagnosis_Injury>=0.5 4435   96710  5.177 *
#    5) NDX>=8.5 32823  463200  4.430 *
#  3) NDX>=14.5 22951  703400  6.948  
#    6) NPR< 3.5 17571  334900  5.765 *
#    7) NPR>=3.5 5380  263600 10.810 *

test.pred.rtree <- predict(rt,df2[-train,])
RMSE.rtree <- sqrt(mean((test.pred.rtree-df2[-train, ]$LOS)^2))
RMSE.rtree  
# [1] 3.666  


#Variables actually used in tree construction:
#[1] diagnosis_Injury NDX              NPR             

#Root node error: 1750880/103982 = 17

#n= 103982 

printcp(rt)  


png("rt_tree.png", units="in", width=8, height=5, res=300, type="cairo")
fancyRpartPlot(rt)
dev.off()


# Get the optimal CP programmatically...  

min.xerror <- rt$cptable[which.min(rt$cptable[,"xerror"]),"CP"]  

min.xerror

# ...and use it to prune the tree
rt.pruned <- prune(rt, cp = min.xerror) 

fancyRpartPlot(rt.pruned)

# Evaluate the new pruned tree on the test set

test.pred.rtree.p <- predict(rt.pruned, df2[-train, ])    

tree_obs_pred <- data.frame(obs = df2[-train, ]$LOS, pred = test.pred.rtree.p, test = rep("decision_tree_tuned", length(test.pred.rtree.p)))   

write.csv(tree_obs_pred, "tree_obs_pred.csv") 

RMSE.rtree.pruned <- sqrt(mean((test.pred.rtree.p-df2[-train,]$LOS)^2))
RMSE.rtree.pruned
# [1] 3.666  

R2_test_tree = 1 - sum((df2[-train, ]$LOS - test.pred.rtree.p)^2)/sum((df2[-train, ]$LOS - mean(df2[-train, ]$LOS))^2)

# [1] 0.2028

cp plot


my_xlab <- paste(levels(rt$cptable %>% data.frame() %>% mutate(CP = factor(round(CP,3), levels = c("0.114", "0.06", "0.021", "0.013", "0.01")))%>% .$CP),"\n(split=",rt$cptable %>% data.frame() %>% .$nsplit,")",sep="")  


png("cp_table.png", units="in", width=8, height=5, res=300, type="cairo")
rt$cptable %>% data.frame() %>% mutate(CP = factor(round(CP,3), levels = c("0.114", "0.06", "0.021", "0.013", "0.01"))) %>% ggplot(aes(x = CP, y = xerror, group = 1)) + geom_line( color="grey") +
    geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
    scale_x_discrete(labels=my_xlab) + ylab("X-val Relative Error") 
dev.off()

plot for regression


df3 <- read.csv("tree_obs_pred.csv")
df1 <- read.csv("knn_obs_pred.csv")
df2 <- read.csv("rf_obs_pred.csv")
df <- rbind(df1, df2, df3)  

png("correlation.png", units="in", width=10, height=3, res=300, type="cairo")
df %>%  ggplot(aes(x = obs, y = pred)) + geom_point(alpha = 0.1) + geom_smooth(method = "lm") + labs( x = "observed values in testset", y = "predicted values in testset")+ 
  facet_wrap(~test, ncol = 3,  scales = "free") 
dev.off()

parameters

parameters_linear <- read.csv("parameters_linear.csv")

png("parameter_linear.png", units="in", width=8, height=3, res=300, type="cairo")
parameters_linear %>% ggplot(aes(x = test, y = value)) + geom_point(aes(color = test), size=3,  alpha=0.7, shape=21, stroke=2) + coord_flip() + facet_wrap(~parameter, ncol = 3,  scales = "free") + theme(legend.position = "none")
dev.off()


  1. HCUP-US NIS Overview [Internet]. [cited 2020 Apr 30]. Available from: https://www.hcup-us.ahrq.gov/nisoverview.jsp.