library(tidyverse)
library(dplyr)
library(gt)
library(SmartEDA)
library(ggplot2)
library(hrbrthemes)
library(dataPreparation)
library(caret)
library(corrplot)
library(Boruta)
library(ROSE)
library(plotROC)
library(MLmetrics)
library(jtools)
library(kableExtra)
library(gridExtra)
library(glue)

1. Background

Hospital readmission costs make for a significant amount of hospital inpatient care expenses. Diabetes is not just one of the world’s top 10 leading causes of mortality, but it is also the most costly chronic disease in the United States. Diabetes patients in the hospital have a greater risk of readmission than those who do not have diabetes. As a result, lowering readmission rates for diabetes patients has the potential to considerably cut medical costs. The goal of this research is to forecast the possibility of a diabetic patient being readmitted. The dataset was collected from the University of California, Irvine’s Center for Machine Learning and Intelligent Systems and comprises over 100,000 characteristics and 50 variables such as the number of operations, drugs, and time spent in hospital etc.


2. Problem Statement

Identifying which factors are the strong predictors for predicting the readmission of diabetic patients and build a classification model with good accuracy for predicting readmission.


3. Description of Data

Feature name Type Description and values
encounter_id Numeric unique identifier of a patient visit
patient_nbr Numeric unique identifier of a patient
race Categorical values: Caucasian, Asian, African American, Hispanic, and other
gender Categorical values: male, female, and unknown/invalid
age Categorical grouped in 10-year intervals up to 100
weight Numeric patient weight in pounds
admission_type_id Categorical integer indicating emergency, urgent, elective, newborn, and not available
discharge_disposition_id Categorical integer indicating discharged to home, expired, and not available
admission_source_id Categorical integer indicating physician referral, emergency room, and transfer from a hospital
time_in_hospital Numeric number of days between admission and discharge
payer_code Categorical Integer identifier corresponding to 23 distinct values, for example, Blue Cross\Blue Shield, Medicare, and self-pay
medical_specialty Categorical identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\general practice, and surgeon.
num_lab_procedures Numeric Number of lab tests performed during the encounter
num_procedures Numeric number of procedures (other than lab tests) performed
num_medications Numeric number of medications administered during the encounter
number_outpatient Numeric number of outpatient visits in the year preceding the encounter
number_emergency Numeric number of emergency visits in the year preceding the encounter
number_inpatient Numeric number of inpatient visits in the year preceding the encounter
diag_1 Categorical primary diagnosis, 848 distinct values/codes
diag_2 Categorical secondary diagnosis, 923 distinct values/codes
diag_3 Categorical additional secondary diagnosis, 954 distinct values/codes
number_diagnoses Numeric number of diagnoses entered to the system
max_glu_serum Categorical Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured
A1c_result Categorical Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.
change of medications Categorical Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”
diabetesMed Categorical dichotomous indicator of diabetic medication being prescribed
24 features for medications Categorical

For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage.

Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed

Readmitted Categorical Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.


db <- read.csv("D:/DS Project/Dataset/diabetic_data.csv")


4. Feature Engineering

4.1. Deal with Missing Values

Missing values in this dataset is recorded as “?” so replacing this value with “NA”.

db1 <- db%>% 
  filter(!(gender %in% "Unknown/Invalid")) %>%
  mutate_all(~replace(., .== "?", NA)) %>% 
  mutate_at(c("diag_1","diag_2","diag_3"),~replace(., is.na(.), 0))


Glimpse of Dataset

db1 %>% 
  head() %>% 
  gt()
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
2278392 8222157 Caucasian Female [0-10) NA 6 25 1 1 NA Pediatrics-Endocrinology 41 0 1 0 0 0 250.83 0 0 1 None None No No No No No No No No No No No No No No No No No No No No No No No No No NO
149190 55629189 Caucasian Female [10-20) NA 1 1 7 3 NA NA 59 0 18 0 0 0 276 250.01 255 9 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes >30
64410 86047875 AfricanAmerican Female [20-30) NA 1 1 7 2 NA NA 11 5 13 2 0 1 648 250 V27 6 None None No No No No No No Steady No No No No No No No No No No No No No No No No No Yes NO
500364 82442376 Caucasian Male [30-40) NA 1 1 7 2 NA NA 44 1 16 0 0 0 8 250.43 403 7 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes NO
16680 42519267 Caucasian Male [40-50) NA 1 1 7 1 NA NA 51 0 8 0 0 0 197 157 250 5 None None No No No No No No Steady No No No No No No No No No No Steady No No No No No Ch Yes NO
35754 82637451 Caucasian Male [50-60) NA 2 1 2 3 NA NA 31 6 16 0 0 0 414 411 250 9 None None No No No No No No No No No No No No No No No No No Steady No No No No No No Yes >30


Counting Missing Value for each Variable

db1 %>% 
  ExpData(type = 2) %>% 
  gt()
Index Variable_Name Variable_Type Sample_n Missing_Count Per_of_Missing No_of_distinct_values
1 encounter_id integer 101763 0 0.000 101763
2 patient_nbr integer 101763 0 0.000 71515
3 race character 99492 2271 0.022 5
4 gender character 101763 0 0.000 2
5 age character 101763 0 0.000 10
6 weight character 3197 98566 0.969 9
7 admission_type_id integer 101763 0 0.000 8
8 discharge_disposition_id integer 101763 0 0.000 26
9 admission_source_id integer 101763 0 0.000 17
10 time_in_hospital integer 101763 0 0.000 14
11 payer_code character 61508 40255 0.396 17
12 medical_specialty character 51816 49947 0.491 72
13 num_lab_procedures integer 101763 0 0.000 118
14 num_procedures integer 101763 0 0.000 7
15 num_medications integer 101763 0 0.000 75
16 number_outpatient integer 101763 0 0.000 39
17 number_emergency integer 101763 0 0.000 33
18 number_inpatient integer 101763 0 0.000 21
19 diag_1 character 101763 0 0.000 717
20 diag_2 character 101763 0 0.000 749
21 diag_3 character 101763 0 0.000 790
22 number_diagnoses integer 101763 0 0.000 16
23 max_glu_serum character 101763 0 0.000 4
24 A1Cresult character 101763 0 0.000 4
25 metformin character 101763 0 0.000 4
26 repaglinide character 101763 0 0.000 4
27 nateglinide character 101763 0 0.000 4
28 chlorpropamide character 101763 0 0.000 4
29 glimepiride character 101763 0 0.000 4
30 acetohexamide character 101763 0 0.000 2
31 glipizide character 101763 0 0.000 4
32 glyburide character 101763 0 0.000 4
33 tolbutamide character 101763 0 0.000 2
34 pioglitazone character 101763 0 0.000 4
35 rosiglitazone character 101763 0 0.000 4
36 acarbose character 101763 0 0.000 4
37 miglitol character 101763 0 0.000 4
38 troglitazone character 101763 0 0.000 2
39 tolazamide character 101763 0 0.000 3
40 examide character 101763 0 0.000 1
41 citoglipton character 101763 0 0.000 1
42 insulin character 101763 0 0.000 4
43 glyburide.metformin character 101763 0 0.000 4
44 glipizide.metformin character 101763 0 0.000 2
45 glimepiride.pioglitazone character 101763 0 0.000 2
46 metformin.rosiglitazone character 101763 0 0.000 2
47 metformin.pioglitazone character 101763 0 0.000 2
48 change character 101763 0 0.000 2
49 diabetesMed character 101763 0 0.000 2
50 readmitted character 101763 0 0.000 3


4.2. Removing Columns

Variable Reason
Weight 96% Missing Value
Payer Code 40% Missing Value
Medical Speciality 50% Missing Value
Encounter ID Not Necessary because we have Unique Patient Number

examide, citoglipton, glimepiride.pioglitazone,

metformin.rosiglitazone

Homogenity of Data
db1$weight <- NULL
db1$payer_code <- NULL
db1$medical_specialty <- NULL
db1$encounter_id <- NULL
db1$examide <- NULL
db1$citoglipton <- NULL
db1$glimepiride.pioglitazone <- NULL
db1$metformin.rosiglitazone <- NULL


Removing remaining rows with missing value

db1 <- na.omit(db1)
dim(db1)
## [1] 99492    42


4.3. Feature Transformations

4.3.1. Patient Number

db1 <- db1[!duplicated(db1$patient_nbr),]
db1$patient_nbr <- NULL


4.3.2. Admission Type

Categorizing Admission type by Grouping similar classification into a major classification

db1$admission_type_id <- replace(db1$admission_type_id,db1$admission_type_id == 2, 1)
db1$admission_type_id <- replace(db1$admission_type_id,db1$admission_type_id == 6, 5)
db1$admission_type_id <- replace(db1$admission_type_id,db1$admission_type_id == 7, 1)
db1$admission_type_id <- replace(db1$admission_type_id,db1$admission_type_id == 8, 5)
db1$admission_type_id <- str_replace(db1$admission_type_id,"1", "Emergency")
db1$admission_type_id <- str_replace(db1$admission_type_id,"3", "Elective")
db1$admission_type_id <- str_replace(db1$admission_type_id,"4", "Newborn")
db1$admission_type_id <- str_replace(db1$admission_type_id,"5", "Others")
db1 <- rename(db1, admission_type = admission_type_id)
db1$admission_type <- as.factor(db1$admission_type)
db1$admission_type_id <- NULL


4.3.3. Admission Source

Categorizing Admission Source by Grouping similar classifications into a major classification

db1$admission_source_id <- case_when(
  db1$admission_source_id %in% c(1,2,3) ~ "Referral",
  db1$admission_source_id %in% c(4,5,6,10,18,19,22,25,26) ~ "Transfer",
  db1$admission_source_id %in% 3 ~ "Emergency Room",
  db1$admission_source_id %in% 4 ~ "Law Enforcement",
  db1$admission_source_id %in% c(11,12,13,14,23,24) ~ "Pregnancy Related",
  TRUE ~ "Unknown/NotAvailable")
db1 <- rename(db1, admission_source = admission_source_id)
db1$admission_source <- as.factor(db1$admission_source)
db1$admission_source_id <- NULL


4.3.4. Discharge Disposition

Categorizing Discharge Disposition by Grouping similar classifications into a major classification

Also Discharge codes with 8,11,13,14,19,20,21 were deleted because they were related to death or hospice.

db1 <- db1[!db1$discharge_disposition_id %in% c(8,11,13,14,19,20,21), ]
db1$discharge_disposition_id <- case_when(
  db1$discharge_disposition_id %in% 1 ~ "Discharged",
  db1$discharge_disposition_id %in% 12 ~ "Outpatient",
  db1$discharge_disposition_id %in% 3 ~ "Left",
  db1$discharge_disposition_id %in% 9 ~ "Admitted",
  db1$discharge_disposition_id %in% c(2,3,4,5,6,10,15,16,17,22,23,24,27,28,29,30) ~ "Transferred", TRUE ~ "Unknown/NotAvailable")
db1 <- rename(db1, discharge_disposition = discharge_disposition_id)
db1$discharge_disposition <- as.factor(db1$discharge_disposition)
db1$discharge_disposition_id <- NULL


4.3.4. Diagnosis Type

Categorizing Diagnosis type by Grouping similar classifications into a major classification for Primary, Secondary and Additional Diagnosis


Primary Diagnosis

db1$diag_1 <- as.character(db1$diag_1)

db1<- mutate(db1, primary_diagnosis =
    ifelse(str_detect(diag_1, "V") | str_detect(diag_1, "E"),"Other",     ifelse(str_detect(diag_1, "250"), "Diabetes",
    ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | as.integer(diag_1) == 785, "Circulatory",
    ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | as.integer(diag_1) == 786, "Respiratory", 
    ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | as.integer(diag_1) == 787, "Digestive", 
    ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | as.integer(diag_1) == 788, "Genitourinary",
    ifelse((as.integer(diag_1) >= 140 & as.integer(diag_1) <= 239), "Neoplasms",  
    ifelse((as.integer(diag_1) >= 710 & as.integer(diag_1) <= 739), "Musculoskeletal",          
    ifelse((as.integer(diag_1) >= 800 & as.integer(diag_1) <= 999), "Injury","Other"))))))))))

db1$diag_1 <- NULL


Secondary Diagnosis

db1$diag_2 <- as.character(db1$diag_2)

db1<- mutate(db1, secondary_diagnosis =
    ifelse(str_detect(diag_2, "V") | str_detect(diag_2, "E"),"Other",     ifelse(str_detect(diag_2, "250"), "Diabetes",
    ifelse((as.integer(diag_2) >= 390 & as.integer(diag_2) <= 459) | as.integer(diag_2) == 785, "Circulatory",
    ifelse((as.integer(diag_2) >= 460 & as.integer(diag_2) <= 519) | as.integer(diag_2) == 786, "Respiratory", 
    ifelse((as.integer(diag_2) >= 520 & as.integer(diag_2) <= 579) | as.integer(diag_2) == 787, "Digestive", 
    ifelse((as.integer(diag_2) >= 580 & as.integer(diag_2) <= 629) | as.integer(diag_2) == 788, "Genitourinary",
    ifelse((as.integer(diag_2) >= 140 & as.integer(diag_2) <= 239), "Neoplasms",  
    ifelse((as.integer(diag_2) >= 710 & as.integer(diag_2) <= 739), "Musculoskeletal",          
    ifelse((as.integer(diag_2) >= 800 & as.integer(diag_2) <= 999), "Injury","Other"))))))))))

db1$diag_2 <- NULL


Additional Diagnosis

db1$diag_3 <- as.character(db1$diag_3)

db1<- mutate(db1, additional_diagnosis =
    ifelse(str_detect(diag_3, "V") | str_detect(diag_3, "E"),"Other",     ifelse(str_detect(diag_3, "250"), "Diabetes",
    ifelse((as.integer(diag_3) >= 390 & as.integer(diag_3) <= 459) | as.integer(diag_3) == 785, "Circulatory",
    ifelse((as.integer(diag_3) >= 460 & as.integer(diag_3) <= 519) | as.integer(diag_3) == 786, "Respiratory", 
    ifelse((as.integer(diag_3) >= 520 & as.integer(diag_3) <= 579) | as.integer(diag_3) == 787, "Digestive", 
    ifelse((as.integer(diag_3) >= 580 & as.integer(diag_3) <= 629) | as.integer(diag_3) == 788, "Genitourinary",
    ifelse((as.integer(diag_3) >= 140 & as.integer(diag_3) <= 239), "Neoplasms",  
    ifelse((as.integer(diag_3) >= 710 & as.integer(diag_3) <= 739), "Musculoskeletal",          
    ifelse((as.integer(diag_3) >= 800 & as.integer(diag_3) <= 999), "Injury","Other"))))))))))

db1$diag_3 <- NULL


4.3.5. Age

Transforming the age to be the middle value in each given range

db1$age <- case_when(
  db1$age %in% "[0-10)" ~ 5,
  db1$age %in% "[10-20)" ~ 15,
  db1$age %in% "[20-30)" ~ 25,
  db1$age %in% "[30-40)" ~ 35,
  db1$age %in% "[40-50)" ~ 45,
  db1$age %in% "[50-60)" ~ 55,
  db1$age %in% "[60-70)" ~ 65,
  db1$age %in% "[70-80)" ~ 75,
  db1$age %in% "[80-90)" ~ 85,
  TRUE ~ 95
  )
db1$age <- as.numeric(db1$age)


4.3.6. Readmission

Transforming the readmitted column values to be 0 if patients were not readmitted, and 1 if patients were both readmitted within and after 30 days.

db1$readmitted <- case_when(db1$readmitted %in% c(">30","<30") ~ "1",
                            TRUE ~ "0")
db1$readmitted <- as.factor(db1$readmitted)


4.3.6 Converting Characters to Factors

Changing Categorical Variables into Factors with multiple “levels”.

cols <- c("race" ,"gender","max_glu_serum", "A1Cresult","metformin","repaglinide","nateglinide" ,"chlorpropamide" ,"glimepiride","acetohexamide","glipizide","glyburide","tolbutamide","pioglitazone","rosiglitazone" ,"acarbose","miglitol","troglitazone" ,"tolazamide",  "insulin","glyburide.metformin","glipizide.metformin","metformin.pioglitazone" ,"change","diabetesMed" , "primary_diagnosis","secondary_diagnosis","additional_diagnosis" )
db1 <- db1 %>% 
  mutate_each_(funs(factor(.)),cols)
db1 <- db1%>%
  mutate_if(is.integer,as.numeric)


4.4 Outlier Treatment

Identifing outliers in the numerical columns and removing them. To keep it simple, I have decided to only keep values that are within 3 standard deviations away from the mean for each feature of the dataset.

par(mfrow = c(2,4))
boxplot(db1$time_in_hospital, main = "time_in_hospital",col='Sky Blue')
boxplot(db1$num_lab_procedures, main = "num_lab_procedures",col='Sky Blue')
boxplot(db1$num_procedures, main = "num_procedures",col='Sky Blue')
boxplot(db1$num_medications, main = "num_medications",col='Sky Blue')
boxplot(db1$number_outpatient, main = "number_outpatient",col='Sky Blue')
boxplot(db1$number_emergency, main = "number_emergency",col='Sky Blue')
boxplot(db1$number_inpatient, main = "number_inpatient",col='Sky Blue')
boxplot(db1$number_diagnoses, main = "number_diagnoses",col='Sky Blue')

db2 <- remove_sd_outlier(db1, cols = "auto", n_sigmas = 3, verbose = FALSE)
##  [1] "race"                   "gender"                 "admission_type"        
##  [4] "discharge_disposition"  "admission_source"       "max_glu_serum"         
##  [7] "A1Cresult"              "metformin"              "repaglinide"           
## [10] "nateglinide"            "chlorpropamide"         "glimepiride"           
## [13] "acetohexamide"          "glipizide"              "glyburide"             
## [16] "tolbutamide"            "pioglitazone"           "rosiglitazone"         
## [19] "acarbose"               "miglitol"               "troglitazone"          
## [22] "tolazamide"             "insulin"                "glyburide.metformin"   
## [25] "glipizide.metformin"    "metformin.pioglitazone" "change"                
## [28] "diabetesMed"            "readmitted"             "primary_diagnosis"     
## [31] "secondary_diagnosis"    "additional_diagnosis"  
## [1] "remove_sd_outlier: c(\"race\", \"gender\", \"admission_type\", \"discharge_disposition\", \"admission_source\", \"max_glu_serum\", \"A1Cresult\", \"metformin\", \"repaglinide\", \"nateglinide\", \"chlorpropamide\", \"glimepiride\", \"acetohexamide\", \"glipizide\", \"glyburide\", \"tolbutamide\", \"pioglitazone\", \"rosiglitazone\", \"acarbose\", \"miglitol\", \"troglitazone\", \"tolazamide\", \"insulin\", \"glyburide.metformin\", \"glipizide.metformin\", \"metformin.pioglitazone\", \"change\", \"diabetesMed\", \"readmitted\", \"primary_diagnosis\", \"secondary_diagnosis\", \"additional_diagnosis\"\n) aren't columns of types numeric i do nothing for those variables."


4.5. Exploratory Data Analysis

#creating a function to plot numerical variable
plot_histogram <- function(df,var1,var2) {
  df %>%
    ggplot(aes(x = {{var1}},fill = {{var2}})) + 
    geom_histogram(alpha = 0.75,position = "stack",color = "black",bins = 30) +
    geom_vline(aes(xintercept = median({{var1}})), linetype = 2,size = 1) + 
    theme(legend.position = "none")
}

plot_density <- function(df,var1,var2){
  df %>%
    ggplot(aes(x = {{var1}},fill = {{var2}})) + 
    geom_density(alpha = 0.5,color = "black") +
    geom_vline(data = df %>%
                 group_by({{var2}}) %>%
                 summarize(mean.grp = mean({{var1}})),
               aes(xintercept = mean.grp,color ={{var2}}),
               linetype = "dashed",size = 1) +
    labs(caption = paste0("Lines represent average by group"),
         y = element_blank(), x = element_blank(), title = "")
    #theme(axis.ticks.y = element_blank(),axis.text.y = element_blank())
}

plot_numerical <- function(df,var1,var2) {
  p1 <- plot_histogram({{df}},{{var1}},{{var2}})
  p2 <- plot_density({{df}},{{var1}},{{var2}})
  
  grid.arrange(p1,p2,ncol = 2)
}


4.5.1. Exploring Numerical Variables

Readmission Rate

#Creating a function to count percent to re-admissions
summarise_readm <- function(tbl) {
  tbl %>%
    summarise(readm = sum(readmitted == 1),
              n = n(),
              low = qbeta(0.025,readm + 0.5,n - readm + 0.5),
              high = qbeta(0.975,readm + 0.5,n - readm + 0.5)) %>% 
              mutate(pct_readm = readm/n)
}
db2 %>% 
  group_by(readmitted) %>% 
  summarise(count = n()) %>%
  mutate(freq = formattable::percent(count/sum(count))) %>% 
  ggplot(aes(x = 2, y = count, fill = readmitted)) +
  geom_col() +
  geom_text(aes(label = formattable::percent(count/sum(count))),position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  xlim(c(0.2,2 + 0.5)) +
  theme_void() +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(),
        panel.border = element_blank(),panel.grid=element_blank(),
        axis.ticks = element_blank(),
        plot.title=element_text(size=14, face="bold")) +
  labs(title = "Around 4 out of every 10 patients get Readmission" )


Age of Patients

Age has a higher impact on readmission?

plot_numerical(db2,age,readmitted)

As observed, Higher the age, higher will be the chance of Readmission.

This Hypothesis is TRUE


Time Spent in Hospital

plot_numerical(db2,time_in_hospital,readmitted)


Number of Lab Tests

plot_numerical(db2,num_lab_procedures,readmitted)


Number of Tests (Other than lab tests)

plot_numerical(db2,num_procedures,readmitted)


Number of Diagnoses

plot_numerical(db2,number_diagnoses,readmitted)


Summary Statistics of Numerical Variables

gt(ExpNumStat(db2,by="G",gp="readmitted",MesofShape=2,Outlier=FALSE,round=2))
Vname Group TN nNeg nZero nPos NegInf PosInf NA_Value Per_of_Missing sum min max mean median SD CV IQR Skewness Kurtosis
num_lab_procedures readmitted:0 37026 0 0 37026 0 0 0 0 1534860 1 102 41.45 43 19.70 0.48 26 -0.19 -0.35
num_lab_procedures readmitted:1 24311 0 0 24311 0 0 0 0 1053052 1 102 43.32 44 19.76 0.46 25 -0.27 -0.32
num_medications readmitted:0 37026 0 0 37026 0 0 0 0 548110 1 40 14.80 14 7.26 0.49 9 0.79 0.51
num_medications readmitted:1 24311 0 0 24311 0 0 0 0 377863 1 40 15.54 15 7.01 0.45 10 0.72 0.42
number_diagnoses readmitted:0 37026 0 0 37026 0 0 0 0 261338 2 13 7.06 8 2.01 0.28 4 -0.59 -0.89
number_diagnoses readmitted:1 24311 0 0 24311 0 0 0 0 181789 2 13 7.48 8 1.85 0.25 3 -0.93 -0.26
time_in_hospital readmitted:0 37026 0 0 37026 0 0 0 0 147217 1 13 3.98 3 2.71 0.68 3 1.18 1.04
time_in_hospital readmitted:1 24311 0 0 24311 0 0 0 0 105427 1 13 4.34 4 2.80 0.65 4 1.01 0.51


4.5.2. Exploring Categorical Variables

#creating a function to plot categorical variable
plot_categorical <- function(tbl,var1) 
  {
  tbl %>%
    group_by({{var1}}) %>%
    summarise_readm() %>%
    mutate({{var1}} := glue("{pull(., {{var1}})}\n ({n})")) %>%
    mutate({{var1}} := fct_reorder({{var1}},desc(pct_readm))) %>%
    ggplot(aes(x = {{var1}}, y = pct_readm)) + 
    geom_point(size = 4,shape = 18,aes(color = {{var1}})) +
    geom_errorbar(aes(ymin = low,ymax = high,
                      color = {{var1}}),size = 1) +
    theme(legend.position = "none") +
    geom_hline(aes(yintercept = sum(readm)/sum(n)),linetype = 3) +
    scale_y_continuous(labels = scales::percent) +
    labs(y=NULL,x=NULL)+
    coord_flip()
}


Gender

Women patients are more likely to be readmitted than men?

plot_categorical(db2,gender)

As observed, Women have higher chance of being readmitted than men.

This Hypothesis is TRUE


Race of Patients

African Americans are more likely to be re-admitted than other ethnic groups?

plot_categorical(db2,race)

As observed, African Americans have relatively low chance of Readmission than the Caucasian group.

This Hypothesis is FALSE


ggplot(data=db2,aes(x=db2$race,y = prop.table(stat(count)),fill = db2$race,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge") + 
  scale_fill_viridis_d(option = "viridis", direction = 1) +
  geom_text(stat = 'count',position = position_dodge(.5),vjust = -0.5,hjust = 0.5,size = 3) +
  scale_y_continuous(labels = scales::percent)+
  labs(y=NULL,x=NULL)+
  theme(legend.position = "none",plot.title = element_text(size = 20, face = "bold"))


Admission Type

ggplot(data=db2,aes(x=db2$admission_type,y = prop.table(stat(count)),fill = db2$admission_type,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge") +
   scale_fill_viridis_d(option = "viridis", direction = 1) +
  facet_wrap(~db2$readmitted) +
  geom_text(stat = 'count',position = position_dodge(.5),vjust = -0.5,hjust = 0.5,size = 3) +
  scale_y_continuous(labels = scales::percent)+
  labs(y=NULL,x=NULL)+
  theme(legend.position = "none")


Admission Source

ggplot(data=db2,aes(x=db2$admission_source,y = prop.table(stat(count)),fill = db2$admission_source,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge") + 
   scale_fill_viridis_d(option = "viridis", direction = 1) +
  facet_wrap(~db2$readmitted) +
  geom_text(stat = 'count',position = position_dodge(.5),vjust = -0.5,hjust = 0.5,size = 3) +
  scale_y_continuous(labels = scales::percent)+
  labs(y=NULL,x=NULL)+
  theme(legend.position = "none")


HbA1c Result

Normal result of HbA1c Measurement leads to lower Hospital Readmission Rates?

plot_categorical(db2,A1Cresult)

Patients with no Hb1Ac obtained have higher chance of readmission than those who have obtained normal results.

This Hypothesis is TRUE


Blood Sugar Level

High Blood sugar level patients have higher chance of readmission?

plot_categorical(db2,max_glu_serum)

As observed, Patients with high blood sugar level have higher chance of Readmission.

This Hypothesis is TRUE


Insulin

plot_categorical(db2,insulin)

Diagnosis Type

Primary Diagnosis

db2 %>% 
  group_by(primary_diagnosis,readmitted) %>% 
  summarise(count = n()) %>%
  ggplot(.,mapping = aes(x=primary_diagnosis,y=readmitted,fill=count)) +
  geom_tile()+
  geom_text(aes(label=count),color="white")+
  labs(title = "Primary Diagnosis v/s Readmission") +
  theme_ipsum()


Secondary Diagnosis

db2 %>% 
  group_by(secondary_diagnosis,readmitted) %>% 
  summarise(count = n()) %>%
  ggplot(.,mapping = aes(x=secondary_diagnosis,y=readmitted,fill=count)) +
  geom_tile()+
  geom_text(aes(label=count),color="white")+
  labs(title = "Secondary Diagnosis v/s Readmission") +
  theme_ipsum()


Additional Diagnosis

db2 %>% 
  group_by(additional_diagnosis,readmitted) %>% 
  summarise(count = n()) %>%
  ggplot(.,mapping = aes(x=additional_diagnosis,y=readmitted,fill=count)) +
  geom_tile()+
  geom_text(aes(label=count),color="white")+
  labs(title = "Additional Diagnosis v/s Readmission") +
  theme_ipsum()


Key highlights

  • Majority Patients fall under Category of Discharged or Transferred.

  • On an Average patient spends 4.4 days in hospital.

  • Around 76% of the patients we have in the data are Caucasian i.e. White, with African-Americans patients in the data being around 19%.

  • Majority Patients fall under the ages from 50-90 years.

  • Our Target variable Readmitted shows 60% patients were not readmitted while 40% patients were readmitted.

  • We can see that higher diagnosis done on patients are related to Circulatory, Neoplasm and Diabetic in diagnosis 1, 2 and 3 as well. People suffering from Circulatory and Diabetes kind of diagnoses are more likely to get readmitted.


Hypothesis Results

HYPOTHESIS RESULT
Age has a higher impact on readmission? TRUE
Women patients are more likely to be re-admitted than men? TRUE
African Americans are more likely to be re-admitted than other ethnic groups? FALSE
Normal result of HbA1c Measurement leads to lower Hospital Readmission Rates? TRUE
High Blood sugar level patients have higher chance of readmission? TRUE


4.7. Check for Multicollinearity

corr <- db2 %>% 
  select(where(is.numeric)) 
cor_matrix <- cor(corr)
corrplot(cor_matrix, method = 'color', order = 'alphabet',addCoef.col = "navy blue")

We can see that there is No Multicollinearity in our Data.


6. Model Building

6.1. Feature Selection

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest

Below is the step wise working of boruta algorithm:

  1. Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features).

  2. Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.

  3. At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z score than the maximum Z score of its shadow features) and constantly removes features which are deemed highly unimportant.

  4. Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a specified limit of random forest runs.

(Source : Analytics Vidya)


set.seed(111)
boruta <- Boruta(readmitted ~ ., data = db2, doTrace = 2, maxRuns = 100)
print(boruta)
#Output
#Boruta performed 99 iterations in 5.126252 hours.
 #25 attributes confirmed important: A1Cresult, additional_diagnosis,
#admission_source, admission_type, age and 20 more;
 #14 attributes confirmed unimportant: acarbose, acetohexamide,
#chlorpropamide, glimepiride, glipizide.metformin and 9 more;
 #1 tentative attributes left: rosiglitazone;

Final Features which are affecting my target are : 26

plot(boruta, las = 2, cex.axis = 0.7)
attStats(boruta)

Blue boxplots correspond to minimal, average and maximum Z score of a shadow attribute. Red, yellow and green boxplots represent Z scores of rejected, tentative and confirmed attributes respectively.

getNonRejectedFormula(boruta)
#Output
#readmitted ~ race + gender + age + admission_type + discharge_disposition + admission_source + time_in_hospital + num_lab_procedures + num_procedures + num_medications + number_outpatient + number_emergency + number_inpatient + number_diagnoses + max_glu_serum + A1Cresult + metformin + repaglinide + glipizide + rosiglitazone + insulin + change + diabetesMed + primary_diagnosis + secondary_diagnosis + additional_diagnosis


Variable Importance


6.2. Preparation of final Dataset

Only Keeping that feature which were selected by boruta algorithm

final_df = select(db2,-19:-22,-24:-26,-28:-31,-33:-35)


Spliting Data into Training Set and Validation Set

set.seed(2022)
trainIndex1 <- createDataPartition(final_df$readmitted,p=0.70,list=FALSE)
#splitting data into training/testing data using the trainIndex object
TRAIN <- final_df[trainIndex1,] #training data (70% of data)
TEST <- final_df[-trainIndex1,] #testing data (30% of data)


6.3. Data Balancing

Imbalanced classification is a supervised learning problem where one class outnumbers other class by a large proportion. This problem is faced more frequently in binary classification problems than multi-level classification problems.

Random oversampling balances the data by randomly oversampling the minority class.

data_rose <- ROSE(readmitted ~., data = TRAIN)$data
table(data_rose$readmitted)
## 
##     0     1 
## 21418 21519
r1 <- ggplot(TRAIN,mapping = aes(x=readmitted,y = prop.table(stat(count)),fill = readmitted,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge",width = .5) + 
  scale_fill_viridis_d(option = "viridis", direction = 1) +
  geom_text(stat = 'count',position = position_dodge(1),vjust = 0,hjust = 0,size = 4) +
  scale_y_continuous(labels = scales::percent)+
  labs(title = "Before ROSE",x=NULL,y=NULL)+
  theme(legend.position = "none")

r2 <- ggplot(data_rose,mapping = aes(x=readmitted,y = prop.table(stat(count)),fill = readmitted,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge",width = .5) + 
  scale_fill_viridis_d(option = "viridis", direction = 1) +
  geom_text(stat = 'count',position = position_dodge(1),vjust = 0,hjust = 0,size = 4) +
  scale_y_continuous(labels = scales::percent)+
  labs(title = "After ROSE",x=NULL,y=NULL)+
  theme(legend.position = "none"
) 
grid.arrange(r1,r2, ncol = 2, nrow = 1)


6.4 Model Building

6.4.1. Model 1

logitMod1 <- glm(readmitted ~ race + gender + age + admission_type + discharge_disposition + admission_source + time_in_hospital + num_lab_procedures + num_procedures + num_medications + number_outpatient + number_emergency + number_inpatient + number_diagnoses + max_glu_serum + A1Cresult + metformin + repaglinide + glipizide + rosiglitazone + insulin + change + diabetesMed + primary_diagnosis + secondary_diagnosis + additional_diagnosis ,data = data_rose,family = binomial)
summ(logitMod1,vifs = TRUE)
Observations 42937
Dependent variable readmitted
Type Generalized linear model
Family binomial
Link logit
χ²(72) 2169.99
Pseudo-R² (Cragg-Uhler) 0.07
Pseudo-R² (McFadden) 0.04
AIC 57499.10
BIC 58131.82
Est. S.E. z val. p VIF
(Intercept) -13.24 59.41 -0.22 0.82 NA
raceAsian -0.64 0.13 -5.11 0.00 1.09
raceCaucasian 0.01 0.03 0.28 0.78 1.09
raceHispanic -0.19 0.07 -2.62 0.01 1.09
raceOther -0.22 0.08 -2.78 0.01 1.09
genderMale -0.06 0.02 -3.17 0.00 1.04
age 0.01 0.00 9.09 0.00 1.26
admission_typeEmergency 0.17 0.03 5.07 0.00 2.44
admission_typeNewborn 0.21 1.00 0.21 0.83 2.44
admission_typeOthers 0.40 0.05 8.80 0.00 2.44
discharge_dispositionDischarged 0.51 1.23 0.42 0.68 1.37
discharge_dispositionLeft 0.57 1.23 0.46 0.64 1.37
discharge_dispositionOutpatient 11.17 119.47 0.09 0.93 1.37
discharge_dispositionTransferred 0.73 1.23 0.59 0.56 1.37
discharge_dispositionUnknown/NotAvailable 0.50 1.24 0.41 0.68 1.37
admission_sourceReferral 10.27 59.39 0.17 0.86 1.80
admission_sourceTransfer 9.60 59.39 0.16 0.87 1.80
admission_sourceUnknown/NotAvailable 10.23 59.39 0.17 0.86 1.80
time_in_hospital 0.02 0.00 5.91 0.00 1.38
num_lab_procedures 0.00 0.00 3.43 0.00 1.30
num_procedures -0.02 0.01 -3.93 0.00 1.28
num_medications 0.00 0.00 2.42 0.02 1.52
number_outpatient 0.14 0.02 8.57 0.00 1.03
number_emergency 0.35 0.04 8.54 0.00 1.02
number_inpatient 0.53 0.03 15.83 0.00 1.02
number_diagnoses 0.06 0.01 10.42 0.00 1.31
max_glu_serum>300 0.08 0.13 0.60 0.55 1.47
max_glu_serumNone 0.17 0.09 1.91 0.06 1.47
max_glu_serumNorm -0.01 0.11 -0.13 0.90 1.47
A1Cresult>8 0.11 0.06 1.76 0.08 1.19
A1CresultNone 0.19 0.05 3.80 0.00 1.19
A1CresultNorm -0.06 0.06 -0.98 0.33 1.19
metforminNo 0.22 0.13 1.71 0.09 1.49
metforminSteady 0.08 0.13 0.64 0.52 1.49
metforminUp 0.24 0.16 1.51 0.13 1.49
repaglinideNo 0.20 0.45 0.44 0.66 1.03
repaglinideSteady 0.42 0.46 0.92 0.36 1.03
repaglinideUp 0.09 0.59 0.16 0.87 1.03
glipizideNo -0.01 0.14 -0.07 0.94 1.20
glipizideSteady 0.04 0.14 0.28 0.78 1.20
glipizideUp 0.06 0.18 0.32 0.75 1.20
rosiglitazoneNo 0.61 0.36 1.72 0.09 1.11
rosiglitazoneSteady 0.84 0.36 2.34 0.02 1.11
rosiglitazoneUp 0.76 0.42 1.83 0.07 1.11
insulinNo -0.12 0.05 -2.52 0.01 2.92
insulinSteady -0.15 0.04 -3.78 0.00 2.92
insulinUp -0.09 0.04 -1.94 0.05 2.92
changeNo 0.01 0.03 0.32 0.75 2.50
diabetesMedYes 0.27 0.03 8.43 0.00 1.93
primary_diagnosisDiabetes 0.13 0.05 2.93 0.00 2.60
primary_diagnosisDigestive -0.06 0.04 -1.60 0.11 2.60
primary_diagnosisGenitourinary -0.15 0.05 -3.07 0.00 2.60
primary_diagnosisInjury -0.17 0.04 -3.85 0.00 2.60
primary_diagnosisMusculoskeletal -0.25 0.05 -5.20 0.00 2.60
primary_diagnosisNeoplasms -0.41 0.06 -6.83 0.00 2.60
primary_diagnosisOther -0.12 0.03 -3.65 0.00 2.60
primary_diagnosisRespiratory -0.11 0.03 -3.32 0.00 2.60
secondary_diagnosisDiabetes 0.01 0.04 0.36 0.72 2.14
secondary_diagnosisDigestive -0.18 0.06 -3.16 0.00 2.14
secondary_diagnosisGenitourinary -0.14 0.04 -3.32 0.00 2.14
secondary_diagnosisInjury -0.31 0.07 -4.55 0.00 2.14
secondary_diagnosisMusculoskeletal -0.14 0.08 -1.85 0.06 2.14
secondary_diagnosisNeoplasms 0.16 0.07 2.16 0.03 2.14
secondary_diagnosisOther -0.09 0.03 -3.20 0.00 2.14
secondary_diagnosisRespiratory -0.10 0.04 -2.68 0.01 2.14
additional_diagnosisDiabetes -0.06 0.03 -1.85 0.06 1.56
additional_diagnosisDigestive 0.03 0.05 0.64 0.52 1.56
additional_diagnosisGenitourinary -0.03 0.05 -0.72 0.47 1.56
additional_diagnosisInjury -0.34 0.08 -4.49 0.00 1.56
additional_diagnosisMusculoskeletal -0.21 0.08 -2.75 0.01 1.56
additional_diagnosisNeoplasms -0.14 0.08 -1.68 0.09 1.56
additional_diagnosisOther -0.15 0.03 -5.66 0.00 1.56
additional_diagnosisRespiratory -0.05 0.04 -1.12 0.26 1.56
Standard errors: MLE
logit_pred1 <- predict(logitMod1, TEST,type = "response")
pred1 <- ifelse(logit_pred1 >= 0.5, 1, 0)
cm1 <- confusionMatrix(factor(pred1), factor(TEST$readmitted),positive = "1")
draw_confusion_matrix(cm1)


AUC(logit_pred1,TEST$readmitted)
## [1] 0.6215771

58.9% Accuracy and 62.1% AUC


6.4.2. Model 2

Removed statistically Insignificant variables from the previous model to check if there is any improvement.

logitMod2 <- glm(readmitted ~ race + gender + age + admission_type +  time_in_hospital + num_lab_procedures + num_procedures + num_medications + number_outpatient + number_emergency + number_inpatient + number_diagnoses + A1Cresult + rosiglitazone + insulin + change + diabetesMed + primary_diagnosis + secondary_diagnosis + additional_diagnosis,data = data_rose,family = binomial)
summ(logitMod2,vifs = TRUE)
Observations 42937
Dependent variable readmitted
Type Generalized linear model
Family binomial
Link logit
χ²(52) 1830.75
Pseudo-R² (Cragg-Uhler) 0.06
Pseudo-R² (McFadden) 0.03
AIC 57798.33
BIC 58257.71
Est. S.E. z val. p VIF
(Intercept) -1.99 0.37 -5.42 0.00 NA
raceAsian -0.63 0.12 -5.08 0.00 1.08
raceCaucasian 0.01 0.03 0.49 0.63 1.08
raceHispanic -0.17 0.07 -2.41 0.02 1.08
raceOther -0.20 0.08 -2.50 0.01 1.08
genderMale -0.06 0.02 -3.06 0.00 1.03
age 0.01 0.00 10.46 0.00 1.17
admission_typeEmergency 0.14 0.03 5.01 0.00 1.28
admission_typeNewborn -0.51 0.93 -0.55 0.58 1.28
admission_typeOthers 0.36 0.04 9.40 0.00 1.28
time_in_hospital 0.02 0.00 6.01 0.00 1.29
num_lab_procedures 0.00 0.00 3.79 0.00 1.26
num_procedures -0.03 0.01 -4.90 0.00 1.26
num_medications 0.00 0.00 3.05 0.00 1.50
number_outpatient 0.15 0.02 9.22 0.00 1.03
number_emergency 0.36 0.04 8.88 0.00 1.01
number_inpatient 0.52 0.03 15.73 0.00 1.01
number_diagnoses 0.07 0.01 12.63 0.00 1.28
A1Cresult>8 0.10 0.06 1.76 0.08 1.17
A1CresultNone 0.19 0.05 3.85 0.00 1.17
A1CresultNorm -0.06 0.06 -0.93 0.35 1.17
rosiglitazoneNo 0.60 0.35 1.69 0.09 1.10
rosiglitazoneSteady 0.83 0.36 2.33 0.02 1.10
rosiglitazoneUp 0.66 0.42 1.58 0.11 1.10
insulinNo -0.17 0.04 -4.18 0.00 2.19
insulinSteady -0.20 0.04 -5.20 0.00 2.19
insulinUp -0.08 0.04 -1.74 0.08 2.19
changeNo 0.04 0.03 1.56 0.12 1.93
diabetesMedYes 0.24 0.03 8.08 0.00 1.69
primary_diagnosisDiabetes 0.15 0.04 3.40 0.00 2.40
primary_diagnosisDigestive -0.07 0.04 -1.69 0.09 2.40
primary_diagnosisGenitourinary -0.15 0.05 -2.97 0.00 2.40
primary_diagnosisInjury -0.14 0.04 -3.12 0.00 2.40
primary_diagnosisMusculoskeletal -0.19 0.05 -3.92 0.00 2.40
primary_diagnosisNeoplasms -0.36 0.06 -6.16 0.00 2.40
primary_diagnosisOther -0.12 0.03 -3.60 0.00 2.40
primary_diagnosisRespiratory -0.12 0.03 -3.54 0.00 2.40
secondary_diagnosisDiabetes 0.03 0.04 0.98 0.33 2.10
secondary_diagnosisDigestive -0.17 0.06 -3.01 0.00 2.10
secondary_diagnosisGenitourinary -0.13 0.04 -3.08 0.00 2.10
secondary_diagnosisInjury -0.31 0.07 -4.57 0.00 2.10
secondary_diagnosisMusculoskeletal -0.15 0.08 -2.00 0.05 2.10
secondary_diagnosisNeoplasms 0.17 0.07 2.25 0.02 2.10
secondary_diagnosisOther -0.08 0.03 -2.89 0.00 2.10
secondary_diagnosisRespiratory -0.10 0.04 -2.61 0.01 2.10
additional_diagnosisDiabetes -0.05 0.03 -1.49 0.14 1.54
additional_diagnosisDigestive 0.04 0.05 0.74 0.46 1.54
additional_diagnosisGenitourinary -0.02 0.05 -0.43 0.66 1.54
additional_diagnosisInjury -0.33 0.07 -4.37 0.00 1.54
additional_diagnosisMusculoskeletal -0.20 0.08 -2.59 0.01 1.54
additional_diagnosisNeoplasms -0.13 0.08 -1.53 0.13 1.54
additional_diagnosisOther -0.14 0.03 -5.46 0.00 1.54
additional_diagnosisRespiratory -0.05 0.04 -1.19 0.23 1.54
Standard errors: MLE
logit_pred2 <- predict(logitMod2, TEST,type = "response")
pred2 <- ifelse(logit_pred2 >= 0.5, 1, 0)
cm2 <- confusionMatrix(factor(pred2), factor(TEST$readmitted),positive = "1")
draw_confusion_matrix(cm2)

AUC(logit_pred2,TEST$readmitted)
## [1] 0.6147122

58.3% Accuracy and 61.5% AUC


6.4.3. ROC Curve

plotdata.test <- data.frame(cbind(TEST$readmitted,
                                  logit_pred1,
                                  logit_pred2))
names(plotdata.test) <- c("Readmitted", "Model1", "Model2")

comparison.plot <- ggplot(plotdata.test, aes(m = Model1, d = Readmitted))+ 
  geom_roc(labels=FALSE, aes(color = 'Logit_Model1')) +
  labs(x = "False Positive Rate (FPR)", y = "True Positive Rate (TPR)") + geom_abline() +
  scale_color_discrete(name = 'Model') +
  geom_roc(data = plotdata.test, labels = FALSE, aes(m = Model2, d = Readmitted, color = 'Logit_Model2')) 
comparison.plot


7. Conclusion

This Model concludes that some key factors that drove readmission were time_in_hospital, num_lab_procedures , num_medications, number_diagnoses ,max_glu_serum ,A1Cresult, metformin, insulin, Medication_change and diagnosis.

We saw that when we removed some of the statistically insignificant variables from our first model, the Accuracy of model dropped from 58.9% to 58.3% and the AUC also got decreased from 62.1% to 61.5%. Therefore we consider our 1st model to be the best model.

This Model can provide doctors with updates on patients and predict which patient is more likely to readmit.

Hospital re-admissions are expensive and avoiding them is still a major challenge. It also creates a financial burden on patients and their families. Also, In US,the legislation penalises those hospitals with high readmission rate. These are the reasons why reducing readmission rate becomes necessity.