Risk Prediction Model for Hospital Readmission

library(tidyverse)
library(dplyr)
library(gt)
library(SmartEDA)
library(ggplot2)
library(hrbrthemes)
library(dataPreparation)
library(caret)
library(corrplot)
library(Boruta)
library(ROSE)
library(plotROC)
library(MLmetrics)
library(jtools)
library(kableExtra)
library(gridExtra)
library(glue)

1. Background

Hospital readmission costs make for a significant amount of hospital inpatient care expenses. Diabetes is not just one of the world’s top 10 leading causes of mortality, but it is also the most costly chronic disease in the United States. Diabetes patients in the hospital have a greater risk of readmission than those who do not have diabetes. As a result, lowering readmission rates for diabetes patients has the potential to considerably cut medical costs. The goal of this research is to forecast the possibility of a diabetic patient being readmitted. The dataset was collected from the University of California, Irvine’s Center for Machine Learning and Intelligent Systems and comprises over 100,000 characteristics and 50 variables such as the number of operations, drugs, and time spent in hospital etc.

2. Problem Statement

Identifying which factors are the strong predictors for predicting the readmission of diabetic patients and build a classification model with good accuracy for predicting readmission.

3. Description of Data

Feature name	Type	Description and values
encounter_id	Numeric	unique identifier of a patient visit
patient_nbr	Numeric	unique identifier of a patient
race	Categorical	values: Caucasian, Asian, African American, Hispanic, and other
gender	Categorical	values: male, female, and unknown/invalid
age	Categorical	grouped in 10-year intervals up to 100
weight	Numeric	patient weight in pounds
admission_type_id	Categorical	integer indicating emergency, urgent, elective, newborn, and not available
discharge_disposition_id	Categorical	integer indicating discharged to home, expired, and not available
admission_source_id	Categorical	integer indicating physician referral, emergency room, and transfer from a hospital
time_in_hospital	Numeric	number of days between admission and discharge
payer_code	Categorical	Integer identifier corresponding to 23 distinct values, for example, Blue Cross\Blue Shield, Medicare, and self-pay
medical_specialty	Categorical	identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\general practice, and surgeon.
num_lab_procedures	Numeric	Number of lab tests performed during the encounter
num_procedures	Numeric	number of procedures (other than lab tests) performed
num_medications	Numeric	number of medications administered during the encounter
number_outpatient	Numeric	number of outpatient visits in the year preceding the encounter
number_emergency	Numeric	number of emergency visits in the year preceding the encounter
number_inpatient	Numeric	number of inpatient visits in the year preceding the encounter
diag_1	Categorical	primary diagnosis, 848 distinct values/codes
diag_2	Categorical	secondary diagnosis, 923 distinct values/codes
diag_3	Categorical	additional secondary diagnosis, 954 distinct values/codes
number_diagnoses	Numeric	number of diagnoses entered to the system
max_glu_serum	Categorical	Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured
A1c_result	Categorical	Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.
change of medications	Categorical	Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”
diabetesMed	Categorical	dichotomous indicator of diabetic medication being prescribed
24 features for medications	Categorical	For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed
Readmitted	Categorical	Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.

db <- read.csv("D:/DS Project/Dataset/diabetic_data.csv")

4. Feature Engineering

4.1. Deal with Missing Values

Missing values in this dataset is recorded as “?” so replacing this value with “NA”.

db1 <- db%>% 
  filter(!(gender %in% "Unknown/Invalid")) %>%
  mutate_all(~replace(., .== "?", NA)) %>% 
  mutate_at(c("diag_1","diag_2","diag_3"),~replace(., is.na(.), 0))

Glimpse of Dataset

db1 %>% 
  head() %>% 
  gt()

encounter_id	patient_nbr	race	gender	age	weight	admission_type_id	discharge_disposition_id	admission_source_id	time_in_hospital	payer_code	medical_specialty	num_lab_procedures	num_procedures	num_medications	number_outpatient	number_inpatient	diag_1	diag_2	diag_3	number_diagnoses	max_glu_serum	A1Cresult	metformin	repaglinide	nateglinide	chlorpropamide	glimepiride	acetohexamide	glipizide	glyburide	tolbutamide	pioglitazone	rosiglitazone	acarbose	miglitol	troglitazone	tolazamide	examide	citoglipton	insulin	glyburide.metformin	glipizide.metformin	glimepiride.pioglitazone	metformin.rosiglitazone	metformin.pioglitazone	change	diabetesMed	readmitted
2278392	8222157	Caucasian	Female	[0-10)	NA	6	25	1	1	NA	Pediatrics-Endocrinology	41	0	1	0	0	250.83	0	0	1	None	None	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	NO
149190	55629189	Caucasian	Female	[10-20)	NA	1	1	7	3	NA	NA	59	0	18	0	0	276	250.01	255	9	None	None	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	Up	No	No	No	No	No	Ch	Yes	>30
64410	86047875	AfricanAmerican	Female	[20-30)	NA	1	1	7	2	NA	NA	11	5	13	2	1	648	250	V27	6	None	None	No	No	No	No	No	No	Steady	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	Yes	NO
500364	82442376	Caucasian	Male	[30-40)	NA	1	1	7	2	NA	NA	44	1	16	0	0	8	250.43	403	7	None	None	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	Up	No	No	No	No	No	Ch	Yes	NO
16680	42519267	Caucasian	Male	[40-50)	NA	1	1	7	1	NA	NA	51	0	8	0	0	197	157	250	5	None	None	No	No	No	No	No	No	Steady	No	No	No	No	No	No	No	No	No	No	Steady	No	No	No	No	No	Ch	Yes	NO
35754	82637451	Caucasian	Male	[50-60)	NA	2	1	2	3	NA	NA	31	6	16	0	0	414	411	250	9	None	None	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	Steady	No	No	No	No	No	No	Yes	>30

Counting Missing Value for each Variable

db1 %>% 
  ExpData(type = 2) %>% 
  gt()

Index	Variable_Name	Variable_Type	Sample_n	Missing_Count	Per_of_Missing	No_of_distinct_values
1	encounter_id	integer	101763	0	0.000	101763
2	patient_nbr	integer	101763	0	0.000	71515
3	race	character	99492	2271	0.022	5
4	gender	character	101763	0	0.000	2
5	age	character	101763	0	0.000	10
6	weight	character	3197	98566	0.969	9
7	admission_type_id	integer	101763	0	0.000	8
8	discharge_disposition_id	integer	101763	0	0.000	26
9	admission_source_id	integer	101763	0	0.000	17
10	time_in_hospital	integer	101763	0	0.000	14
11	payer_code	character	61508	40255	0.396	17
12	medical_specialty	character	51816	49947	0.491	72
13	num_lab_procedures	integer	101763	0	0.000	118
14	num_procedures	integer	101763	0	0.000	7
15	num_medications	integer	101763	0	0.000	75
16	number_outpatient	integer	101763	0	0.000	39
17	number_emergency	integer	101763	0	0.000	33
18	number_inpatient	integer	101763	0	0.000	21
19	diag_1	character	101763	0	0.000	717
20	diag_2	character	101763	0	0.000	749
21	diag_3	character	101763	0	0.000	790
22	number_diagnoses	integer	101763	0	0.000	16
23	max_glu_serum	character	101763	0	0.000	4
24	A1Cresult	character	101763	0	0.000	4
25	metformin	character	101763	0	0.000	4
26	repaglinide	character	101763	0	0.000	4
27	nateglinide	character	101763	0	0.000	4
28	chlorpropamide	character	101763	0	0.000	4
29	glimepiride	character	101763	0	0.000	4
30	acetohexamide	character	101763	0	0.000	2
31	glipizide	character	101763	0	0.000	4
32	glyburide	character	101763	0	0.000	4
33	tolbutamide	character	101763	0	0.000	2
34	pioglitazone	character	101763	0	0.000	4
35	rosiglitazone	character	101763	0	0.000	4
36	acarbose	character	101763	0	0.000	4
37	miglitol	character	101763	0	0.000	4
38	troglitazone	character	101763	0	0.000	2
39	tolazamide	character	101763	0	0.000	3
40	examide	character	101763	0	0.000	1
41	citoglipton	character	101763	0	0.000	1
42	insulin	character	101763	0	0.000	4
43	glyburide.metformin	character	101763	0	0.000	4
44	glipizide.metformin	character	101763	0	0.000	2
45	glimepiride.pioglitazone	character	101763	0	0.000	2
46	metformin.rosiglitazone	character	101763	0	0.000	2
47	metformin.pioglitazone	character	101763	0	0.000	2
48	change	character	101763	0	0.000	2
49	diabetesMed	character	101763	0	0.000	2
50	readmitted	character	101763	0	0.000	3

4.2. Removing Columns

Variable	Reason
Weight	96% Missing Value
Payer Code	40% Missing Value
Medical Speciality	50% Missing Value
Encounter ID	Not Necessary because we have Unique Patient Number
examide, citoglipton, glimepiride.pioglitazone, metformin.rosiglitazone	Homogenity of Data

db1$weight <- NULL
db1$payer_code <- NULL
db1$medical_specialty <- NULL
db1$encounter_id <- NULL
db1$examide <- NULL
db1$citoglipton <- NULL
db1$glimepiride.pioglitazone <- NULL
db1$metformin.rosiglitazone <- NULL

Removing remaining rows with missing value

db1 <- na.omit(db1)
dim(db1)

## [1] 99492    42

4.3. Feature Transformations

4.3.1. Patient Number

db1 <- db1[!duplicated(db1$patient_nbr),]
db1$patient_nbr <- NULL

4.3.2. Admission Type

Categorizing Admission type by Grouping similar classification into a major classification

db1$admission_type_id <- replace(db1$admission_type_id,db1$admission_type_id == 2, 1)
db1$admission_type_id <- replace(db1$admission_type_id,db1$admission_type_id == 6, 5)
db1$admission_type_id <- replace(db1$admission_type_id,db1$admission_type_id == 7, 1)
db1$admission_type_id <- replace(db1$admission_type_id,db1$admission_type_id == 8, 5)

db1$admission_type_id <- str_replace(db1$admission_type_id,"1", "Emergency")
db1$admission_type_id <- str_replace(db1$admission_type_id,"3", "Elective")
db1$admission_type_id <- str_replace(db1$admission_type_id,"4", "Newborn")
db1$admission_type_id <- str_replace(db1$admission_type_id,"5", "Others")

db1 <- rename(db1, admission_type = admission_type_id)
db1$admission_type <- as.factor(db1$admission_type)
db1$admission_type_id <- NULL

4.3.3. Admission Source

Categorizing Admission Source by Grouping similar classifications into a major classification

db1$admission_source_id <- case_when(
  db1$admission_source_id %in% c(1,2,3) ~ "Referral",
  db1$admission_source_id %in% c(4,5,6,10,18,19,22,25,26) ~ "Transfer",
  db1$admission_source_id %in% 3 ~ "Emergency Room",
  db1$admission_source_id %in% 4 ~ "Law Enforcement",
  db1$admission_source_id %in% c(11,12,13,14,23,24) ~ "Pregnancy Related",
  TRUE ~ "Unknown/NotAvailable")

db1 <- rename(db1, admission_source = admission_source_id)
db1$admission_source <- as.factor(db1$admission_source)
db1$admission_source_id <- NULL

4.3.4. Discharge Disposition

Categorizing Discharge Disposition by Grouping similar classifications into a major classification

Also Discharge codes with 8,11,13,14,19,20,21 were deleted because they were related to death or hospice.

db1 <- db1[!db1$discharge_disposition_id %in% c(8,11,13,14,19,20,21), ]

db1$discharge_disposition_id <- case_when(
  db1$discharge_disposition_id %in% 1 ~ "Discharged",
  db1$discharge_disposition_id %in% 12 ~ "Outpatient",
  db1$discharge_disposition_id %in% 3 ~ "Left",
  db1$discharge_disposition_id %in% 9 ~ "Admitted",
  db1$discharge_disposition_id %in% c(2,3,4,5,6,10,15,16,17,22,23,24,27,28,29,30) ~ "Transferred", TRUE ~ "Unknown/NotAvailable")

db1 <- rename(db1, discharge_disposition = discharge_disposition_id)
db1$discharge_disposition <- as.factor(db1$discharge_disposition)
db1$discharge_disposition_id <- NULL

4.3.4. Diagnosis Type

Categorizing Diagnosis type by Grouping similar classifications into a major classification for Primary, Secondary and Additional Diagnosis

Primary Diagnosis

db1$diag_1 <- as.character(db1$diag_1)

db1<- mutate(db1, primary_diagnosis =
    ifelse(str_detect(diag_1, "V") | str_detect(diag_1, "E"),"Other",     ifelse(str_detect(diag_1, "250"), "Diabetes",
    ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | as.integer(diag_1) == 785, "Circulatory",
    ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | as.integer(diag_1) == 786, "Respiratory", 
    ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | as.integer(diag_1) == 787, "Digestive", 
    ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | as.integer(diag_1) == 788, "Genitourinary",
    ifelse((as.integer(diag_1) >= 140 & as.integer(diag_1) <= 239), "Neoplasms",  
    ifelse((as.integer(diag_1) >= 710 & as.integer(diag_1) <= 739), "Musculoskeletal",          
    ifelse((as.integer(diag_1) >= 800 & as.integer(diag_1) <= 999), "Injury","Other"))))))))))

db1$diag_1 <- NULL

Secondary Diagnosis

db1$diag_2 <- as.character(db1$diag_2)

db1<- mutate(db1, secondary_diagnosis =
    ifelse(str_detect(diag_2, "V") | str_detect(diag_2, "E"),"Other",     ifelse(str_detect(diag_2, "250"), "Diabetes",
    ifelse((as.integer(diag_2) >= 390 & as.integer(diag_2) <= 459) | as.integer(diag_2) == 785, "Circulatory",
    ifelse((as.integer(diag_2) >= 460 & as.integer(diag_2) <= 519) | as.integer(diag_2) == 786, "Respiratory", 
    ifelse((as.integer(diag_2) >= 520 & as.integer(diag_2) <= 579) | as.integer(diag_2) == 787, "Digestive", 
    ifelse((as.integer(diag_2) >= 580 & as.integer(diag_2) <= 629) | as.integer(diag_2) == 788, "Genitourinary",
    ifelse((as.integer(diag_2) >= 140 & as.integer(diag_2) <= 239), "Neoplasms",  
    ifelse((as.integer(diag_2) >= 710 & as.integer(diag_2) <= 739), "Musculoskeletal",          
    ifelse((as.integer(diag_2) >= 800 & as.integer(diag_2) <= 999), "Injury","Other"))))))))))

db1$diag_2 <- NULL

Additional Diagnosis

db1$diag_3 <- as.character(db1$diag_3)

db1<- mutate(db1, additional_diagnosis =
    ifelse(str_detect(diag_3, "V") | str_detect(diag_3, "E"),"Other",     ifelse(str_detect(diag_3, "250"), "Diabetes",
    ifelse((as.integer(diag_3) >= 390 & as.integer(diag_3) <= 459) | as.integer(diag_3) == 785, "Circulatory",
    ifelse((as.integer(diag_3) >= 460 & as.integer(diag_3) <= 519) | as.integer(diag_3) == 786, "Respiratory", 
    ifelse((as.integer(diag_3) >= 520 & as.integer(diag_3) <= 579) | as.integer(diag_3) == 787, "Digestive", 
    ifelse((as.integer(diag_3) >= 580 & as.integer(diag_3) <= 629) | as.integer(diag_3) == 788, "Genitourinary",
    ifelse((as.integer(diag_3) >= 140 & as.integer(diag_3) <= 239), "Neoplasms",  
    ifelse((as.integer(diag_3) >= 710 & as.integer(diag_3) <= 739), "Musculoskeletal",          
    ifelse((as.integer(diag_3) >= 800 & as.integer(diag_3) <= 999), "Injury","Other"))))))))))

db1$diag_3 <- NULL

4.3.5. Age

Transforming the age to be the middle value in each given range

db1$age <- case_when(
  db1$age %in% "[0-10)" ~ 5,
  db1$age %in% "[10-20)" ~ 15,
  db1$age %in% "[20-30)" ~ 25,
  db1$age %in% "[30-40)" ~ 35,
  db1$age %in% "[40-50)" ~ 45,
  db1$age %in% "[50-60)" ~ 55,
  db1$age %in% "[60-70)" ~ 65,
  db1$age %in% "[70-80)" ~ 75,
  db1$age %in% "[80-90)" ~ 85,
  TRUE ~ 95
  )
db1$age <- as.numeric(db1$age)

4.3.6. Readmission

Transforming the readmitted column values to be 0 if patients were not readmitted, and 1 if patients were both readmitted within and after 30 days.

db1$readmitted <- case_when(db1$readmitted %in% c(">30","<30") ~ "1",
                            TRUE ~ "0")
db1$readmitted <- as.factor(db1$readmitted)

4.3.6 Converting Characters to Factors

Changing Categorical Variables into Factors with multiple “levels”.

cols <- c("race" ,"gender","max_glu_serum", "A1Cresult","metformin","repaglinide","nateglinide" ,"chlorpropamide" ,"glimepiride","acetohexamide","glipizide","glyburide","tolbutamide","pioglitazone","rosiglitazone" ,"acarbose","miglitol","troglitazone" ,"tolazamide",  "insulin","glyburide.metformin","glipizide.metformin","metformin.pioglitazone" ,"change","diabetesMed" , "primary_diagnosis","secondary_diagnosis","additional_diagnosis" )
db1 <- db1 %>% 
  mutate_each_(funs(factor(.)),cols)

db1 <- db1%>%
  mutate_if(is.integer,as.numeric)

4.4 Outlier Treatment

Identifing outliers in the numerical columns and removing them. To keep it simple, I have decided to only keep values that are within 3 standard deviations away from the mean for each feature of the dataset.

par(mfrow = c(2,4))
boxplot(db1$time_in_hospital, main = "time_in_hospital",col='Sky Blue')
boxplot(db1$num_lab_procedures, main = "num_lab_procedures",col='Sky Blue')
boxplot(db1$num_procedures, main = "num_procedures",col='Sky Blue')
boxplot(db1$num_medications, main = "num_medications",col='Sky Blue')
boxplot(db1$number_outpatient, main = "number_outpatient",col='Sky Blue')
boxplot(db1$number_emergency, main = "number_emergency",col='Sky Blue')
boxplot(db1$number_inpatient, main = "number_inpatient",col='Sky Blue')
boxplot(db1$number_diagnoses, main = "number_diagnoses",col='Sky Blue')

db2 <- remove_sd_outlier(db1, cols = "auto", n_sigmas = 3, verbose = FALSE)

##  [1] "race"                   "gender"                 "admission_type"        
##  [4] "discharge_disposition"  "admission_source"       "max_glu_serum"         
##  [7] "A1Cresult"              "metformin"              "repaglinide"           
## [10] "nateglinide"            "chlorpropamide"         "glimepiride"           
## [13] "acetohexamide"          "glipizide"              "glyburide"             
## [16] "tolbutamide"            "pioglitazone"           "rosiglitazone"         
## [19] "acarbose"               "miglitol"               "troglitazone"          
## [22] "tolazamide"             "insulin"                "glyburide.metformin"   
## [25] "glipizide.metformin"    "metformin.pioglitazone" "change"                
## [28] "diabetesMed"            "readmitted"             "primary_diagnosis"     
## [31] "secondary_diagnosis"    "additional_diagnosis"  
## [1] "remove_sd_outlier: c(\"race\", \"gender\", \"admission_type\", \"discharge_disposition\", \"admission_source\", \"max_glu_serum\", \"A1Cresult\", \"metformin\", \"repaglinide\", \"nateglinide\", \"chlorpropamide\", \"glimepiride\", \"acetohexamide\", \"glipizide\", \"glyburide\", \"tolbutamide\", \"pioglitazone\", \"rosiglitazone\", \"acarbose\", \"miglitol\", \"troglitazone\", \"tolazamide\", \"insulin\", \"glyburide.metformin\", \"glipizide.metformin\", \"metformin.pioglitazone\", \"change\", \"diabetesMed\", \"readmitted\", \"primary_diagnosis\", \"secondary_diagnosis\", \"additional_diagnosis\"\n) aren't columns of types numeric i do nothing for those variables."

4.5. Exploratory Data Analysis

#creating a function to plot numerical variable
plot_histogram <- function(df,var1,var2) {
  df %>%
    ggplot(aes(x = {{var1}},fill = {{var2}})) + 
    geom_histogram(alpha = 0.75,position = "stack",color = "black",bins = 30) +
    geom_vline(aes(xintercept = median({{var1}})), linetype = 2,size = 1) + 
    theme(legend.position = "none")
}

plot_density <- function(df,var1,var2){
  df %>%
    ggplot(aes(x = {{var1}},fill = {{var2}})) + 
    geom_density(alpha = 0.5,color = "black") +
    geom_vline(data = df %>%
                 group_by({{var2}}) %>%
                 summarize(mean.grp = mean({{var1}})),
               aes(xintercept = mean.grp,color ={{var2}}),
               linetype = "dashed",size = 1) +
    labs(caption = paste0("Lines represent average by group"),
         y = element_blank(), x = element_blank(), title = "")
    #theme(axis.ticks.y = element_blank(),axis.text.y = element_blank())
}

plot_numerical <- function(df,var1,var2) {
  p1 <- plot_histogram({{df}},{{var1}},{{var2}})
  p2 <- plot_density({{df}},{{var1}},{{var2}})
  
  grid.arrange(p1,p2,ncol = 2)
}

4.5.1. Exploring Numerical Variables

Readmission Rate

#Creating a function to count percent to re-admissions
summarise_readm <- function(tbl) {
  tbl %>%
    summarise(readm = sum(readmitted == 1),
              n = n(),
              low = qbeta(0.025,readm + 0.5,n - readm + 0.5),
              high = qbeta(0.975,readm + 0.5,n - readm + 0.5)) %>% 
              mutate(pct_readm = readm/n)
}

db2 %>% 
  group_by(readmitted) %>% 
  summarise(count = n()) %>%
  mutate(freq = formattable::percent(count/sum(count))) %>% 
  ggplot(aes(x = 2, y = count, fill = readmitted)) +
  geom_col() +
  geom_text(aes(label = formattable::percent(count/sum(count))),position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  xlim(c(0.2,2 + 0.5)) +
  theme_void() +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(),
        panel.border = element_blank(),panel.grid=element_blank(),
        axis.ticks = element_blank(),
        plot.title=element_text(size=14, face="bold")) +
  labs(title = "Around 4 out of every 10 patients get Readmission" )

Age of Patients

Age has a higher impact on readmission?

plot_numerical(db2,age,readmitted)

As observed, Higher the age, higher will be the chance of Readmission.

This Hypothesis is TRUE

Time Spent in Hospital

plot_numerical(db2,time_in_hospital,readmitted)

Number of Lab Tests

plot_numerical(db2,num_lab_procedures,readmitted)

Number of Tests (Other than lab tests)

plot_numerical(db2,num_procedures,readmitted)

Number of Diagnoses

plot_numerical(db2,number_diagnoses,readmitted)

Summary Statistics of Numerical Variables

gt(ExpNumStat(db2,by="G",gp="readmitted",MesofShape=2,Outlier=FALSE,round=2))

Vname	Group	TN	nPos	sum	min	max	mean	median	SD	CV	IQR	Skewness	Kurtosis
num_lab_procedures	readmitted:0	37026	37026	1534860	1	102	41.45	43	19.70	0.48	26	-0.19	-0.35
num_lab_procedures	readmitted:1	24311	24311	1053052	1	102	43.32	44	19.76	0.46	25	-0.27	-0.32
num_medications	readmitted:0	37026	37026	548110	1	40	14.80	14	7.26	0.49	9	0.79	0.51
num_medications	readmitted:1	24311	24311	377863	1	40	15.54	15	7.01	0.45	10	0.72	0.42
number_diagnoses	readmitted:0	37026	37026	261338	2	13	7.06	8	2.01	0.28	4	-0.59	-0.89
number_diagnoses	readmitted:1	24311	24311	181789	2	13	7.48	8	1.85	0.25	3	-0.93	-0.26
time_in_hospital	readmitted:0	37026	37026	147217	1	13	3.98	3	2.71	0.68	3	1.18	1.04
time_in_hospital	readmitted:1	24311	24311	105427	1	13	4.34	4	2.80	0.65	4	1.01	0.51

4.5.2. Exploring Categorical Variables

#creating a function to plot categorical variable
plot_categorical <- function(tbl,var1) 
  {
  tbl %>%
    group_by({{var1}}) %>%
    summarise_readm() %>%
    mutate({{var1}} := glue("{pull(., {{var1}})}\n ({n})")) %>%
    mutate({{var1}} := fct_reorder({{var1}},desc(pct_readm))) %>%
    ggplot(aes(x = {{var1}}, y = pct_readm)) + 
    geom_point(size = 4,shape = 18,aes(color = {{var1}})) +
    geom_errorbar(aes(ymin = low,ymax = high,
                      color = {{var1}}),size = 1) +
    theme(legend.position = "none") +
    geom_hline(aes(yintercept = sum(readm)/sum(n)),linetype = 3) +
    scale_y_continuous(labels = scales::percent) +
    labs(y=NULL,x=NULL)+
    coord_flip()
}

Gender

Women patients are more likely to be readmitted than men?

plot_categorical(db2,gender)

As observed, Women have higher chance of being readmitted than men.

This Hypothesis is TRUE

Race of Patients

African Americans are more likely to be re-admitted than other ethnic groups?

plot_categorical(db2,race)

As observed, African Americans have relatively low chance of Readmission than the Caucasian group.

This Hypothesis is FALSE

ggplot(data=db2,aes(x=db2$race,y = prop.table(stat(count)),fill = db2$race,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge") + 
  scale_fill_viridis_d(option = "viridis", direction = 1) +
  geom_text(stat = 'count',position = position_dodge(.5),vjust = -0.5,hjust = 0.5,size = 3) +
  scale_y_continuous(labels = scales::percent)+
  labs(y=NULL,x=NULL)+
  theme(legend.position = "none",plot.title = element_text(size = 20, face = "bold"))

Admission Type

ggplot(data=db2,aes(x=db2$admission_type,y = prop.table(stat(count)),fill = db2$admission_type,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge") +
   scale_fill_viridis_d(option = "viridis", direction = 1) +
  facet_wrap(~db2$readmitted) +
  geom_text(stat = 'count',position = position_dodge(.5),vjust = -0.5,hjust = 0.5,size = 3) +
  scale_y_continuous(labels = scales::percent)+
  labs(y=NULL,x=NULL)+
  theme(legend.position = "none")

Admission Source

ggplot(data=db2,aes(x=db2$admission_source,y = prop.table(stat(count)),fill = db2$admission_source,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge") + 
   scale_fill_viridis_d(option = "viridis", direction = 1) +
  facet_wrap(~db2$readmitted) +
  geom_text(stat = 'count',position = position_dodge(.5),vjust = -0.5,hjust = 0.5,size = 3) +
  scale_y_continuous(labels = scales::percent)+
  labs(y=NULL,x=NULL)+
  theme(legend.position = "none")

HbA1c Result

Normal result of HbA1c Measurement leads to lower Hospital Readmission Rates?

plot_categorical(db2,A1Cresult)

Patients with no Hb1Ac obtained have higher chance of readmission than those who have obtained normal results.

This Hypothesis is TRUE

Blood Sugar Level

High Blood sugar level patients have higher chance of readmission?

plot_categorical(db2,max_glu_serum)

As observed, Patients with high blood sugar level have higher chance of Readmission.

This Hypothesis is TRUE

Insulin

plot_categorical(db2,insulin)

Diagnosis Type

Primary Diagnosis

db2 %>% 
  group_by(primary_diagnosis,readmitted) %>% 
  summarise(count = n()) %>%
  ggplot(.,mapping = aes(x=primary_diagnosis,y=readmitted,fill=count)) +
  geom_tile()+
  geom_text(aes(label=count),color="white")+
  labs(title = "Primary Diagnosis v/s Readmission") +
  theme_ipsum()

Secondary Diagnosis

db2 %>% 
  group_by(secondary_diagnosis,readmitted) %>% 
  summarise(count = n()) %>%
  ggplot(.,mapping = aes(x=secondary_diagnosis,y=readmitted,fill=count)) +
  geom_tile()+
  geom_text(aes(label=count),color="white")+
  labs(title = "Secondary Diagnosis v/s Readmission") +
  theme_ipsum()

Additional Diagnosis

db2 %>% 
  group_by(additional_diagnosis,readmitted) %>% 
  summarise(count = n()) %>%
  ggplot(.,mapping = aes(x=additional_diagnosis,y=readmitted,fill=count)) +
  geom_tile()+
  geom_text(aes(label=count),color="white")+
  labs(title = "Additional Diagnosis v/s Readmission") +
  theme_ipsum()

Key highlights

Majority Patients fall under Category of Discharged or Transferred.
On an Average patient spends 4.4 days in hospital.
Around 76% of the patients we have in the data are Caucasian i.e. White, with African-Americans patients in the data being around 19%.
Majority Patients fall under the ages from 50-90 years.
Our Target variable Readmitted shows 60% patients were not readmitted while 40% patients were readmitted.
We can see that higher diagnosis done on patients are related to Circulatory, Neoplasm and Diabetic in diagnosis 1, 2 and 3 as well. People suffering from Circulatory and Diabetes kind of diagnoses are more likely to get readmitted.

Hypothesis Results

HYPOTHESIS	RESULT
Age has a higher impact on readmission?	TRUE
Women patients are more likely to be re-admitted than men?	TRUE
African Americans are more likely to be re-admitted than other ethnic groups?	FALSE
Normal result of HbA1c Measurement leads to lower Hospital Readmission Rates?	TRUE
High Blood sugar level patients have higher chance of readmission?	TRUE

4.7. Check for Multicollinearity

corr <- db2 %>% 
  select(where(is.numeric)) 
cor_matrix <- cor(corr)
corrplot(cor_matrix, method = 'color', order = 'alphabet',addCoef.col = "navy blue")

We can see that there is No Multicollinearity in our Data.

6. Model Building

6.1. Feature Selection

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest

Below is the step wise working of boruta algorithm:

Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features).
Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.
At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z score than the maximum Z score of its shadow features) and constantly removes features which are deemed highly unimportant.
Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a specified limit of random forest runs.

(Source : Analytics Vidya)

set.seed(111)
boruta <- Boruta(readmitted ~ ., data = db2, doTrace = 2, maxRuns = 100)
print(boruta)
#Output
#Boruta performed 99 iterations in 5.126252 hours.
 #25 attributes confirmed important: A1Cresult, additional_diagnosis,
#admission_source, admission_type, age and 20 more;
 #14 attributes confirmed unimportant: acarbose, acetohexamide,
#chlorpropamide, glimepiride, glipizide.metformin and 9 more;
 #1 tentative attributes left: rosiglitazone;

Final Features which are affecting my target are : 26

plot(boruta, las = 2, cex.axis = 0.7)
attStats(boruta)

Blue boxplots correspond to minimal, average and maximum Z score of a shadow attribute. Red, yellow and green boxplots represent Z scores of rejected, tentative and confirmed attributes respectively.

getNonRejectedFormula(boruta)
#Output
#readmitted ~ race + gender + age + admission_type + discharge_disposition + admission_source + time_in_hospital + num_lab_procedures + num_procedures + num_medications + number_outpatient + number_emergency + number_inpatient + number_diagnoses + max_glu_serum + A1Cresult + metformin + repaglinide + glipizide + rosiglitazone + insulin + change + diabetesMed + primary_diagnosis + secondary_diagnosis + additional_diagnosis

Variable Importance

6.2. Preparation of final Dataset

Only Keeping that feature which were selected by boruta algorithm

final_df = select(db2,-19:-22,-24:-26,-28:-31,-33:-35)

Spliting Data into Training Set and Validation Set

set.seed(2022)
trainIndex1 <- createDataPartition(final_df$readmitted,p=0.70,list=FALSE)
#splitting data into training/testing data using the trainIndex object
TRAIN <- final_df[trainIndex1,] #training data (70% of data)
TEST <- final_df[-trainIndex1,] #testing data (30% of data)

6.3. Data Balancing

Imbalanced classification is a supervised learning problem where one class outnumbers other class by a large proportion. This problem is faced more frequently in binary classification problems than multi-level classification problems.

Random oversampling balances the data by randomly oversampling the minority class.

data_rose <- ROSE(readmitted ~., data = TRAIN)$data
table(data_rose$readmitted)

## 
##     0     1 
## 21418 21519

r1 <- ggplot(TRAIN,mapping = aes(x=readmitted,y = prop.table(stat(count)),fill = readmitted,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge",width = .5) + 
  scale_fill_viridis_d(option = "viridis", direction = 1) +
  geom_text(stat = 'count',position = position_dodge(1),vjust = 0,hjust = 0,size = 4) +
  scale_y_continuous(labels = scales::percent)+
  labs(title = "Before ROSE",x=NULL,y=NULL)+
  theme(legend.position = "none")

r2 <- ggplot(data_rose,mapping = aes(x=readmitted,y = prop.table(stat(count)),fill = readmitted,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge",width = .5) + 
  scale_fill_viridis_d(option = "viridis", direction = 1) +
  geom_text(stat = 'count',position = position_dodge(1),vjust = 0,hjust = 0,size = 4) +
  scale_y_continuous(labels = scales::percent)+
  labs(title = "After ROSE",x=NULL,y=NULL)+
  theme(legend.position = "none"
) 
grid.arrange(r1,r2, ncol = 2, nrow = 1)

6.4 Model Building

6.4.1. Model 1

logitMod1 <- glm(readmitted ~ race + gender + age + admission_type + discharge_disposition + admission_source + time_in_hospital + num_lab_procedures + num_procedures + num_medications + number_outpatient + number_emergency + number_inpatient + number_diagnoses + max_glu_serum + A1Cresult + metformin + repaglinide + glipizide + rosiglitazone + insulin + change + diabetesMed + primary_diagnosis + secondary_diagnosis + additional_diagnosis ,data = data_rose,family = binomial)
summ(logitMod1,vifs = TRUE)

Observations	42937
Dependent variable	readmitted
Type	Generalized linear model
Family	binomial
Link	logit

χ²(72)	2169.99
Pseudo-R² (Cragg-Uhler)	0.07
Pseudo-R² (McFadden)	0.04
AIC	57499.10
BIC	58131.82

	Est.	S.E.	z val.	p	VIF
(Intercept)	-13.24	59.41	-0.22	0.82	NA
raceAsian	-0.64	0.13	-5.11	0.00	1.09
raceCaucasian	0.01	0.03	0.28	0.78	1.09
raceHispanic	-0.19	0.07	-2.62	0.01	1.09
raceOther	-0.22	0.08	-2.78	0.01	1.09
genderMale	-0.06	0.02	-3.17	0.00	1.04
age	0.01	0.00	9.09	0.00	1.26
admission_typeEmergency	0.17	0.03	5.07	0.00	2.44
admission_typeNewborn	0.21	1.00	0.21	0.83	2.44
admission_typeOthers	0.40	0.05	8.80	0.00	2.44
discharge_dispositionDischarged	0.51	1.23	0.42	0.68	1.37
discharge_dispositionLeft	0.57	1.23	0.46	0.64	1.37
discharge_dispositionOutpatient	11.17	119.47	0.09	0.93	1.37
discharge_dispositionTransferred	0.73	1.23	0.59	0.56	1.37
discharge_dispositionUnknown/NotAvailable	0.50	1.24	0.41	0.68	1.37
admission_sourceReferral	10.27	59.39	0.17	0.86	1.80
admission_sourceTransfer	9.60	59.39	0.16	0.87	1.80
admission_sourceUnknown/NotAvailable	10.23	59.39	0.17	0.86	1.80
time_in_hospital	0.02	0.00	5.91	0.00	1.38
num_lab_procedures	0.00	0.00	3.43	0.00	1.30
num_procedures	-0.02	0.01	-3.93	0.00	1.28
num_medications	0.00	0.00	2.42	0.02	1.52
number_outpatient	0.14	0.02	8.57	0.00	1.03
number_emergency	0.35	0.04	8.54	0.00	1.02
number_inpatient	0.53	0.03	15.83	0.00	1.02
number_diagnoses	0.06	0.01	10.42	0.00	1.31
max_glu_serum>300	0.08	0.13	0.60	0.55	1.47
max_glu_serumNone	0.17	0.09	1.91	0.06	1.47
max_glu_serumNorm	-0.01	0.11	-0.13	0.90	1.47
A1Cresult>8	0.11	0.06	1.76	0.08	1.19
A1CresultNone	0.19	0.05	3.80	0.00	1.19
A1CresultNorm	-0.06	0.06	-0.98	0.33	1.19
metforminNo	0.22	0.13	1.71	0.09	1.49
metforminSteady	0.08	0.13	0.64	0.52	1.49
metforminUp	0.24	0.16	1.51	0.13	1.49
repaglinideNo	0.20	0.45	0.44	0.66	1.03
repaglinideSteady	0.42	0.46	0.92	0.36	1.03
repaglinideUp	0.09	0.59	0.16	0.87	1.03
glipizideNo	-0.01	0.14	-0.07	0.94	1.20
glipizideSteady	0.04	0.14	0.28	0.78	1.20
glipizideUp	0.06	0.18	0.32	0.75	1.20
rosiglitazoneNo	0.61	0.36	1.72	0.09	1.11
rosiglitazoneSteady	0.84	0.36	2.34	0.02	1.11
rosiglitazoneUp	0.76	0.42	1.83	0.07	1.11
insulinNo	-0.12	0.05	-2.52	0.01	2.92
insulinSteady	-0.15	0.04	-3.78	0.00	2.92
insulinUp	-0.09	0.04	-1.94	0.05	2.92
changeNo	0.01	0.03	0.32	0.75	2.50
diabetesMedYes	0.27	0.03	8.43	0.00	1.93
primary_diagnosisDiabetes	0.13	0.05	2.93	0.00	2.60
primary_diagnosisDigestive	-0.06	0.04	-1.60	0.11	2.60
primary_diagnosisGenitourinary	-0.15	0.05	-3.07	0.00	2.60
primary_diagnosisInjury	-0.17	0.04	-3.85	0.00	2.60
primary_diagnosisMusculoskeletal	-0.25	0.05	-5.20	0.00	2.60
primary_diagnosisNeoplasms	-0.41	0.06	-6.83	0.00	2.60
primary_diagnosisOther	-0.12	0.03	-3.65	0.00	2.60
primary_diagnosisRespiratory	-0.11	0.03	-3.32	0.00	2.60
secondary_diagnosisDiabetes	0.01	0.04	0.36	0.72	2.14
secondary_diagnosisDigestive	-0.18	0.06	-3.16	0.00	2.14
secondary_diagnosisGenitourinary	-0.14	0.04	-3.32	0.00	2.14
secondary_diagnosisInjury	-0.31	0.07	-4.55	0.00	2.14
secondary_diagnosisMusculoskeletal	-0.14	0.08	-1.85	0.06	2.14
secondary_diagnosisNeoplasms	0.16	0.07	2.16	0.03	2.14
secondary_diagnosisOther	-0.09	0.03	-3.20	0.00	2.14
secondary_diagnosisRespiratory	-0.10	0.04	-2.68	0.01	2.14
additional_diagnosisDiabetes	-0.06	0.03	-1.85	0.06	1.56
additional_diagnosisDigestive	0.03	0.05	0.64	0.52	1.56
additional_diagnosisGenitourinary	-0.03	0.05	-0.72	0.47	1.56
additional_diagnosisInjury	-0.34	0.08	-4.49	0.00	1.56
additional_diagnosisMusculoskeletal	-0.21	0.08	-2.75	0.01	1.56
additional_diagnosisNeoplasms	-0.14	0.08	-1.68	0.09	1.56
additional_diagnosisOther	-0.15	0.03	-5.66	0.00	1.56
additional_diagnosisRespiratory	-0.05	0.04	-1.12	0.26	1.56
Standard errors: MLE

logit_pred1 <- predict(logitMod1, TEST,type = "response")
pred1 <- ifelse(logit_pred1 >= 0.5, 1, 0)

cm1 <- confusionMatrix(factor(pred1), factor(TEST$readmitted),positive = "1")

draw_confusion_matrix(cm1)

AUC(logit_pred1,TEST$readmitted)

## [1] 0.6215771

58.9% Accuracy and 62.1% AUC

6.4.2. Model 2

Removed statistically Insignificant variables from the previous model to check if there is any improvement.

logitMod2 <- glm(readmitted ~ race + gender + age + admission_type +  time_in_hospital + num_lab_procedures + num_procedures + num_medications + number_outpatient + number_emergency + number_inpatient + number_diagnoses + A1Cresult + rosiglitazone + insulin + change + diabetesMed + primary_diagnosis + secondary_diagnosis + additional_diagnosis,data = data_rose,family = binomial)
summ(logitMod2,vifs = TRUE)

Observations	42937
Dependent variable	readmitted
Type	Generalized linear model
Family	binomial
Link	logit

χ²(52)	1830.75
Pseudo-R² (Cragg-Uhler)	0.06
Pseudo-R² (McFadden)	0.03
AIC	57798.33
BIC	58257.71

	Est.	S.E.	z val.	p	VIF
(Intercept)	-1.99	0.37	-5.42	0.00	NA
raceAsian	-0.63	0.12	-5.08	0.00	1.08
raceCaucasian	0.01	0.03	0.49	0.63	1.08
raceHispanic	-0.17	0.07	-2.41	0.02	1.08
raceOther	-0.20	0.08	-2.50	0.01	1.08
genderMale	-0.06	0.02	-3.06	0.00	1.03
age	0.01	0.00	10.46	0.00	1.17
admission_typeEmergency	0.14	0.03	5.01	0.00	1.28
admission_typeNewborn	-0.51	0.93	-0.55	0.58	1.28
admission_typeOthers	0.36	0.04	9.40	0.00	1.28
time_in_hospital	0.02	0.00	6.01	0.00	1.29
num_lab_procedures	0.00	0.00	3.79	0.00	1.26
num_procedures	-0.03	0.01	-4.90	0.00	1.26
num_medications	0.00	0.00	3.05	0.00	1.50
number_outpatient	0.15	0.02	9.22	0.00	1.03
number_emergency	0.36	0.04	8.88	0.00	1.01
number_inpatient	0.52	0.03	15.73	0.00	1.01
number_diagnoses	0.07	0.01	12.63	0.00	1.28
A1Cresult>8	0.10	0.06	1.76	0.08	1.17
A1CresultNone	0.19	0.05	3.85	0.00	1.17
A1CresultNorm	-0.06	0.06	-0.93	0.35	1.17
rosiglitazoneNo	0.60	0.35	1.69	0.09	1.10
rosiglitazoneSteady	0.83	0.36	2.33	0.02	1.10
rosiglitazoneUp	0.66	0.42	1.58	0.11	1.10
insulinNo	-0.17	0.04	-4.18	0.00	2.19
insulinSteady	-0.20	0.04	-5.20	0.00	2.19
insulinUp	-0.08	0.04	-1.74	0.08	2.19
changeNo	0.04	0.03	1.56	0.12	1.93
diabetesMedYes	0.24	0.03	8.08	0.00	1.69
primary_diagnosisDiabetes	0.15	0.04	3.40	0.00	2.40
primary_diagnosisDigestive	-0.07	0.04	-1.69	0.09	2.40
primary_diagnosisGenitourinary	-0.15	0.05	-2.97	0.00	2.40
primary_diagnosisInjury	-0.14	0.04	-3.12	0.00	2.40
primary_diagnosisMusculoskeletal	-0.19	0.05	-3.92	0.00	2.40
primary_diagnosisNeoplasms	-0.36	0.06	-6.16	0.00	2.40
primary_diagnosisOther	-0.12	0.03	-3.60	0.00	2.40
primary_diagnosisRespiratory	-0.12	0.03	-3.54	0.00	2.40
secondary_diagnosisDiabetes	0.03	0.04	0.98	0.33	2.10
secondary_diagnosisDigestive	-0.17	0.06	-3.01	0.00	2.10
secondary_diagnosisGenitourinary	-0.13	0.04	-3.08	0.00	2.10
secondary_diagnosisInjury	-0.31	0.07	-4.57	0.00	2.10
secondary_diagnosisMusculoskeletal	-0.15	0.08	-2.00	0.05	2.10
secondary_diagnosisNeoplasms	0.17	0.07	2.25	0.02	2.10
secondary_diagnosisOther	-0.08	0.03	-2.89	0.00	2.10
secondary_diagnosisRespiratory	-0.10	0.04	-2.61	0.01	2.10
additional_diagnosisDiabetes	-0.05	0.03	-1.49	0.14	1.54
additional_diagnosisDigestive	0.04	0.05	0.74	0.46	1.54
additional_diagnosisGenitourinary	-0.02	0.05	-0.43	0.66	1.54
additional_diagnosisInjury	-0.33	0.07	-4.37	0.00	1.54
additional_diagnosisMusculoskeletal	-0.20	0.08	-2.59	0.01	1.54
additional_diagnosisNeoplasms	-0.13	0.08	-1.53	0.13	1.54
additional_diagnosisOther	-0.14	0.03	-5.46	0.00	1.54
additional_diagnosisRespiratory	-0.05	0.04	-1.19	0.23	1.54
Standard errors: MLE

logit_pred2 <- predict(logitMod2, TEST,type = "response")
pred2 <- ifelse(logit_pred2 >= 0.5, 1, 0)

cm2 <- confusionMatrix(factor(pred2), factor(TEST$readmitted),positive = "1")

draw_confusion_matrix(cm2)

AUC(logit_pred2,TEST$readmitted)

## [1] 0.6147122

58.3% Accuracy and 61.5% AUC

6.4.3. ROC Curve

plotdata.test <- data.frame(cbind(TEST$readmitted,
                                  logit_pred1,
                                  logit_pred2))
names(plotdata.test) <- c("Readmitted", "Model1", "Model2")

comparison.plot <- ggplot(plotdata.test, aes(m = Model1, d = Readmitted))+ 
  geom_roc(labels=FALSE, aes(color = 'Logit_Model1')) +
  labs(x = "False Positive Rate (FPR)", y = "True Positive Rate (TPR)") + geom_abline() +
  scale_color_discrete(name = 'Model') +
  geom_roc(data = plotdata.test, labels = FALSE, aes(m = Model2, d = Readmitted, color = 'Logit_Model2')) 
comparison.plot

7. Conclusion

This Model concludes that some key factors that drove readmission were time_in_hospital, num_lab_procedures , num_medications, number_diagnoses ,max_glu_serum ,A1Cresult, metformin, insulin, Medication_change and diagnosis.

We saw that when we removed some of the statistically insignificant variables from our first model, the Accuracy of model dropped from 58.9% to 58.3% and the AUC also got decreased from 62.1% to 61.5%. Therefore we consider our 1st model to be the best model.

This Model can provide doctors with updates on patients and predict which patient is more likely to readmit.

Hospital re-admissions are expensive and avoiding them is still a major challenge. It also creates a financial burden on patients and their families. Also, In US,the legislation penalises those hospitals with high readmission rate. These are the reasons why reducing readmission rate becomes necessity.