BST 260 Final Project

Predicting Hospital Readmission

A hospital readmission is when a patient who is discharged from the hospital is re-admitted again within a certain period of time. Hospital readmissions contribute to the high cost of healthcare in America, and the rate of hosptial readmission is considered an indicator of hospital quality. In order to adress this issue, the Centers for Medicare & Medicaid Services have established the Hospital Readmissions Reduction Program (HRRP) with the goal of improving the quality of care for patients and reducing healthcare spending. The primary way in which they do this is by applying payment penalties to hospitals that have higher than expected readmission rates for certain conditions. While diabetes has not yet been added to HRRP's list of conditions, Diabetes is the is the condition with the 3rd most all-cause, 30-day readmissions for Medicaid patients, and in 2011, American hospitals spent over $41 billion caring for diabetic patients who were readmitted within 30 days of discharge .

The ability to determine which risk factors lead to higher readmission in such patients, and being able to predict which patients will be readmitted, could help save hospitals billions of dollars while also improving quality of care. With this goal in mind, we set out to answer these two questions:

How well can we predict 30 day hospital readmission of diabetes patients using electronic health record data?
What factors are the most important in predicting hosptial readmission for a diabetic patient?

To see the code that was used in this analysis, check out our project's GitHub repository.

Screencast

Dataset

The dataset we will be working with is a subset of the data in the Health Facts database that represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the Health Facts database for encounters that satisfied the following criteria.

It is an inpatient encounter (a hospital admission).
It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
The length of stay was at least 1 day and at most 14 days.
Laboratory tests were perfocrmed during the encounter.
Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test reslt, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc. This dataset is made publicly avaliable by the UCI Machine Learning Repository

Numerous steps were taken to clean the dataset, including the addition and removal of certain features. The final dataset features are described below:

Exploratory Data Analysis

Data Cleaning

Our first step was to clean the data by removing features that were unneccessary, redundant, or missing too many values. Then we created new variables, including Healthcare Utilization (sum of inpatient, outpatient, and emergency encounters), number of diabetes medications, and diagnosis groupers based on the Strack et al. 2014 paper.

Here is the correlation matrix for our clean dataset:

PCA and t-SNE

Principal component analysis and analysis using a (t-SNE) dimensionality reduction algorithm showed that our two readmission classes are not well separated, and even the most imnportant principal components only explain a small proportion of the variance:

For many features that we might expect to be relevant to hospital readmission, there is in fact little to no separation between classes:

Machine Learning Analysis

XGBoost

XGBoost stands for eXtreme Gradient Boosting. As indicated by the name, it is a boosting method that combines the output of many different Classification And Regression Trees (CART). CARTs are like decision trees except that in addition to simply producing a classification they attach a score to each individual leaf. When we combine several trees the values attached to a particular observation are added together. XGBoost's classification and regression outputs are based upon the output of the sum of these values. Training consists of selecting a new tree to add to the ensemble based upon which one best optimizes the loss function. Hence, why it is called gradient boosting; we are adding the trees that take a step in the best direction for optimizing the objective function.

Here are the most important features:

kNN

KNN is an algorithm that takes in numerical features, then calculates the distance between a given observation and all the other observations in the dataset, then chooses the predicted label for the given point based on what label the majority of that point’s “K nearest neighbors” (K points closest to the point of interest) possess. For this reason it is important to choose a K in such a way as to avoid tie votes among the “K neighbors.”

GLMnet

GLMnet fits a generalized linear model via penalized maximum likelihood. The algorithm implemented in the package computes the regularization path for the elastic-net penalty over a grid of values for the regularization parameter λ. The elastic-net penalty is controlled by α, and bridges the gap between lasso (α=1, the default) and ridge (α=0). The tuning parameter λ controls the overall strength of the penalty. To reduce noise caused by data-splitting, we used 10-fold cross-validation for tuning parameter values using the caret package as described here. In our dataset, the best tuning parameter chosen after 10-fold cross validation is α = 0, λ = 0.01. This means that we are effectively using a ridge regression.

Here are the most important features selected by GLMnet:

Conclusions and Limitations

Conclusions

Our best performing model was XGBoost (though it had poor performance with AUC = 0.59)
While other analysis of this same dataset has produced significant results, we chose to include more features than other analyses in an attempt to better capture the whole picture of a patient’s hospital admission and health at the time of the encounter
Our results highlight the complexities of predicting readmissions while also taking into account all relevant features and confounders

Limitations

Databases of clinical data can present difficulties related to missing values, incomplete or inconsistent records, high dimensionality, and complexity of features
Analyzing external data is more challenging than analysis of data collected during a carefully designed study, as features that may be important may simply not be available in an external dataset
Readmission is an important yet somewhat arbitrary measure which is influenced by a potentially infinite number of factors related to a patient’s health and care received during a hospital admission

References

Strack B, DeShazo JP, Gennings C, Olmo JL, Ventura S, Cios KJ, Clore JN: Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. Biomed Res Int 2014, 2014:781670
Centers for Disease Control and Prevention: National Diabetes Statistics Report, 2017. In. Edited by Centers for Disease Control and Prevention US Department of Health and Human Services. Atlanta, GA; 2017
OLeppin AL, Gionfriddo MR, Kessler M, Brito JP, Mair FS, Gallacher K, Wang Z, Erwin PJ, Sylvester T, Boehmer K et al: Preventing 30-day hospital readmissions: a systematic review and meta-analysis of randomized trials. JAMA Intern Med 2014, 174(7):1095-1107
Rubin DJ, Handorf EA, Golden SH, Nelson DB, McDonnell ME, Zhao H: Development and Validation of a Novel Tool to Predict Hospital Readmission Risk among Patients with Diabetes. Endocr Pract 2016, 22(10):1204-1215

R Packages

tidyverse. Hadley Wickham (2017). tidyverse: Easily Install and Load the 'Tidyverse'. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse
caret. Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2018). caret: Classification and Regression Training. R package version 6.0-81. https://CRAN.R-project.org/package=caret
xgboost. Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, Mu Li, Junyuan Xie, Min Lin, Yifeng Geng and Yutian Li (2018). xgboost: Extreme Gradient Boosting. R package version 0.71.2. https://CRAN.R-project.org/package=xgboost
skimr. Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, Shannon Ellis and Michael Quinn (2018). skimr: Compact and Flexible Summaries of Data. R package version 1.0.3. https://CRAN.R-project.org/package=skimr
data.table. Matt Dowle and Arun Srinivasan (2018). data.table: Extension of `data.frame`. R package version 1.11.8. https://CRAN.R-project.org/package=data.table
mltools. Ben Gorman (2018). mltools: Machine Learning Tools. R package version 0.3.5. https://CRAN.R-project.org/package=mltools
corrplot. Taiyun Wei and Viliam Simko (2017). R package "corrplot": Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot
ROCR. Sing T, Sander O, Beerenwinkel N, Lengauer T (2005). “ROCR: visualizing classifier performance in R.” Bioinformatics, 21(20), 7881. URL: http://rocr.bioinf.mpi-sb.mpg.de.
DMwR. Torgo, L. (2010). Data Mining with R, learning with case studies Chapman and Hall/CRC. URL: http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR
t-SNE. Jesse H. Krijthe (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation, URL: https://github.com/jkrijthe/Rtsne
glmnet. Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. URL: http://www.jstatsoft.org/v33/i01/
tableone. Kazuki Yoshida and Justin Bohn. (2018). tableone: Create 'Table 1' to Describe Baseline Characteristics. R package version 0.9.3. https://CRAN.R-project.org/package=tableone
R. R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/