I am a biologist by training and new to Machine Learning (ML). As a member of the ScoreData team, I wanted to take the opportunity to try to analyze bioscience data on ScoreData’s ML platform (ScoreFast™). This project allowed me to use my skills as a biologist and my passion to learn new technologies and concepts. After searching for a public data set to experiment with, I decided to use Diabetic 130-US hospitals (1999-2008). This data set has a relatively large size (about 100,000 instances, 55 features). Also, it has a published paper linked to the data.
ScoreFast™, the web based model development and management platform from ScoreData, is built as an enterprise grade machine learning platform. Its data and model management modules are dashboard driven, and intuitively easy to use. ScoreFast™ supports many algorithms and with its in-memory computation, it makes it easy to build and test models quickly. As someone new to machine learning, the easy interface to build models made it easy for me to get started.
The goals of this project were to:
(1) Use ScoreFast™ platform to find correlations between data features and readmission of patients to hospitals within 30 days,
(2) Compare results from the ScoreFast™ platform with other platforms such as IBM’s Watson, Amazon’s ML platform and with the published paper, and
(3) Build a predictive model on ScoreFast™ platform.
First, I read the paper to understand a bit more about the data. In the paper, the authors hypothesized that measurement of HbA1c (The Hemoglobin A1c test, HbA1c, is an important blood test that shows how well diabetes is being controlled) is associated with a reduction in readmission rates in individuals admitted to the hospital. The authors took a large dataset of about 74 million unique encounters corresponding to 17 million unique patients and extracted data down to 101,766 encounters. The authors further cleaned up the data to avoid bias: (1) Removed all patient encounters that resulted in either discharge to a hospice or patient death. (2) Considered only the first encounter for each patient as primary admission and determined whether or not they were readmitted within 30 days. The final data set consists of 69,984 encounters. This is the dataset we used to do data analysis and to build a predictive model.
A few key details about the data set: 1) Although all patients were diabetic, only 8.2% of the patients had diabetes as primary diagnosis (Table 3 of paper), 2) HbA1c test was done for only 18.4% of the patients during the hospital encounter (Table 3 of paper).
We used the same criteria, as mentioned in the paper, to pare down the data to 69,984 encounters. To reduce bias, I did not identify the parameters that were found to be significant in the paper. I wanted to see which parameters the ScoreFast™ machine learning platform picked up to be significant to the hospital readmission rate and how well they compared to the analysis in the paper and also to other ML platforms.
Data Analysis: Understanding the data
The data set had 55 features. After ignoring some of the features, such as Encounter_ID, Patient_No, Weight (97% of the data was missing), and Payer_Code, which were either irrelevant to the response (readmission to hospital) or had poor data quality, the total number of relevant features was 46. The first step was to do a data quality check on the 69,984 records (46 features) to understand the quality of data. Out of 46 features, 23 were medicine related, and of those, 22 were diabetic medications and one was cardiovascular. The data had 17 features that had constant values. The diagnosis features (diag_1, diag_2, diag_3) had icd9 (the International Classification of Diseases, Ninth revision; standard list of 6-character alphanumeric codes to describe diagnoses) codes. These codes were transcribed to their corresponding descriptions (Table 2 of paper). Diag_1 was renamed to ic9Groupname, diag_2 renamed ic9Groupname2 and diag_3 to ic9Groupname3. By doing this aggregation, the correlation between any disease/procedure and hospital readmission was easier to see.
When I used the ScoreFast™ platform to build the models, I got AUC (Area Under the Curve) in the range of 0.61-0.65 (Table B, below), which indicated that the data was not of very good quality. This was confirmed by both IBM Watson which gave a data quality score of 54, and Amazon ML platform which also had AUC 0.63 (Table A).
Building a predictive Model on ScoreFast™ Platform
I found it easy to upload the data set onto the ScoreFast™ platform. The platform easily splits the data into train and test sets. The platform has options to build four different classes of models, among others: GBM, GLM, DL (Deep Learning) and DRF (Distributed Random Forest). ScoreData plans to add more algorithms in the near future. Without knowing the details of the algorithms, I was able to build the four models easily on the platform. All the four models had very similar results with AUC ranging from 0.61-0.65.
Once the models were built, I could click on the “Detail” links to learn more about each model; the MSE, ROC, thresholds and the key features. The top ten features with significant correlation with hospital admission rates were similar on GBM and DRF models (Table C, below). I was intrigued to see that the top ten features with significant correlation with hospital readmission rate, were different on the GLM model and DL model (Table D, below). After discussions with my colleagues, I understood the sensitivity differences between the various algorithms. I realized that one can use different models depending on what someone is looking for and data available.
Comparison with the Amazon and IBM Machine Learning Platform
The IBM Watson platform, showed the following three parameters to have significant correlation with the hospital readmission rate: Discharge disposition (where the patient is discharged to), the number of inpatient visits to hospital, and time spent at each hospital visit. Their interaction graphs are very useful to understand visually. When the data was analyzed on ScoreFast™, the three significant features on IBM platform were also part of the top ten significant features on the GBM and DRF models of ScoreFast™ (Table C for ScoreFast™ platform and Table E on IBM’s Watson).
The Amazon ML platform did not provide a way to visualize the top features of the model. It does provide the cross validation result. The percentage of correct was 91% with an error rate of 9%. The True positive (TP) was very low. This was again validated in the confusion matrix results on the ScoreFast™ platform, as shown in Table F.
The top features, model accuracy and results on ScoreFast™ platform were very similar when compared with IBM Watson and Amazon ML platform.
Predicting admission rates for new patients
With the push of a button, the models can be deployed in a production environment in the ScoreFast™ platform. 14,462 (20% of original data) rows were used to predict the models. Using the batch prediction interface on ScoreFast™, I tested the hospital readmission models and the results are shown below (Table G).
The model was able to correctly predict 92% of population which do not require readmission (True Negative). Of the remaining 8%, for GBM model, 9 out of 1299 were correctly predicted (TP) but False Negative was high (1290). Deep learning had a slightly better True Positive rate (19 out of 1299). The DRF model did not give good True Positive predictions. The low TP rate was expected as we had observed during building of the models.
Table H (below) shows how threshold impacts the accuracy of the model (GBM). Maximum accuracy is obtained with a threshold of 0.644. If threshold is tuned to increase the TP, it starts to impact TN. In this case, both TN (a patient who should not be readmitted is not readmitted) and TP (a patient who should be readmitted is readmitted) are both important. In general, the threshold can be used to tune the model for desired result.
To improve the TP and/or reduce the False Negative (FN), we need more data samples as well as additional features. We had observed from the beginning that the data quality was not that great and the model had an AUC of 0.63.
As mentioned earlier, the data quality was not good. The paper mentioned that there was a significant correlation between the HbA1c test not done during the hospital stay and the readmission rate. The DL model on the ScoreFast™ platform also showed a correlation between HbA1c test not done and hospital readmission rate. The DL model also picked up Circulatory diagnosis as a significant parameter which was also mentioned as a significant feature in the paper. Again, due to the differences in the algorithms, the other models did not pick up these features. As a next step, I need to understand the algorithms better to answer how data/features affect the choice of models for building predictive models and to understand the impact of each feature.
The significant parameters with a high correlation to hospital readmission found on the ScoreFast™ (GBM & DRF models) and in the Watson platform were comparable to the data in Table 3 of paper. The data in the Table 3 of paper shows a correlation between the hospital readmission rate and (1) discharge disposition, i.e., where the patient is discharged to after the first visit (discharge_disposition_id), (2) primary diagnosis, and (3) age (higher chances of a person older than 60 years to be readmitted).
I found ScoreFast™ easy to use. It was easy to load large data sets and analyze data. The platform provides a choice of many different kinds of algorithms. These can be used either to confirm the significant correlations between parameters or can be used for performing different kinds of analysis. It supports model versioning which allows one to try different variations of models keeping track of the performance of each model.
I would like to get to understand the details of the algorithms of the different models a bit deeper, so I can figure out how to choose different models for different purposes. As a next step, I plan to understand the data a bit more and try to see how I can improve the quality of data. One idea is to club the medicines from 23 features to 2. I also want to understand the feature correlations and the importance of variables. Also, I plan to work with my data scientist colleagues to tweak the algorithm configurations to see if I can improve the model accuracy.
Beyond this diabetic data analysis project, I plan to experiment with a few more datasets in order to gain more insight into how machine learning and ScoreFast™ can be used to get actionable insights from clinical or medical data.
I would like to thank Prasanta Behera for helping me guide this project.