1 Executive Summary

The California Teachers Study (CTS) is an observational cohort study that follows and records health data from female teachers, administrators, school nurses, and other members of the California State Teachers Retirement System (CalSTRS). The data from the CTS has been used for extensive research into breast cancer while also giving insight into other cancers and diseases. The main focus of this explorational study is the OSHPD hospitalization records that contain hospitalization data for CTS participants.

The primary objective of this study is as follows:

Develop a prediction model to predict the short-term risk of death based on prior patient hospitalization
Assess whether time window after hospitalization and participant’s co-morbidities are significant factors in predicting the risk of death.

Finding 1: Time Window and Age are key factors in predicting the risk of death

The age of the participant and the time window after hospitalization were found to be key features in predicting the short-term risk of death for participants with hospitalization records. The noted effect of time windows after hospitalization aligns with the exploratory data analysis, as there was a high rate of deaths per day in the first 30 days after participants were discharged from the hospital. In addition, age will regardless be a significant co-variate when it comes to predicting death. The contribution of these features were so significant that the other primary exposures of this study (co-morbidites and length of stay in the hospital) were considered not significant relatively. Although the accuracy of logistic regression model containing age and time window is very high, as a prediction model it is not very informative when assessing the effect of co-morbidites and other possible covariates on participant death after hospitalization.

Finding 2: Co-morbidites, diet, and physical activity are significant predictors of death

After the effects of age and time windows after discharge were removed, the diagnoses/procedures from the hospitalization data, the participants diet, and the participants’ physical activity were noted as variables of importance in the random forest model. More specifically, the primary diagnosis of the participant, having a high carbohydrate diet, the length of stay in the hospital, and the amount of physical activity were considered to be the most important variable. The summary of the logistic regression model does indicate that the before mentioned features have an significant impact on calculating the short-term probability of death. Again, although the model without age and time windows is not as accurate, it provides a clear picture into the effect of co-morbidites and certain lifestyle behaviors on the probability of death after hospitialization.

2 Introducing the California Teachers Study (CTS)

The California Teachers Study (CTS) is an observational cohort study established in 1995 that follows female teachers, administrators, school nurses, and other members of the California State Teachers Retirement System (CalSTRS). Members of the CalSTRS have provided information regarding their health and behaviors, and information regarding these patients have continued to be recorded and studied. The data provided has allowed for extensive research on breast cancer, and has given insight into other cancers and disease. This report will focus in particular on the OSHPD hospitalization records that contain hospitalization information from the participants of the CTS.

3 Objective:

The primary objective of this report is to predict the short-term risk of death based on prior in-patient hospitalization. The hospitalization data (which includes diagnoses, co-morbidities, length of stay, discharge date) of California Teacher’s Study Participants from 2000 to 2015 will be used for the analysis. Key characteristics and possible co-variables such as age, ethnicity, height, and weight (taken from the California Teacher’s Study Questionnaire) will also be incorporated.

Machine learning will be used to build the best fitting prediction model that predicts the probability of death after prior in-patient hospitalization. In particular, the effect of certain time windows between hospital discharge and death, as well as the effect of co-morbidities and specific procedures will be examined to determine their significant in predicting short-term risk of death after in-patient hospitalization.

4 Exploratory Data Analysis:

4.1 Basic Data Exploration

## [1] 132538    164

The data set consists of 132,538 CTS participants and 154 variable. The primary outcome that will be examined in this project is death of the CTS participants.

Deceased	Frequency
No	64099
Yes	68439

Out of all of the participants in the data set, around 51.6% of participants with prior in-patient hospitalization records had died. The data set is relatively balanced with regards to the target outcome and will help reduce potential biases when building prediction models.

4.2 Variable Creation: Time Window After Death

One of the aims of this research project is to predict the probability of death within a certain time window after hospitalization and to assess whether the time window itself is a significant factor in assessing the risk of death. Therefore, a new variable will be created to count the number days between discharge from the hospital and the date of death.

The summary of the newly created variable shows that were some negative values, suggesting that patients had died before they were discharged.

##    date_of_death_dt discharge_dt days_after_discharge_death
## 51       2004-05-27   2004-05-28                         -1

Examining one observation in which there was a negative time window we see that it indeed does indicate that the day of death was May 27th, 2004 while the day of discharge was May 28th, 2004, which does not make sense. There were only 169 instances out of the 6,739 individuals that died in which the time window until death was negative. Therefore, these instances were filtered out from the data set.

After filtering out negative values, the mean amount of days between discharge and death was 1414 days, or around 3.87 years. A majority of the deaths occurred within one year after being discharged from the hospital.

New variables will be added to indicate the time window between discharge and death. These time windows will be: < 30 days, 30 - 180 days, 180 days - 1 year,and 1-3 years between discharge from the hospital and death.

Because the number of days in each time period was different, the number of deaths during each time window was standardized by dividing by the number of days in each time window. After adjusting for the number of days, it is clear that the majority of the deaths after hospitalization occurred within the first 30 days after hospital discharge.

4.3 Primary Exposures

4.3.1 Length of Stay In Hospital

One of the primary exposures in the study is the length of stay in the hospital. The average number of days a patient spent at the hospital was 4.71 days. The difference between the 3rd quantile and the largest number of days in enormous at 1687 days, indicating that a majority of patients only stay in the hospital for around a week.

The boxplot show the number of days in the hospital for individuals that did and did not die after being discharged. According to the boxplot, the average length of stay in the hospital for both groups is very small. However, there are more outlines for those that did die, indicating that, on average, those who died after hospitalization stayed in the hospital longer than those that did not die.

Below the average number of days in the hospital was calculated for study participants that died after hospitalization and those that did not.

Deceased	Average Length of Stay
No	3.570773
Yes	5.778834

## 
##  Welch Two Sample t-test
## 
## data:  length_of_stay_day_cnt by deceased
## t = -34.151, df = 81133, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.334787 -2.081335
## sample estimates:
##  mean in group No mean in group Yes 
##          3.570773          5.778834

A t-test comparing the means of the two groups indicates that the difference in the average number of days in the hospital between those that died after hospitalization and those that did not is statistically significant. Therefore, it is highly likely that the number of days spent at the hospital will be a significant predictor of death.

4.3.2 Diagnosis CCS Code

The diagnosis CCS code contains information regarding the different diagnoses that the CTS participant received during hospitalization. In the data set, there are five different diagnosis ccs codes, with one of the five as the primary diagnosis. All five diagnoses CCS codes will be included in the prediction model, but the frequency of different primary diagnoses will be further examined.

Below are the 10 most common primary diagnoses present in CTS participants. The most common primary diagnoses was Osteoarthritis, and other common primary diagnoses include congestive heart failure and pneumonia.

Primary Diagnosis	Frequency
Osteoarthitis	12131
Septicemia	4347
Cardiac Dysrthymias	4316
Rehabilitation Care	4228
Pneumonia	3834
Congestive Heart Failure	3448
Spondylosis	3315
Fracture of Neck of Femur (Hip)	3292
Acute Cerebrovascular Disease	3237
Undefined	3158

Above are the 10 most common primary diagnoses obtained by the diagnoses ccs codes. In addition, the diagnoses are divided into groups that indicate whether the patient died after hospitalization. More participants that had a primary diagnosis of congestive heart failure died after hospitalization then survived, indicating that it could be a predictor of death after being discharged. In addition, more participants that had a primary diagnosis of Osteoarthritis are still alive after hospitalization.

4.3.3 Procedure CSS Code

The procedure CSS code functions similarly to the diagnoses CCS code mentioned previously. The procedure CCS code contains information regarding the procedure that the participant underwent during hospitalization. There were five procedure CCS codes given in the dataset, and all five will be included in the model. However, just as with the diagnoses CCS codes, the different primary procedures will be further examined.

Primary Procedure	Frequency
Arthroplasty Knee	7823
Hip Replacement	5988
Undefined	3469
Blood Transfusion	3119
Hysterectomy	2753
Fracture Treatment	2409
Physical Therapy	2313
Upper Gastrointestinal Endoscopy	2168
Spinal Fusion	2061
Respiratory Intubation	1945

The table above listed the 10 most common primary procedures that CTS participants underwent. The most common include arthroplasty knee and hip replacement.

According to the above figure, there were more participants that died after undergoing procedures such as respiratory intubation and blood transfusion. Procedures such as getting a hip replacement and arthroplasty knee did not lead to as many deaths after hospitalization.

4.4 Age

Age is one of the baseline covariates in regards to patient health. Therefore, the age for each participant will be calculated. For participants that have died, the age in which they passed away will be calulated. For participants that are still alive, their age at the end of 2015 will be caculated, as this study will only be looking at hospitalization data until the year 2015.

The histogram of the age of each participant indicates the expected pattern, as there were more older participants that had died compared to younger participants.

4.5 Ethnicity

Participant Race	Frequency
White	119194
Hispanic	3366
Black	3228
Asian	3012
Native American	1321
None	1156
Other/Mixed	1092

An overwhelming majority of the participants in the study were white compared to individuals of other races.

5 Building Prediction Models

Machine learning and logistic regression will be used to build models that will predict short-term risk of death based on prior in-patient hospitalization. The main features that will be included in this initial exploratory model are the five diagnosis and procedure codes from the patients hospital stay and basic information such as age, ethnicity, BMI, alcohol use, and tobacco use. Additional hospitalization information such as length of stay and admission types will also be included. Hospitalizations of CTS participants from 2000 through 2015 will be used.

The missing values present in the data was imputed as machine learning methods cannot function with any missing values. If the missing value was an numeric variable, it was replaced by the mean value of that feature. If the missing value was a factor, it was replaced by the mode of that feature.

To build accurate models, the dataset was split into training and testing data, with 70% of the observations randomly selected to be part of the training set and the remaining 30% designated as the test set.

5.1 Balanced Random Forests

Random forests will be used to create decision trees that will predict the short-term risk of death in CTS participants with prior hospitalization records. In total, there are 44 features that will be used to create the random forests. These features include our outcome of interest, whether the participant died after being discharged, and hospitalization data such as diagnoses and procedure codes. Key demographic data such as age and race are also included. Also, information regarding certain lifestyle behaviors, such as alcohol and tobacco use, physical exercise, and diet was also incorporated.

After running the random forest, the resulting AUC value is 1, indicating that the random forest model is able to accurately predict death in participants with prior hospitalization records.

## Area under the curve: 1

Looking at the variable importance for the random forests, the features that contributed the most to the prediction model was the features that indicated the time that elapsed between hospital discharge and death, and the age of the participant. The most influential feature was whether the participant had died within thirty days being discharged.

However, there is a big different in the variable importance between the features that indicate the time window between discharge and death, and other seemingly important predictors such as diagnoses and procedure. Because of this large disparity, and the seemingly high AUC, a second random forest model will be built after filtering out the features that indicate time window and the age of the participant.

5.2 Balanced Random Forest (without Age and Time Window After Death)

A second random forest model was constructed after excluding the age of the participant and the time windows between hospital dicharge and death. The resulting AUC value for the test set is 0.8775, which is still high although it is not as accurate as with the first random forest model.

## Area under the curve: 0.8802

Looking at the variable importance for the new random forest model, there are more features that contribute to predicting death after hospitalization. The most important feature is whether the participant had a diet that was high in carbohydrates and the primary diagnosis. Although other features such as the length of stay at the hospital and the primary procedure were not insignificant.

Compared to the first random forest model, this second random forest model is not as accurate. However, it does incorporate more of the primary exposures of interest including the diagnoses and procedures of the participant. This second random forest model does offer a more comprehensive look at the effect of diverse variables on the short-term risk of death after hospitalization.

5.3 Logistic Regression

A logistic regression model will be created using the 20 features that contributed the most to the random forest model. Through creating a logistic regression model we can also see the statistical significance of each feature and the effect that each feature has on the short-term probability of death. Again, the features for age and time window were not included in the logistic regression.

The AUC for the logistic regression is not as high as that of the random forest model, with an AUC of 0.7663. The logistic regression model that included the features for time window and age (not shown) had an AUC of 1, further indicating that those two predictors are highly significant in calculating the short-term probability of death after hospitalization.

## Area under the curve: 0.7664

## 95% CI: 0.762-0.7707 (DeLong)

5.3.1 Summary of Logistic Regression

Below is the summary of the logistic regression using the variables obtained from the random forest variable importance plot:

	Coefficient	Std.Error	Z Value	P Value
(Intercept)	-7.7449323	0.0821380	-94.291670	0.0000000
diag_ccs_code1	-0.0032707	0.0000733	-44.605739	0.0000000
diag_ccs_code2	-0.0004567	0.0000632	-7.224098	0.0000000
diag_ccs_code3	-0.0003309	0.0000571	-5.797611	0.0000000
diag_ccs_code4	-0.0001942	0.0000546	-3.555555	0.0003772
diag_ccs_code5	-0.0001600	0.0000544	-2.940358	0.0032783
proc_ccs_code1	0.0019221	0.0001311	14.663602	0.0000000
proc_ccs_code2	0.0012131	0.0001549	7.829589	0.0000000
diet_highcarb	-0.2584238	0.0083538	-30.934864	0.0000000
diet_plant	0.0766375	0.0067820	11.300096	0.0000000
diet_highprotfat	0.0962586	0.0068458	14.061007	0.0000000
diet_saladwine	0.0722528	0.0070817	10.202736	0.0000000
length_of_stay_day_cnt	0.1130971	0.0018883	59.892342	0.0000000
allex_life_hrs	-0.0192377	0.0016333	-11.778481	0.0000000
bmi_q1	0.0136421	0.0011940	11.425958	0.0000000
smoke_totyrs	0.0511368	0.0010834	47.199632	0.0000000
smoke_yrs_quit	0.0392726	0.0012845	30.575240	0.0000000
preg_total_q1	-0.0290019	0.0036186	-8.014593	0.0000000
menarche_age	0.0088437	0.0044014	2.009263	0.0445093
total_charges_amt	-0.0000033	0.0000001	-27.385908	0.0000000
age_at_death	0.0659668	0.0007016	94.018201	0.0000000

A majority of the features present are considered statistically significant in predicting the model (P-values of zero indicate that the p-value is <2e-16). The specific coefficients for the length of stay in the hospital, a diet that is high in carbohydrates, and the primary diagnosis code are larger than the rest indicating that a change in these variables more readily influences the calculated probability of death from the logistic regression.

6 Results:

6.1 Finding 1: Time Window and Age are key predictors

The age of the participant and the time window after hospitalization were found to be key features in predicting the short-term risk of death for participants with hospitalization records. The noted effect of time windows after hospitalization aligns with the exploratory data analysis, as there was a high rate of deaths per day in the first 30 days after participants were discharged from the hospital. In addition, age will regardless be a significant co-variate when it comes to predicting death. The contribution of these features were so significant that the other primary exposures of this study (co-morbidites and length of stay in the hospital) were considered not significant relatively. Although the accuracy of logistic regression model containing age and time window is very high, as a prediction model it is not very informative when assessing the effect of co-morbidites and other possible covariates on participant death after hospitalization.

6.2 Finding 2: Co-morbidites, diet, and physical activity are significant predictors

After the effects of age and time windows after discharge were removed, the diagnoses/procedures from the hospitalization data, the participants diet, and the participants’ physical activity were noted as variables of importance in the random forest model. More specifically, the primary diagnosis of the participant, having a high carbohydrate diet, the length of stay in the hospital, and the amount of physical activity were considered to be the most important variable. The summary of the logistic regression model does indicate that the before mentioned features have an significant impact on calculating the short-term probability of death. Again, although the model without age and time windows is not as accurate, it provides a clear picture into the effect of co-morbidites and certain lifestyle behaviors on the probability of death after hospitialization.

PM 566 Practicum Final Report

Edward Kim