Risk Factors of Lung Cancer Patients: A Survival Analysis with R

Use your smartphone to scan this QR code and download this article ABSTRACT Introduction: This paper studies risk factors which can have effects on the survival time of lung cancer patients during the treatment. Methods: The Cox proportional-hazards model has been applied for investigating the association between the survival time of patients and the predictors such as age, gender, the weight of patients, meal, the ECOG, and Karnofsky scores. Results: In the study, we find that the ECOG score, the Karnofsky score evaluated by doctors and the gender are the top three factors that significantly affect the hazard rate. Also, we utilize the estimated model to predict survival probability for the patients. Conclusion: In this article, we intentionally present a complete and detailed guide on how to perform a R-based package in survival analysis step by step as well as how to interpret all output results.


INTRODUCTION
In scientific research, we sometimes encounter data related to assessing an event over time. For example, a "death" event that describes the time from diagnosis to death or a "relapse" event describes the time from when a given treatment is applied until the recurrence of the disease. Nevertheless, until the end of the study period, not all cases occur "events". For instance, the study of death caused by lung cancer, from the time of diagnosis, duration of the study was 36 months; not all patients died after the end of the study. Therefore, it can be seen that their time of death is unknown. Hence, this type of research data is characterized as rarely existing in a normal distribution, because some data are complete, some data are censored. Thus, the typical and common method of analysis in this form of data is survival analysis. Survival analysis is a crucial sector in Statistics to investigate the expected duration of time until one or more events occur. For example, death in biological organisms and failure in mechanical systems. This method is also called reliability theory or reliability analysis in engineering and duration analysis. About this regard, there are many scholars studied it; for instance, Laird and Olivier(1981) 1 present the using log-linear analysis techniques to covariance analysis of censored survival data. Hakulinen and Abeywickrama (1985) 2 introduced a package to survival analysis. Murray et al. (1993) 3 provided a survival analysis of joint replacements. Leggat et al. (1998) 4 presented the noncompliance in hemodialysis: predictors and survival analysis. Klein and Moeschberger (2006) 5 provided techniques for survival analysis with censored and truncated data. Survival analysis is a widespread analysis in several medical studies. It is the type of data analysis of an event that occurs at the recorded time, after a certain treatment intervention or after the time the pathology is diagnosed. For example, calculate the risk of death of patients after surgery to remove the liver cancer. Patients will be monitored for the procedure after 2-7 years, statistics of deaths to determine the death rate, the patient's ability to live at the following during surgery. In addition, survival analysis can be used to compare the probability of survival of patients after the intervention of two or more certain treatments. Also, it was found that the relationship between the probability of survival of the patient and other factors such as age at intervention, stage of the disease, chemotherapy dose by developing the Cox risk ratio model. Up to the present, the problems of survival analysis are still concerned and researched by several scientists. Readers may refer in Mahmood et al. (2007) 6  The main purposes of this article are to study the association between the survival time of lung patients and the predictors such as age, gender, the weight of patients, meal, the ECOG and Karnofsky scores. Then, we can train a model used for predicting hazard rate and the probability of survival time. Besides, we intentionally present a complete and detailed guide for survival analysis by using the statistical software R. As we know that, R is a very powerful statistic software, easy to install, and easy to use. Survival analysis plays a tremendously crucial role and immensely profound significant in life. Thus it is exceedingly meaningful to have a study about using the statistical software R in survival analysis.
The paper is organized as follows. The definitions and examples of the ubiquitous functions and the most widespread model in survival analysis are presented in Section 2. The methods used for estimating the ubiquitous functions in survival analysis are also offered in this section. The procedure of applying survival analysis and an application to the real data set is provided in Section 3. Some discussions are illustrated in the next Section 4. Conclusions are stated in the last section.

METHODOLOGY
In this section, we first review the most ubiquitous functions for survival analysis, such as survival and hazard function and examples of these functions. We now discuss the survival function in the next section.

Survival Function
As we have known, survival time is a non-negative random variable, so if we denote by T the time of the event, then T ≥ 0. The time considered in the study may be discrete (set of discrete values (0 < t 1 < t 2 < t 3 ...) or may be continuous T ∈ [0, ∞). Let S(t) be a survival function defined by (1)

Hazard Function
This function is described as the probability that the research object will happen at time t, usually denoted by λ (t) or h(t). The function h(t) is defined as follow It can be shown that In fact, we have Besides, by the definition, we have From this result, we can express This is equivalent to Thereafter, we get and .
In case of the discrete variables: The cumulative hazard function is defined by In order for readers to easily access these two functions, we provide two specific examples in the next sub-section.

Examples of Survival and Hazard Function
(a) Example 1: Exponential distribution The exponential density with mean parameter θ has the form The survival function is given by The hazard function is then The cumulative hazard is given by (b) Example 2: Weibull distribution The Weibull density with shape parameter λ and scale parameter θ is The survival function can be expressed as follows The hazard function is then The cumulative hazard is given by In most analyses, we often consider the appropriate regression model to study the relationship between covariates. It is the same with survival analysis. The Cox regression model is the most widespread regression model in this analysis. Thus we present this model in the next sub-section.

Cox Regression Model (Cox's proportional hazards model)
This model was first proposed by David Cox (1972) 13 . The probability of the endpoint is called the hazard such as death or any other event of interest, e.g., recurrence of the disease. The hazard is modeled as: where t represents the survival time, h(t) is the expected hazard at a time t, and (X 1 , X 2 , ...X n ) is a collection of predictor variables. The coefficient (α 1 , α 2 , ..., α n ) measures the impact of covariates, and h 0 (t) is the baseline hazard at a time t, represents the hazard when all of the predictors (or independent variables) (X 1 , X 2 , ...X n ) are equal to zero.
The quantities exp (α i ) are called hazard ratios (HR). If α i , or HR > 1, it indicates that as the value of the i th covariate increases, the hazard will also be increased. This value is called a bad prognostic factor. In this case, the length of survival decreases. By contrast, if α i < 0 it's called a good prognostic factor. If the hazard ratio equal to 1 corresponding α i = 0, the covariate makes no effect.

Estimation of the survival and hazard function
We present the two most widespread methods to estimate the survival and hazard function that is Kaplan-Meier estimator and Nelson-Aalen estimator. Let t i be a time when at least one event happened. Suppose that t i is arranged in order: the number of events (e.g., deaths) that happened at the time t i ,n i is the number of objects that occur at t i or later, and let r i be the number of individuals at risk (i.e., alive and not censored) just before to time t i .

Kaplan-Meier estimator
This estimator is a non-parametric maximum estimate for the survival function S t . This method is proposed by Kaplan and Meier (1958) 14 . It can be expressed as follows

Nelson-Aalen estimator
The Nelson-Aalen estimator is a non-parametric estimator that can be utilized to estimate the hazard function. Because there is no need for distribution assumptions, a crucial use of the estimator is to test the appropriateness of parameter models graphically, and this is exactly why this method was first introduced by Nelson (1969Nelson ( , 1972 15,16 . Independent from Nelson, Altshuler (1970) 17 also studied this issue. Later, Aalen (1978) 18 expanded its use beyond survival data and established competitive risk, and studied its small and large sample properties by martingale. Nowadays, this estimator is called the Nelson-Aalen estimator. This function can be expressed as follows

DATA AND ANALYSIS RESULTS
The procedure of using R in survival analysis In this section, we introduce the procedure of survival analysis using R language as well as the usage of some functions.
Step 1. Determining variables related to time factors such as lifetime, survival time, and failure time. Before using the R-system software for survival analysis, we first have to load the survival package into the working environment by using the command library(survival).
Step 2. Calculating descriptive statistics, plot some graphs for the data set, for instance, using plot, Survand survfit built in the survival package.
Step 3. Applying the Cox regression model to find out the risk factors which have effects to the hazard risk h(u). The command is coxph(). The simple usage of the function coxph() is given below: coxph(combined variables~independent variables, data) or it can be written as follows coxph(Surv(variable 1, variable 2)~independent variables, data) Step 4. We can use cox.zph() to test the hypothesis of risk factors: cox.zph(coxph(Surv(variable 1, variable 2)~independent variables, data)) In order to help readers that can easily use the statistical software R in survival analysis in practice, we introduce analysis with real data in the next section.

Application and analysis results
Firstly, we load the survival package to be developed by Terry Therneau et al. 19 into the working environment: > library(survival) In this section, we do analysis on the lung cancer data set, which is taken from the North Central Cancer Treatment Group. This data set is named "lung". The data set has 228 rows and 10 columns, with 10 variables described as follows Using the data above, we focus on how to estimate the probability of survival time for patients and to find out risk factors which can cause deaths to the patients based on the Cox regression model. Next, we can take a look on the first 6 rows and last 6 rows of the data set by using the command head(lung) and tail(lung). After executing these commands, the result is depicted in Figure 1.
Note that NA is missing values. We now can compute descriptive statistics by installing and loading some packages, as shown below: Cancer Treatment Group, with 228 rows and 10 columns. In this study, we find information on relative among variables with time (in days) of each status. > install.packages("fBasics") > install.packages("tseries") > library(fBasics) > library(tseries) Because our data set is named "lung", so we execute the command: > basicStats(lung) For simplicity, we can utilize the round(result, 2)to get 2 decimal places. The general of the roundfunction in R is round(result, n) to get n decimal places. > round(basicStats(lung),2) After performing the above command, the result of the descriptive statistics for the data set "lung" is depicted in Figure 2.
As can be seen in Figure 2, the average lifetime of the patients is more than 305 days.  1 Restricted in physically strenuous activity but ambulatory and able to carry out work of a light or sedentary nature, e.g., light housework.
2 Ambulatory and capable of all self-care but unable to carry out any work activities; up and about more than 50% of waking hours.

3
Capable of only limited self-care; confined to bed or chair more than 50% of waking hours.

4
Completely disabled; cannot carry on any self-care; totally confined to bed or chair.    The results show that the data has 228 objects, in which it occurred 165 death events with median of the lifetime is 310 days, and the 95% confidence intervals of the lifetime ranges from 284 to 361 days. In order to illustrate survival probability, we can represent it through a graph. We execute the following command. > plot(survfit(survival.time~1), xlab="Time",ylab="Cumulative survival probability", col=c("red", "black", "black")) After performing the above command, the result in R is provided as in Figure 4. After we have thoroughly investigated the data set, we want to study how the factors affect the hazard rate h(t). The Cox regression model helps us address this problem. We can estimate this model by the function coxph(formula, data).  , and two black lines are 95% confidence intervals. The 95% confidence interval of this example is relatively good, and the interval is changing from narrow and larger width, which can help us to realize the change of the survival probability of patients over time.

Dead
We first use the Cox regression model to check relationship between the compound variable (time and status) and all remaining independent variables by the following command: >coxph(Surv(time,status)~age+sex+ph.ecog +ph.karno+pat.karno+ +meal.cal+wt.loss, data=lung) After performing the above command, we have the results presented as follows. > pkh=coxph(Surv(time,status)~age+sex+ph.ecog +ph.karno+ +pat.karno+meal.cal+wt.loss, data=lung) > pkh Call: coxph(formula = Surv(time, status)~age + sex + ph.ecog +ph.karno + pat.karno + meal.cal + wt.loss, data = lung) coef exp(coef) se(coef) z p-value age 1.06e-02 1.01e+00 1.16e-02 0.92 0.3591 sex -5.51e-01 5.76e-01 2.01e-01 -2.74 0.0061 ph.ecog 7.34e-01 2.08e+00 2.23e-01 3.29 0.0010 ph.karno 2.25e-02 1.02e+00 1.12e-02 2.00 0.0457 pat.karno -1.24e-02 9.88e-01 8.05e-03 -1.54 0.1232 meal.cal 3.33e-05 1.00e+00 2.60e-04 0.13 0.8979 wt.loss -1.43e-02 9.86e-01 7.77e-03 -1.84 0.0652 Likelihood ratio test=28.3 on 7 df, p=0.000192 n= 168, number of events= 121 (60 observations deleted due to missingness) It can be seen that, this is the results based on the influence of age, sex, Ecog and Karnofsky scores of doctors and patients, calories per meal and weight lost over 6 months. Results show that the variables sex, ph.ecog, and ph.karno have significant influences on the survival probability because their p-value < 5%). The coefcolumn represents for estimates of the parameters in the model, while exp(coef) represents the influence on the risk score when the variable increases by 1 unit. Next, we can use summary(pkh) to summarize all results from the fitted Cox regression model as below: > summary(pkh) Call: coxph(formula = Surv(time, status == 2)~age + sex + ph.ecog + ph.karno + +pat.karno + meal.cal + wt.loss, data = lung) n= 168, number of events= 121 (60 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(>|z|) age 1. These tests are applied to evaluate the null hypothesis that all of coefficient α of model is zero. So in this case, the null hypothesis is rejected. We can see that the covariates sex, ph.ecog, and ph.karno are significant since p-values are less than 0.05. However, the covariates of age, pat.karno, meal.call, and wt.loss are failed to be significant. The p-value for sex is 0:00609, with a hazard ratio hf = exp(coef) = 0:5765 < 1, show that it has a strong relationship between the partient's sex and decreased risk of death. Note that the hazard ratios of covariates are interpretable as multiplicative effects on the hazard. Similarly, the p-value for ph.ecog is 0.00101, with hazard ratio is 2:0838 > 1, indicating a strong relationship between the ph.ecog value and increased risk of death. By contrast, the p-value for meal.cal is 0.89791. The hazard ratio is 1.000, with a 95% confidence interval of 0.9995 to 1.0005. Since the confidence interval for hazard ratio include 1, it indicates that age makes a smaller contribution to the difference in the hazard ratio after adjusting for ph.ecog value, pat.karno value and patient's sex, and the only trend toward significance. It seems not a significant contribution in this case. For the purpose of predicting the survival probability of an event, we split data into training data set and testing data set with the ratios 80% and 20% of the whole data, respectively. With the training data set, At the first row in test1 data set, we can state that the probability that a 57-year-old man is still alive after 210 days is 0.5725. Also, the model can't define probability for some events because of missing data.

DISCUSSIONS
The paper presents a completed and detailed guide survival analysis and its application by using the statistical software R. Survival analysis plays a tremendously crucial role an immensely profound significant in life. Thus, it is exceedingly meaningful to have a study about using the statistical software R in survival analysis.
This paper is also a valuable reference for faculty members as well as students in probability and statistics. Besides, it can also help those who are interested in this area. The article is very detailed and complete about survival analysis, so it is easy for people to access and understand it. In general, it can be seen that the functions in R used for survival analysis will work well if our data set does not contain missing values. In practice, however, we often encounter data sets that contain missing values. Therefore, we cannot immediately apply the available functions in R for survival analysis. Usually, the effective solution for this situation is to combine with methods used for solving the missing data problems. Regrading missing data problems, they have also been studied and mentioned by different researchers. For instance, Pho et al. (2019b) 22 reviewed three update methods to solve the issues with missing data. Besides, we refer to Little (1992) 23 24 , etc; for further details. This will also be a potential research direction if we can combine the methods used for dealing with missing values and the methods of survival analysis. In this study, we capture the relationship between survival time and risk factors of lung cancer patients by only the Cox regression. For future work, it will be a more powerful tool if we can utilize different machine learning models such as survival tree, random forest, etc. Also, we can think of applying cause-effect relationships inference in survival analysis. Such relationships can be modeled by the Directed Acyclic Graph (DAG), and the effect scores of the risk factors can be determined by using the Treenet models, see, e.g., Changpetch (2016) 25 . Time series analysis and survival analysis lead to future object changes, so understanding and working together will yield more reliable results. These research tools have an important practical application, such as the study of the Covit-19 disease. About related research topics can look in Maleki et al. 26,27 , Pho, K. H. 28 .

CONCLUSIONS
In this study, we have presented the most ubiquitous functions in survival analysis, consisting of survival and hazard function and examples of these functions. In addition, we also reviewed the Cox regression model which is one of the most widespread model in survival analysis. Moreover, we provided the approach to estimating the survival and hazard function as well as the procedure of using R in survival analysis. As an application, the data set of lung cancer has been used to analyze risk factors which can cause deaths of the patients. Furthermore, we have also introduced some extensive research directions such as doing survival analysis with missing data and the study of the cause-effect relationships in survival analysis.