Development and Validation of Deep Learning Model for Intermediate-Stage Hepatocellular Carcinoma Survival with Transarterial Chemoembolization (MC-hccAI 002): a Retrospective, Multicenter, Cohort Study

Background: There are few effective prediction models for intermediate-stage hepatocellular carcinoma (IM-HCC) patients treated with transarterial chemoembolization (TACE) to predict overall survival (OS) is available. The learning survival neural network (DeepSurv) was developed to showed a better performance than cox proportional hazards model in prediction of OS. This study aimed to develop a deep learning-based prediction model to predict individual OS. Methods: This multicenter, retrospective, cohort study examined data from the electronic medical record system of four hospitals in China between January 1, 2007, to December 31, 2016. Patients were divided into a training set(n=1075) and a test set(n=269) at a ratio of 8:2 to develop a deep learning-based algorithm (deepHAP IV). The deepHAP IV model was externally validated on an independent cohort(n=414) from the other three centers. The concordance index, the area under the receiver operator characteristic curves, and the calibration curve were used to assess the performance of the models. Results: The deepHAP IV model had a c-index of 0.74, whereas AUROC for predicting survival outcomes of 1-, 3-, and 5-year reached 0.80, 0.76, and 0.74 in the training set. Calibration graphs showed good consistency between the actual and predicted OS in the training set and the validation cohort. Compared to the other five Cox proportional-hazards models, the model this study conducted had a better performance. Patients were finally classified into three groups by X-tile plots with predicted 3-year OS rate (low: ≤ 0.11; middle: > 0.11 and ≤ 0.35; high: >0.35). Conclusion: The deepHAP IV model can effectively predict the OS of patients with IM-HCC, showing a better performance than previous Cox proportional hazards models.


Background
Intermediate-stage hepatocellular carcinoma (IM-HCC) has a wide heterogeneity in tumor burden, and the large gaps in residual liver function (Child-Pugh score 5 to 9), contributing to its highly

Ivyspring
International Publisher variable lifetime.The combination of chemotherapy, programmed death 1/programmed death ligand 1 (PD-1/PD-L1) inhibitors, targeted drugs and t ransarterial chemoembolization (TACE) has become prominent in unresectable-HCC research.TACE is recommended as the frontline treatment for IM-HCC [1], but only a subgroup of patients can benefit from this therapy.Therefore, it is necessary to establish a model to make an individualized prediction of the survival prognosis of this group of patients and identify the survival differences between patients.Several risk prediction models for HCC have been developed, usually based on Cox regression analysis, to identify individual clinical outcome, including up-to-seven criteria [2], six-and-twelve criteria [3], HAP score [4], mHAP-Ⅱ score [5], mHAP-Ⅲ [6], ALBI grade [7], BCLC-B HCC sub-classification [8] and so on.However, most of these models are linear prediction models, and the relationship between each variable of real-world data is usually nonlinear [9,10].Some variables may be removed when constructing the above model, which causes partial distortion of the model and affects the prognosis.
Artificial intelligence (AI) can synthesize and analyze multimodal data with superhuman precision and reliability.In recent years, the use of AI in multiple medical fields, including liver disease, has rapidly increased [11,12].Katzman et al. developed a novel deep learning method for survival analysis that first uses a deep learning network to integrate Cox proportional hazards, which is referred to as the learning survival neural network (DeepSurv) [13].Compared with the Cox proportional hazards model, DeepSurv model has demonstrated its superior performance in predicting prognosis and providing personalized treatment recommendations on multiple solid tumors [14][15][16][17][18].However, few DeepSurv models are developed to identify the prognosis of this group of patients with strong heterogeneity in IM-HCC.In this study, we update a deepHAP Ⅳ model based on DeepSurv using five variables from mHAP-Ⅲ model to predict individual overall survival (OS) in IM-HCC.

Data sources and patient selection
Between January 1, 2007 to December 31, 2016, the consecutive unsectable HCC patients (BCLC stage B) treated with TACE as first-line therapy were retrospectively collected from the electronic medical record system of four hospitals in Guangzhou, China.Details of this multicenter, retrospective, cohort study were previously published in details [19][20][21] The research was carried out under the guidance of the Declaration of Helsinki.The Clinical Research Department approved the study protocol (2017-FXY-129) of SYSUCC.The informed consent was waived for this study as a secondary analysis study, and patients in the study were anonymized.

Definitions of variables and outcomes
Only baseline data including age, gender, AFP, albumin (ALB), total bilirubin (TBLT), Child-Pugh class, major tumor size, location of Lesions, intrahepatic lesions number, and hepatitis B virus (HBV) infection, were collected in the analysis.The distribution of AFP and TBLT were skewed towards the left and transformed to the Log10 scale (Log AFP and Log TBLT) for analysis.According to the mHAP II score [5] and mHAP-III model [6], continuous variables were divided into categorical variables, including AFP (≤400, >400), major tumor size (≤ 7, >7), No. of intrahepatic lesions (2, 3, >3).The interesting endpoint was OS, which the time from the first diagnosis of HCC to death or last follow-up.

Deep learning model design and statistical analysis
In this study, some continuous variables were transformed into categorical variables, expressed in terms of number and proportion.Continuous variables were included in the study either in logarithmic form or by calculating the standard deviation of the mean.Differences are compared using the chi-square test (or Fisher's exact test) for categorical measures and Kruskal-Wallis test for continuous measures.Univariate COX regression analysis was performed to explore the relationship between baseline characteristics and OS.To be consistent with mHAP-III model, Log AFP, ALB, Log TBLT, major tumor size, and intrahepatic lesion number were taken into DeepSurv neural network model [13], namely the deepHAP IV model.
A deep learning algorithm with a 2-layer neural network was used to establish a prediction model which can be more individualized to predict OS with IM-HCC.Convolutional neural networks (CNN) are a class of feedforward neural networks with deep structure and convolutional computation.In this study, we conducted the deep learning model by CNN.The deepHAP IV model contained a single output node to calculate patients' survival risks hθ(xi) using the negative log-partial likelihood function (Figure 1).
Model discrimination was evaluated using C-index and time-area under curve (t-AUC) curves.Calibration plots assess the degree of agreement between the model's predicted and observed probability.We aimed to assess the clinical practicability of our model by comparing its performance with existing metrics, including HAP score, mHAP II score, Up-to-seven, Four-and-seven, and Six-and-twelve.
X-tile software was used to distinguish patients at low and intermediate survival risk from those at high risk.Kaplan-Meier curves and log-rank tests were used to compare differences between groups.A 2-sided P<0.05 was statistically significant.Statistical analyses were mainly performed using R language software (version 4.2.2;Vienna, Austria;

Patient Characteristics
A total of 8848 cases and 1695 cases of HCC patients were collected in SYSUCC and multicenter.We screened the inclusion and exclusion criteria and finally collected 1344 cases of clinical data from the SYSUCC cohort.Among them, 1075 patients were trained to establish the deepHAP IV model, and 269 patients were used as the internal test set at 8:2 ratio.Besides, 414 from the multicenter cohort were finally enrolled in the external cohort (Figure S1).The baseline characteristic was shown in Table 1.Most patients were Child-Pugh class A (1502/1758, 85.4%), and the median tumor size was 7.2cm.Most of the patients (1662/1758, 94.5%) had a history of HBV infection.

Development of deepHAP IV model and validation
Table 2 showed the univariate COX regression analysis results for potential risk factors.Considering the HAP-Ⅲ model, we finally selected five factors, including log AFP, ALB, Log TBLT, major tumor size and intrahepatic lesion number, to develop a deepHAP IV model.The training set reached a C-index of 0.74 (95%CI: 0.71, 0.77), with a 1-, 3-, and 5-year area under the receiver operating characteristic (AUROC) is 0.80, 0.76, and 0.74, respectively.We have listed 5 samples of predicted survival rates for IM-HCC treated with TACE.For example, a 66-year-old patient with a Child-Pugh score of 6, his AFP at baseline is 792.6 ng/ml, ALB is 34.7 g/L, TBLT is 13.4μmol/L, largest tumor size is 4cm, tumor number is 2, by using our model, he got a survival rate at 1-, 3-, 5-year is 0.819, 0.519 and 0.386.While his actual OS is 12.8 months, which indicates our model deed has a good prediction power (Table 3).The test and external validation cohorts were set independently for internal and external validation.The C-index of test set was 0.69 (0.63, 0.76), and the 1-, 3-, 5-year AUROC is 0.74, 0.70, 0.69.The C-index of validation cohort was 0.70 (0.65, 0.75), and the 1-, 3-, 5-year AUROC is 0.77, 0.73, 0.70.Besides, a further time-dependence receiver operating characteristic (ROC) showed a stable performance in prediction power in 40 months both in the training set and validation cohort.When the time stretch to 40 months, we can still see continuous stability in the SYSUCC cohort, but the stability in the multicenter cohort is less satisfactory (Figure 2).A comparison between our model with five others [2][3][4][5]22], including the HAP and mHAP II scores, as well as the up-to-seven, the four-and-seven, the six-and-twelve score, indicated that ours had the highest C-index and 1-, 3-, 5-years AUROC (Table 4).
A calibration curve was performed in the training set and validation cohort of 3-year and 5-year OS in IM-HCC patients.Calibration curves showed good consistency between the prediction model's actual and predicted survival rates (Figure 3).
Individual patient 3-year survival rates were calculated according to the deepHAP IV model, and patients were divided into a low survival rate group (3-year survival rate ≤0.11), a moderate survival rate group (0.11 <3-year survival rate ≤ 0.35), and a high survival rate group (3-year survival rate >0.35) using X-tile.Significant differences in OS in both the training set and validation cohort were observed among the three groups (p <0.0001) (Figure 4).The OS in the training set was 8.9, 16.3, and 56.7 months in the low, middle, and high groups, respectively.

Discussion
In this study, we constructed a model based on a deep learning algorithm that can individualize and predict the survival prognosis of patients with IM-HCC after TACE.Five parameters, log AFP, ALB, Log TBLT, major tumor size, and intrahepatic lesion number, were assessed and identified as predictors and used in model construction.By comparing several linear prediction models, it is proved that the prediction performance of the deepHAP IV model is better than that of traditional linear prediction models.This model effectively assesses the prognosis after TACE in patients with BCLC stage B HCC. Besides, populations were divided into three groups by predicted 3-year survival rate (low: ≤ 0.11; middle: > 0.11 and ≤ 0.35; high: >0.35).Several common liner models, such as HAP score [4], mHAP-Ⅱ score [5], up-to-seven criteria [2], six-and-twelve criteria [3] and four-and-seven [22], are often used to compare model performance.L. Kadalayil et al. firstly developed a simple liner model to predict OS of HCC patients with TACE, named HAP, and proved to be better performance against the other prognostic model [4].We performed it with our data and obtained a less satisfactory performance with a C-index of 0.63, and 1-, 3-, 5-years AUROC of 0.67, 0.63, 0.58.Lin H also tried this model with itself data, showed even poor performance than ours with a C-index of 0.54, and 1-, 3-years AUROC of 0.60, 0.58 [23].In Park's et al study, by adding the "tumor number" variable to HAP model, mHAP-II was constructed and got a better prediction performance.But when we use our queues to validate both models, the expressiveness is similar [5].Up-to-seven criteria is an expanded criteria beyond to Milan criteria which can better predict the OS after liver transplantation in patients with HCC [2].In our research, this model acquired a 1-, 3-, 5-years AUROC at 0.62, 0.64 and 0.63, C-index at 0.60.Lin H also tested with their data and got a AUROC of 0.62 in both 1-year and 3-years, a C-index of 0.59 [23].The same with models mentioned above, the six-to-twelve score and four-and-seven also showed poor performance in our and Lin H' research.In a word, we compared several liner models, and found prediction model contrasted by ML method can perform better.Liner models may have some limitation such as over-fitting or non-liner relation between variables [10,24].
Many studies have developed prognostic, predictive models for HCC using the ML method with the development of ML algorithms.Lin H has explored a machine learning-based model to predict the survival prognosis of patients with IM-HCC after TACE.Five variables were included in the model: the size of the tumor, BCLC B sub-classification, AFP, ALB, and the number of lesions.The established model had a C-index of 0.69, whereas AUROC for predicting survival outcomes of the first three years reached 0.72, 0.71, and 0.73 [23].The performance of their model is less satisfactory.Hence, we developed the deepHAP IV model in the study with a c-index of 0.74, whereas AUROC for predicting survival outcomes of 1-, 3-and 5 years reached 0.80, 0.76, and 0.74 in the training set.The calibration curve and standard line coincide well.And we further divided it to high 3-year survival rate, middle rate and low rate group.Deep learning has already been applied to multiple tumor species.Deep learning networks can learn the highly intricate and linear/nonlinear associations between prognostic clinical characteristics and an individual's risk of death from HCC-specific survival [25].
This study had several limitations.As a secondary study, the raw data's limitations, such as selective bias which is one of the intrinsic limitations of retrospective data, cannot be avoided.The data were derived from a Chinese population, and it remains to be verified whether the model applies to other ethnic populations.Second, Clinically, there may be some other indicators that can also affect prognosis, but these indicators are missing from the original data, so they are not included in this study model.More clinical parameters, genetics, and imaging features need to be informative in the modeling in the future.As methods of TACE and times of TACE can influence the OS of IM-HCC, it's important to including information about these.We hope to conduct more prospective clinical studies on IM-HCC in the future to confirm our conclusions and hypotheses.Third, the current deep learning methods are still not readily available for clinical practice.Finally, it is difficult to understand how the deep learning network makes its decisions, for the networks function much like black boxes.
In conclusion, we established a deep learning-based model, which can effectively predict the OS of patients with IM-HCC, showing a better performance than previous Cox proportional hazards models.
Fig. www.r-project.org) and X-tile software (Yale University School of Medicine, New Haven, CT, USA).

Figure 1 .
Figure 1.Diagram of the Study Procedure.

Figure 2 .
Figure 2. The time-dependent area under the curve in the SYSUCC Cohort (A) and Multicenter Cohort (B).

Figure 3 .
Figure 3. Calibration Plots for Overall Survival for the deepHAP IV Model in the SYSUCC Cohort (A, and Multicenter Cohort (C, D).

Table 1 .
Baseline characteristics of training and validation cohort.

Table 2 .
Univariate Cox regression analysis of potential risk factors

Table 3 .
Five samples of predicted survival rate for intermediate-stage HCC treated with TACE

Largest tumor size (cm) Tumor number Predicted 1-yr survival rate Predicted 3-yr survival rate Predicted 5-yr survival rate
A 66-year-old man, Child-Pugh, scored 6. OS 12.8 months and dead.

Table 4 .
The comparison of the deepHAP IV model versus other models for intermediate-stage HCC treated with TACE.