Diagnostic Value of Plasma MicroRNAs for Lung Cancer Using Support Vector Machine Model

Aim: Small single-stranded non-coding RNAs (miRNAs) play an important role in carcinogenesis through degrading target mRNAs. However, the diagnostic value of miRNAs was not explored in lung cancers. In this study, a support-vector-machine (SVM) model for diagnosis of lung cancer was established based on plasma miRNAs biomarkers, clinical symptoms and epidemiology material. Methods: The expressions of plasma miRNA were examined with SYBR Green-based quantitative real-time PCR. Results: We identified that the expressions of 10 plasma miRNAs (miR-21, miR-20a, miR-210, miR-145, miR-126, miR-223, miR-197, miR-30a, miR-30d, miR-25), smoking status, fever, cough, chest pain or tightness, bloody phlegm, haemoptysis, were significantly different between lung cancer and control groups (P<0.05). The accuracies of the combined SVM, miRNAs SVM, symptom SVM, combined Fisher, miRNAs Fisher and symptom Fisher were 96.34%, 80.49%, 84.15%, 84.15%, 75.61%, and 80.49%, respectively; AUC of these six model were 0.976, 0.841, 0.838, 0.865, 0.750, and 0.801, respectively. The accuracy and AUC of combined SVM were higher than the other 5 models (P<0.05). Conclusions: Our findings indicate that SVM model based on plasma miRNAs biomarkers may serve as a novel, accurate, noninvasive method for auxiliary diagnosis of lung cancer.


Introduction
Lung cancer is currently the number one cause of morbidity and mortality worldwide [1] , which has been classified into small cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC). NSCLC could account for 85% of all lung cancers to become the main subgroup of lung cancer [2] . NSCLCs can be further divided into adenocarcinoma (AC), squamous cell carcinoma (SCC) and large cell carcinoma (LCLC) three major histological subtypes [3] . Yet the NSCLC patients at an early stage with no obvious clinical symptoms, and lack of sensitive biomarkers and effective tools for early diagnosis, therefore more than 75% of NSCLC patients are still diagnosed at advanced stages with distant metastases [4] . Although novel therapies are improving the survival of lung cancer patients, 5-year survival rate was still less than Ivyspring International Publisher 15% for advanced NSCLCs. However, the 5-year survival rate was up to 80% for the initial stage NSCLCs [5] . Therefore, early diagnosis and early screening of lung cancer are particularly important.
Lung cancer is diagnosed by means of histological examination, diagnostic imaging, low-dose spiral computed tomography (LDSCT) and positron emission tomography (PET). Although these techniques have been improved, they still have some limitations and the five-year death rate of lung cancer remains low [6] . For example, histological examination is the golden standard for diagnosis of lung cancer, but it is not suitable for early screening of lung cancer because of its traumatic and highly technical requirements. Diagnostic imaging such as Chest X Ray (CxR) and Computed Tomography (CT) have been used for diagnosing NSCLC at an early stage, however, there is a certain radiation hazard and limited role in reducing lung cancer mortality [7] . While lung cancer mortality is reduced by 20% in high-risk lung cancer patients through LDSCT method, the false-positive are as high as 90% [8] .
Although the sensitivity and specificity of PET method are up to 90%, there is still a 10% false-positive rate and the cost is expensive [9] . Therefore, new biomarkers and therapeutic strategies urgently need to be developed for better management of lung cancer.
miRNAs are small single-stranded non-coding RNAs that play vital regulatory roles by targeting mRNAs for degradation or translational repression. It acts as key regulators of cell proliferation, differentiation, apoptosis and other biological processes [10] . A line of studies suggest that miRNAs are involved in human diseases and cancers. miRNAs expression is associated with lung cancer has been identified in varieties of normal and cancer tissues [11,12] . Moreover, it has been demonstrated that plasma miRNAs regulate numerous target genes and play a critical role in lung carcinogenesis, which indicates that miRNAs might be a potential diagnostic tool for lung cancer [13] . Published studies [14][15][16] have shown that 11 plasma miRNAs (miR-16, miR-21, miR-20a, miR-210, miR-145, miR-126, miR-223, miR-197, miR-30a, miR-30d, miR-25) in lung cancer patients are abnormal expressions. These results suggest that combined with several miRNAs can improve the sensitivity and specificity for the early diagnosis of lung cancer.
Data mining(DM), also called Knowledge Discovery in Database(KDD), is extracting potentially useful information and knowledge of the process from abundant, incomplete, noisy, fuzzy and random practical application data [17] . DM has a unique advantage in solving multi-parameter problems.
Classification is part of the important functions of data mining, which is often closely related to disease diagnosis. At present, data mining is used primarily in the field of auxiliary diagnosis of diseases [18] . DM techniques include SVM, artificial neural networks (ANN), decision tree (DT), genetic algorithms and so on. SVM is a pattern recognition method based on statistical learning theory (SLT) and structural risk minimization, which has several advantages such as prominent generalization ability and non-linear processing capacity and high-dimensional processing capacity in many areas [19] .
Based on the previous research [20] , this study explored the significance of the SVM model by using data of plasma miRNAs biomarkers for the auxiliary diagnosis of lung cancer.

Study population
The lung cancer patient group consisted of 148 cases (age rank 29-87 years) with primary lung cancer from the First Affiliated Hospital of Zhengzhou University, Henan Cancer Hospital and Henan Provincial Chest Hospital, from Jun. 2016 to Feb. 2017. Patients were selected on the basis of the following inclusion criteria: (1) patients had a pathological diagnostic primary lung cancer that met histological or cytological criteria; (2) without undergoing surgical resection, chemotherapy, or radiotherapy; (3) without previous other organ tumors; (4) good compliance and availability of outcome data. Patients were excluded with major organ function failure, pregnant, or lactating. Pathologic diagnosis was based on WHO criteria. Lung cancer staging for each patient was performed according to the AJCC Cancer Staging Manual, 7th edition.
Controls come from a company who take physical examinations in Qixian Center for Disease Control and Prevention. The controls were excluded according to the following criteria: (1) without malignant tumors of the lung or other organs; (2) without major organ function failure; (3) without pregnant or lactating; (4) good compliance and availability of outcome data. A total of 148 genderand age-frequency matched (±3 years) were enrolled in this study. The permission was got from each participant. A questionnaire that included the information of epidemiology was completed for each participant by trained interviewers. Smokers are defined as people who have smoked for six months or more in their lifetime according to the criteria of WHO. The alcohol-drinkers are defined as drinking alcohol at least once a week and the consumption of pure alcohol is above 20 g.

Main instruments and reagents
The instruments and reagents used in the study included a Labcycler PCR amplifier (SensoQuest Company, China), a 7500 Fast Real-time PCR system (ABI, America), primers (Sangon Biotech), miRcute miRNA extraction and separation kit(Tiangen, Beijing), MiRcute enhanced miRNA fluorescence quantitative detection kit(Tiangen, Beijing) and ChemiDoc MP gel imaging analyzer(Bio-RAD, America).

Statistical analysis and model evaluation
The Ct values of the samples were calculated with the software for real-time PCR instrument. The comparison of multiple of the expression of miRNA in the lung cancer patients to the normal controls was calculated using the formula of 2 −ΔΔCt (ΔCt= Ct miR − Ct external reference ; ΔΔCt=ΔCt miR -ΔCt average normal controls ).
The data was analyzed using SPSS 21.0 software. SPSS Clementine 21.0 software was used for data mining. The analysis of the quantitative data was analyzed with independent sample t-test or Mann-Whitney U. Each contingency table was tested by Chi-Square test. Binary logistic regression was conducted to analyse the influencing factors of lung cancer. The significance level was set at 0.05.
This study assessed sensitivity, specificity, accuracy positive predictive value (PPV), negative predictive value (NPV), and area under the ROC curve (AUC) to estimate the models.

Data preprocessing
Data transformation: The relative expression of 11 miRNAs did not follow a normal distribution, so normal transformation was needed. The expression of 11 miRNAs was normalized based on 10 common logarithm transformations.
Groups of training set and validation set: Based on the random sampling function of the partition node, according to a ratio of 3:1, the normalized data of each group were separated randomly into a training set (114 controls, 100 cancer cases) and a validation set (34 controls, 48 cancer cases). The training set was utilized to develop the model, while the validation set was used to verify the model.

Model derivation
The Data node is the source of data for the study; the variables are documented using Type node; the samples were randomly divided into the training set and validation set according to the proportion of 3:1 using Partition node; Random number seed is 1111111.

Fisher discrimination model
Fisher discrimination is a widely used classification model in traditional statistical methods. The basic idea: Projection before discriminant analysis, Projection is the core of the Fisher discrimination analysis. After repeating training, the Fisher discrimination parameter settings were: Use partitioned data: no; method: Enter; Mode: Expert; Prior probabilities: All groups equal; Use covariance matrix: Within-groups.

SVM model
The basic principle is to transform the input space into a high dimensional space by using the nonlinear transformation defined by the inner product function, and to find the optimal linear classification surface.

Demographic characteristics of lung cancer patients and controls
The 148 lung cancer patients (mean age 60.97 ± 10.83 years) and 148 controls (mean age 60.14 ± 9.66 years) were enrolled. The age distribution of subjects was in normal distribution, so the age group was divided into two groups according to mean age (60 years). All the subjects were divided into four groups (Never smoking; Light smoking: <10 cigarettes/day; Moderate smoking: 10~20 cigarettes/day; Heavy smoke>20 cigarettes/day) according to the smoking status. As shown in Table 1, the average age, sex and alcohol were no significant differences between the two groups (P>0.05). However, the frequency of smoking, fever, cough, chest pain or tightness, bloody phlegm and hemoptysis were significantly higher in the cancer group than that in control group (P<0.001).

Clinical pathologic characteristics of lung cancer patients
The clinical and pathological characteristics of lung cancer patients collected in this study are shown in Table 2. The lung cancer group was consisted of 36 SCC cases, 18 SCLC cases, 66 AC cases, 2 LCLC cases, and 26 other histological type cases; 33 cases of clinical stage Ⅰ and Ⅱ, 101 cases of clinical stage Ⅲ and Ⅳ.

The evaluation of models
The results of the evaluation indexes of the 6 models were presented in Table 5. Sensitivity of combined SVM model reached 97.90%, and the specificity was 94.10%. PPV and NPV were likewise highest. Meanwhile, AUC was greater than 0.9. On the other hand, AUC of the miRNAs Fisher and symptom Fisher models were slightly smaller than the other models. The results of the AUC of the 6 models were shown in Table 6. The AUC of combined SVM model was superior to the other 5 models, and the difference was statistically significant (P<0.05); The AUC of combined Fisher model was higher than miRNAs Fisher model and symptom Fisher model (P<0.05). There were no statistical differences in AUC among the other 3 models (P>0.05).

Discussion
Early diagnosis and effective treatment of lung cancer is the key to improve the survival rate of patients. Therefore, early and non-invasive biomarkers for lung cancer diagnosis have been the most popular research areas. It has been shown that circulating miRNAs are stable under the actual experimental conditions and that abnormal expression of cancer-related miRNAs may be earlier than the clinical symptoms, therefore, circulating miRNAs may be used as tumor biomarkers [21] . A large body of studies have suggested that a series of circulating miRNAs have the potential as diagnostic tool in malignancies [22] . It has been shown that four plasma miRNAs (miR-21, miR-126, miR-210, and miR-486) could differentiate NSCLC from controls with 86.22% sensitivity and 96.55% specificity, which also could to distinguish NSCLC with 73.33% sensitivity and 96.55% specificity in phase Ⅰ [23] . In the plasma of NSCLC patients, one study identified 15 types of miRNAs associated with lung cancer tissues from the literature and found that the expression of miR-155, miR-197, and miR-182 were significant increase in phase Ⅰ [16] . The sensitivity and specificity of diagnosis NSCLC patients were 81.33% and 86.76%, respectively.
In this study, we compared the expression of 11 plasma miRNAs in lung cancer patient to that in the controls. Single-factor analysis showed that the expressions of 10 plasma miRNAs (miR-21, miR-20a, miR-210, miR-145, miR-126, miR-223, miR-197, miR-30a, miR-30d, miR-25) in lung cancer group were statistically significant higher than the controls; Multiple factor analysis revealed that elevated plasma miR-20a levels and miR-223 were risk factors for lung cancer.
The data of the expressions of miR-145 are not consistent in different studies. It has been shown that miR-145 is down-regulated in various malignancies including lung adenocarcinoma, which inhibited cell proliferation through targeting epidermal growth factor receptor (EGFR) and nucleoside diphosphate 1 (NUDT1) [42] . However, Study found that the increased expression of miR-145 in the plasma lung cancer, which is consistent with our study [43] . miR-126 may inhibit the proliferation of lung cancer cells and the expression of miR-126 was lower than normal tissue [44] . miR-30a can inhibit the invasion and migration of lung cancer cells by directly inhibiting the expression of the snail [45] . miR-30d could inhibit the cell proliferation and activity of NSCLC by directly regulating CCNE2 [46] . In this study, the relative expressions of miR-126, miR-30a and miR-30d in plasma lung cancer patients were greater than controls, and the data differed from the studies above.
Various data mining algorithms have been improved in recent years, such as cluster analysis, decision tree and rough set, ANN and genetic algorithm, SVM and fuzzy processing technology [47] . Each method has advantages and limits as well as the applicable scope. Fisher discriminant analysis is one of the most widely used method in multivariate statistical pattern recognition, which requires the independent input variables without interaction effect and normal distribution and so on [48] . Therefore, the analysis of the nonlinear system has a couple of limitations. In order to get the best generalization ability, based on the statistical learning theory of VC (Vapnik-Cher-Vonenkis) and structural risk minimization principle, SVM finds the best compromise between the complexity of the model and the ability to learn [49] . SVM is a classical method in data mining. There are several advantages of SVM method. For example, structural risk minimization and good generalization ability, what is based on statistical learning theory [50] . The second, SVM can achieve similar results with different kernel functions like ANN, which depends on the selected model [51] . In general, SVM is the optimal solution in the existing information situation, which makes up for the deficiency of ANN in determining the reasonable structure and local optimal problem, and has a significant improvement in learning methods. This study deeply analyzed with more mature SVM algorithms employed in the medical field.
At present, some studies have mostly focused on one or several biomarkers using traditional analysis methods. One study explored serum miR-22, miR-125b, and miR-15b diagnosis compared with the current commonly used tumor marker CEA, which indicates that the diagnostic significance of these three serum miRNAs(AUC=0.725, 0.704, and 0.619) for NSCLC was higher than that of serum CEA (AUC=0.594) [52] . Meanwhile, some studies focused on gene and other biomarkers using ANN or decision tree model and so on. The ANN and decision tree model of lung cancer based on the genetic polymorphism of CYP1A1, GSTM1, mEH, XRCC1, the length of telomere, and the methylations of p16 and RASSF1A gene, the results showed that the accuracy for ANN and decision tree model validation sets was 89.62% and 93.00% [53] . The accuracy and sensitivity were also improved by the above methods. In this study, the SVM model and Fisher model were established based on miRNAs tumor biomarkers and clinical symptom characteristics for the first time.
We established the Fisher model with 10 miRNA and 6 symptom for lung cancer diagnostic, and, the AUCs of three models are combined Fisher model ( [20] . This may be due to the miRNAs biomarkers has better specificity compared with gene or other biomarkers. Our findings indicate that the changed expression levels could be used as potential biomarkers for diagnosis of lung cancer. Besides, probably because of the data pretreatment before model established. After the normal transformation, the expression levels of miRNAs are approximately normal distribution and without missing values. miRNAs play a critical role in lung cancer carcinogenesis, which were studied widely as cancer biomarkers. Zhang et al [54] established screening method for early-stage NSCLC using four miRNAs (miR-145, miR-20a, miR-21, miR-223), and the AUC of the model was 0.897. To the best of our knowledge, there is no data mining model for lung cancer diagnosis based on miRNAs. SVM model were established for lung cancer diagnostic in our study, which combined 10 miRNAs and 6 symptoms, had a higher accuracy. The combined SVM model with miRNAs was superior in lung cancer diagnosis in this study compared to models with methylation and telomere biomarkers in our prophase research [20] . The accuracy and AUC of combined SVM model in our study were also better than the results of other studies on gene and other biomarkers using ANN or SVM and so on. For example, one study explored eighteen genes (including TTN, RHOH, RPS20, TRBC2) for six cancer (including lung cancer) using SVM with accuracy of 75.10% [55] .
As to the three models we established, the accuracy of models (10 miRNAs SVM, 6 symptom SVM model, and combined SVM) were 80.45%, 84.15%, and 96.34%, respectively; the AUC of models (10 miRNAs SVM, 6 symptom SVM model, and combined SVM) were 0.841, 0.818, and 0.976, respectively. The AUC and accuracy of combined SVM model were better than the miRNAs SVM and symptom SVM model. Overall, the SVM model based on miRNAs and clinical symptom characteristics has a higher accuracy rate and might be useful for early diagnosis of lung cancer, which also has excellent predictive power, such as all patients with stage Ⅰ and Ⅱ lung cancer in validation set were correctly predicted to be lung cancer.
This study showed that 10 plasma miRNAs expression levels were associated with lung cancer, which provides a theoretical possibility for further prospective studies or large-scale clinical trials. More importantly, the expression of the plasma miRNAs is very stable under different harsh conditions, which indicating that the plasma miRNAs has the potential to serve as biomarker for auxiliary diagnosis of lung cancer. Our findings indicate that SVM model based on plasma miRNAs biomarkers may serve as a novel, accurate, noninvasive method for auxiliary diagnosis of lung cancer. However, there are some limitations in this study. Firstly, the selection of 10 plasma miRNAs were based on published studies rather than miRNA array or bioinformatics method. More plasma miRNAs need to be analyzed to for using as specific biomarkers. Secondly, compare to single study, large sample and multicenter clinical trial studies will yield more reliable results. Moreover, there are still things for the further validation study need to be thought, including health policy, ethics, cost, et al.

Conclusions
In summary, this study suggests that the 10 plasma miRNAs are associated with lung cancer, and the changed expression levels could be used as potential biomarkers for diagnosis of lung cancer. SVM model has the superior diagnostic value for auxiliary diagnosis of lung cancer based on miRNAs tumor biomarkers and clinical symptom characteristics.