Computational analysis of microarray gene expression profiles of lung cancer

S. A. Babichev, A. I. Kornelyuk, V. I. Lytvynenko, V. V. Osypenko © 2016 S. A. Babichev et al.; Published by the Institute of Molecular Biology and Genetics, NAS of Ukraine on behalf of Biopolymers and Cell. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited UDC 651.3:678


Introduction
The DNA microarray technology is one of the modern areas of molecular biological research that allows us to identify and carry out quantitative analysis of thousands of genes simultaneously [1][2][3]. Qualitatively made an analysis of gene expression of the studied biological organism contributes to the determination of mechanism of disease at an early stage, and the nature of gene expression change allows to predict the nature of the appropriate type of disease further development.
There are many international computer databases of biological research objects of various nature (Array Express etc.). Singularity of the DNA microarray data is a large dimension of the feature space (~ 100000), that describes the object and a high level and diversity of the noise component. The emergence of the noise component is conditioned by the influence of technological factors on the process of microchip manufacturing, reading information from microarray and its subsequent processing. The efficient processing of DNA microarray data is possible due to the improvement and the development of new methods (background correction, normalization, filtration, summarization), an optimal technology of the feature space dimension reduction and the development of new clustering and classification technol- ISSN 1993- ogies of biological objects based on an integrated use of modern technologies and methods of system analysis and data mining. The issues of DNA microarrays processing are presented in [4][5][6]. The authors consider in detail the stages of DNA microarrays creation and the peculiarities of their processing. In [7] the possibility of using the neuro-fuzzy modeling to process the results of microarray experiments has been considered for the purpose of cancer diagnostics. In [8] for the analysis of the gene expression level to create the objects classification Bayesian network was used, that allows taking into account the probabilistic character of the objects distribution in multidimensional space. Notably, despite some progress in this area, it is still actual to achieve the desired precision of classification or to get the unambiguous interpretation of the results of DNA microarray obtained by clustering of investigated objects.
The aim of the paper is to research into the ways of optimizing methods of DNA microarray data processing, which is aimed at improving the quality of object clustering.

Materials and Methods
The structural flowchart of the processing of the light intensities matrix of corresponding genes of investigated objects is shown in Fig. 1. The need of background correction is caused by imperfection of the received image scan system. In this paper two methods of background correction were used: Affymetrix MicroArray Suite (MAS) method of MisMatch(MM) and Perfect Match(PM) tests proposed by company Affymetrix [9] and Robust Multi-Analysis (RMA) method [10]. In the first case MM and PM samples are used, herewith each chip is broken into 16 parts.
For each part used the least 2 % of intensity for the background correction of respective areas. The second method assumes that the light intensity at appropriate point consist of useful component and noise that have normal distribution. If we assume that α -average value exponentially-distributed signal, μ and σ 2 -the mathematical expectation and variance of the noise component respectively, the signal intensity correction in the corresponding point occurs with the formula: where a = I -μ -σ 2 α; b = σ; Φ and φ -distribution function and the density of the standard normal distribution respectively.
The process of data normalization leads to their single range that allows to carry out a comparative analysis of the research objects for the purpose of their classification or clustering in future. In this paper the following normalization methods were used: • constant or scaling normalization, proposed by company Affymetrix [9]. When using this method all the arrays are scaled so that they have the same average value of light intensities; • loess normalization [11]. Method provides calculation of variables for all values of light intensities of each pair of microarray: where x ik and x in -intensity value of i-th test on k-th and n-th microarrays respectively. It is assumed that between vectors A and M exists the regressive dependence. Further the normalizing correction is calculated: where M i -values of the regression function that corresponds to i-th sample.
Normalizing values of intensities are calculated as follows: (5) x' in = 2 • contrast data normalization [12]. Calculation of vectors A and M and use of regression dependence between them are going to be investigated, but not all pairs of arrays are analyzed, on the first step the basic array is chosen and the whole set of calculations is performed in accordance to it; • invariant set normalization [13]. The method assumes the use of a basic subset of PM samples as possible low light intensity distribution within each sample. Next a nonlinear relationship between light intensity values in the basic subset of samples and in investigated samples is found. This dependence later [is] used to normalize the data; • qspline data normalization [14]. Used cubic splines and quantiles of light intensities of corresponding arrays. Spline interpolation between the respective quantiles of investigated data is realized in the normalization process; • quantiles data normalization [15]. To use this method the projection of all points of n-dimensional quantile space on diagonal that defined by the unit vector ( 1 ,…, 1 ) √n √n is calculated. In the case of direct diagonal, the line of light intensities values in all microarrays will be distributed equally. PM correction is performed to reduce the effect of nonspecific hybridization that contributes to the noise component of investigated data. In this paper we used "mas" and "substractum" PM correction that was offered by company Affymetrix [9] and "pmonly" PM correction.
During summarization performed the calculations of expression of corresponding gene depending on the light intensity at this point. In this paper following summarization methods have been used: • average difference method, which allows the calculation of the level of gene expression as an average of the light intensities of corresponding samples. The method proposed by company Affymetrix [9]; • liwong method [13] based on an assumption that between the intensity values in the array PM i-th sample and j-th one, or between difference of light intensities PM-MM probes and level of expression of the corresponding gene in the array and θ i there is the following relationship: where ∑ϕ i 2 = J -the number of simple pairs in the investigated set, e ij -random error. Estimation of expression of the corresponding gene was calculated as the weighted average value of the difference of PM-MM: • mas method was offered by the company Affymetrix.
The value expression is calculated as robust average using 1-step Tukey biweight on log 2 scale; • median polish method [10] takes into account the difference in the nature of the interaction of genes with samples. Method based on the following additive models: log 2 (y ij ) = α i + μ j + e ij (9) where α i -the coefficient of interactions of i-th sample with genes; μ j -concentration of the j-th gene, which is taken as the expression of the corresponding gene.
To assess the quality of information processing Shannon entropy criterion has been used: E = -∑θ i log 2 θ i (10) where θ -empirical parameter that was calculated depending on the method used.
The aim of the data transformation is to reduce the dimensionality of feature space, but this process is accompanied by loss of a certain amount of useful infor-mation that affects the quality of the problem solution.
In this paper to reduce the dimension of the informative features space the component analysis has been used with selection of all major important components.
The clustering of objects was made: by methods K-means and C-means using Euclidean distance as a measure of proximity between the objects and the related cluster; by neural network algorithms SOM (Self-organizing Map); by SOTA (Self-Organizing Tree Algorithm). As the measure of proximity between objects and corresponding clusters when using SOTA algorithm, Euclidean and cosine distances have been used.

Results and Discussion
Simulation of DNA microarrays data processing system was implemented by system KNIME using programmatic functions of environment WEKA and package Bioconductor of program R. As the experimental base for research we used a database of patients with lung cancer E-GEOD-68571 database Array Express [16], which includes the geneexpression profiles of 95 patients, ten of which are healthy (Norm), and 85 patients divided by the degree of the disease into three groups: 23 patients with good state (Well), 41 patients with moderate state (Moderate-Md), and 21 patients with poor state (Poor). Fig. 2a presents the initial image of one of the objects. Fig. 2b  and 2c show the results of background correction that was made by mas and rma methods respectively. Table 1 presents the values of Shannon entropy when using different methods for calculating the entropy for unprocessed and processed images using rma and mas methods of background correction.
In this case the comparison was made by quantiles normalization of all data with further PM mas correction and liwong summarization method. Analysis of the results shown in Table 1, allows the conclusion that mas background correction method in this case is inefficient. The images processed by rma background correction method have the smallest entropy. This fact indicates a high quality in terms of useful information availability. The same conclusion may be made by analyzing the image in Fig. 2. The scatter diagram of the light intensity magnitude of the objects without data preprocessing is shown in Fig. 3. Median values of various vectors of investigated objects indicate the necessity of data normalization.
The entropy of normalized data with different methods of normalization is presented in Table 2. The data analysis in Table 2 allows a conclusion that by the Shannon's entropy criterion the usage of quantum data normalization is optimal. Fig. 4 shows the scatter diagram of normalized data using quantiles method of data normalization. The analysis results (Fig. 4) indicate a high quality of data normalization because they have the same median and are distributed in the same range. Tables  3 and 4 show the values of Shannon's entropy when using different methods of PM correction and summarization respectively.
Data analysis of Tables 3 and 4 allows a conclusion that the data have minimum entropy when using the mas method RM correction and summarization. Thus, the optimal preprocessing stages of DNA microarray data are: rma background correction meth-    od, quantiles data normalization, mas PM correction and mas summarization method. The data transformation was performed by calculating the principal components, and the dimension of the feature space was reduced from 7129 to 93. The model of cluster analysis for obtained database is implemented in the system KNIME and shown in Fig. 5. Table 5 presents the results of the simulation clustering process using different methods of cluster analysis.
Based on the analysis of the data in Table 5 it can be concluded that all used methods of cluster analysis share the investigated objects on patients and not patients, but in the mid of patients cluster algorithms SOM, k-means and C-means have unsatisfactory sharing resolution, because clusters Md, Well and Poor have intersection with each other. However, within the cluster of patients the algorithms SOM, k-means and C-means have unsatisfactory resolution, because clusters Md, Well and Poor intersect with each other. Additionally, some patients with good facilities condition were assign as not patients that is also unacceptable. Another conclusion can be made by analyzing the results of the algorithm Sota.
Sota algorithm clearly separates the objects on patients and not patients. Furthermore, within the cluster of patients the subclusters Poor and Well do not intersect using Euclidean distance estimation and overlap of 2.3 % using the cosine distance. There are some clusters of the objects intersection with moderate condition along with good and moderate and poor. This is logical, because the moderate patient's condition is fairly conventional concept. The patient's condition may be moderately good and moderately poor, however, the cluster of objects with moderate condition cannot be defined uniquely and it should intersect with Well and Poor clusters. Fig. 6 shows the chart of objects distribution on clusters by algorithm Sota using Euclidean distance and assessment of the degree of objects remoteness. The chart analysis confirms a good sharing ability of the algorithm Sota.

Conclusion
The research on the choice of optimal methods of DNA microarray data preprocessing has been conducted for further transformation and clustering of the investigated objects. The data preprocessing was performed using program R and consisted in the following: the correction of light intensity background in the corresponding point of the microarray; the normalization of data correction; and PM summarization, due to which the gene expression was calculated for the investigated objects.
Evaluation of the quality of information processing has been conducted using the Shannon's entropy and such methods of calculation have been used: bias-corrected maximum likelihood (MM); method Dirichlet with a=1/2 (Jeffreys); method Dirichlet with a=1(Laplace); method Dirichlet with a=1/length(y) (SG); method Dirichlet with a = sqrt(sum(y)) /length(y) (minimax); Chaosen method (CS). The Studies have shown that the optimal methods of data preprocessing by the Shannon's criterion are: rma background correction method, quantiles data normalization, RM mascorrection and mas-summarization. Reducing dimension of feature space was performed by component analysis, and the primary matrix gene expressions (95×7129) were transformed to a matrix (95×93) through the selection of all significant principal components.
The modeling of cluster analysis was performed in the system KNIME using the functions of the system WEKA. The clustering algorithms Sota, SOM, kmeans and C-means were investigated using the database of patients with lung cancer, 10 of whom were not patients. As a result of modeling it was established that all algorithms share the objects into the patients and not patients, but only Sota algorithm gives a satisfactory clustering result of patients within the cluster.
The prospect for further research is to develop efficient methods for clustering and classification of biological objects in order to improve the dividing ability of the presented algorithm.