Objective clustering inductive technology of gene expression profiles based on sota clustering algorithm

S. A. Babichev, A. Gozhyj, A. I. Kornelyuk © 2017 S. A. Babichev et al.; Published by the Institute of Molecular Biology and Genetics, NAS of Ukraine on behalf of Biopolymers and Cell. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited UDC 004.048


Introduction
Gene regulatory network reconstruction based on the gene expression profiles is one of the current directions of modern bioinformatics.Gene regulatory network is a set of genes, which interact with each other to control the specific cell functions.Qualitatively reconstructed gene regulatory Bioinformatics ISSN 1993-6842 (on-line); ISSN 0233-7657 (print) Biopolymers and Cell.2017.Vol.33.N 5. P 379-392 doi: http://dx.doi.org/10.7124/bc.000961network allows us to study the influence of the corresponding group of genes or individual genes on abilities of the biology objects.Gene expression profiles, which are obtained by DNA microarray experiments or by RNA sequences technology are the basis to reconstruct gene regulatory networks.High dimension of the features space is one of the gene expression profiles peculiarities.About tens of thousands genes are contained in gene expression profiles.It is obvious that reconstruction of the gene regulatory network based on full dataset is very difficult task because this process requests large capacity of computer resources and complicity of the obtained network complicates the results of its work interpretation.Therefore, it is necessary at the early stage of network reconstruction to group studied gene profiles according to the level of their similarity.
Biclustering technology is current one for solving this problem.Implementation of this technology allows grouping objects and genes according to their mutual correlation.So, in the paper [1] author s provide a review of a large quantity of biclustering approaches existing in literature with analysis of their advantages and disadvantages.
In [2] authors have proposed and implemented convex biclustering method using gene expression profiles of the lung cancer patient.The author s have shown the efficiency of the proposed method during simulation process.However, it should be noted that one of the significant problems of this technology qualitative implementation is selection of the biclustering level during objects and genes grouping.Qualitative validation of the obtained model is another task, which has no solution currently.High dimension of features space promotes to large quantity of the obtained biclusters.Limitation of their quantity by removing of small biclusters leads to the loss of some useful information.To solve this problem we propose cluster-bicluster technology, the implementation of which involves two stages: clustering of gene expression profiles at the first step and biclustering of the obtained clusters at the second step.To decrease the reproducibility error of clustering process the data clustering is performed within the framework of the objective clustering inductive technology the implementation of which involves the use of external information to correct verification of the obtained model and the use of internal clustering quality criteria, external criterion and complex balance clustering quality criterion.High objectivity is achieved by using two equal power subsets du ring clustering process.The term equal power means that these subsets contain the same quantity of pairwise similar objects.The idea and conceptual basis of the objective clustering methods have been proposed by Madala and Ivakhnenko [3] and further deve loped in [4,5].The authors' research is based on the inductive method of complex systems selforganization models on the basis of Group Method of Data Handling (GMDH), the idea and main principles of which are presented in [6,7].Implementation of the proposed method involves enumeration of the models from simple to complex ones and selection of the best mo del based on qualitative criteria of the studied process estimation.However, it should be noted that the authors' research is focused mainly on low dimensional data processing.The [8] presents objective clustering inductive technology of high dimensional data.The authors have developed an architecture of this technology and step-bystep procedure of its implementation.Practical implementation of objective clustering inductive technology based on agglomerative hierarchical clustering algorithm is presented in [9].However, in spite of the progress achieved there are some unsolved issues in this field.They are connected with practical implementation of the objective clustering inductive technology based on selforganizing hierarchical clustering algorithms and verification of the obtained models using dif ferent high dimensional data.
The unsolved parts of the general problem are: • Absence of complex criterial analysis of clustering results which are obtained concurrently on two equal power subsets based on complex balance clustering quality criterion that takes into account: character of objects distribution relative to mass center of clusters where these objects are and character of cluster's mass centers distribution in features space; difference between clustering results which are obtained using two equal power subsets.• Practical implementation of objective clustering inductive technology based on existing clustering algorithm using gene expression profiles in order to select the best clustering algorithms for studied data and to determine the optimal parameters of this algorithm operation and its practical implementation within the framework of hybrid models of gene expression profiles grou ping.the aim of the paper is the development of objective clustering inductive technology of gene expression profiles based on self-organizing SOTA clustering algorithm.The architecture of objective clustering inductive technology [8] is shown in Fig. 1.The initial dataset is presented as a matrix:

Materials and methods
where n -is the quantity of the studied objects, m -is the quantity of the objects features.The aim of the clustering is partition of the objects into non-empty subsets of pairwise non-intersecting clusters in accordance with the clustering quality criteria taking into account the properties of the studied objects: where k -is the clusters quantity.Objective clustering inductive technology is based on the inductive methods of complex systems analysis, which involves sequential enumeration of clus-tering in order to select from them the best variants.Let W -is the set of available clustering for equal power datasets A and B. Clustering is optimal if the following condition is performed: where QC(K) -is the clustering quality criterion for K clustering.
Clustering K opt ⊆ W is the objective if dif-ference between distribution of objects and clusters in different clustering for equal po wer subsets A and B is minimal: Implementation of objective clustering inductive technology involves the following steps: 1. Studied data analysis and preprocessing.Formation of clustering aims.

Determination of affinity function (level of similarity) between objects, clusters and
between objects and clusters.Division of the initial dataset into two equal power subsets using chosen affinity function.
3. Selection of clustering algorithm.Setup of its initial parameters, intervals and steps of these parameters changing during the algorithm operation.
4. Data clustering on the equal power subsets A and B concurrently within the given range of the algorithm's parameters variation.Clusters formation at each stage of the clustering process.
5. Internal, external and complex balance clustering quality criteria calculation at each stage of the clustering algorithm operation.
6. Analysis of the obtained results.If clusters quantity differs or if the extremum values of clustering quality criteria are more than admissible values, choice and setup another clustering algorithm for the studied data.Otherwise, fixation of objective clustering corresponds to the extremum value of the complex balance criterion.
Comparison analysis of different clustering quality criteria within the framework of objective clustering inductive technology is carried out in [10].Analysis of the obtained results allows us to determine the complex multiplicative criterion based on Calinski-Harabasz [11] and WB-index [12].This criterion was used as an internal clustering quality criterion: where K and N -are the quantity of the clusters and studied objects respectively; QCW and QCB -are the components which allow us to estimate quantitative of the objects character distribution within the clus-ters and character of the clusters distribution in features space.The first component is calculated as an average distance from objects to mass centers in clusters, where these objects are: The second component is calculated as an average distance between clusters mass centers: where N S -is the quantity of objects in cluster S; x i S -is the i-th object in S cluster; C i , C j and C S -are mass centers of the clusters i, j and S respectively; d() -is the similarity metric used to estimate proximity level of the studied vectors.Correlation distance was used as a similarity metric in case of high dimensional gene expression profiles analysis: where m -is the features quantity of the studied vector; x s and x p -are the average values of the vectors s and p respectively.In case of low dimensional data, correlation distance is not effective and Euclidean distance was used as a similarity metric: External clustering quality criterion was calculated as normalized difference of internal clustering quality criteria for the equal power subsets A and B: It is obvious that objective clustering corresponds to the minimum values of internal and external clustering quality criteria.However, it is possible that the extremums of these criteria correspond to different clustering.Thus, it is necessary to determine complex balance clustering quality criterion which takes into account both the character of the objects and the clusters distribution in various clustering and the difference between clustering results, which are obtained on the equal power subsets A and B. To calculate complex balance clustering quality criterion Harrington desirability function [13] was used.Implementation of this function involves transformation of scales of internal and external criteria into reaction scale the values of which are changed linearly within the range from -2 to 5: The coefficients a and b are determined empirically.Then the private desirabilities of the appropriate criteria are calculated by the formula: General desirability value is calculated as geometric average of private desirabilities: The largest value of the general Harrin gton desirability function corresponds to the best parameters of clustering algorithm opera tion.
SOTA clustering algorithm (Self-Organizing Tree Algorithm) [14] which is a type of selforganizing neural networks based on Kohonen maps and Fritzke algorithm of spatial cell structure growing [15] was used within the framework of objective clustering inductive technology.Opposed to Kohonen maps that reflect a set of high dimensional input data on the elements of two-dimensional array of small dimension, SOTA algorithm generates a binary topological tree.Fritzke algorithm performs self-organization of output nodes of network in such a way that quantity of the nodes increases in the field of higher density of objects concentration and decreases in the field of lower density.Effectiveness of SOTA clustering algorithm operation is determined by the two parameters: weight coefficient of the sister's cell (scell) and maximum divergence coefficient (E).Weight coefficients of the parent's and winner's cells are calculated automatically.To calculate the optimal parameters of algorithm operation we propose to use the objective clustering inductive technology.
Block-scheme of the inductive algorithm of objective clustering based on SOTA clustering algorithm is shown in Fig. 2. Implementation of this model involves the following steps: Step 1. Formation of the initial set Ω of the objects.Data preprocessing (filtration and normalization).Presentation of data as a matrix n × m, where n -is the quantity of the studied objects or the quantity of the rows and m -is the quantity of the features characterizing objects or the quantity of the columns.
Step 2. Determination of the similar metric depending on the type of the studied vectors by formulas (7) or (8).Division of the initial dataset into two equal power subsets.
Step 3. Setup of SOTA clustering algorithm.
Setting of E and b parameters and initial value of scell weight parameter, interval and step of its change.The pcell and wcell parameters are changed automatically by formulas: pcell = scell • 5; wcell = pcell • 2.
Step 4. Data clustering on the equal power subsets A and B concurrently.Clusters formation and internal clustering quality criteria calculation by formulas ( 4)-( 6) within a range of the algorithm's parameter interval change.
Step 6. Fixation of the optimal scell parameter corresponding to the maximum value of the balance criterion.
Step 7. Setting of the initial value of the maximum divergence parameter (E), interval and step of its change.Repetition of the steps 4-5 of this algorithm.Fixation of the optimal E parameter.
Step 8. Data clustering by SOTA clustering algorithm using the optimal parameters of the algorithm operation.

Results and Discussion
Implementation of the proposed technology was performed using three well known databases: gene expression profiles of the lung cancer patients, which were obtained by DNA microchip experiments [16], Seeds data [17] which contained the examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, each of these groups contains 70 elements randomly selected for the experiment, and Fisher's Iris [18] which was used as the third dataset.This dataset consists of three species of Iris: setosa, virginica and versicolor.Each of the groups contains 50 vectors.Correlation metric was used to estimate the proximity level of the gene expression profiles.To determine the distance between the studied objects in case of Seed and Iris data we used Euclidean metric since the studied vectors in these cases have low dimen- sion of features space.The length of the vectors in case of gene expression profiles was 96 (it equals the studied objects quantity).The steps of these data preprocessing in order to increase the informativity of gene expression profiles are described in [19].The aim of the clustering in this case is grou ping of gene expression profiles to decrease the dimension of feature space.Vectors of Seeds and Iris data consist of 7 and 4 features respectively.The interval of the scell parameter in case of gene expression profiles dataset was changed within a range from 0,001 to 0,2 with the step 0,001.The results of internal criteria for the equal power subsets A and B, external criterion and complex balance clustering quality criterion versus weight parameter of the sister's cell value are presented in Fig. 3. Maximum divergence val- ue in this case E = 0,001 was taken.As it can be seen from Fig. 3, the internal clustering quality criteria CX_1 and CX_2, which have been calculated on equal power subsets A and B do not allow us to determine the optimal scell value corresponding the objective clustering of the studied data.External clustering quality criterion CQE has several local minimums corresponding to the successful grouping of the studied vectors.However, the analysis of general Harrington desirability values, which takes into account both internal and external criteria, allows us to conclude that the best clustering corresponds to the scell = 0,001.In this case 6659 profiles were divided into two clusters.The first cluster contained 4276 profiles and the second -2383 ones.Variation of maximum divergence value in the range from 0,001 to 1 has not changed the obtained results.Fig. 4 presents the same charts for Seeds data.
The scell value in this case was changed within the range from 0,001 to 0,05 with the step 0,002.The analysis of the charts shows that the largest value of balance criterion is achieved for scell = 0,013.This value corresponds also to the least value of external clustering quality criterion and the least difference of clustering results for the equal power subsets A and B (minimum difference between internal clustering quality criteria values).Fig. 5 presents the charts of internal criteria, external criterion and complex balance criterion versus maximum divergence value, which was changed within the range from 0,05 to 1 with step 0,05.
Analysis of the charts shows that the most optimal and the most objective clustering corresponds to 0,7 maximum divergence value.During clustering with the use of full dataset the obtained results have shown that in case of scell = 0,013 and E = 0,7 values using the fourth clusters contained 20 and 21 versicolor vectors.In the fifth cluster, there were both virginica and versicolor vectors.It should be noted that virginica and versicolor data have some intersection a priory.
As a conclusion, we would like to say that the obtained results for Seeds and Iris data are not perfect.Self-organizing SOTA clustering algorithm is focused mainly on high dimensional gene expression profiles.Better results for Seeds and Iris data can be obtained using other clustering algorithms.However, the effectiveness of the objective clustering inductive technology based on SOTA clustering algorithm was shown during the simulation process.Implementation of this technology allows us to select objectively the optimal parameters of SOTA algorithm operation, which corresponds to maximum value of ge ne ral Harrington desirability index.

Conclusions
The problem of gene expression profiles grouping at the early stage of gene regulatory network reconstruction is one of the current problems of the modern bioinformatics.Qualitatively performed profiles grouping determines high quality of gene regulatory network implementation.The paper presents the inductive technology of complex high dimensional data grouping, high objectivity of which is determined by the use of equal power subsets during clustering algorithm operation.Implementation of the proposed technology involves estimation of clustering results for different clustering within a given range of clustering algorithm parameters variation using internal and external clustering quality criteria.The final decision about the character of the studied vectors grouping is taken basing on complex balance criterion, which takes into account both character of objects and clusters distribution in various clustering and difference of clustering results on two equal power subsets.Harrington desirability function was used to calculate the complex balance criterion.Simulation of clustering process was carried out based on self-organizing SOTA clustering algorithm using three well know databases: gene expression profiles of lung cancer patient, Seeds dataset and Fisher's Iris dataset.Results of the simulation have shown high effectiveness of the proposed technology.The use of objective clustering inductive technology has allowed us to determine the optimal parameters of SOTA clustering algorithm operation, which correspond to high objectivity of the studied data grouping.During simulation process in case of lung cancer gene expression profiles maximum value of general Harrington desirability index corresponded to weight coefficient of the sister's cell 0,001.Weight coefficients of the parents (mother) cell and winner's cell were 0,005 and 0,01 respectively.Maximum divergence value was taken 0,001.6659 gene expression profiles were divided into two clusters.4276 profiles were in the first cluster and 2383 profiles were in the second one.It should be noted that the variation of maximum divergence value within the range from 0,001 to 1 has not changed the character of objects partition.Three clusters were obtained in case of Seeds data processing.Weight coefficients of the sister's cell, parent's cell and winner's cell were determined as 0,013, 0,052 and 0,104 respectively.Maximum divergence value was changed within the range from 0,05 to 1 with step 0,05.Maximum of Harrington desirability function corresponds to maximum divergence value E = 0,7.The percent of correctly distributed objects in this case was 85,5 %.Small change of the determined parameters decreased the percentage of correctly distributed objects.Thus, it can be concluded that the obtained combination of the parameters is optimal in terms of clustering objectivity.Interesting results were obtained in case of Fisher's Iris data use.The studied data were divided into five clusters.Fifty setosa objects were in one cluster.Virginica and versicolor objects were divided into four clusters.In the second cluster there were 27 virginica data of 50.The third and the fourth clusters contained only 20 and 21 versicolor vectors.The fifth cluster contained both virginica and versicolor vectors.It is enough logically because the virginica and the versicolor data have some intersection a priory.The optimal parameters in case of Iris data using were the following: weight coefficients of the sister's cell, parent's cell and winner's cell were 0,029, 0,116 and 0,232 respectively.Maximum divergence value was taken as 0,5.Similarly to Seeds data the small change of the determined parameters made the obtained clustering results worse.As the next step of our research we plan to create hybrid technology of gene expression profiles grouping based on complex use of objective clustering inductive technology at the first step of the data proces sing and biclustering technology on the obtained clusters at the final stage of data grou ping.

Fig. 2 .
Fig. 2. Block-scheme of the inductive algorithm of the objective clustering based on SOTA clustering algorithm

Fig. 3 .Fig. 4 .
Fig. 3. Charts of internal, external and balance criteria versus weight coefficient of the sister's cell values for lung cancer data