In silico approach to study and functionally analyze interferon regulated genes

Finding genes which have biologically meaningful ISRE (interferon-stimulated response element) is important for better understanding of the Jak-STAT activated cellular IFN response. We used transcription factor binding site (TFBS) search with gene orthology filtering to find putative ISREs in the promoters of protein-coding genes of Rattus norvegicus, and used Gene Ontology (GO) analysis to check the validity of ISRE search results in terms of biological meaning. A total of 23286 promoters of rat genes were analyzed. ISRE search with 80 % threshold produced 5214 sites in 4571 promoters. 850 ISREs in 768 promoters passed orthology filtering. Distribution of ISREs along the promoter in 768-gene set reveals 3 regions of ISRE localization: 0 to –250, –250 to –550, and above –550 relative to TSS (transcription start site). It is not yet known whether ISRE localization has any functional implications. Using BayGO, a total of 84 GO terms were found to be enriched at P < 0.05 in the 768-gene set. Among these categories some are directly related to known IFN actions (positive regulation of B cell differentiation, humoral immune response, response to virus, cell differentiation etc.). 768 gene set was compared to the 4571 gene set using GO Tree Machine. Such categories as cell differentiation, cell cycle, regulation of cell cycle, viral life cycle and some others were found to be enriched, and belong to the well-known domains of interferon actions. Their relative enrichment is an indirect indication that the applied orthology filtering does increase the quality of results. Gene orthology-based filtering of the initial TFBS search results was shown to produce viable and expected results. Genes identified in this research as containing ISRE in promoters will be used to seed the construction of the IFN-a-induced gene regulatory network.


Introduction.
Transcription factors play an important role in the regulation of gene expression. Prediction of putative transcription factor binding sites (TFBS) has become an important resource to explore genome organization and predict transcriptional regulation [1].
Computational TFBS prediction methods are also necessary for the efficient annotation of transcriptional regulatory networks [2].
The TFBS consensus sequence motifs are usually represented using either IUPAC (International Union of Pure and Applied Chemistry) nomenclature consensus string, or matrices, the two most common being PFM (position frequency matrix, also known as position count matrix) and PWM (position weight matrix, or nucleotide weight matrix) [3][4][5][6]. PFM is a matrix consisting of nucleotide counts per each position of the identified binding site. PFMs were first used to characterize DNA-binding site specificity in 1982-1986 [7,8]. Later, quantitative discrimination of sites with calculated site scores using position weight matrices was introduced [5,[9][10][11]. A weight matrix pattern definition is superior to a simple IUPAC consensus sequence, as it represents the complete nucleotide occurrence probabilities for each position. It also allows the quantification of the similarity between the weight matrix and a potential TFBS detected in the target sequence. PWM is an estimate of the binding energy of the transcription factor to its specific binding site [5,6].
High throughput analyses using SELEX (Systematic Evolution of Ligands by Exponential Enrichment) and CHIP-Chip (Chromatin Immunoprecipitation-Microarray), along with computational sampling methods, have generated thousands of PFMs. These data together with the related transcription factor information are curated in online databases -Transfac [12], JASPAR [3], and others.
Online applications, such as MatInspector [19], MATCH [20] and ConSite [21], have been built to predict TFBS embedded in promoter sequences. However, TFBS search only identifies sites where the transcription factor could bind, but not necessarily will bind.
Taking into account the length of matrices used for TFBS prediction (usually less than 15 nucleotides), no wonder that simple search for IUPAC-denoted binding sites yields numerous false-positive results, occurring by chance. When applying PWM-based methods, matrix-site similarity score threshold can be used to increase specificity (get less false-positives) at the cost of sensitivity (find less true-positives).
Thus, using only TFBS search is not sufficient, and additional processing is required to refine the results. To avoid the loss of sensitivity, and reduce the number of false-positive binding site predictions, additional analysis can be applied: looking for paired TFBS, TFBS motifs, using gene orthology information, microarray-derived gene co-expression data, applying learning algorithms trained on known transcription factor target genes, etc.
Interferons are inducible cytokines with strong antiviral properties and a wide range of gene targets in the cell [22][23][24][25]. Type I interferons share common cellular surface receptor, which consists of two subunits (IFNAR-1 and IFNAR-2). Specific interferon binding to the receptor induces hetero-dimerization of IFNAR-1 and IFNAR-2, with the following activation of receptor-bound kinases Jak1 and Tyk2 by phosphorylation, and the Jak-STAT signal transduction pathway initiation [26,27]. The ISRE, a conserved regulatory element of interferon-stimulated genes (ISGs), is the target for transcriptional activation by the IFN-stimulated gene factor-3 (ISGF3), which consists of two STAT molecules (STAT1 and STAT2), and IRF-9 (interferon-regulatory factor 9, or p48) [23,28]. Genes containing ISRE can be considered primary interferon-response genes, with the cautionary note on the ability of not only ISGF3, but also of IRFs to bind to ISRE. Finding genes which have biologically meaningful ISRE sites will provide better understanding the Jak-STAT branch of cellular primary interferon response.
In this study, we searched for putative interferonstimulated response elements (ISREs) in the promoters of all protein-coding genes of Rattus norvegicus, as defined and annotated in the Ensembl database [29], and looked at the validity of using TFBS search in terms of biological meaning.
Meth ods. A to tal of 23286 pro tein-cod ing rat genes were an no tated and avail able in the Ensembl rat ge nome da ta base re lease 40. For this set of genes, pro moter and exon se quences were ex tracted and stored in a lo cal da ta base for faster ac cess by sequence pro cess ing al go rithms. A to tal of 70844 promot ers and 544585 exons (with type = «cod ing») were ob tained for 3 or gan isms (Ta ble 1). For the purposes of this study, gene pro moter was de fined as a se quence which starts at the -1000 po si tion rel a tive to the TSS (tran scrip tion start site) and ends at po sition 0. Pro moter and exon ex trac tion was car ried out us ing an au to mated pipe line, built us ing MySQL 4.1.16 da ta base for data stor age and PHP 5.1.6 program ming lan guage for que ry ing the Ensembl da tabase. Exons shorter than ISRE ma trix (15 nu cleotides), and both exons and pro mot ers with out the con tig u ous 15-nu cle o tide long subsequence, were dis carded and not counted. Con tig u ous 15-nu cle otide long subsequence was al lowed to have no more than a sin gle un known (N) nu cle o tide.
ISRE PFM matrix was taken from the database of transcription factors TRANSFAC 7.0 Public [30], accession number M00258. ISRE PFM, compiled from 13 sequences, is shown in Table 2.
To compute matrix-site similarity scores, PFM was converted into PWM. To obtain ISRE PWM, for each PFM matrix element we applied the following logarithm transformation [31]: Due to the relatively small number of sequences which are used to compile PFM, the resulting PWM might be inaccurate. For example, in Table 2 nucleotide T has 13 out of 13 possible counts in column 4; when converting this PFM to PWM without the pseudocounts, all other nucleotides at position 4 would have got -¥ as their value, and then any sequence, which does not have «T» in that position, would automatically fail, even if all the other nucleotides gave the highest similarity score possible. Introduction of pseudocounts into the formula saves from obtaining restrictive base-score values, and thus allows mismatches even in strongly conserved positions. However, mismatch in a conserved position still gets the lowest score.
Most commonly p(b) is taken equal to 0.25. Another approach is to take p(b) based on AT/GC content for the whole studied genome. In this study, we compared using p(b) = 0.25 and calculating p(b) from the background individually for each 1000-nucleotide long promoter before searching for ISRE with generated PWM (Fig. 1). For each target sequence (exon or promoter), full-length comparison of both strands with ISRE matrix was done, and the highest similarity score saved. Then the count of the best matrix-site matches was plotted against the 1-100 % range of similarity scores.
It can be seen that calculating target sequencespecific background nucleotide frequencies leads to slightly higher matrix-site similarity scores, especially when comparing matrix to promoter sites. However, performing F-test shows no difference between using p(b) = 0.25 and sequence-specific p(b): for exons F-value = 0.04, P = 0.997, for promoters F-value = 0.05, P = 0.995. The reason for promoters to exhibit more noticeable shift towards higher similarity scores can be explained by the differences in promoter and exons lengths: all the promoters were uniformly 1000 nucleotides long (with some exceptions containing long «not know» subsequences), while exons were much shorter, representing the range from 15 (for this study, ISRE-limited) to 19456 nucleotides long, with mean 165.3, and median 123. The two choices of p(b) were not further compared, but the sequence-specific   In order to define the threshold to make a presence/absence call for each matrix-site similarity score, we obtained means and standard deviations of the maximal similarity score distributions (Fig. 1). For exons, mean = 66.7 %, SD = 6.8; for promoters, mean = = 76.7 %, SD = 4.3. Assuming that all matches of ISRE matrix in exons were false-positives, and assuming close to normal distribution of the similarity scores, threshold was chosen at the similarity score level of 80 %, which includes no more than 2.5 % of high-scoring ISRE matches in exons, and includes 16 % of potential true positives. Such a stringent threshold was chosen because only one biologically-significant TFBS selection filter (namely, comparison of the promoters of the orthologous rat and mouse genes) was applied. Used in literature threshold level of 75 % might be more appropriate in case of more filters applied, as it initially includes over 50 % of potential true-positives. Running TFBS search on both DNA strands with 80 % threshold in the promoters of all the protein-coding rat genes produced 5214 binding sites in 4571 promoters.
In order to filter away biologically insignificant binding sites, we compared promoters of the orthologous rat and mouse genes. The assumption behind this procedure was that if the binding site has no biological meaning, then it is more likely not to be preserved in the process of evolution, and vice versa for meaningful binding sites. The procedure of identifying potential ISRE binding sites in the promoters of 24260 mouse protein-coding genes was identical to that used for rat genes, with the same threshold of 80 %. The list of orthologous genes was obtained from Ensembl. This database provides similarity percentage for each pair of genes, and orthology type flag, which can be one of: «ortholog_one2one» (reciprocal best hit genes), «ortholog_one2many», «ortholog_many2many» and «apparent_ortholog_one2one» (when no similarities were found, or annotation system error occurred). The following criteria were used to choose genes considered orthologous for the purposes of this study: 1) for «ortholog_one2one» entries, identity of over 70 % was required; 2) for «ortholog_one2many», 75 %; 3) for «ortholog_many2many» and «apparent_ortholog_one2one», 80 % of identity were required.
Two orthology-based approaches were tried: looking for occurrence of ISRE both in rat and mouse orthologous genes with no other constraints, and looking for ISRE occurrence in orthologs with an additional constraint for the start positions of found sites to be no more than 25 bp apart (distances measured relative to the TSS in the genes of each organism). Simple co-occurrence in orthologous genes decreased the numbers to 1722 TFBS in 1419 promoters, and position constraint brought that number down to 850 TFBS in 768 promoters. This means that out of 4571 genes 768 (16.8 %) have between-species position-conserved ISRE sites in promoters. The set of 768 genes was used for further analysis.
Gene Ontology (GO) categories enrichment analysis was performed using BayGO [32], with 100 iterations and at least a single gene within a category to consider it during analysis. Categories with P > 0.05 were considered insignificant.
Results and Discussion. In the final 768-gene set, 697 genes had single putative ISRE, 62 genes had two putative ISREs, and 9 genes had three or more ISREs in their promoters.
Graph of the distribution of found ISRE start sites along the length of the promoter (Fig. 2) reveals three characteristic regions of ISRE localization: 0 to -250, -250 to -550, and above -550 nucleotides relative to the TSS. It is not yet known whether ISRE localization has any functional implications.
GO categories enrichment analysis was conducted for the 768 gene set against all the rat protein-coding genes (see Table 3 for GO Slims representation). GO has three main categories: biological process, molecular function, and cellular component. We analyzed GO terms enrichment within the biological process category.
A total of 84 GO terms in the biological_process category were found to be enriched at P < 0.05. GO Slims representation includes GO terms which are hierarchically linked, e. g. cellular process, physiological process, and cellular physiological process (Fig. 3). Thus in Table 3 GO Slims terms are overlapping, and do not add up to the total of 84 enriched categories.
To identify the effects of orthology-based filtering, 768 gene set was compared to the 4571 gene set using GO Tree Machine (Fig. 3). Categories in bold and underlined in Fig. 3 are enriched in 768 set as compared to 4571 set with P < 0.01 (no multiple-testing correction was applied). Such categories as cell differentiation, cell cycle, regulation of cell cycle, viral life cycle and some others belong to the well-known domains of interferon actions. Their relative enrichment is an indirect indication that the applied orthology filtering does increase the quality of results.
Among the 84 enriched categories, a number of categories are directly related to known IFN actions (e. g. positive regulation of B cell differentiation, humoral immune response, response to virus, cell differentiation, positive regulation of transcription factor activity etc.). Some of the GO terms are found both in the set of 84 categories (in 768 versus all genes enrichment test), and in the 20 categories, shown in bold and underlined in Fig. 3 (in 768 versus 4571   It was interesting to observe the «nervous system development» category to be found enriched both in 768 gene set versus all genes (with P < 0.01 and gamma = 0.4, using BayGO), and in 768 versus 4571 (P = 0.0099, using GOTM). This observation needs further investigation, especially in the light of known interferon side effects, which include headache, increased irritability and some other.

Conclusions.
We identified 768 rat genes which contain ISRE in their promoters, and are the potential targets of transcriptional regulation by type I interferons. Functional analysis of these genes, conducted using Gene Ontology, had shown the enrichment of the GO categories related to already known IFN effects.
Additional step, based on the comparison of promoters of the orthologous mouse and rat genes, was applied after the TFBS search. GO analysis of the selected genes revealed 20 categories, with more than 8 As the GO analysis shows, a simple yet effective additional step in TFBS search data post-processing would be to concentrate on the genes, known to be expressed in the tissue/cells of interest. This would allow focusing on the tissue-specific factor effects, and seeing a clearer picture of GO terms enrichment.
Another improvement would be to refine the TFBS matrix before use. In the case of interferon-regulated genes, one could use public microarray data from the interferon-stimulation experiments to build the list of early interferon-response genes, and then build a new matrix, taking into account actual base frequencies in the promoters of the up/down-regulated genes. With this approach it is also possible to take into account paired and dependent changes of the matrix nucleotides, using hidden Markov models or learning algorithms, such as support vector machine.
Genes identified in this research as ISRE-containing will be used to seed the construction of the IFN-a-induced gene regulatory network.