Extreme evolutionary stability of conserved non-protein coding element of baculovirus genome

V. E. Makarenko, I. M. Kikhno, V. I. Kashuba © 2016 V. E. Makarenko et al.; Published by the Institute of Molecular Biology and Genetics, NAS of Ukraine on behalf of Biopolymers and Cell. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited UDC 577


Introduction
The recent advances in genome-wide studies have revealed the abundance of extended non-coding regions that are conserved at the sequence level across eukaryotic genomes and that are believed to be related to basic processes in a eukaryotic cell [1]. Some of these regions known by several different names (conserved non-coding element, CNE is one among them) have been shown to encode the genes of ncRNA or the 5` and 3` untranslated regions (UTRs) of mRNAs. Others appear to be true noncoding sequences, the functional significance of which is still poorly understandable. Herewith, at least some of them were shown to be involved in transcriptional regulation of the key developmental genes in metazoan genomes [2].
The amenability of RNA-coding CNEs to negative selection reflected in the CNE sequence conservation can be explained by the necessity of strictly defined structure of their RNA products for functional efficiency and/or molecule stability. In contrast, none of the available hypotheses explaining amenability of true non-coding CNEs to selective pressure can be found quite satisfactory [3]. The point is that the vast majority of known non-coding elements of the genome (transcription and translation regulators, origins of replication, sites of recombination, scaffold attachment sites, etc.) recognized as platforms for the protein trans-factors performing particular DNA trans-actions have been found to evolve too fast to be detected on the basis of their overall sequence conservation in the genomes of evolutionarily distant organisms, and only short nucleotide stretches (5-10bp), targets for the protein trans-factors, may be expected to constitute their conserved parts. Herewith, regardless of the function that can be attributed to CNEs, the length of these sequences (up to several hundred bps) is much bigger than necessary to specify the binding of a single protein trans-factor or even several cooperatively acting proteins.
Pairwise comparison of eukaryotic genomes results in the identification of a variety of CNEs characterized by vastly different levels of the sequence conservation. While the conservation of some of these CNEs appears to be a result of negative selection, the conservation of others may simply reflect the short evolutionary distance or slow rate of neutral divergence among the species compared [4]. Therefore, only the sequences showing an extreme level of conservation in comparison with other genome elements can be reliably identified as CNEs which are likely under evolutionary constraint. Some of the most conserved CNEs may exhibit up to 100 % sequence identity (for instance, ultraconserved CNEs in mammals [5]), and some of those may be shared by so distantly related lineages as invertebrates and vertebrates [6].
Non protein-coding element essential for productive infection has been revealed recently in the genomes of insect pathogens, large dsDNA viruses of family Baculoviridae [7]. The newly discovered element was shown to express extended homology (152-156 bp) across the representatives of genus Alphabaculovirus and, accordingly it was called CNE. The alphabaculovirus CNE is composed of 7 blocks of absolutely conserved nucleotides interspaced by relatively variable nucleotide stretches of conserved length. CNE of Autographa californica nucleopolyhedrovirus (AcMNPV) was found to overlap with short ORF and three ncRNA genes [8], it includes three previously described protein-binding motifs [9] and sequence that activates minimal promoter of adjacent gene ie-2 [10]. The transfection-infection assays using CNE knock out AcMNPV demonstrated that an additional not-yet-defined essential function is also attributed to СNE [7].
The present research was aimed at estimation of the degree of CNE sequence conservation measured as percentage of pairwise nucleotide identity (%ID) and comparison with that of other functional elements of alphabaculovirus genome. An average pairwise percent ID (av %ID) of CNE across 1225 al-phabaculovirus pairs was found to be 73 %. The established fact that this value greatly exceeds the conservation levels of the vast majority of baculovirus genomic functional elements, both coding and non-coding, strongly suggests that CNE represents the essential genome element whose overall sequence appears to be under strong negative selection

Materials and methods
The nucleotide sequences of CNE, polh and pif2 were extracted from 50 alphabaculovirus genomes available in GenBank database by using NCBI-BLAST.
[The] Members of each orthologous group were aligned by using ClustalW software [11].

CNE has evolved very slowly over the evolutionary time
[The] Phylogenetic studies suggest that the representatives of genus Alphabaculovirus are pretty diverse [12]. It is of interest if the level of virus diversification is high enough to consider their common element CNE as a sequence of extreme conservation.
Thus, based upon the data above, it can be suggested that CNE evolved very slowly from the early stages of the alphabaculovirus evolution.

CNE conservation level is comparable with those of the most conserved coding regions
In order to quantify the relative CNE conservation value we compared the degree of CNE conservation, calculated as an av % ID for the 1225-member sample resulted from pairwise comparison of 50 CNE , with those of the genes coding regions of polh (the most conserved baculovirus protein) and pif2 (the most conserved protein among 34 core proteins shared by all baculoviruses) [25]. In the work presented here we have not checked for, nor removed recombinants, assuming that av %ID values cannot be affected significantly by the presence of some recombinants in the 1225-member sample.
CNE, polh and pif2 sequences were extracted from each of 50 alphabaculovirus genomes available in the GeneBank database and subsequently aligned.
1225 %ID values displayed in the %ID matrix table were subjected to statistical analysis. Box and whisker plots were used to visualize the distribution ranges of %IDs for each orthologous group (Fig. 2). The dataset of each group of orthologs was found to follow a nearly normal distribution (p<0.01). The pairwise IDs values calculated for 1225 virus pairs ranged from 62 % to 100 % (av %ID 73 %) for the CNE, from 79 % to 100 % (av %ID 88 % ) for the polh, and from 43 % to 100 % (av %ID 65 %) for the pif2.
The comparison of av %ID of the CNE and re ferenced genes showed that CNE tends to be more conserved than pif2 and less conserved than polh. Consequently, CNE can be defined as an exceedingly conserved genomic element, the conservation degree of which is comparable with that of the coding regions of the most conserved baculovirus protein genes.

CNE is quite unique among baculovirus nonprotein coding elements in respect of its evolutionary stability
Besides CNE, the baculovirus set of the known non protein-coding functional elements includes the promoters of different temporal regulation, hrs (homologous repeat regions, that play roles of enhancers and origins of replication), so called non-hr origins of replication (Rewieved in [26]), 5` and 3` UTRs of mRNAs, ncRNA genes (ncRNA of unknown function [8] and microRNA precursor genes [27] are among them). It is well established that the known alphabaculovirus non-hr origins are not alignable [28], as well as hrs originating from the genomes of relatively diverse alphabaculoviruses do not express extended homology, although they contain short conserved palindromic segment, a binding site for ie1 [29]. The nucleotide sequences of baculovirus genes of miRNA precursors were found not being conserved even among closely related baculoviruses [30].
The question whether CNE is unique among the alphabaculovirus non protein-coding elements in respect of extended sequence conservation arises from these data. To clarify this issue we focused on the search for the functional non protein-coding sequences of extended length shared by the alphabaculovirus genomes.
The promoter regions of 44 late and very late genes shared by all alphabaculoviruses were chosen as a class of elements subjected to the analysis. The baculovirus late and very late genes are specific in the conserved motif (A/T/G)TAAG, which serves as a transcriptional initiator [8]. The 12-18bp region encompassing the TAAG has been shown as a minimal promoter determinant for the basal transcription from late [31] and very late promoters [32]. AcMNPV was used as a reference virus to define the exact extension of the promoter region (5`UTR in conjunction with overlapping minimal promoter) of each gene because the genuine transcription start sites of all AcMNPV genes were determined [8]. The regions extracted for the analysis from the AcMNPV genome were bounded by a nucleotide upstream from the translation start codon and by the 30-bp nucleotide stretch upstream from the transcription start site TAAG. The promoter regions less than 40 bp in length were excluded, and finally 37 of 44 regions were chosen for the analysis, including the regions of polh, lef-9, ODV-e18 , Ac81, lef-5, vlf-1, 38K, gp41, p33,  alk-exo, p6.9, lef-2, Ac68, vp91, vp59, p143, pp34,  ODV-e25, Ac106, Ac1313, pp31, Ac53, 25K, Ac75,  Ac76, p18, ODV-e28, Ac102, F-protein, ODV-c42,  me53, p78/83, Ac19, calyx, pkip, Ac60. The recently published phylogenetic tree was used as a guide to infer the evolutionary distances within alphabaculoviruses [33], and 10 viruses belonging to diverse evolutionary lineages were chosen for the first round of analysis: AcMNPV, EppoNPV, LeseNPV, AgMNPV, OrleNPV, LdMNPV, HearSNPV, TnSNPV, MacoNPV-B, SeMNPV. The promoter regions of each of 37 genes were extracted from each of 10 above genomic sequences. The lengths of extracted sequences were similar to the lengths of their AcMNPV orthologs if their TAAG was found at the position close to that of TAAG site in the AcMNPV promoter region. If the TAAG was absent in the above position the 5`end of the extracted sequence was extended up to the first upstream TAAG.
The multiple sequence alignment was used to assess the conservation levels of 37 promoter regions. The consensus sequences derived from the alignments served as the main criteria of conservation. A strong tendency to the extended sequence conservation was clearly displayed by the alignments of 9 promoter regions (polh, ODV-e18, pp34, ODV-e25, pp31, 25K, p18, Ac102, calyx): long sequence stretches (from 40 to 140bp) enriched by absolutely conserved nucleotides were visualized. The most conserved of them assigned to the polh promoter region and the longest of them assigned to the p18 promoter region were chosen for the subsequent aligning using a sample from 50 viruses. The multiple alignments demonstrated that the conserved nucleotide stretch of the polh promoter region is represented by full 5`UTR, whereas the conserved nucleotide stretch of the p18 promoter region includes not only the full 5`UTR but also a dozen of conserved nucleotides located just upstream the TAAG.
The %ID datasets of the promoter regions of both genes were found to be distributed nearly normally (p<0.01). ID values of the polh promoter region varied from 49 % to 100 % (av %ID 74 %) over sequence lengths ranging from 39 to 42 bp. ID values of p18 varied from 35 % to 100 % (av %ID 65 %) over sequence lengths ranging from 129 to 145 bp (Fig.2). The comparison of avID values allows the suggestion that CNE index of this value is nearly the same as that of the polh promoter region and higher than that of the p18 promoter region.

Discussion
In this study the relationship between the CNE conservation degree and the degree of virus pair divergence has been analyzed. This preliminary analysis revealed the tendency to strong evolutionary stability of CNE. A comparison of the CNE conservation level with those of the nucleotide sequences of genes for the structural proteins under strong evolutionary constraint (polh and pif2) and with those of the most conserved promoter regions of the late genes (polh and p18) was done to investigate the observed tendency further. The CNE sequences, gene coding sequences and gene promoter region sequences were extracted from 50 alphabaculovirus genomes available in NCBI database, aligned and subjected to pairwise comparison. Resulting data allowed us to recognize the alphabaculovirus CNE as an element of extreme conservation across the 152-156 bases, being more highly conserved than the vast majority of both coding and non-coding sequences in the alphabaculovirus genomes. CNE is unique among the previously studied baculovirus non protein-coding elements (hrs, non-hr oris, genes of miRNA precursors) because the latter do not exhibit a tendency of sequence conservation between distantly related alphabaculoviruses (see references in Results, section3).
Of the particular interest are the results of promoter region analysis conducted within a framework of this study: nine baculovirus late genes (polh, ODV-e18, pp34, ODV-e25, pp31, 25K, p18, Ac102, calyx) were found to be associated with the promoter regions of relatively high levels of sequence conservation. ID values of the two most conserved of them (the polh promoter region including 5'UTR and p18 promoter region including 5'UTR and dozen nucleotides upstream from the transcription start site) were found to be comparable with the CNE ID value. An extreme conservation of some baculovirus 5`-UTRs is quite explicable: 5`UTRs may represent RNA regions that adopt a selectively constrained spatial structure providing molecule stability and/or interaction with proteins.
In contrast, it is difficult to predict whether the revealed phenomenon of extreme conservation of the alphabaculovirus CNE specifies its similarity with some eukaryotic CNEs or it indicates the uniqueness of some viral element providing the alphabaculovirus-specific function(s).
Three alternative (but not mutually exclusive) hypotheses may be suggested to explain the above phenomenon.
1. "The overlap hypothesis". The CNE represents the region of overlapping several non-coding elements of extended length. The overlapping elements may evolve slowly due to the increasing deleterious effect of mutations affecting more than one element. However, this assumption is valid for the explanation of overall high conservation degree of the CNE but the following question remains unanswered: why is the length of quite variable CNE blocks kept unaltered (i.e. are any insertions or deletions within variable regions not allowed)?
2. "The secondary structure hypothesis". The CNE nucleotide sequence conservation is dictated by significance of the secondary structure of this element at either DNA or RNA level. This assumption is supported by the fact of CNE enrichment in the dyad symmetry elements that specifies CNE as a sequence potentially capable of the DNA conformational changes [34], and by the fact that as shown by the AcMNPV transcriptome analysis three ncRNA are likely overlap CNE in the AcMNPV genome [8].
It should be noted, however, that conserved parts of the overlapping ncRNAs represented by CNE were shown not to be essential for the AcMNPV life cycle [7] and this put the necessity of their extreme conservation under the question. In addition, the hypothesis can be considered strictly valid only if the CNEoverlapping ncRNAs would be identified in all other alphabaculovirus genomes.
3. "The module hypothesis". CNE appears to be the module, consisting of several conserved motifs, which bind cooperatively acting proteins. These modules, within which the particular distances between protein-binding sites provide interaction of bound proteins, were predicted by bioinformatic approaches in eukaryotic genomes [35]. Herewith, such organization has been observed only in very few cases of transcriptional regulatory elements [36], among them the most studied is the enhancer of interferon-β gene that rather represents [a] particular case as it consists of numerous non spaced overlapping motifs [37]. While [the] c4-c5 CNE conserved block was shown to be the only which bind[s] proteins at 40 hours post infection [9], c2, c3 and c6 blocks remain the candidate binding motifs for cooperatively acting proteins at some other time point. It also seems plausible to speculate that two similar terminal blocks c1 and c7 that bind proteins in uninfected cell, as well as in early infection [9], represent such recognition motifs for cooperatively acting proteins, and the strictly predetermined distance between them appears to be the prerequisite of their cooperation and, thus defines the fixed CNE length.