Protein splicing and its evolution in eukaryotes

Inteins, or protein introns, are parts of protein sequences that are post-translationally excised, their flanking regions (exteins) being spliced together. This process was called protein splicing. Originally inteins were found in prokaryotic or unicellular eukaryotic organisms. But the general principles of post-translation protein rearrangement are evolving yielding different post-translation modification of proteins in multicellular organisms. For clarity, these non-intein mediated events call either protein rearrangements or protein editing. The most intriguing example of protein editing is proteasome-mediated splicing of antigens in vertebrates that may play important role in antigen presentation. Other examples of protein rearrangements are maturation of Hg-proteins (critical receptors in embryogenesis) as well as maturation of several metabolic enzymes. Despite a lack of experimental data we try to analyze some intriguing examples of protein splicing evolution.


Introduction.
Splicing is well known to be a processing mechanism that excises an internal region from a precursor molecule with subsequent ligation of the flanking sequences. Until recently splicing was revealed only for nucleic acids. However, in 1990, Stevens et al. have found that one yeast protein -VMA1 -acquire an ability of unusual posttranslational modification via process, similar to DNA and RNA splicing [1,2]. By analogy they named this process «protein splicing». Below we briefly describe the main concepts of this phenomenon.
Protein splicing is an autocatalytic process and is independent of any cofactor or enzyme, differing thereby from posttranslational processing (Fig. 1). It was proposed to use the term intein (internal protein) for the central protein region, which is subject to excision; terms N-and C-extein (external protein) for the corresponding flanking sequence [2,3].
Analysis of sequences found the most important sites and motifs within inteins (Fig. 2) [4,[11][12][13]. Position 1 at the N-terminus of an intein is almost always occupied by Cys or Ser or, in extremely rare cases (in Ala-inteins) by Ala. Second most important motif is a short C-terminal sequence of 8 residues, seven of which belong to the intein and the eighth one is the N-terminal residue of C-extein. These two last residues play an important role in hydrolysis of a peptide bond between intein and C-extein [4,13], that is necessary for the final step of splicing.
A more detailed description of the functional motifs and chemical mechanism of the splicing is omitted here and can be found in the previous works (e. g. [8,14]).
Evolution of inteins in prokaryotes and primitive eukaryotes. Apparently, inteins have appeared early in evolution [5,15]. This is indirectly supported by the fact that they are spread in all three domains of life. Intein host proteins are very diverse, including DNA and RNA-polymeres, ATPases, proteases, metabolic enzymes, transcription and trans-lation factors [15]. However, their distribution is highly nonuniform. For instance, intein-coding sequences were found in 19 genes only in Methanococcus jannaschii, whereas no intein-coding sequences are in the genomes of more than 30 species (prokaryotes and eukaryotes), including Arabidopsis thaliana, Agrobacterium tumefaciens, Escherichia coli K12, Drosophila melanogaster, Homo sapiens etc. [4].
Horizontal transfer is an important mechanism allowing the spreading of intein sequences. This is evident from the fact that inteins of homologous proteins from different species often share many features [8]. The possibility of horizontal transfer of inteins between prokaryotes and eukaryotes is evident from some data. For instance, inteins within the dnaB of two different organisms (R. marinus and Synechocystis sp. PCC6803) display far higher homology as compared to the exteins [22,24]. Also transfer of inteins can occur between non-allelic genes, which is possibly due to endonucleolytic cleavage of a similar, though not identical, site [15]. This hypothesis is supported by the fact that the M. leprae and M. tuberculosis genomes have three homologous inteins in  [4].
Inter-specific horizontal transfer of inteins is mediated first at all by viruses. Intein-coding sequences were found in the genomes of several prokaryotic and eukaryotic viruses [19,[25][26][27]. The presence of intein sequences in virus genomes (i. e. intein APMVPol in polB of the Mimivirus [19,27]) provides a brilliant illustration for the mechanism ensuring horizontal transfer of inteins.
As mentioned above, horizontal transfer is an important mechanism of spreading intein sequences resulting in similarity of inteins of homologous proteins from different species [8]. An evident example is provided by eukaryotic chitin synthase intein PanCHS2. This intein was found in P. anserine, but its analogs were not detected in the chitin synthase genes of other species. Unexpectedly, PanCHS2 is highly homologous to glutamate synthase intein PanGLT1 [15]. This finding suggests a horizontal transfer of intein-coding sequences between nonallelic genes and different hosts.
Evolution of protein splicing's principle in higher eukaryotes. It's fairly to presume, that a principle of protein splicing should inherits and upgrades in eukaryotic organisms. Different papers have described non intein-mediated protein rearrangements in eukaryotes. The mechanism of these rearrangements is not autocatalytic and modifications are catalyzed either by specific protease [20,28] or by proteasome [16][17][18][19]29]. For clarity, it was suggested to call these non-intein mediated events either protein rearrangements or protein editing. Here we consider some of interesting cases.
Post-translational rearragement by reverse proteolysis in plants.
One of examples of post-translational modification, namely, protein splicing by reverse proteolysis, was discovered in plants. It was observed in case of lectin concanavalin A (ConA) [29] -protein which binds specifically to certain structures found in various sugars, glycoproteins, and glycolipids, mainly internal and nonreducing terminal alpha-mannosyl groups.
Some time ago it was showed, that maturation of this protein occured in non-typical way. Although the exact mechanism of this splicing in plants is still not clear, the fact that all digestion occurs before Asn, suggests a participation of some endopeptidase. Indeed, in vitro study showed that specific enzyme as-paraginyl endopeptidase (AEP) could digest ConA and then re-ligate the fragments by its reverse proteolytic activity [20,30].
The initial precursor of ConA (glyco-pro-ConA) is first activated by deglycosylation to pro-Con A (Fig.  3). Pro-Con A is then cleaved to produce two distinct proteins that are transposed and relegated to become a mature ConA. A biological role of such rearrangement it is not clear.
Proteasome-mediated protein splicing. Recently it has been shown that proteasomes are cell compartments where not only protein degradation but also post-translational protein modifications occur [16,17]. is performed by proteasome through protein splicing mechanism (peptide excision and re-ligation) (Fig. 4). They found, that CTLs, killing cancer cells overexpressing fibroblast growth factor-5 (FGF-5), recognize presented by MHC class I molecules a nine-residue FGF-5 peptide generated by protein splicing. Using the proteasome inhibitor, the authors evidenced that FGF-5 protein splicing observed in their system is proteasome-mediated. This splicing was conducted by 20S proteasome in vitro and in vivo. They showed that in contrast to intein splicing, that usually excised polypeptide of not less than 134 aa in length, the excised polypeptides in their system were only 18-40 aa [17]. Thus, the proteasome-mediated splicing overcomes the limitation, associated with the length of intervening sequences, and can be considered as a next step of protein splicing evolution. The detailed mechanism of proteasome-mediated splicing you can find in [17]. Shortly, it was shown that proteasome b-subunits directly catalyze protein rearrangement in a way very similar to protein trans-splicing.
Implications of these process allow the immune system to monitor non-contiguous peptide sequence generated post-translationally. This capability represents an enormous increase in the ability of immune cells to recognize intrinsic and foreign proteins.
Intein-like processing of embryonic signal proteins of the Hedgehog (Hh) family. Signal proteins of the Hh family were found in human, mouse, insects, and other multicellular organisms. The Hh family proteins play a crucial role for embryo development [5,8,24,[31][32][33]. In context of our review the autocatalytic maturation of Hh proteins will be considered. The Hh proteins are synthesized as inactive precursors consisting of two domains, N-terminal signal domain (Hh-N) and C-terminal catalytic domain (Hh-C) (Fig. 5, B). Hh-C forms an intein-like domain, containing conserved motifs similar to intein. The structural similarity suggests a common origin of the inteins and Hh proteins. Yet their host organisms differ: the inteins occur mostly in unicellular, whereas the Hh proteins were found only in multicellular organisms [8,32]. In contrast to the inteins, during the Hh proteins during maturation excise themselves onto two parts. The reaction requires an endogenous inductor, cholesterol. As a result, the peptide bond between Hh-N and Hh-C parts is disrupted and cholesterol is covalently linked to Hh-N. Modified with cholesterol, Hh-N migrates toward the plasma membrane as a re-ceptor or is exported from the cell to perform its function [30,32].
Autocatalytic modification of enzymes. Similar to protein splicing, the reactions with subsequent hydrolysis of the peptide bond are involved in maturation of several enzymes like N-terminal nucleophilic hydrolases, N-terminal aminotransferases, and pyruvate-dependent enzymes (Fig. 5, C, D) [22,31,33,34]. Interestingly that their autoprocessing is catalyzed by domains that totally differ in sequence and three-dimensional structure from the inteins, but chemical principles are the same.
Catalytic activity of enzymes from the first two families depends on highly reactive Ser, Thr, or Cys located at the N-end. The mechanism of autoprocessing can be considered on the example of Flavobacterium meningosepticum glycosyl asparaginase, since a si-  milar mechanism is involved in maturation of glycosyl glycosyl asparaginase of other organisms, including H. sapiens. The mature enzyme is a heterodimer encoded by one gene. An inactive protein precursor is converted into an active form as a result of autocatalytic cleavage of the peptide bond before Thr152, yielding two separate subunits, a and b. The reaction is triggered by an N-O shift with generation of the ester bond, which is then hydrolyzed (Fig. 5, C) [22,31]. As a result, Thr152 becomes the N-terminal residue of the b-subunit. Substitution of Thr152 with an amino acid residue of a different type (other than Ser or Cys) totally abolishes cleavage. Biological functions of protein splicing into the cells. The function of inteins was initially explained in terms of «egoistic genes». An intein utilizes the endonuclease domain to spread itself into the genome, and Hint domain to cut-and-paste host protein, preserving its biological activity (i. e., viability of the host cell). Located within a protein sequence, an intein utilizes the cell mechanism of protein expression, including the regulatory sequences of the host gene and its mRNA, that testifies to the selfishness of the intein sequence.
However, this point of view is rather naive. The intein sequences can indeed be classified as mobile elements according to some of their features, first and foremost, processes of homing [8,15]. However, while retrotransposons occur in many copies in the genome, intein sequences are restricted to certain sites of particular genes, and their genomic distribution directly correlates with the copy number of an intein-coding gene. Also because inteins are integrated in highly conserved regions, the functions of host protein before and after excision of an intein should be different.
It seems more probable that free inteins (after splicing) perform still unknown regulatory and enzymatic functions in the cell [35]. For instance, inteins may regulate the activity of the host enzymes through regulation of their maturation via cis-or trans-splicing. The most evident example is the arrangement of active enzyme (DnaE in Synechocystis sp. PCC6803) from two different proteins through the trans-splicing mechanism [14,15,36]. The N-and C-terminal halves of DnaE (catalytic subunit of DNA polymerase III) are encoded by two separate genes, dnaE-N and dnaE-C, respectively.
The dnaE-N product consists of a N-extein sequence followed by a 123-aa intein sequence, whereas the dnaE-C product consists of a 36-aa intein sequence followed by a C-extein sequence. The two intein sequences together reconstitute a split miniintein, which mediates reconstitution of the full active DnaE. Its means that here the intein can be an integrated regulator of polymerase activity.
In addition, the cleaved off inteins may catalyze certain reactions in the cell, using its activated Nterminal nucleophilic group and acting similarly to activated glycosyl asparaginase, ConA or pyruvoyl enzymes [22].
The occurrence of protein splicing in vertebrates has important implications for the complexity of the vertebrate proteome and for the immune recognition of own and foreign peptides. A principle of protein splicing is fully used by immune system of mammalians and plants. Its usage allows to combine different parts of antigens providing a logarithmic increase of possible combinations.
In eukaryotes, DNA recombination and RNA splicing were already known to increase the number of different proteins produced by each of mammalian genes. The discovery of protein splicing adds new tools to this kit. In a broader context, the existence of protein splicing in vertebrates greatly increases the cell's options for converting genetic information on the post-translation level. All together splicing (DNA, RNA, protein) is a powerful mechanism allowing to realize the main principle of multi-functionality -transformation of original compact genetic information into wide diversity of protein structures and functions after its realization.