Identification of hierarchy of dynamic domains in proteins : comparison of HDWA and HCCP techniques

Aim. There are several techniques for the identification of hierarchy of dynamic domains in proteins. The goal of this work is to compare systematically two recently developed techniques, HCCP and HDWA,on a set of proteins from diverse structural classes. Methods. HDWA and HCCP techniques are used. The HDWA technique is designed to identify hierarchically organized dynamic domains in proteins using the Molecular Dynamics (MD) trajectories, while HCCP utilizes the normal modes of simplified elastic network models. Results. It is shown that the dynamic domains found by HDWA are consistent with the domains identified by HCCP and other techniques. At the same time HDWA identifies flexible mobile loops of proteins correctly, which is hard to achieve with other model-based domain identification techniques. Conclusion. HDWA is shown to be a powerful method of analysis of MD trajectories, which can be used in various areas of protein science.

Introduction.The method of Hierarchical Clustering of Correlation Patterns (HCCP) was developed for identifying dynamic domains in proteins [1].HCCP is the only existing technique, which identifies the hierarchy of dynamic domains.Each dynamic domain can be divided into smaller relatively independent subdomains of next hierarchical level and so on.The HCCP technique was successful in revealing the statistics of dynamic domain in PDB [2], in finding the candidate proteins for biosensor design [3] and in simulating domain closure in the hinge-bending proteins [4].Despite these successful application the HCCP technique possesses a serious limitation.It depends on the matrices of residue-residue correlations of motion, which should be computed by other techniques.It was shown that the Gaussian Network Model (GNM) [5][6][7][8][9] is an optimal choice for constructing such matrices in the case when a single crystal structure of a protein is available.However, the usage of GNM (or any other technique based on the normal modes calculations) restricts the sampled protein motions to small-amplitude harmonic displacements around some reference structure [10,11].As a result only tiny part of the protein conformational space could be described.The dynamic domains computed from the correlations of such restricted motions may not correspond to the pattern of largeamplitude inharmonic dynamics of real proteins.Certain techniques, such as DynDom [12] utilize the differences between two alternative structures of a protein or between several frames from the trajectories of Molecular Dynamics (MD) simulations, which allows to take into account large conformational displacements.However these techniques do not reveal the hierarchical arrangement of dynamic domains.
Recently the Hierarchical Domain-Wise Alignment technique (HDWA) has been developed.It is conceptually similar to the HCCP technique [1], but uses different input data.HDWA exploits the hierarchical character of protein motions recorded in MD trajectories, while HCCP utilizes the patterns in the matrices of residue-residue correlation of motions, which are computed using GNM.HDWA identifies a hierarchy of dynamic domains from MD trajectories or any other sets of atomic coordinates and allows estimating stability and interdependence of domains.
In the current work we compare systematically the HDWA and HCCP techniques using the set of four test proteins of different structural classes.A comparison with the widely used DynDom technique is also performed.
Molecular dynamics simulations.All MD simulations were performed using Gromacs 4.0 suit of programs [17].All four test proteins were simulated under NPT conditions at the temperature of 300 K and the pressure of 1 bar maintained by the Berendsen thermostat and the Berendsen barostat respectively [18].GROMOS G43a2 force field for the proteins [19] and the SPC model for water [12] were used.The bond lengths in protein were constrained using the LINCS algorithm [20].The water molecules were constrained using SETTLE [21].The fourth-order PME algorithm [22] with the cut-off of 1 nm was used for computations of electrostatic interactions.The time step of 2 fs was used in all cases except the human serum albumin, which was simulated with the time step of 4 fs after increasing the masses of hygrogen atoms to 4 a. u. and decreasing the masses of the corresponding heavy atoms [23].The number of water molecules, the length of the trajectories and the number of frames used in HDWA for all studied proteins are summarized in Table 1.The frames used in HDWA were extracted from the equilibrated parts of the trajectories at equal intervals.The quality of equilibration was controlled by monitoring backbone RMSD and the secondary structure content of the proteins.
Choice of the reference structure.If the molecular system subjected to MD simulation is well-equilibrated, it samples the ensemble of states, which are all equally suitable as a reference structure for domainwise alignment.The choice of any single frame as a reference means that HDWA will attempt to transform all frames of the trajectory to this selected structure, which will inevitably introduce a bias.Indeed, in this case the motions of domains, which describe the transitions between other trajectory frames, are not taken into account.In order to avoid such bias the structure averaged over whole trajectory is used as a reference.The common argument against the usage of average structures is their «unphysical» nature.Indeed, the average structure may contain sterical clashes of atoms, unusually long bonds, etc.This may constitute a significant problem in the methods, which rely on correctness of the protein topology.However, HDWA does not suffer from this problem because it uses only the geometrical positions of atoms regardless of any «unphysical» contacts or bonds.
Results and discussion.Top-level domains.The boundaries of top-level domains identified by the HDWA, HCCP and DynDom techniques were compared.In the case of DynDom, which needs two struc- In the case of calmodulin the boundary between the domains is correctly identified by all techniques to be between the residues 69 and 75.The discrepancy is easily explained by the fact that the long helix, which connects two domains, is rather featureless in terms of structure and dynamics.
LAOBP is a classical hinge-bending protein, which exhibits large displacement of domains around welldefined hinge.The domain boundary in LAOBP is very well defined, thus it is not surprising that all techniques find it correctly with the difference of 1-2 residues.
The human serum albumin is the most interesting among the studied proteins in terms of its domain organization.This protein is quite large and exhibits complex multicomponent dynamics.It also contains many flexible unstructured loops, which are important for its functioning.All three techniques find two top-level domains in serum albumin, however their boundaries are significantly different.HCCP and HDWA produce similar results with three continuous segments in each domain.The boundaries of these segments are shifted by up to 8 residues, but the overall arrangement is the same.DynDom identifies only two segments in each domain.It is necessary to note that the DynDom domain assignment for serum albumin is rather unreliable.It depends significantly on the choice of two alternative structures, which are used for domain identifi-cation (data not shown).This may be explained by high flexibility of serum albumin.
Subdomains.Both HDWA and HCCP technique are able to identify the subdomains of several hierarchical levels.However, it is impossible to compare these techniques on the level-by-level basis because of different algorithms of domain identification.The particular subdomain identified by HDWA at, say, level 3 may appear in HCCP at level 7 or does not appear at all.Thus the following procedure of comparison was used.HDWA was run with 6 hierarchical levels for all the proteins studied.Each subdomain found by HDWA on each level was matched with all the subdomains identified by HCCP for the same protein at the levels from 1 to 50.Matching was performed in terms of the Hamming distance between the binary vectors, which represent the domains.After this procedure, the mean mismatches for each HDWA hierarchical level were computed (Table 3).
The mismatches for different hierarchical levels differ substantially in different test proteins.In LAOBP and calmodulin the mismatch of the first-level domains is very small, while the domains of the levels 2-5 differ significantly in HCCP and HDWA.The mismatch decreases again for level 6.The same trend is observed for serum albumin.The mismatch of the first-level domains looks large (20 residues).However, this difference actually is not so dramatic because of large size of this protein and the fact that each of first-level domains consists of three pieces in terms of the sequence.The reason of this intriguing trend becomes evident after visual inspection of the subdomains identified by HCCP and HDWA.Typically small regions around the hinge residues are cut off the largest domains on the second level of hierarchy in HCCP.The bodies of domains start to fragment into several subdomains on higher levels of hierarchy.These subdomains correlate rarely with the flexible loops and other highly mobile regions in the protein because of limitations of the underlying elastic network model.In contrast, HDWA subdivides the domains of the first level according to the mobility of their structural elements in the course of MD.Flexible fluctuating loops are assigned to one subdomain of the second level, while relatively rigid body of the domain is assigned to another subdomain (Figure).The same is true for subsequent levels of hierarchy until the subdomains become small enough to cover a single element of the secondary structure or individual loop.Such basic structural elements are identified by both HCCP and HDWA (although on different hierarchical levels).Thus the mismatch decreases for high levels of hierarchy.
The ribonuclease A is an exception among other studied proteins because it does not contain pronounced domains of the first level.Thus the mismatch is the largest for the first-level domains and decreases for higher hierarchical levels.In HDWA case the globule is subdivided into flexible loops and the rigid core at the first level of hierarchy.In the case of HCCP the mobility of loops is not detected and the domains of the first level do not correlate with the domains identified by HDWA.
The HDWA technique has some limitations.It is slow in comparison to other techniques due to expensive exhaustive search performed computationally for each domain subdivision.Typically, run time for the test proteins used in this work is between 5 and 30 min on fast office workstations for ~10-20 trajectory frames.This time increases rapidly with an increase in the number of frames.
However, the MD simulations themselves are typically 3-4 order of magnitude slower, thus the performance of HDWA is not critical.Another disadvantage is the character of domain subdivision.Each domain is subdivided into exactly two subdomains, which is not always the case in reality.However, as it was explained above, this is the only unbiased way of division (division into larger number of subdomains raises the problem of «overfitting»).The post-processing of the domain tree eliminates this problem partially by ensuring that the flexibility of domains increases with the increase of the hierarchical level.After the postprocessing some domains may possess more than two subdomains.HDWA can also be viewed as a powerful method of analysis of MD simulations, which extracts information about the hierarchy of the protein dynamics from the «mess of trajectories» for individual atoms.Our technique can be used in concert with the essential dynamics and other well established analysis techniques when the information about the hierarchy of domain motions is required.Our method is expected to be especially useful for large complex proteins.Such proteins possess the dynamics, which is unlikely to be described adequately at the single level of hierarchy.HDWA is the technique revealing the whole hierarchy of motions present in MD trajectories for such proteins.
Conclusion.The HDWA and HCCP methods of domain identification are tested on four proteins from different structural classes.It is shown that the number and the boundaries of large dynamic domains are consistent in both techniques and correspond well to the data of widely used DynDom technique.The hierarchy of dynamic domains in HDWA accounts for the presence of flexible loops and rigid regions, which is hard to achieve in other existing domain identification techniques.The domains found by HDWA may be considered as the most realistic units of the protein dynamics because they are identified using the data of atomistic MD simulations.

4 YESYLEVSKYY
HDWA domains in LAOBP for hierarchical levels 1-3.The subdomains are colored black and white on each level.The parts of the protein, which do not belong to the current domain, are shown transparent.The domain indexes and the values of flexibility R are shown

Table 2
Comparison of the domain boundaries obtained in HDWA, HCCP and DynDom techniques