Geometric filters for protein–ligand complexes based on phenomenological molecular models

Molecular docking is a widely used method of computer-aided drug design capable of accurate prediction of protein-ligand complex conformations. However, scoring functions used to estimate free energy of binding still lack accuracy. Aim. Development of computationally simple and rapid algorithms for ranking ligands based on docking results. Methods. Computational filters utilizing geometry of protein-ligand complex were designed. Efficiency of the filters was verified in a cross-docking study with QXP/Flo software using crystal structures of human serine proteases thrombin (F2) and factor Xa (F10) and two corresponding sets of known selective inhibitors. Results. Evaluation of filtering results in terms of ROC curves with varying filter threshold value has shown their efficiency. However, none of the filters outperformed QXP/Flo built-in scoring function Pi . Nevertheless, usage of the filters with optimized set of thresholds in combination with P i achieved significant improvement in performance of ligand selection when compared to usage of P i alone. Conclusions. The proposed geometric filters can be used as a complementary to traditional scoring functions in order to optimize ligand search performance and decrease usage of computational and human resources.

Introduction.Nowadays, computer-aided drug design is a widely used technique.It is mostly based on molecular docking and scoring approach [1].Docking is the procedure of protein (target) and small molecule (ligand) complexes geometry optimization aimed at finding the global energy minimum of the system.According to thermodynamics, the most likely configuration of the complex corresponds to the Gibbs free energy minimum.Energy is usually estimated using certain force field model and its minimization is performed by various methods.Typically, a large collection of small molecules is docked against the protein active site.For each optimized complex, different characteristics (scores) are calculated to estimate binding free energy.Ligands with the highest scores are filtered for further testing in biophysical, biochemical and/or cell-based screening assays.
Although there are many publicly available and commercial tools for molecular docking and filters for scoring, some problems still exist.While docking usually can provide adequate results for optimized geometry prediction, scoring is a tricky thing and requires human intrusion like visual inspection of three-dimensional molecular complex structures by a drug discovery expert [2].
As the mechanisms of intermolecular interaction in a protein-ligand complex exceed the classical mechanics limits, accurate prediction of binding energy needs quantum mechanical calculations, which boost requirements for memory size and floating point calculations speed by orders of magnitude.Furthermore, flexibility of both protein and ligand molecules causes an increase of freedom degrees of a typical system up to hundreds or thousands for simulations with implicit solvent models and even tens of thousands in case of explicit solvent.
As a result of these complications, precise virtual screening of large collections of small molecules becomes practically impossible, which forced us to use simplified models with empirical scoring algorithms.
In this paper, we introduce geometric filters, which are designed to select protein-ligand complexes from the database of molecular docking results.The filters use the molecular geometry of protein-ligand complex as a main filtering characteristic as opposed to approximated potentials of inter-atomic interaction or other loosely defined and computationally expensive functions.
The main idea behind this approach is based on the fact that molecular docking can predict the molecular geometry stationary point rather accurately, which has been proved by numerous X-ray structural analysis experiments [3].All mentioned above makes the proposed filters robust and quick for interactive usage.
Materials and methods.In this study, four types of geometry-based filters were introduced: nearest atom filter (NA), center of mass filter (COM), out coefficient filter (OUT) and hydrogen bond filter (HB).Description of the filters is provided below.
Nearest atom filter finds atom of the ligand that is the nearest to the given atom of protein in the current complex.Ligand passes the filter if this distance is less than the specified value: min , where l is ligand atom index, p is the given protein atom index, r n ® is position of the n-th atom, R min is the specified minimal distance.This filter has complexity of O(n) and can select ligands that are partially located close to the given atom of the protein active site and evidently screens it from solvent.Filtering results may be modified by considering only ligand atoms of certain type.
Center of mass filter finds the distance from ligand center of mass to the given protein atom.Ligand passes the filter if this distance is less than the specified value: where m n is mass of the n-th atom.This filter can select ligands that are located close to the specified atom of the protein active site and evidently screen it.
Out coefficient filter calculates numerical characteristic which approximates the probability of destroying the given protein-ligand complex.The following model is used.The complex will be destroyed if ligand binds to some external molecule and bonds with protein are destroyed.The probability of this event, P, may be approximated as where N e is average number of ligand atoms that may bind to the external molecule and N l is total number of ligand atoms.The less P, the more stable the complex .
Number N e is estimated as where k is ligand atom index and p k e is the probability that k-th ligand atom will bind to the external molecule.Probability p k e depends on the number of protein atoms that bind to the k-th ligand atom and shield it from outside.Probability of shielding may be described by the Markov field model with the Gibbs distribution: This filter requires O(n 2 ) operations.To simplify the case, R kp is defined to be the same for all atom pairs and equal to 1.1 C. In this case, the exact value influences only the value of filtering function but not its behavior.
Hydrogen bonds filter calculates estimated number of hydrogen bonds between ligand and protein atoms.Each hydrogen bond is characterized by strength coefficient that may be estimated as To optimize the filters parameters and verify their effectiveness, we conducted a cross-docking study.Human serine proteases thrombin (gene F2) and factor Xa (gene F10) were selected as targets.X-ray crystal structures of the protein catalytic sites were retrieved from RCSB Protein Data Bank [4], entries 1oyt and 1f0s respectively.Two sets of selective small molecule inhibitors containing 244 compounds for thrombin and 331 compounds for factor Xa were retrieved from MDDR database [5].After generation of stereoisomers and ionization using LigPrep software [6], compounds were docked into the three-dimensional protein active site structure using QXP/Flo software [7] with 100 steps of SDOCK+ routine.10 lowest energy complex structures were selected for each compound structure, which resulted in a total of 10,580 and 7,410 complexes for thrombin and factor Xa inhibitor sets respectively.Filters were applied to the complexes, and their perfor-mance was evaluated in terms of receiver operating characteristics (ROC).
Description of all filters is provided in Table 1.Arbitrary atoms of thrombin active site which were selected for the nearest atom filters and center of mass filter are shown in Fig 1.
Results and discussion.To evaluate efficiency of the filters, we conducted a virtual screening study where two sets of small-molecule inhibitors of two serine proteases were docked against two sets of their selective inhibitors.This resulted in a set of protein-ligand complexes with both «native» and «wrong» inhibitors.For each filter, a receiver operating characteristic (ROC) was built.For each of the two proteins, the compounds which have at least one protein-ligand complex passed through a filter were considered positives.Out of positives, essentially, the compounds from one protein's inhibitor set were considered true positives (TP), while compounds from another protein's inhibitor set were considered false positives (FP).Additionally, a ROC was built for the docking software QXP/Flo+ built-in scoring function P i (Fig. 2).For the two protein crystal structures, we compare only filters which are independent of arbitrary protein atom selection: OUT, HB and P i .
It is clear from the ROCs, that the filters can be used efficiently for selection of inhibitors, except the hydrogen bond filter in case of factor Xa.However, the QXP/ Flo built-in scoring function outperforms any of the proposed filters.
Next, we focused on selection of thrombin inhibitors with introduction of atom-specific filters NA and COM.ROCs for all filters for this case are provided in Fig. 3.As mentioned in Materials and methods, the number of compounds in the two sets is different.Furthermore, as 10 complexes were generated for every stereoisomer and every possible ionization state, different compounds also have different number of complexes in the docking output.As a result, the number of thrombin inhibitor complexes (true positives) is about 25 % higher than that of factor Xa inhibitors (false positives).
To investigate an impact of this inequality, in addition to obvious compound-based random guess ROC (TPR = FPR), a complex-based random guess ROC curve was built (Fig. 3).At this point, all complexes had equ- al probability to pass a «random guess filter», and this probability was considered the filter parameter, varying along the ROC curve.
As one can see from the Fig. 3, all ROC curves are located above the compound-based random guess line, which proves the filters efficiency.Furthermore, all of them are located above complex-based random guess curve, except those for the nearest atom filters NA182 and NA63, which are only effective in certain ranges.However, none of the filters outperformed the built-in scoring function P i .DesPi te that, use of Pi alone would not be a proper choice.Really, let us consider that in a tyPi cal docking setup, we screen a set of about 50,000 compounds to obtain a docking library of no more than 5,000 compounds, which are going to be tested experimentally.As we do not expect more than a few percent of true binders in the initial set, size of the docking library can be estimated as size of initial set multiplied by false positive ratio (FPR), which in this case should be 10 % at maximum.As one can see from the ROCs, even the best-performing at this FPR filters, P i and HB, give only about 40 % of true positives, which is generally not acceptable as it means loss of more than half of potentially active compounds at the very first stage of drug development.To address this issue, we carried out multiple filtering, in which all protein-ligand complexes were sequentially conducted through all 7 filters, including the built-in scoring, thus applying logical conjunction to the filter conditions.In this computation, both TPR and FPR are the functions of 7 variables, which are filter cut-off values.The values of TPR and FPR were sampled in a broad range of filter parameters to optimize filtering performance.The resulting ROC data points are plotted in Fig. 3, and filter cut-off values for them are provided in Table 2.

Conclusions.
The proposed geometric filters for protein-ligand complexes have shown their efficiency for selection of specific inhibitors in a cross-docking study for serine proteases thrombin and factor Xa.However, their efficiency in terms of receiver operating cha-racteristics is lower than that of QXP/Flo+ native sco-ring function.Nevertheless, the filters can significantly improve virtual screening performance when used in combination with the scoring function.When compa-red to usage of the scoring function alone, target-spe-cific tuning of filtering parameters achieved an incre-ase of TPR from 40 % to 80 % at 10 % FPR, and from 30 % to 65 % at 5 % FPR.
Acknowledgements.Software development and computations were performed using Ukrainian National Grid Infrastructure and computing cluster of Taras Shevchenko National University of Kyiv [8,9].
is average number of protein atoms which shield the k-th ligand atom.Number n k may be estimated in the same way: k is the probability that p-th protein atom binds to the k-th ligand atom.Probability b p k in turn also may be described by Markov field model: is characteristic length of chemical bond between k-th ligand atom and p-th protein atom.The final expression for P is thus

1 ® 2 ®
is position of the 1-st acceptor, r A is position of the second acceptor, r H ® is position of hydrogen atom, R H is characteristic length of hydrogen bond and j is the bond angle.