Archive for the ‘Tutorial’ Category

Statistical molecular design

Sunday, March 2nd, 2008

The use of an external independent validation set, which has been collected independently of the training set, is widely regarded as the best way to assess the quality of a QSAR/qSAR model (Wold et al. 1995). However, it is usually difficult to find additional sources of data to construct an independent validation set and thus the typical method is to split the original dataset into two different sets, a training set for developing the QSAR/qSAR model and a validation set for evaluating the model performance (Gramatica et al. 2004). The training set should contain compounds of diverse structures that can adequately represent all of the compounds that possess a particular activity  (Rajer-Kanduc et al. 2003; Schultz et al. 2003). The validation set also needs to be sufficiently diverse and representative of the compounds studied in order to accurately assess the accuracies of the QSAR/qSAR models (Rajer-Kanduc et al. 2003; Schultz et al. 2003).

There are a number of approaches for creating diverse training sets and representative validation sets from the datasets, which are given in Table 1. These include random selection, cluster-based methods, dissimilarity-based methods, cell-based methods, stochastic techniques, statistical experimental designs and neural networks (Daszykowski et al. 2002; Leach et al. 2003). Studies have shown that dissimilarity-based methods, such as Kennard and Stone algorithm and removal-until-done algorithm, are more effective than other algorithms in selecting diverse training sets and representative validation sets for developing and validating QSAR/qSAR models (Snarey et al. 1997; Rajer-Kanduc et al. 2003).

Table 1: Methods for selecting training and validation sets

Cluster-based methods
Hierarchical Non-hierarchical
Single linkage (Leach et al. 2003)
Complete linkage (Leach et al. 2003)
Group average (Leach et al. 2003)
Wards method (Leach et al. 2003)
Centroid method (Leach et al. 2003)
Median method (Leach et al. 2003)
K-means (Forgy 1965)
Jarvis-Patrick clustering (Jarvis et al. 1973)
DBSCAN (Ester et al. 1996)
OPTICS (Ankrest et al. 1999)
DENCLUE (Han et al. 2001)
Dissimilarity-based methods
MaxSum (Snarey et al. 1997)
Kennard and Stone algorithm (Kennard et al. 1969)
Removal-until-done (Hobohm et al. 1992)
Sphere exclusion (Hudson et al. 1996)
OptiSim (Clark 1997)
IcePick (Mount et al. 1999)
Minimum spanning tree error function (Waldman et al. 2000)
Cell-based methods
Cummins algorithm (Cummins et al. 1996)
Menard algorithm (Menard et al. 1998)
Uniform cell coverage (Lam et al. 2002)
Stochastic techniques
Techniques using Monte Carlo sampling (Agrafiotis 1996; Hassan et al. 1996)
Techniques using genetic algorithms (Sheridan et al. 2000; Gillet et al. 2002)
Statistical experimental designs
D-optimal design (Mitchell 1974)
Factorial design (Box et al. 1978)
Others
Random selection
Kohonen’s self-organizing map
Informative design (Miller et al. 2002)

References

  • Agrafiotis DK (1996). Stochastic algorithms for maximizing molecular diversity. 3rd Electronic Computational Chemistry Conference.
  • Ankrest M, Breunig M, Kriegel H and Sander J (1999). OPTICS: Ordering points to identify the clustering structure. Proceedings of the ACM SIGMOD International Conference on Management of Data: 49-60.
  • Box GEP, Hunter WG and Hunter JS (1978). Statistics for experimenters: An introduction to design, data analysis, and model building. New York, Wiley.
  • Clark RD (1997). OptiSim: An extended dissimilarity selection method for finding diverse representative subsets. Journal of Chemical Information and Computer Sciences 37(6): 1181-1188.
  • Cummins DJ, Andrews CW, Bentley JA and Cory M (1996). Molecular diversity in chemical databases: Comparison of medicinal chemistry knowledge bases and databases of commerically available compounds. Journal of Chemical Information and Computer Sciences 36(4): 750-763.
  • Daszykowski M, Walczak B and Massart DL (2002). Representative subset selection. Analytica Chimica Acta 468(1): 91-103.
  • Ester M, Kriegel HP, Sander J and Xu X (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining: 226-231.
  • Forgy E (1965). Cluster analysis of multivariate data: Efficiency vs interpretability of classifications. Biometrics 21: 768-780.
  • Gillet VJ, Willett P, Fleming PJ and Green DVS (2002). Designing focused libraries using MoSELECT. Journal of Molecular Graphics and Modelling 20(6): 491-498.
  • Gramatica P, Pilutti P and Papa E (2004). Validated QSAR prediction of OH tropospheric degradation of VOCs: Splitting into training-test sets and consensus modeling. Journal of Chemical Information and Computer Sciences 44(5): 1794-1802.
  • Han JW and Kamber M (2001). Data mining : concepts and techniques. San Francisco, Morgan Kaufmann Publishers.
  • Hassan M, Bielawski JP, Hempel JC and Waldman M (1996). Optimization and visualization of molecular diversity of combinatorial libraries. Molecular Diversity 2(1-2): 64-74.
  • Hobohm U, Scharf M, Schneider R and Sander C (1992). Selection of representative protein data sets. Protein Science 1(3): 409-417.
  • Hudson BD, Hyde RM, Rahr E, Wood J and Osman J (1996). Parameter based methods for compound selection from chemical databases. Quantitative Structure-Activity Relationships 15: 285-289.
  • Jarvis RA and Patrick EA (1973). Clustering using a similarity measure based on shared near neighbours. IEEE Transactions in Computers C-22: 1025-1034.
  • Kennard RW and Stone L (1969). Computer aided design of experiments. Technometrics 11: 137-148.
  • Lam RLH, Welch WJ and Young SS (2002). Uniform coverage designs for molecule selection. Technometrics 44(2): 99-109.
  • Leach AR and Gillet VJ (2003). Selecting diverse sets of compounds. An introduction to chemoinformatics. Boston, Kluwer Academic Publisher: 123-145.
  • Menard PR, Mason JS, Morize I and Bauerschmidt S (1998). Chemical space metrics in diversity analysis, library design, and compound selection. Journal of Chemical Information and Computer Sciences 38(6): 1204-1213.
  • Miller JL, Bradley EK and Teig SL (2002). Luddite: An information-theoretic library design tool. Journal of Chemical Information and Computer Sciences 43(1): 47-54.
  • Mitchell TJ (1974). An algorithm for the construction of “D-optimal” experimental designs. Technometrics 16: 203-210.
  • Mount J, Ruppert J, Welch W and Jain AN (1999). IcePick: flexible surface-based system for molecular diversity. Journal of Medicinal Chemistry 42(1): 60-66.
  • Rajer-Kanduc K and Zupan JM, N. (2003). Separation of data on the training and test set for modelling: a case study for modelling of five colour properties of a white pigment. Chemometrics and Intelligent Laboratory Systems 65(2): 221-229.
  • Schultz TW, Netzeva TI and Cronin MTD (2003). Selection of data sets for QSARs: analyses of Tetrahymena toxicity from aromatic compounds. SAR and QSAR in Environmental Research 14(1): 59-81.
  • Sheridan RP, SanFeliciano SG and Kearsley SK (2000). Designing targeted libraries with genetic algorithms. Journal of Molecular Graphics and Modelling 18(4-5): 320-334.
  • Snarey M, Terrett NK, Willett P and Wilton DJ (1997). Comparison of algorithms for dissimilarity-based compound selection. Journal of Molecular Graphics and Modelling 15(6): 372-385.
  • Waldman M, Li H and Hassan M (2000). Novel algorithms for the optimization of molecular diversity of combinatorial libraries. Journal of Molecular Graphics and Modelling 18(4-5): 412-426.
  • Wold S and Eriksson L (1995). Statistical validation of QSAR results. Chemometric methods in molecular design. van de Waterbeemd H. Weinheim; New York; Basel; Cambridge; Tokyo, VCH: 309-318.

Molecular descriptors - Selection

Friday, February 29th, 2008

The purpose of descriptor selection is to remove descriptors irrelevant or negligible to the activity of the compounds, so as to improve computation speed, performance and interpretability of predictive models. Irrelevant and redundant descriptors are removed either by using a filter or a wrapper approach or a combination of these approaches. The filter approach is independent of the in silico method and is frequently used to remove redundant descriptors or descriptors of low information content. Descriptors are chosen or removed based on one or more of the following considerations: prior knowledge of factors affecting a particular activity, the properties of the descriptors (e.g. variance), the correlation between different descriptors, and the distribution of the descriptor values in different data classes. In the wrapper approach, a descriptor selection algorithm is incorporated into an in silico classification method (Guyon et al. 2003).

In many cases, it is difficult to uniquely select an optimum set of descriptors due to the high redundancy and overlapping of many descriptors (Gramatica et al. 2004). Separate sets of descriptors containing different members of redundant descriptor classes have been found to give similar prediction accuracies (Izrailev et al. 2004). The interpretation of the prediction results in these cases should be more appropriately conducted at the descriptor class level where redundant and overlapping descriptors are grouped into one class. Table 1 gives a list of the common descriptor selection methods used in QSAR/qSAR studies.

Table 1: Common descriptor selection methods used in QSAR studies

Filter methods Wrapper methods
Remove descriptors with low variance
Remove highly correlated descriptors
CORCHOP (Livingstone et al. 1989)
Decision tree (Cardie 1993)
FOCUS (Almuallim et al. 1994)
LVF (Brassard et al. 1996)
RELIEF (Kononenko 1994)
Discrimination scores (Guyon et al. 2002)
Information gain (Liu 2004)
Mutual information (Liu 2004)
chi-square-test (Liu 2004)
Odds ratio (Liu 2004)
GSS coefficient (Liu 2004)
Forward selection(Xu et al. 2001)
Backward elimination (Xu et al. 2001)
Stepwise regression (Xu et al. 2001)
Branch and bound (Narendra et al. 1977)
Floating search (Pudil et al. 1994)
Adaptive floating search (Somol et al. 1999)
Oscillating search (Somol et al. 2000)
Tabu search (Glover 1989)
Simulated annealing (Sutter et al. 1993)
Genetic algorithm (Siedlecki et al. 1989)
Recursive feature elimination (Guyon et al. 2002)

References

  • Almuallim H and Dietterich TG (1994). Learning Boolean concepts in the presence of many irrelevant features. Artificial Intelligence 69: 279-306.
  • Brassard G and al. e (1996). Fundamentals of algorithms. New Jersey, Prentice Hall.
  • Cardie C (1993). Using decision trees to improve case-based learning. Proceedings 10th International Conference on Machine Learning. Los Altos, Morgan Kaufmann: 25-32.
  • Glover F (1989). Tabu search - Part I. ORSA Journal on Computing 1: 190-206.
  • Guyon I, Weston J, Barnhill S and Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3): 389-422.
  • Guyon I and Elisseeff A (2003). An introduction to variable and feature selection. Journal of Machine Learning Research 3: 1157-1182.
  • Gramatica P, Pilutti P and Papa E (2004). Validated QSAR prediction of OH tropospheric degradation of VOCs: Splitting into training-test sets and consensus modeling. Journal of Chemical Information and Computer Sciences 44(5): 1794-1802.
  • Izrailev S and Agrafiotis DK (2004). A method for quantifying and visualizing the diversity of QSAR models. Journal of Molecular Graphics and Modelling 22(4): 275-284.
  • Kononenko I (1994). Estimating attributes: analysis and extensions of RELIEF. Machine Learning: ECML-94. European Conference on Machine Learning. Proceedings.
  • Liu Y (2004). A comparative study on feature selection methods for drug discovery. Journal of Chemical Information and Computer Sciences 44(5): 1823-1828.
  • Livingstone DJ and Rahr E (1989). Corchop - An interactive routine for the dimension reduction of large QSAR data sets. Quantitative Structure-Activity Relationships 8: 103-108.
  • Narendra PM and Fukunaga K (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers 26: 917-922.
  • Pudil P, Novoviová J and Kittler J (1994). Floating search methods in feature selection. Pattern Recognition Letters 15(11): 1119-1125.
  • Siedlecki W and Sklansky J (1989). A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters 10: 335-347.
  • Somol P, Pudila P, Novoviová J and Paclí P (1999). Adaptive floating search methods in feature selection. Pattern Recognition Letters 20(11-13): 1157-1163.
  • Somol P and Pudil P (2000). Oscillating search algorithms for feature selection. Proceedings of the 15th International Conference on Pattern Recognition. Barcelona. 2: 406-409.
  • Sutter JM and H. KJ (1993). Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchemical Journal 47(1-2): 60-66.
  • Xu L and Zhang WJ (2001). Comparison of different methods for variable selection. Analytica Chimica Acta 446: 475-481.

Molecular descriptors - Scaling

Tuesday, February 26th, 2008

Molecular descriptors are usually scaled before they are used for QSAR/qSAR modeling. This is to ensure that all descriptors have equal potential to affect the QSAR/qSAR model (Livingstone 1995). There are four main types of descriptor scaling: autoscaling (Livingstone 1995), range scaling (Livingstone 1995), feature weighting (Livingstone 1995) and Pareto scaling (Eriksson et al. 2001). Autoscaling and range scaling are the two most common types of descriptor scaling methods used in QSAR/qSAR modeling.

Autoscaling
In autoscaling, the mean is subtracted from the descriptor values and the resultant values are divided by the standard deviation:
autoscale.jpg
where X’ij is the new scaled value for descriptor j of compound i and Xj and sigmaj are the mean and standard deviation of descriptor j respectively. The autoscaled descriptors have a mean of zero and a standard deviation of one. The advantage of autoscaling is that it is less susceptible to effects of compounds with extreme values because they are mean centred. In addition, variance of one is useful in variance-related methods since they each contribute one unit of variance to the overall variance of a dataset.

Range scaling (Normalization)
In range scaling, the minimum value of the descriptor is subtracted from the descriptor values and the resultant values are divided by the range:
rangescale.jpg
where Xj,min and Xj,max are the minimum and maximum value of descriptor j respectively. The range-scaled descriptors have a minimum and maximum value of -1 and 1 respectively. Range scaling can be carried out over any preferred range by multiplication of the range-scaled values by a factor. The disadvantage of range scaling is that it is dependent on the minimum and maximum values of the descriptors, which makes it very sensitive to outliers.

References

Molecular descriptors - Introduction

Sunday, February 24th, 2008

A molecular descriptor is “the final result of a logical and mathematical procedure which transforms chemical information encoded within a symbolic representation of a compound into an useful number or the result of some standardized experiment” (Todeschini et al. 2000). There are currently over 3,700 types of descriptors, which are classified into three broad categories: 1-, 2- and 3-D descriptors that encode chemical composition, topology, and 3D shape and functionality respectively (Todeschini et al. 2000; Farnum et al. 2003). A descriptor can be simple, like molecular volume, which encode only one feature of a compound, or can be complex, like 3D-MoRSE, which encode multiple physicochemical and structural properties of a compound. Several computer programs have been developed for deriving molecular descriptors of a compound. Examples of the most popularly used and internet accessible programs are DRAGON (Todeschini et al. 2005), Molconn-Z (Hall et al. 2005), JOELib (Wegner 2005), and MODEL (Li et al. 2007).

References

  • Farnum M, DesJarlais R and Agrafiotis DK (2003). Molecular diversity. Handbook of chemoinformatics : From data to knowledge. Gasteiger J. Chichester, Wiley-VCH. 4: 1640-1686
  • Hall LH, Kellogg GE and Haney DN (2002). Molconn-Z. eduSoft, LC.
  • Li ZR, Han LY, Xue Y, Yap CW, Li H, Jiang L and Chen YZ* (2007). MODEL - Molecular Descriptor Lab: A web-based server for computing structural and physicochemical features of compounds. Biotechnology and Bioengineering 97 (2): 389-396.
  • Todeschini R and Consonni V (2000). Handbook of molecular descriptors. Weinheim, Wiley-VCH.
  • Todeschini R, Consonni V, Mauri A and Pavan M (2005). DRAGON. Talete SRL: Milan.
  • Wegner JK (2005). JOELib/JOELib2.

Quantitative Structure Activity Relationship (QSAR)

Thursday, February 21st, 2008

A QSAR/qSAR model is a mathematical model which approximates the relationship between the pharmacological activity of a compound and its structure-derived physicochemical and structural features. The two main objectives of QSAR/qSAR modeling are to allow prediction of the pharmacological activity of a not yet biologically tested, but chemically characterized compound and to extract clues as to which molecular characteristics of a compound are important for the pharmacological activity. The term “QSAR model” is used to refer to quantitative models (regression problems), and the term “qSAR model” will be used to refer to qualitative models (classification problems). There are variants of QSAR like QSPR (for chemical properties), QSPkR (for pharmacokinetic properties), QSTR (for toxicological properties).

The initial step is the collection of relevant pharmacological activity data and the elimination of low quality data that are likely to affect the quality of the model. The next step is the selection of representative compounds into a training set and a validation set to calibrate and evaluate the QSAR/qSAR model respectively. Molecular descriptors are then computed for representing the physicochemical and structural properties of the compounds studied and those that are redundant or contained little information are removed prior to the modeling process. A machine learning method or statistical method is then used to develop a model that relates the pharmacological activity to the physicochemical and structural properties of the compounds. During a modeling process, optimization of the essential parameters of the machine learning methods or statistical methods and the selection of relevant descriptor subsets are conducted simultaneously. The optimum set of parameters and descriptor subset are used to construct a final QSAR/qSAR model, which is subsequently subjected to various validation methods to ensure that it is valid and useful.

Docking practical/tutorial

Saturday, February 16th, 2008

Other than the CoMFA practical, I have also prepared a docking practical. All the files needed for the practical (except Sybyl 7.2) can be downloaded from here.

In this practical, we will be looking at two different methods for ligand-protein docking: DOCK and FlexiDock.

Protein

  1. Load a protein (HIV-1 protease).
    • File >>> Read.
    • Select protein.pdband press OK.
    • Select YES in the Center the molecule dialog and press OK.
  2. Name the protein.
    • Build/Edit >>> Name Molecule.
    • Type HIV_protease for the new molecule name (press OK).
  3. Add hydrogens to the protein.
    • Biopolymer >>> Prepare Structure >>> Add Hydrogens.
    • Press OK.
  4. Compute charges for this protein.
    • Biopolymer >>> Prepare Structure >>> Load Charges.
    • Press OK.
    • Press Yes.
  5. Save the protein
    • File >>> Save As.
    • Type hiv_protease for the new molecule name (press Save).
  6. Zoom out to view the entire protein.
  7. Find secondary structures in protein.
    • Biopolymer >>> Conformation >>> Find Secondary Structure.
    • Use the Kabsch-Sander method.
    • Select Render Conformations and press Find.
    • Press Close.
  8. Construct a three-dimensional shaded ribbon for the protein.
    • View >>> Biopolymer Display >>> Shaded Ribbon.
    • Press All and press OK.
    • Select WHITE as the color and press OK
  9. Highlight catalytic triad ASP25, THR26 and GLY27 by spacefill representation.
    • View >>> Mixed Rendering.
    • Press Substructures and select ASP25, THR26 and GLY27 from both chain A and B and press OK.
    • Press OK.
    • Select SpaceFill for Representation and press OK.
  10. Rotate, translate, zoom in and out to have a good overview of the secondary and tertiary structure of the protein.
    • Question 1: Where is the active site of the protein?
  11. Delete the protein.
    • Build/Edit >>> Zap (Delete) Molecule.

 

Ligand

  1. Load a drug (A77800).
    • File >>> Read.
    • Select ligand.pdb and press OK.
    • Select YES in the Center the molecule dialog and press OK.
  2. Name the ligand.
    • Build/Edit >>> Name Molecule.
    • Type A77800 for the new molecule name (press OK).
  3. Add hydrogens to this drug.
    • Build/Edit>>> Add >>> Hydrogens.
  4. Compute charges for this drug.
    • Compute >>> Charges >>> Gasteiger-Huckel.
    • Press No when asked if you want to change formal charges before computing charges.
  5. Conduct molecular mechanics computation to find lowest potential energy conformation for this drug.
    • Specify the force field and type of charges to use.
      • Compute >>> Minimize.
      • In the Minimize dialog, press Modify.
      • In the Energy dialog, select Tripos from the Force Field option menu.
      • Select Gasteiger-Huckel from the Charges option menu.
      • Press OK.
    • Set the minimization parameters.
      • In the Minimize dialog, press Minimize Details.
      • In the Minimize Details dialog, increase the maximum number of iterations (Max.Iterations) to 10000.
      • Decrease the Gradient to 0.005.
        The Gradient value is a termination criterion. If the gradient difference between two consecutive iterations goes below this value, the calculation stops.
      • Set the color option to Force.
        This color codes atoms according to the total force on each atom, as the minimization proceeds.
      • Press OK.
    • Submit the job to run interactively.
      • In the Minimize dialog, type A77800 as the job name.
      • Press OK.
        As the minimization proceeds, changes to the structure are interactively displayed. Information about each iteration is also printed to the textport.
      • Question 2: What is the total energy of the minimized structure?
  6. Save the drug
    • File >>> Save As.
    • Type A77800 for the new molecule name (press Save).
  7. Create molecular surface of this drug.
    • Create a molecular surface.
      • View >>> MOLCAD Surfaces >>> Molecular Surfaces
      • Press Create in the MOLCAD Surfaces dialog.
    • Select type of molecular surface.
      • In the Create MOLCAD Surface dialog, select Connolly from the Type menu.
      • Press OK.
      • Press Done.
  8. Rotate, translate, zoom in and out to have a good overview of the conformation of the ligand.
  9. Delete the drug.
    • Build/Edit >>> Zap (Delete) Molecule.

 

DOCK

SYBYL’s docking functionality provides a real time approximation of the intermolecular energy of interaction between a pair of molecules (in kcals/mol), a useful tool for interactively identifying possible binding conformations. One molecule (stationary, by convention) is called the site, and the other is the ligand. Interactive output includes the total energy, the magnitude and direction of the overall force (for ligand atoms strongly interacting with the site), and the one site atom which is interacting most strongly.

  1. Load protein (hiv_protease.mol2) and drug (A77800.mol2). It is recommended that protein be loaded first.
  2. Position the ligand into the active site of the protein (you may wish to turn on shaded ribbon for the protein and spacefill for ligand).
  3. Prepare DOCK parameters.
    • Options >>> Tailor.
    • In the Tailor dialog, select DOCK from the Subject menu.
    • Increase the maximum number of lattice points (MAX_LATT_PTS) to 800000.
    • Press Apply and then press Close.
  4. Begin docking.
    • Compute >>> Dock.
    • Select M2:A77800 as the ligand molecule area (press OK).
    • Select M1:HIV_protease as the site molecule area (press OK).
      Three new items appear on the screen:
      1. An Energy Gadget dialog, showing the total energy of interaction, plus its steric and electrostatic components, for the currently displayed configuration using the Tripos force field.
      2. A box surrounding the ligand molecule. As you move the ligand, the energy of interaction is computed only for atoms within that box.
      3. A prompt at the command line for the next action SYBYL is to take.
        • press ? to show other actions that can be taken from within the docking process.
        • If you wish to stop DOCK at any time, press to exit the DOCK command.
  5. Move the ligand.
    • Slowly move the ligand a few Å in any direction.
      As an atom of the ligand gets too close to a site atom, the energy of interaction suddenly increases, and new yellow and red lines are displayed.
      • Yellow lines connect ligand atoms to the nearest bumping site atom,
      • Red lines show the direction that a ligand atom should be moved to reduce the repulsive energy.
  6. Minimize docking energy
    • With the textport active, type REINITIALIZE (press the return key).
    • Type MINIMIZE_DOCK (press the return key).
    • Select SITE as the molecule to have fixed geometries (press OK).
    • Wait for minimization to finish running.
    • Question 3: What is your final docking energy?
  7. End the docking operation.
    • When finished docking, exit back to the main menu by pressing in the textport.
  8. Warning: If you exit docking mode, then re-enter it, you may get an error message regarding a lack of memory. This is due to a memory allocation problem that cannot be avoided. The only work-around is to exit SYBYL and restart. Freeze the current view and save it to a database if you want to be able to restart exactly where you left off.

 

FlexiDock

Genetic algorithm-based Flexible Docking provides a means of docking ligands into protein active sites. FlexiDock works in torsional space, keeping bond lengths and angles constant. As large vdW interactions can only relax via bond rotation(s), optimization cannot alter chiral centers and bond stereochemistry. FlexiDock works on a protein/ligand pair. The protein backbone atoms are fixed in space, but the ligand is mobile (rotation/translation can be applied). Both the protein (sidechains only) and the ligand can contain a number of flexible bonds. However, to speed up calculations, FlexiDock considers only non-ring single and amide bonds as rotatable.

FlexiDock uses a genetic algorithm to determine the optimum ligand geometry. Genetic algorithms are relatively robust global optimizers, with performance requirements which scale well with increasing system size. The fitness function uses a subset of the Tripos force field: the van der Waals, electrostatic, torsional and constraint energy terms, and calculates the energy of the important atoms in the supermolecule.

  1. Set Up FlexiDock Structures
    • With the ligand docked inside the protein’s active site,
      • Compute >>> FlexiDock >>> Create an Input File.
      • In the Molecule Area dialog, select M1:HIV_protease, then press OK.
        The Set Up FlexiDock Structures, Protein Display Options and No Pocket Defined dialogs are then posted.
      • At the No Pocket Defined dialog, press OK.
    • Protein Display Options dialog.
      • Define and visualize the protein binding pocket.
      • Press Define Pocket
      • Press Substructures and select ASP25, THR26 and GLY27 from both chain A and B and press OK.
      • Increase the radius (Radius to Show Around Pocket) to 10.
    • Set Up FlexiDock Structures dialog.
      • The following seven items in this dialog correspond to the seven steps generally necessary to prepare a protein file from a structural database for submission to FlexiDock, and should be executed in the order suggested. To enforce this the check boxes on the left are organized as a cascade:
        • All are grayed out initially.
        • The first check box becomes accessible only after you have specified the location of the protein and ligand molecules, and both are in separate work areas.
        • Toggling on the first check box activates the next line, and so on.
        • Toggling off any check box turns all the lines below it off and disables them.
      • Five of the check boxes have associated action buttons to Remove the water molecules, Add the hydrogens, Add the atomic charges, Choose the rotatable bonds, and Choose H-bond sites. Each Choose button launches a subsidiary dialog for carrying out the corresponding operation.
        • To perform the desired operation, simply press the button. When the operation is completed, the check box is automatically toggled on and the next line becomes active.
        • If an operation is not necessary, simply toggle on the corresponding check box so you can proceed to the next operation.
      • Two of the check boxes have associated Hint buttons to help you perform the corresponding operations manually: checking the atom types and positioning the ligand in the cavity.
        • The two Hint buttons differ from the four action buttons. Pressing them does not toggle on the associated check box, nor does it activate the next lines.
      • For Water has been removed, press Remove to delete all the water molecules.
      • Check box for Atom types have been checked.
      • For Hydrogens have been added, press Add to fill all valences with hydrogens.
      • For Charges are in place, press Add to compute all atomic charges.
      • For Rotatable bonds are set, press Choose.
        • In the Pick Rotatable Bonds dialog, press All and press OK.
      • For H-bond sites are marked, press Choose.
        • In the Defining H-Bond Sites dialog, press Add, then All and OK for every item.
      • Check box for Ligand is pre-positioned in cavity.
      • Type hiv for file name and press Write Out FlexiDock Input File.
  2. Run FlexiDock.
    • Press FlexiDock It.
      In three successive dialogs you will be prompted to specify:
      • The template parameter file to use. By default, this file is $TA_MOLTABLES/FlexiDock.par.
      • The random number seed to start from.
      • The maximum number for generations to allow. The default is 3000.
    • Accept all default values for the three dialogs.
    • Wait for FlexiDock to finish running.
    • Close Flexidock window by pressing Quit.
    • Delete both protein and drug.
  3. View docking results.
    • Open spreadsheet.
      • File >>> Molecular Spreadsheet >>> New.
      • Select Database and press OK.
      • Select hiv.mdb and press Open.
    • Remove unnecessary information.
      • Select the first row.
      • MSS: Edit >>> Delete.
    • Import list of docking energies.
      • MSS: File >>> Import.
      • Select Custom in the Format option menu of the Import dialog.
      • Press the [] button to its right to bring up the Custom Format dialog.
      • Select Space in the Delimiter option menu and toggle both Labels check boxes on. Enter * and NEW, respectively, in the row and column fields. This means that when the data file is imported its rows will match the existing rows and new columns will be created.
      • Press OK to exit the Custom Format dialog.
      • Enter hiv.mdb/hiv.txt in the File field.
      • Press Import in the Import dialog.
  4. View docked structure.
    • File >>> Database >>> Open.
    • Select hiv.mdb and press Open.
    • Select READONLY as the access mode (press OK).
    • File >>> Database >>> Get Molecule.
    • Select HIV0001 (or select the molecule with the lowest energy).
    • Press OK.
    • Question 4: Is there any difference between the docked conformation and the initial conformation?
    • Question 5: Is there any difference between this docked conformation and the docked conformation from DOCK?
    • Question 6: Is it valid to compare the energies of the docked conformations from DOCK and FlexiDock? Why?
  5. Delete all molecules.
    • Build/Edit >>> Zap (Delete) Molecule.
    • Press All and press OK.

 

Answer to Questions

Question 1: Where is the active site of the protein?
Answer:
protein_active_site.jpg

Question 2: What is the total energy of the minimized structure?
Answer: 36.161 kcals/mol

Question 3: What is your final docking energy?
Answer: A good docked conformation should have negative docking energy.
dock1.jpg

Question 4: Is there any difference between the docked conformation and the initial conformation?
Answer: Before you can answer this question, you must first find a method to detect differences (if any) between the conformations. One method is to measure the distances between several key atoms in the protein and ligand and compare these distances. For example, you might measure the distance between a hydrogen bond donor on the protein and a hydrogen bond acceptor on the ligand and check to see how this distance has changed between the initial conformation and docked conformation.

Question 5: Is there any difference between this docked conformation and the docked conformation from DOCK?
Answer: The approach is the same as the answer given in question 4.

Question 6: Is it valid to compare the energies of the docked conformations from DOCK and FlexiDock? Why?
Answer: No, it is not valid to compare the energies given by the two algorithms. There are two reasons.

  1. The two algorithms may be using different force fields to compute the interaction energy between the protein and ligand.
  2. The atoms in the protein and ligand used to compute the interaction energy is different in the two algorithms. For the DOCK algorithm, only atoms in the box is considered. For FlexiDock, only atoms inside the defined pocket (which is a sphere) is considered.

CoMFA practical/tutorial

Thursday, February 14th, 2008

When I was a teaching assistant at the National University of Singapore, I prepared a practical on the use of Sybyl 7.2 to perform CoMFA for one of the modules that I was teaching. The following is the practical. All the files needed for the practical (except Sybyl 7.2) can be downloaded from here.

CoMFA Theory

The idea underlying a Comparative Molecular Field Analysis (CoMFA) is that differences in a target property are often related to differences in the shapes of the non-covalent fields surrounding the tested molecules. To put the shape of a molecular field into a QSAR table, the magnitudes of its steric (Lennard-Jones) and electrostatic (Coulombic) fields are sampled at regular intervals throughout a defined region. While there are many adjustable parameters in CoMFA, certainly the most important is the relative alignment of the individual molecules when their fields are computed. Properly aligned molecules have a comparable conformation and a similar orientation in Cartesian space. The QSAR is then generated by a PLS analysis of the data contained in the MSS. The value of the resulting QSAR can be determined through the value of the crossvalidated r2 (from now on referred to as q2) reported by the PLS. If acceptable, the CoMFA QSAR, re-derived in final, non-crossvalidated form, can most easily be manipulated using various graphics techniques. Otherwise the alignment of one or more molecules can be changed, or other parameters altered, and the analysis repeated. Once an acceptable QSAR has been derived, prediction of the target property value for a new molecule is particularly straightforward.

From a user’s point of view, the four major phases of a CoMFA are:

  1. Preparing for CoMFA
  2. Building Spreadsheet for CoMFA
  3. CoMFA PLS Runs
  4. Examination and Use of CoMFA Results

In this practical, we will be constructing a QSAR for a set of ryanoid molecules. The inhibitory activity values are chicken KDs (nM) for displacement of 7 nM [3H]ryanodine from skeletal muscle. (W. Welch, S. Ahmad, K. Gerzon, R.A. Humerickhouse, H.R. Besch, Jr., L. Ruest, P. Deslongchamps & J.L. Sutko, Biochemistry, 1994, 33: 6074-6085)

 

Preparing for CoMFA

  1. All the ryanoids include a tetrahydropyran which can be used to align them. Retrieve tetrahydropyran from the Fragment Library.
    • Build/Edit >>> Get Fragment.
    • Select TETRAHYDROPYRAN and press OK.
  2. Remove all the hydrogens.
    • Build/Edit >>> Delete >>> Atom.
    • Press Atom Types in the Atom Expression dialog.
    • Check the H check box in the Atom Types dialog and press OK.
    • Press OK in the Atom Expression dialog.
  3. Save the core structure as a mol2 file so you can retrieve it later. This is a precautionary measure because the contents of this work area will be overwritten during step 4.
    • File >>> Save As.
    • Enter THP_core in the File field and click on Save.
  4. Align all the molecules in the database onto one of them using the molecule on the screen as the basis for the alignment.
    • File >>> Align Database.
    • The Database Align Dialog has three fields: one for the database containing the molecules to align, one for the template molecule, and one to indicate the molecule area which contains the substructure common to all molecules. Since only the default work area is currently occupied, that default area is already entered in the discussion, it will be assumed that this is M1.
    • Type ryanoids in the Database to Align field. If there is already an entry here, you can select the whole path by double-clicking on it and replace it. Alternatively, you can delete the current entry and leave it empty, or even insert nonsense. Whenever SYBYL cannot interpret the name you supply, it will launch the Database Selection dialog.
    • Type ryanodine_1 in the Template Molecule field.
    • Make sure the New Database and All Molecules radio buttons have been checked and press Go.
    • Press OK in the information box warning you that the contents of the molecule areas will be overwritten during the alignment process.
    • Note About Template Alignment:
      Setting the Grid Orientation to Inertial in the Database Align dialog causes the molecule upon which the database is being aligned to be oriented along the coordinate lattice, then frozen. This makes use of the information embedded in a CoMFA field maximally efficient, will generally improve q2, and will make your results reproducible by other workers. It is recommended that the template molecule always be oriented in this way before undertaking CoMFA.
  5. Watch the alignment as it proceeds: each molecule in turn is retrieved from the database, brought into a new molecule area and aligned on the template. Once the process is complete, you are prompted for the name of a new database to store the 18 ryanoids in their new orientations.
    • Type my_ryanoids at the prompt for the name of the new database and press OK.
  6. The new alignments have all been saved, so you can now clear the screen. This will reduce the time SYBYL spends refreshing the screen between commands.
    • Build/Edit >>> Zap (Delete) Molecule.
    • Press All and press OK.

 

Building Spreadsheet for CoMFA

  1. Create a spreadsheet from the new database and save it.
    • Tools >>> QSAR >>> New Spreadsheet .
    • Select Database as the Data Source and press OK.
    • In the Database Selection dialog select my_ryanoids.mdb as the database and press Open.
      The 18 molecules in the database are read in and entered as rows in the spreadsheet.
  2. Read in biological data
    • In the end, the purpose of CoMFA is to relate structural information to biological data. Add the relevant biological data to your molecular spreadsheet now in preparation for subsequent manipulations.
    • MSS: File >>> Import.
    • Select Custom in the Format option menu of the Import dialog.
    • Press the [] button to its right to bring up the Custom Format dialog.
    • Select Comma in the Delimiter option menu and toggle both Labels check boxes on. Enter * and NEW, respectively, in the row and column fields. This means that when the data file is imported its rows will match the existing rows and new columns will be created.
    • Press OK to exit the Custom Format dialog.
    • Enter ryanoids.txt in the File field.
    • Press Import in the Import dialog.
  3. In QSAR, the logarithms of concentrations are used. So the values we will be using are -log(concentration*10-9) since the concentration is expressed in nM.
    • Highlight column 2 and press Autofill.
    • Select FUNCTIONAL_DATA as the new column type and press OK.
    • Type PKD as the name for the new column and press OK.
    • Type 9-LOG(”KD”) for the functional specification and press OK.
  4. Add CoMFA Columns
    • The process of adding CoMFA fields involves scanning all the molecules to establish an encompassing region and computing somewhat more than 33,000 energies.
    • Highlight column 3 and press Autofill.
    • Select COMFA as the new column type and press OK.
    • Select Tripos Standard in the CoMFA Field Class option menu.
    • Select Both as the Field Types.
    • Make sure that the steric and electrostatic cutoffs are both set to the default 30 kcal/mol, and that the Create Automatically radio button is on in the Region section of the dialog.
    • Press Add Column.
    • Press OK to accept COMFA3 as the column name.
      The Tripos standard CoMFA fields are calculated for each molecule in the spreadsheet and the number of lattice points at or above the cutoffs are displayed for each molecule. Because thousands of actual columns of numbers are usually produced by evaluation of a single CoMFA column, for each molecule this column shows the number of lattice intersections located “inside” that molecule, a very crude volume estimate.
    • Question 1: What is the range of values for the CoMFA column?

             
    AddNewCoMFAColumn

  5. Save the Table
    • MSS: File >>> Save.
      The file my_ryanoids.tbl is saved with the CoMFA columns.

 

CoMFA PLS Runs

  1. With no rows or columns selected, any operations are performed on the entire spreadsheet. Perform a crossvalidated run with two objectives:
    • To find out whether the CoMFA model is predictively useful.
    • If useful, decide how many components to use for the best model. The number of components describes the degree of complexity of the model; at some point adding more detail corresponds to fitting the data to noise, and the predictive ability begins to diminish.
    • MSS: QSAR >>> Partial Least Squares.
      The Partial Least Squares Analysis dialog appears.
      • Column to use: PKD, COMFA3; the only columns in the table.
      • Dependent Column: PKD; the column whose values you wish to predict from the resulting PLS model.
      • Validation: Leave-One-Out; performs a PLS analysis with crossvalidation where the number of groups is equal to the number of rows selected (here all rows).
      • Use SAMPLS: off; This box is automatically checked on at start-up, but toggle it off for now. Once you are familiar with the crossvalidation procedure, you will probably want to use the SAMPLS option instead, because it is much faster.
      • Components: 5.
      • Scaling: CoMFA Standard; the best option when CoMFA columns are present.
      • Column Filtering: on, 2.0; to omit from the analysis those columns (lattice points) whose energy variance is less than 2.0 kcal/mol.
      • Type one as the Analysis Name.
      • Press Do PLS.
        A lot of details scroll in the text window. The important summary is presented at the end: the r2 value and the optimal number of components.
        Question 2: What is the crossvalidated r2? Is the model good enough? What is the number of components for maximum r2?
      • Select End when prompted to save the analysis.
  2. Having the crossvalidation to confirm the predictive ability, derive the best predictive model for use in graphics and in numerical prediction.
    • The Partial Least Squares Analysis dialog is still on the screen. Set the options as follows:
      • Validation: No Validation; performs a PLS analysis without any validation. This is typically done at the end of PLS analysis.
      • Components: 2.
      • Scaling: CoMFA Standard; the best option when CoMFA columns are present.
      • Column Filtering: on; to omit from the analysis those columns (lattice points) whose energy variance is less than 2.0 kcal/mol.
      • Type two as the Analysis Name.
      • Press Do PLS.
        Question 3: What is the r2? What is the percentage of contribution to the model’s information from the electrostatic fields? What is the percentage of contribution to the model’s information from the steric fields?
      • Press OK when asked to save the analysis.
        The analysis is saved in the file two.pls.
      • Press End to close the Partial Least Squares Analysis dialog.

 

Examination and Use of CoMFA Results

  1. Investigate the CoMFA results.
    • MSS: QSAR >>> View QSAR >>> CoMFA
      The View CoMFA dialog is displayed.
      • Accept the default values for all options.
      • Press Show & Quit to exit the View CoMFA dialog.
      • Read the information in the textport window.
    • The large red areas are the regions where negative potential is favorable for the inhibitory activity, blue areas are unfavorable. Scattered green areas are regions where bulky substituents are desirable, yellow areas are undesirable.
    • The graph in D1 represents the predicted (Y-axis) versus Actual (X-axis) value for the dependent column (inhibitory activity). Each point represents a compound. Compounds in the upper right corner are the most active.
  2. Take a molecule that is a good candidate for improvement, make some structural changes and predict its activity.
    • Select the point corresponding to the molecule with the highest actual value of activity.
      Question 4: Which molecule has the highest inhibitory activity? What is its actual and predicted activities?
    • Press End Select to terminate point picking.
  3. Label the atoms by ID numbers.
    • Press theimg src=’http://voyagememoirs.com/wordpress-voyphar/wp-content/uploads/2008/02/icons_display_area_options.gif’ alt=’IconsDisplayAreaOptions’ /> icon and turn on Id Atom Labels for the molecule in D2.
      Hydrogen 25 is near the region which prefers negative potential (red) and near the sterically unfavorable area (yellow). One approach to try to achieve both these objectives is to replace the hydrogen with a fluorine.
  4. You can easily model this new compound.
    • Build/Edit >>> Modify >>> Atom.
    • Select TYPE and press OK.
    • Click on hydrogen 25.
    • Press OK in thei>Atom Expression dialog.
    • Select atom type F and press OK.
  5. Calculate the charges and predict the activity of this new compound. Note that the changes you have made to the DehydroR_2 molecule are minor. This means that minimization and realignment are not necessary.
    • Compute >>> Charges >>> Gasteiger-Huckel.
    • Select NO changing of formal charges.
    • Build/Edit >>> Name Molecule.
    • Type DehydroR_2F and press OK.
    • MSS: QSAR >>> Predict Property.
    • Select M2:DEHYDROR_2F in the molecule list and press OK.
      Question 5: What is the predicted activity of the modified compound?
  6. Close the spreadsheet.
    • MSS File >>> Close

 

Answer to Questions

Question 1: What is the range of values for the CoMFA column?
Answer: The values range from 154 to 236.

Question 2: What is the crossvalidated r2? Is the model good enough? What is the number of components for maximum r2?
Answer: The crossvalidated r2 is about 0.522, so one would expect to use such a model to account for 52% of the actual variance in activity among additional similar ryanoids. This is good enough to at least rank the activity of proposed new compounds rather accurately. The maximum r2 occurs at 2 components. Thus we know that the model is useful and that we should use 2 components for the final analysis.

Question 3: What is the r2? What is the percentage of contribution to the model’s information from the electrostatic fields? What is the percentage of contribution to the model’s information from the steric fields?
Answer: Information in the text window reports that the r2 measure of fit is 0.801. The electrostatic fields contribute 49.8% of the model’s information, while the steric fields represent the other 50.2%.

Question 4: Which molecule has the highest inhibitory activity? What is its actual and predicted activities?
Answer: The molecule in row 5: DEHYDRO_R2 has the highest inhibitory activity. The actual value for the inhibitory activity is 8.22, while the predicted value is 7.85.

Question 5: What is the predicted activity of the modified compound?
Answer: The predicted value of DehydroR_2F, the modified compound, is 8.09. The predicted activity of the compound before modification is 7.85, as can be seen higher in the textport window (when you picked the compound in row 5). So, replacing the hydrogen with a fluorine is expected to have a small, positive effect on activity.


Close
E-mail It