Archive for the ‘Pharmacy’ Category

PaDEL-ADV

Monday, November 9th, 2009

Introducing another new software, PaDEL-ADV. This is a software to perform virtual screening using AutoDock Vina.

PaDEL-ADV reads a directory containing ligands files. For each ligand, the structural file is converted into a pdb file, if necessary, using The Chemistry Development Kit. The pdb file is then converted to pdbqt using the prepare_ligand4.py script provided by AutoDockTools. AutoDock Vina is then used to dock the ligand with the receptor. Individual binding modes are extracted from the output pdbqt file using vina_split. The pdbqt files are then converted to pdb files using the pdbqt_to_pdb.py script provided by AutoDockTools. Results for each binding modes are extracted from the log file and placed into the results CSV file. The log file and all the related pdb and pdbqt files are then compressed into a zip file.

Modern QSAR - Validation

Tuesday, April 21st, 2009

modern-qsar-validation.jpg

Modern QSAR - Modeling methods

Monday, March 23rd, 2009

modern-qsar-modelingmethods.jpg

Modern QSAR - Descriptor

Friday, February 6th, 2009

modern-qsar-descriptors.jpg

Modern QSAR - Dataset

Wednesday, January 28th, 2009

32-modern-qsar-dataset.jpg

OECD Principles For The Validation, For Regulatory Purposes, Of (Quantitative) Structure-Activity Relationship Models

Tuesday, December 16th, 2008

In 2004, OECD came up with 5 principles for QSAR models. They are:

  1. a defined endpoint
  2. an unambiguous algorithm
  3. a defined domain of applicability
  4. appropriate measures of goodness-of– fit, robustness and predictivity
  5. a mechanistic interpretation, if possible

If you are working on QSAR models, it will be good for you to know these principles and apply them in your work.

For more information on these principles, you can go to the OECD website

PaDEL-Descriptor

Saturday, August 2nd, 2008

Introducing the first product from my laboratory, PaDEL-Descriptor. It is a software to calculate molecular descriptors and fingerprints. The software currently calculates 393 descriptors (290 1D, 2D descriptors and 103 3D descriptors) and 5 types of fingerprints. The descriptors and fingerprints are calculated using The Chemistry Development Kit with some in-house addition for electrotopological descriptors. All the different types of descriptors are calculated in parallel to take full advantage of the multi-core CPUs that are commonly found nowadays. The usage instructions can be found on the website itself. This software is free for all (e.g. personal, academic, non-profit, non-commercial, government, commercial, etc) to use.

The software is Java Web Start ready. What this means is that if you have Java JRE installed on your computer (which most people should have by now), you can just click on a link on the website to launch the software directly. A copy of the software will automatically be downloaded, stored on your computer and run. You can create a shortcut to this software on your desktop. When you click on this shortcut and if you are online, Java Web Start will automatically check if there is a new version of the software available. If there is, it will download it before running the software. If you are offline, Java Web Start will just run your local copy. The main advantage of Java Web Start is that it will always ensure that you are running the latest version of the software (if you are online). If I have the time, I will give a short writeup on how to make your own Java software Java Web Start ready. It is really very easy if you are using NetBeans.

Different requirements for predictive models at different stages of drug design cycle

Friday, May 30th, 2008

In a typical qSAR, it is usually assumed that sensitivity and specificity of the predictive model are equally important. However, in a drug discovery project, these accuracies may have different importance at different stages of the design cycle. For example, in the initial target and hit identification phase, it may be more important not to miss potential leads. Thus, it is more important to have a predictive model which has very high sensitivity (small number of false negatives) and reasonably good specificity. At later stages, it becomes increasingly important to focus on a manageable number of candidates. Thus a predictive model with very high specificity (small number of false positives) and reasonably good sensitivity may become more important. Hence it is important to be able to modify the modeling method or predictive model so as to meet this two types of requirements.

For SVM classification systems, there are two possible approaches for modifications to suit these different needs. The first approach uses different training error penalties for compounds in positive and negative classes. For example, a higher training error penalty for compounds in positive class and lower training error penalty for compounds in negative class can be used to increase the sensitivity of the SVM classification systems. The second approach adds a correction factor to the SVM decision function. A positive or negative correction factor will improve the sensitivity or specificity of the SVM classification system respectively.

Overfitting

Tuesday, March 11th, 2008

It is not sufficient for a QSAR/qSAR model to have good predictive capability. A second requirement for a good quality QSAR/qSAR model is that it must not suffer from overfitting. There are two main types of overfitting: (1) using a model that is more flexible than it needs to be and (2) using a model that includes irrelevant descriptors (Hawkins 2004). There are various methods that can be used to prevent or to check for these two types of overfitting.

A number of different QSAR/qSAR models can be developed using machine learning methods of varying complexities. The QSAR/qSAR model with the best balance between complexity of the machine learning method used and its predictive capability is the one that is most suitable for predicting the activity of a compound. This method prevents the use of a QSAR/qSAR model that is more flexible than is necessary.

A frequently used method for checking whether a QSAR/qSAR model is overfitted is to compare its prediction capability determined by using cross-validation methods with those determined by using independent validation sets (Hawkins 2004). Even though cross-validation methods tend to give a pessimistic estimate of the predictive capability of a QSAR/qSAR model, a model that is not overfitted should not have large differences in the estimates of its predictive capability from cross-validation methods and independent validation sets.

Y-randomization is commonly used to determine the probability of chance correlation during descriptor selection (Manly 1997; Leardia et al. 1998). In classification problems, a portion of compounds in the training set belonging to the positive data class (D+) is randomly exchanged with compounds in the training set belonging to the negative data class (D-) , creating new training sets with false D+ and D- compounds. For regression problems, the activities of all the compounds in the training set are randomly rearranged. The machine learning method is trained using this scrambled training set. The randomization is repeated a number of times and prediction capabilities of the new scrambled QSAR/qSAR model from each run are compared to that of the original QSAR/qSAR model. If the scrambled training set gives significantly lower prediction capabilities than the original training set, it can be concluded that the original QSAR/qSAR model was relevant and unlikely to arise as a result of chance correlation.

In order to determine whether the selected descriptors of the original QSAR/qSAR model include those irrelevant for the prediction of the activity of a compound, different groups of QSAR/qSAR models, each containing different number of descriptors, can be generated by using the descriptor selection method. Each group contains a fixed number of QSAR/qSAR models having the same number of descriptors. The prediction capabilities of the QSAR/qSAR models in each group are determined and the average prediction capabilities of all the groups are compared and used to determine the optimal number of descriptors for prediction of the activity of a compound. If the optimal number of descriptors coincide with the number of descriptors in the original QSAR/qSAR model, the original model is unlikely to contain irrelevant descriptors.

References

  • Hawkins DM (2004). The problem of overfitting. Journal of Chemical Information and Computer Sciences 44(1): 1-12.
  • Leardia R and González AL (1998). Genetic algorithms applied to feature selection in PLS regression: How and when to use them. Chemometrics and Intelligent Laboratory Systems 41(2): 195-207.
  • Manly BFJ (1997). Randomization bootstrap and Monte Carlo methods in biology. London, Chapman and Hall.

Methods for measuring predictive capability of QSAR models

Sunday, March 9th, 2008

The following statistics are commonly calculated to determine the predictive capability of a QSAR model.

rsquare.jpg
mse.jpg
mae.jpg
folderror.jpg
averagefolderror.jpg

The r2 value measures the explained variance between the predicted and actual activity values. The fold-error of a compound measures the degree of overprediction or underprediction for a compound and is useful for identifying chemical structures which are not well-represented by the QSAR model. The average-fold error avoids the cases in which poor overpredictions are cancelled by equally poor underpredictions. A QSAR model that predicts an activity value perfectly gives an average-fold error of 1 and a model with an average-fold error of less than 2 is considered to be a successful one (Obach et al. 1997).

References

  • Obach RS, Baxter JG, Liston TE, Silber BM, Jones BC, Macintyre F, Rance DJ and Wastall P (1997). The prediction of human pharmacokinetic parameters from preclinical and in vitro metabolism data. Journal of Pharmacology and Experimental Therapeutics 283(1): 46-58.

Close
E-mail It