Archive for August, 2008

VisuMap - Part 2

Friday, August 29th, 2008

I had previously tried out the mapping algorithms in the software VisuMap using my own dataset of 171 compounds, which can be separated into three congeneric groups of compounds: penicillins, cephalosporins, fluoroquinolones.

A recent comment by the author of VisuMap suggests that the performance of the mapping algorithms could be improved by carefully selecting the appropriate distance metric. Since my dataset was using fingerprints (1025 binary features), he suggested using the Jaccard or Dice distance metric.

So I did some more experiments.

sammon-jaccard.jpg
Results from Sammon mapping using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca-jaccard.jpg
Results from Curvilinear component analysis (CCA) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm-jaccard.jpg
Results from Relational perspective map (RPM) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds-jaccard.jpg
Results from SMACOF MDS using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

sammon-dice.jpg
Results from Sammon mapping using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca-dice.jpg
Results from Curvilinear component analysis (CCA) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm-dice.jpg
Results from Relational perspective map (RPM) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds-dice.jpg
Results from SMACOF MDS using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

From the figures, it can be seen that the Jaccard distance metric didn’t really improve the results (i.e. Sammon and MDS are still good, and CCA and RPM are still poor in separating the three groups). However, using the Dice distance metric has rather interesting results. CCA now shows very good separation for the three groups, whereas RPM failed rather terribly. Sammon and MDS has good separation between the penicillins/cephalosporins and fluoroquinolones but now has difficulty in cleaning separating the penicillins and cephalosporins. (It is important for me to reiterate here that the colours and shapes of the different groups were added in manually to enhance the visual effects. Bear in mind that when you process a dataset with unknown groupings, every point will appear to be the same. Thus the only way to differentitate groups is if there is an obvious separation band).

Thus the obvious conclusion is that both the mapping algorithm and distance metric are important for separating different groups (nothing new here). The question then boils down to how to select the appropriate mapping algorithm and distance metric for a dataset? Is there some rule of thumb for selection or do we have to manually try different combinations? Some clues from the author of VisuMap is that “Sammon map and PCA emphasize on the global inter-cluster structure, whereas other mapping algorithms (like the RPM and CCA) emphasize more on the details within clusters.”. So does that mean that we should use Sammon map or PCA to get a broad overview, then extract each cluster (or highlight each cluster) and then use RPM or CCA to examine the structure of the cluster? Also, is there any references which state what type of distance metric is appropriate for what type of features? These are some questions that are going in my mind now and I guess it is time to do some literature searches. Anyone has any comments, answers or can provide some useful references?

Knowledge Discovery and Data Mining Process Model

Tuesday, August 12th, 2008

I read an article (Kurgan et al. 2006) reviewing several commonly used process models for knowledge discovery and data mining recently. The number of steps in these models ranged from 5 to 9 but the actual process is pretty similar among the models. Well, I guess you can’t deviate too much if you wish to do knowledge discovery properly.

Among the various models presented, I particularly like the Generic model, which pools and summarizes the important points from the reviewed models. The Generic model borrows heavily on a model proposed by Cios et al. in 2000. The steps in the Generic model are:

  1. Application domain understanding
  2. Data understanding
  3. Data preparation and identification of data mining technology
  4. Data mining
  5. Evaluation
  6. Knowledge consolidation and deployment

I am sure these steps are nothing new and will be familiar to those involved in knowledge discovery and data mining. However, if you had not been following any particular models, this might serve as a good reference to show that the methods which you had been using were already validated by others.

References

  • Cios KJ, Teresinska A, Konieczna S, Potocka J, Sharma S. A knowledge discovery approach to diagnosing myocardial perfusion. Engineering in Medicine and Biology Magazine, IEEE. 2000;19(4):17-25.
  • Kurgan LA, Musilek P. A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. 2006;21(01):1-24.

PaDEL-Descriptor

Saturday, August 2nd, 2008

Introducing the first product from my laboratory, PaDEL-Descriptor. It is a software to calculate molecular descriptors and fingerprints. The software currently calculates 393 descriptors (290 1D, 2D descriptors and 103 3D descriptors) and 5 types of fingerprints. The descriptors and fingerprints are calculated using The Chemistry Development Kit with some in-house addition for electrotopological descriptors. All the different types of descriptors are calculated in parallel to take full advantage of the multi-core CPUs that are commonly found nowadays. The usage instructions can be found on the website itself. This software is free for all (e.g. personal, academic, non-profit, non-commercial, government, commercial, etc) to use.

The software is Java Web Start ready. What this means is that if you have Java JRE installed on your computer (which most people should have by now), you can just click on a link on the website to launch the software directly. A copy of the software will automatically be downloaded, stored on your computer and run. You can create a shortcut to this software on your desktop. When you click on this shortcut and if you are online, Java Web Start will automatically check if there is a new version of the software available. If there is, it will download it before running the software. If you are offline, Java Web Start will just run your local copy. The main advantage of Java Web Start is that it will always ensure that you are running the latest version of the software (if you are online). If I have the time, I will give a short writeup on how to make your own Java software Java Web Start ready. It is really very easy if you are using NetBeans.


Close
E-mail It