VisuMap - Part 2

I had previously tried out the mapping algorithms in the software VisuMap using my own dataset of 171 compounds, which can be separated into three congeneric groups of compounds: penicillins, cephalosporins, fluoroquinolones.

A recent comment by the author of VisuMap suggests that the performance of the mapping algorithms could be improved by carefully selecting the appropriate distance metric. Since my dataset was using fingerprints (1025 binary features), he suggested using the Jaccard or Dice distance metric.

So I did some more experiments.

sammon-jaccard.jpg
Results from Sammon mapping using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca-jaccard.jpg
Results from Curvilinear component analysis (CCA) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm-jaccard.jpg
Results from Relational perspective map (RPM) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds-jaccard.jpg
Results from SMACOF MDS using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

sammon-dice.jpg
Results from Sammon mapping using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca-dice.jpg
Results from Curvilinear component analysis (CCA) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm-dice.jpg
Results from Relational perspective map (RPM) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds-dice.jpg
Results from SMACOF MDS using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

From the figures, it can be seen that the Jaccard distance metric didn’t really improve the results (i.e. Sammon and MDS are still good, and CCA and RPM are still poor in separating the three groups). However, using the Dice distance metric has rather interesting results. CCA now shows very good separation for the three groups, whereas RPM failed rather terribly. Sammon and MDS has good separation between the penicillins/cephalosporins and fluoroquinolones but now has difficulty in cleaning separating the penicillins and cephalosporins. (It is important for me to reiterate here that the colours and shapes of the different groups were added in manually to enhance the visual effects. Bear in mind that when you process a dataset with unknown groupings, every point will appear to be the same. Thus the only way to differentitate groups is if there is an obvious separation band).

Thus the obvious conclusion is that both the mapping algorithm and distance metric are important for separating different groups (nothing new here). The question then boils down to how to select the appropriate mapping algorithm and distance metric for a dataset? Is there some rule of thumb for selection or do we have to manually try different combinations? Some clues from the author of VisuMap is that “Sammon map and PCA emphasize on the global inter-cluster structure, whereas other mapping algorithms (like the RPM and CCA) emphasize more on the details within clusters.”. So does that mean that we should use Sammon map or PCA to get a broad overview, then extract each cluster (or highlight each cluster) and then use RPM or CCA to examine the structure of the cluster? Also, is there any references which state what type of distance metric is appropriate for what type of features? These are some questions that are going in my mind now and I guess it is time to do some literature searches. Anyone has any comments, answers or can provide some useful references?

Share This

3 Responses to “VisuMap - Part 2”

  1. James X. Li Says:

    I am a little surprised by the disappointing result of the Dice-metric with RPM, as this metric is quite similar to the Jaccard-metric. Would it possible to release your dataset for us to take a close look? You can replace the labels & names of the data points with anonymous strings (with the table editor), if you need to keep the data confidential.

    If the cluster structure is your main interest, you should also try the t-SNE mapping method. This method is the latest addition to VisuMap that preserves clusters structure very well.

    In general, when I explore a new dataset I would first try to use the PCA method. PCA is the safest mapping algorithm; it actually does not do any data processing except rotating and shifting the coordinators system, then project to the 3 coordinators with the most variances. PCA is less powerful compared to other non-linear mapping algorithms, since it does not do any unfolding, twisting, segmentation etc. But, PCA often provides good results in the practice.

    When applying PCA method, you should check the eigenvalues of the principal components (via PCA Projection>PCA Analyzer>PCA Details). Larger eigenvalues mean more variances (and more information). If you see more than 3 relatively large eigenvalues, you should be careful with the results of PCA map, since it only displays projection to 3 components, some relevant information may be invisible in the map.

    Also notice that with VisuMap you can select a cluster of data points then apply PCA on the selected data (via the context menu “Show PCA View”). Thus, in addition to global structure, you can easily explore the detailed relationships within clusters.

    If PCA does not deliver satisfactory results (e.g. the map has no clusters, no clear geometrical shapes or density gradients, does not fit to expectations, etc. just like a random cloud), I would then try Sammon or SMACOF method, then those more powerful methods like CCA, t-SNE or RPM. The tool Shepard diagram (view>Shepard Diagram) provides some help to assess how good a map reflects the original distance information.

    The selection of distance metric and, more generally, the preparation of data (filtering, transformation, cleansing etc.) are very domain specific tasks. Knowledge about different groups of distance metrics (dissimilarity distances) could offer some guides by searching for literatures.

  2. Yap Chun Wei Says:

    Thank you so much for the useful information. I will experiment more and read more literature.

    I have added the link to the dataset to the post (in the first paragraph). Ids 1 to 72 are cephalosporins, 73 to 111 are fluoroquinolones and 112 to 170 are penicillins.

    I have experimented with t-SNE. The results are pretty similar to those of Sammon and SMACOF method for the three distance metrics.

  3. James X. Li Says:

    Thanks for posting the dataset. After some examination we have found a defect in the implementation of the Dice dissimilarity metric that has been made available as a free plugin module for VisuMap.

    The defect has been corrected in the mean time, and the new version is ready to download on our web site. Using the correct Dice metric, RPM method produces similar maps as with other metrics.

    It is not too surprising that differences between those binary metrics are not apparent for human eye. But those fine differences could be significant for automated searching algorithms.

    The similarity metric as a whole depends on how you calculate the fingerprints; and that in turn depends on how you select the fragments to produce the fingerprints. The process to find appropriates similarity (or dissimilarity) metric is domain and problem specific. VisuMap can also offer some help in this regard. Please see my blog (http://jamesxli.blogspot.com/2008/09/on-similarity-metrics-for-chemical.html) for more comments.

Leave a Reply


Close
E-mail It