I had previously tried out the mapping algorithms in the software VisuMap using my own dataset of 171 compounds, which can be separated into three congeneric groups of compounds: penicillins, cephalosporins, fluoroquinolones.
A recent comment by the author of VisuMap suggests that the performance of the mapping algorithms could be improved by carefully selecting the appropriate distance metric. Since my dataset was using fingerprints (1025 binary features), he suggested using the Jaccard or Dice distance metric.
So I did some more experiments.

Results from Sammon mapping using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Curvilinear component analysis (CCA) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Relational perspective map (RPM) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from SMACOF MDS using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Sammon mapping using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Curvilinear component analysis (CCA) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Relational perspective map (RPM) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from SMACOF MDS using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones
From the figures, it can be seen that the Jaccard distance metric didn’t really improve the results (i.e. Sammon and MDS are still good, and CCA and RPM are still poor in separating the three groups). However, using the Dice distance metric has rather interesting results. CCA now shows very good separation for the three groups, whereas RPM failed rather terribly. Sammon and MDS has good separation between the penicillins/cephalosporins and fluoroquinolones but now has difficulty in cleaning separating the penicillins and cephalosporins. (It is important for me to reiterate here that the colours and shapes of the different groups were added in manually to enhance the visual effects. Bear in mind that when you process a dataset with unknown groupings, every point will appear to be the same. Thus the only way to differentitate groups is if there is an obvious separation band).
Thus the obvious conclusion is that both the mapping algorithm and distance metric are important for separating different groups (nothing new here). The question then boils down to how to select the appropriate mapping algorithm and distance metric for a dataset? Is there some rule of thumb for selection or do we have to manually try different combinations? Some clues from the author of VisuMap is that “Sammon map and PCA emphasize on the global inter-cluster structure, whereas other mapping algorithms (like the RPM and CCA) emphasize more on the details within clusters.”. So does that mean that we should use Sammon map or PCA to get a broad overview, then extract each cluster (or highlight each cluster) and then use RPM or CCA to examine the structure of the cluster? Also, is there any references which state what type of distance metric is appropriate for what type of features? These are some questions that are going in my mind now and I guess it is time to do some literature searches. Anyone has any comments, answers or can provide some useful references?
Share This