Archive for the ‘Data mining’ Category
Modern QSAR - Modeling methods
Monday, March 23rd, 2009Modern QSAR - Descriptor
Friday, February 6th, 2009Modern QSAR - Dataset
Wednesday, January 28th, 2009OECD Principles For The Validation, For Regulatory Purposes, Of (Quantitative) Structure-Activity Relationship Models
Tuesday, December 16th, 2008In 2004, OECD came up with 5 principles for QSAR models. They are:
- a defined endpoint
- an unambiguous algorithm
- a defined domain of applicability
- appropriate measures of goodness-of fit, robustness and predictivity
- a mechanistic interpretation, if possible
If you are working on QSAR models, it will be good for you to know these principles and apply them in your work.
For more information on these principles, you can go to the OECD website
Health Discovery Corporation holds the patents to SVM and RFE
Thursday, November 20th, 2008While doing some literature search and reading, I discovered that SVM and RFE are actually patented technologies. I am not sure what are the implications of this to researchers but I don’t like the sound of it. Maybe it is time to look into other new machine learning technologies and hold off using SVM for the next 20 years.
Y-randomization in KNIME
Monday, November 10th, 2008Previously, I had wrote about how to perform y-randomization in Rapidminer. You can also use those basic concepts to do y-randomization in KNIME. Unlike the previous post where I detailed the steps for an entire y-randomization experiment, in this post, I will show how to perform a single y-randomization on a dataset only. Below is the basic workflow.
“Column Filter” is used to remove all variables except the label. This is then passed to “Shuffle” to randomize the labels. An increasing row id number is then added to this randomized label dataset and the original dataset using “Math Formula”.
“Row ID” is then used to replace the original row ids in both original and randomized label dataset with the newly created row id.
Finally, “Joiner” is used to merge the two datasets together, creating a randomized dataset.
Y-randomization in Rapidminer
Tuesday, November 4th, 2008I had mentioned using Y-randomization as one of the methods to use for checking overfitting of a prediction model. Recently, someone had asked me about the Y-randomization that was implemented in my software, PHAKISO. PHAKISO was created during my PhD studies and unfortunately, it did not have an automated method to perform the Y-randomization experiment automatically for n number of times. I had always used the associated library, YMLL, to create a simple program to do the job and thus did not implement such feature in PHAKISO.
Since I am using Rapidminer for my research now, I thought it would be easy to create a Y-randomization process in it. Unfortunately, Rapidminer did not have a Y-randomization operator. However, through the solutions provided by the helpful moderators in the Rapidminer forum, I finally know how to do it in Rapidminer and in the process, learnt more about Rapidminer.
The basic process is shown in the following figure. 
The basic idea in Y-randomization is to randomize the label of the dataset. So in Rapidminer, you would load a dataset, create a copy of it and remove all attributes, except the label, from the copy. Then randomly permutate the examples in the copy and tag all the examples with an unique id. Select the original dataset, tag all examples with an unique id and do a join between the original dataset and its copy, using the id as the key for joining. If you do it in the correct way, the labels in the original dataset will be skipped during the joining and the permutated labels in the copy will be used for the joined dataset. To perform the entire Y-randomization experiment automatically, you will need to use the IteratingPerformanceAverage operator chain to enclose the Y-randomization portion and add a validation procedure after the Y-randomization portion as shown in the figure.
The complete XML process is as follows:
<operator name=”Root” class=”Process” expanded=”yes”>
<parameter key=”random_seed” value=”-1″/>
<operator name=”CSVExampleSource” class=”CSVExampleSource”>
</operator>
<operator name=”IteratingPerformanceAverage” class=”IteratingPerformanceAverage” expanded=”yes”>
<parameter key=”iterations” value=”100″/>
<operator name=”IOMultiplier” class=”IOMultiplier”>
<parameter key=”io_object” value=”ExampleSet”/>
</operator>
<operator name=”AttributeSubsetPreprocessing” class=”AttributeSubsetPreprocessing” expanded=”yes”>
<parameter key=”attribute_name_regex” value=”label”/>
<parameter key=”condition_class” value=”attribute_name_filter”/>
<parameter key=”keep_subset_only” value=”true”/>
<operator name=”Permutation” class=”Permutation”>
</operator>
<operator name=”IdTagging” class=”IdTagging”>
</operator>
</operator>
<operator name=”IOSelector” class=”IOSelector”>
<parameter key=”io_object” value=”ExampleSet”/>
<parameter key=”select_which” value=”2″/>
</operator>
<operator name=”IdTagging (2)” class=”IdTagging”>
</operator>
<operator name=”ExampleSetJoin” class=”ExampleSetJoin”>
</operator>
<operator name=”XValidation” class=”XValidation” expanded=”yes”>
<parameter key=”leave_one_out” value=”true”/>
<operator name=”NearestNeighbors” class=”NearestNeighbors”>
<parameter key=”k” value=”3″/>
</operator>
<operator name=”OperatorChain” class=”OperatorChain” expanded=”yes”>
<operator name=”ModelApplier” class=”ModelApplier”>
<list key=”application_parameters”>
</list>
</operator>
<operator name=”ClassificationPerformance” class=”ClassificationPerformance”>
<parameter key=”accuracy” value=”true”/>
<list key=”class_weights”>
</list>
</operator>
</operator>
</operator>
</operator>
</operator>
Just enter your dataset file in CSVExampleSource and change the method in XValidation from NearestNeighbors to your desired modeling method.
VisuMap - Part 2
Friday, August 29th, 2008I had previously tried out the mapping algorithms in the software VisuMap using my own dataset of 171 compounds, which can be separated into three congeneric groups of compounds: penicillins, cephalosporins, fluoroquinolones.
A recent comment by the author of VisuMap suggests that the performance of the mapping algorithms could be improved by carefully selecting the appropriate distance metric. Since my dataset was using fingerprints (1025 binary features), he suggested using the Jaccard or Dice distance metric.
So I did some more experiments.

Results from Sammon mapping using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Curvilinear component analysis (CCA) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Relational perspective map (RPM) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from SMACOF MDS using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Sammon mapping using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Curvilinear component analysis (CCA) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from Relational perspective map (RPM) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

Results from SMACOF MDS using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones
From the figures, it can be seen that the Jaccard distance metric didn’t really improve the results (i.e. Sammon and MDS are still good, and CCA and RPM are still poor in separating the three groups). However, using the Dice distance metric has rather interesting results. CCA now shows very good separation for the three groups, whereas RPM failed rather terribly. Sammon and MDS has good separation between the penicillins/cephalosporins and fluoroquinolones but now has difficulty in cleaning separating the penicillins and cephalosporins. (It is important for me to reiterate here that the colours and shapes of the different groups were added in manually to enhance the visual effects. Bear in mind that when you process a dataset with unknown groupings, every point will appear to be the same. Thus the only way to differentitate groups is if there is an obvious separation band).
Thus the obvious conclusion is that both the mapping algorithm and distance metric are important for separating different groups (nothing new here). The question then boils down to how to select the appropriate mapping algorithm and distance metric for a dataset? Is there some rule of thumb for selection or do we have to manually try different combinations? Some clues from the author of VisuMap is that “Sammon map and PCA emphasize on the global inter-cluster structure, whereas other mapping algorithms (like the RPM and CCA) emphasize more on the details within clusters.”. So does that mean that we should use Sammon map or PCA to get a broad overview, then extract each cluster (or highlight each cluster) and then use RPM or CCA to examine the structure of the cluster? Also, is there any references which state what type of distance metric is appropriate for what type of features? These are some questions that are going in my mind now and I guess it is time to do some literature searches. Anyone has any comments, answers or can provide some useful references?
Knowledge Discovery and Data Mining Process Model
Tuesday, August 12th, 2008I read an article (Kurgan et al. 2006) reviewing several commonly used process models for knowledge discovery and data mining recently. The number of steps in these models ranged from 5 to 9 but the actual process is pretty similar among the models. Well, I guess you can’t deviate too much if you wish to do knowledge discovery properly.
Among the various models presented, I particularly like the Generic model, which pools and summarizes the important points from the reviewed models. The Generic model borrows heavily on a model proposed by Cios et al. in 2000. The steps in the Generic model are:
- Application domain understanding
- Data understanding
- Data preparation and identification of data mining technology
- Data mining
- Evaluation
- Knowledge consolidation and deployment
I am sure these steps are nothing new and will be familiar to those involved in knowledge discovery and data mining. However, if you had not been following any particular models, this might serve as a good reference to show that the methods which you had been using were already validated by others.
References
- Cios KJ, Teresinska A, Konieczna S, Potocka J, Sharma S. A knowledge discovery approach to diagnosing myocardial perfusion. Engineering in Medicine and Biology Magazine, IEEE. 2000;19(4):17-25.
- Kurgan LA, Musilek P. A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. 2006;21(01):1-24.







