Archive for March, 2008

RapidMiner - Part III: Descriptor scaling

Monday, March 31st, 2008

Scale the training set

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add Normalization operator (Preprocessing) to Root.
    • Configure it by checking the return_preprocessing_model checkbox
    • Set the z_transform checkbox to unchecked.
  3. Add ModelWriter operator (IO->Models) to Root and configure it to save the model to a file.
  4. Run.

Scale the testing set

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a testing set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add ModelLoader operator (IO->Models) to Root and configure it to load the model that is saved during the scaling of the training set.
  3. Add ModelApplier operator to Root.
  4. Run.

RapidMiner score full marks for its ease in scaling descriptors.

RapidMiner - Part II: Partitioning of dataset into training and testing sets

Sunday, March 30th, 2008

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a dataset from a file.
    • Set the value of the label_attribute to the class column.
  2. Add SimpleValidation operator (Validation) to Root and configure it by setting the split_ratio value to 0.8.
    • Add OperatorChain operator to SimpleValidation operator.
      • Add ArffExampleSetWriter operator (IO->Examples) to the current OperatorChain operator and configure it to save the training set to a file.
      • Add NearestNeighors operator (Learner->Supervised->Lazy) to the current OperatorChain operator.
    • Add another OperatorChain operator to SimpleValidation operator.
      • Add ArffExampleSetWriter operator (IO->Examples) to the current OperatorChain operator and configure it to save the testing set to a file.
      • Add ModelApplier operator to the current OperatorChain operator.
      • Add Performance operator (Validation) to the current OperatorChain operator.
  3. Run.

In RapidMiner, partitioning of dataset must be accompanied by learning a model from the newly created training set and evaluation of the model on the newly created testing set. There are no operators which can just partition the dataset into a training set and testing set without model building and evaluation. Although RapidMiner has a lot of operators, its selection of operators for partitioning datasets seem to be rather limited. For example, it does not contain other algorithms, like the Kennard and Stone algorithm, for partitioning datasets.

RapidMiner - Part I: Overview

Thursday, March 27th, 2008

RapidMiner (version Free 4.1beta2, licensed under GPL version 3)
From their official website, “is the world-wide leading open-source data mining solution due to the combination of its leading-edge technologies and its functional range. Applications of RapidMiner cover a wide range of real-world data mining tasks”. RapidMiner “is available in different flavours: the open-source version licensed under the GPL which can be used by everyone for free, another free version including an improved graphical user interface, and a proprietary version which can be used by commercial developers where the open-source license does not suit their needs”.

If you install the current version of RapidMiner, with all its optional plugins, you will have a total of 436 (+127 Weka’s) operators, with the following operators distribution:

  • Core: 9
  • IO: 75
  • Learner: 65 (+119 Weka’s)
  • Meta: 21
  • OLAP: 3
  • Other: 11
  • Postprocessing: 6
  • Preprocessing: 205 (+8 Weka’s)
  • Validation: 31
  • Visualization: 10

This is an enormous collection of operators. However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, RapidMiner can read data from quite a number of sources, e.g. ARFF, csv, Excel, sparse format, SPSS, database, etc, so most users should not have any problems opening their existing data files in RapidMiner.

RapidMiner has a few descriptor selection operators. However, it seems like it does not have some common ones like stepwise regression. Also the filter and wrapper descriptor selection methods seems to be mixed together. Will explore this in more detail when I start the testing proper.

Currently, RapidMiner contains 11 (+21 Weka’s) algorithms for developing regression models and 30 (+69 Weka’s) algorithms for constructing classification models.

RapidMiner contains a few validation methods like cross-validation, boosting and bagging.

Overall, my first impression of RapidMiner is that it has a very comprehensive set of tools for a full QSAR experiment. However, the large number of tools can make it difficult for an inexperienced data miner to decide which to use. Also the learning curve for the software seems to be quite steep as it is not easy to visualize the experimental workflow using the current graphical user interface. Fortunately, RapidMiner has a useful wizard for constructing experimental workflow for a few common data mining scenarios automatically. In addition, it has a comprehensive set of sample files which can help users to learn how to construct the experimental workflow for different parts of the data mining process.

KNIME - Part VI: Model validation using cross-validation and/or independent validation set.

Tuesday, March 25th, 2008

The previous post already provides the steps for model validation using cross-validation. So how do we validate a model using an independent validation set?

Develop a model

  1. Put File Reader node (IO->Read) to workbench and configure it to load a training set from a file.
  2. Put SVM Learner node (Mining->SVM) to workbench and connect the output port from the File Reader node to its input port.
    • Configure it by selecting the correct column for the Class column listbox.
    • Set the kernel and parameter values to the optimum values that have been determined by the parameter optimization procedure.
  3. Put Model Writer node (IO->Write) to workbench and connect the model port of the SVM Learner node to its input port. Configure it to save the model to a file.
  4. Execute all nodes.

Validate model on an independent validation set

  1. Put File Reader node (IO->Read) to workbench and configure it to load an independent validation set from a file.
  2. Put SVM Predictor node (Mining->SVM) to workbench and connect the output port from the File Reader node to its test data input port.
  3. Put Model Reader node (IO->Read) to workbench and connect its output port to the model port of the SVM Predictor node. Configure it to load the model that is saved in the model development phase.
  4. Put Cross validation node (Meta) to workbench and open the Cross validation node Meta-workflow editor.
    • Copy and paste the Aggregator node to the workbench.
    • Exit the Meta-workflow editor and delete the Cross validation node from the workbench.
    • Connect the output port from the SVM Predictor node to the input port of the Aggregator node.
  5. Execute all nodes.

It can be seen that KNIME is able to validate a model using either cross-validation or an independent validation set. However, it is rather limited in the number of available error measurement methods. For example, for classification problems, it does not have sensitivity and specificity measurements, and for regression problems, it does not have r2 or mean square error.

KNIME - Part V: Parameter optimization of machine learning/statistical methods

Sunday, March 23rd, 2008
  1. Put File Reader node (IO->Read) to workbench and configure it to load a training set from a file.
  2. Put Cross validation node (Meta) to workbench and connect the output port from the File Reader node to its input port.
    • Configure it by setting Number of validations to 10.
    • Ensure the Random sampling box is checked.
    • Select the correct column for the Column with class labels listbox.
  3. Open the Cross validation node Meta-workflow editor.
    • Put K Nearest Neighbor node (Mining->Misc Classifiers) on the editor and connect the training data and test data output ports of the X-Partitioner node to the training data and test data input ports of the K Nearest Neighbor node, respectively.
      • Configure it by setting the Number of neighbours to consider (k) to 3.
  4. Exit the Meta-workflow editor and put Statistics View node (Statistics) to workbench and connect the error rates port from the Cross validation node to its input port.
  5. Execute all nodes.

The above procedure shows how KNIME can be used to train and assess the performance of a model. However, it is not possible to automatically determine the optimum parameter value (e.g. Number of neighbours to consider (k) in the above procedure) for a machine learning/statistical method (According to their forum, this feature may be available in version 2.0). To determine the optimum parameter value, you have to do it manually by setting a parameter value, run all the nodes, record the mean error rates given in the Statistics View node, set another parameter value, run all the nodes again, record the mean error rates and so on, until you have evaluated all the parameter values that you are interested in. Then the parameter value which gives the lowest mean error rates will be the optimum parameter value of the machine learning/statistical method for the training set.

KNIME - Part IV: Descriptor selection

Friday, March 21st, 2008

KNIME does not seems to have any common descriptor selection methods like forward selection, backward elimination, stepwise regression, genetic algorithm, etc.

KNIME - Part III: Descriptor scaling

Wednesday, March 19th, 2008

Scale the training set

  1. Put File Reader node (IO->Read) to workbench and configure it to load a training set from a file.
  2. Put Normalizer node (Data Manipulation->Column) to workbench and connect the output port from the File Reader node to its input port.
    • Configure it by choosing Min-Max Normalization with the Min set as 0.0 and Max set as 1.0.
    • Select the columns to normalize.
  3. Put Model Writer node (IO->Write) to workbench and connect the model port of the Normalizer node to its input port. Configure it to save the model to a file.
  4. Execute all nodes.

Scale the testing set

  1. Put File Reader node (IO->Read) to workbench and configure it to load a testing set from a file.
  2. Put Normalizer (Apply) node (Data Manipulation->Column) to workbench and connect the output port from the File Reader node to its input port.
  3. Put Model Reader node (IO->Read) to workbench and connect its output port to the model port of the Normalizer (Apply) node. Configure it to load the model that is saved during the scaling of the training set.
  4. Execute all nodes.

KNIME score full marks for its ease in scaling descriptors.

KNIME - Part II: Partitioning of dataset into training and testing sets

Monday, March 17th, 2008
  1. Put File Reader node (IO->Read) to workbench and configure it to load a dataset from a file.
  2. Put Partitioning node (Data Manipulation->Row) to workbench and connect the output port from the File Reader node to its input port.
    • Configure it by choosing Relative and setting it at 80%
    • Ensure the Draw randomly box is checked.
  3. Put two CSV Writer nodes (IO->Write) on the workbench. Connect the first output port from the Partitioning node to the input node of the first CSV Writer and configure it to save the first set (which is the training set) to a file. Connect the second output port from the Partitioning node to the input node of the second CSV Writer and configure it to save the second set (which is the testing set) to a file.
  4. Execute all nodes.

As can be seen from the above procedure, it is very easy to partition a dataset randomly into a training set and testing set. However, KNIME does not seems to contain other algorithms, like the Kennard and Stone algorithm, for partitioning datasets.

KNIME - Part I: Overview

Saturday, March 15th, 2008

KNIME - Konstanz Information Miner (version 1.3.3)

From their official website, “KNIME is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models”. KNIME uses a non-profit open source license which “allows KNIME to be downloaded, distributed, and used freely as long as the software or its use is not distributed per profit”.

If you install the current version of KNIME, with all its optional plugins, you will have a total of 189 nodes, with the following nodes distribution:

  • IO: 11
  • Database: 2
  • Data manipulation: 36
  • Data views: 21
  • Statistics: 4
  • Machines: 28
  • Chemistry: 22
  • Meta: 7
  • Misc: 3
  • Weka: 47
  • Python: 3
  • R: 4
  • Reporting: 2

However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, KNIME can only read data from three sources: ARFF files (which are Weka files), text-delimited files (which include csv files), and from a database. There are no nodes for reading from SVMlight files or LIBSVM files or from Microsoft Excel files. The lack of support for Microsoft Excel files is no big deal since you can easily convert them to csv format using Microsoft Excel. However, the lack of support for SVMlight and LIBSVM files will inconvenient users who are already using these two popular support vector machine softwares.

At first sight, KNIME does not seem to have any descriptor selection capability. Will explore this in more detail when I start the testing proper.

Currently, KNIME contains 3 algorithms for developing regression models and 8 algorithms for constructing classification models. I did not count those algorithms that are under the Weka branch because those algorithms are just wrappers over algorithms that are present in Weka and do not have the ability to load and save developed models.

KNIME contains a Cross validation meta-node. Though the website states that it also has boosting and bagging nodes, they were not present in the downloadable version.

Overall, my first impression of KNIME is that it has a very good graphical user interface and seems easy to use. However, it may not contain sufficient tools for a full QSAR experiment.

Data mining tools comparison methodology

Thursday, March 13th, 2008

As mentioned in my previous post, I will explore Weka, RapidMiner, and KNIME in more details. A reader has suggested that I look at TANAGRA also. So I will try to give a comparison between these four tools. However, I will not be doing the usual comparison (i.e. side by side comparison) and I will not be going into all the features of these tools. Instead, I will gauge the ease with which the tool can be used for QSAR experiments. I will evaluate the tools using a few procedures that are widely used in QSAR experiments. These procedures have been described in my previous posts and are:

  1. Partitioning of dataset into training and testing sets.
  2. Descriptor scaling.
  3. Descriptor selection.
  4. Parameter optimization of machine learning/statistical methods.
  5. Model validation using cross-validation and/or independent validation set.

As I am not very familiar with these tools, my comments on these tools will be highly subjective.


Close
E-mail It