Archive for the ‘Review’ Category

RapidMiner - Part VI: Model validation using cross-validation and/or independent validation set

Sunday, April 6th, 2008

Model validation using cross-validation

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add XValidation operator (Validation) to Root and configure it by setting the number_of_validations to 10.
    • Add LibSVMLearner operator (Learner->Supervised->Functions) to XValidation operator.
    • Add OperatorChain operator to XValidation operator.
      • Add ModelApplier operator to OperatorChain operator.
      • Add Performance operator (Validation) to OperatorChain operator.
  3. Run.

Develop a model

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add LibSVMLearner operator (Learner->Supervised->Functions) to Root.
  3. Add ModelWriter operator (IO->Models) to Root and configure it to save the model to a file.
  4. Run.

Model validation using independent validation set

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load an independent validation set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add ModelLoader operator (IO->Models) to Root and configure it to load the model that is saved in the model development phase.
  3. Add ModelApplier operator to Root.
  4. Add Performance operator (Validation) to Root.
  5. Run.

RapidMiner score full marks again for its ease in model validation.

RapidMiner - Part V: Parameter optimization of machine learning/statistical methods

Friday, April 4th, 2008

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add GridParameterOptimization operator (Meta->Parameter) to Root.
    • Add XValidation operator (Validation) to GridParameterOptimization operator.
      • Add LibSVMLearner operator (Learner->Supervised->Functions) to XValidation operator.
      • Add OperatorChain operator to XValidation operator.
        • Add ModelApplier operator to OperatorChain operator.
        • Add Performance operator (Validation) to OperatorChain operator.
    • Add ProcessLog operator to GridParameterOptimization operator.
      • Configure the log by editing the Edit List.
        • Click the Add button and enter C for the log portion. Choose LibSVMLearner, parameter, C for the column_name.
        • Click the Add button and enter gamma for the log portion. Choose LibSVMLearner, parameter, gamma for the column_name.
        • Click the Add button and enter performance for the log portion. Choose XValidation, value, performance for the column_name.
    • Configure the GridParameterOptimization operator by editing the Edit List
      • Select LibSVMLearner, C for parameters and enter 50,100,150,200,250 for values.
      • Select LibSVMLearner, gamma for parameters and enter 0.0001,0.001,0.01,0.1 for values.
  3. Add ParameterSetWriter operator (IO->Other) to Root and configure it to save the parameters to a file.
  4. Add GnuplotWriter operator (IO->Other) to Root.
    • Configure it to save the plot to a file.
    • Set name as Log.
    • Set x_axis as C.
    • Set y_axis as gamma.
    • Set values as performance.
  5. Run.

It is easy for RapidMiner to perform parameter optimization. Full marks for this part.

RapidMiner - Part IV: Descriptor selection

Wednesday, April 2nd, 2008

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add FeatureSelection operator (Preprocessing->Attributes->Selection) to Root.
    • Add XValidation operator (Validation) to FeatureSelection operator.
      • Add NearestNeighors operator (Learner->Supervised->Lazy) to XValidation operator.
      • Add OperatorChain operator to XValidation operator.
        • Add ModelApplier operator to OperatorChain operator.
        • Add Performance operator (Validation) to OperatorChain operator.
    • Add ProcessLog operator to FeatureSelection operator.
      • Configure the log by editing the Edit List.
        • Click the Add button and enter generation for the log portion. Choose FeatureSelection, value, generation for the column_name.
        • Click the Add button and enter performance for the log portion. Choose FeatureSelection, value, performance for the column_name.
  3. Run.

It is a simple matter for RapidMiner to perform descriptor selection and there are a number of descriptor selection methods available.

RapidMiner - Part III: Descriptor scaling

Monday, March 31st, 2008

Scale the training set

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add Normalization operator (Preprocessing) to Root.
    • Configure it by checking the return_preprocessing_model checkbox
    • Set the z_transform checkbox to unchecked.
  3. Add ModelWriter operator (IO->Models) to Root and configure it to save the model to a file.
  4. Run.

Scale the testing set

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a testing set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add ModelLoader operator (IO->Models) to Root and configure it to load the model that is saved during the scaling of the training set.
  3. Add ModelApplier operator to Root.
  4. Run.

RapidMiner score full marks for its ease in scaling descriptors.

RapidMiner - Part II: Partitioning of dataset into training and testing sets

Sunday, March 30th, 2008

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a dataset from a file.
    • Set the value of the label_attribute to the class column.
  2. Add SimpleValidation operator (Validation) to Root and configure it by setting the split_ratio value to 0.8.
    • Add OperatorChain operator to SimpleValidation operator.
      • Add ArffExampleSetWriter operator (IO->Examples) to the current OperatorChain operator and configure it to save the training set to a file.
      • Add NearestNeighors operator (Learner->Supervised->Lazy) to the current OperatorChain operator.
    • Add another OperatorChain operator to SimpleValidation operator.
      • Add ArffExampleSetWriter operator (IO->Examples) to the current OperatorChain operator and configure it to save the testing set to a file.
      • Add ModelApplier operator to the current OperatorChain operator.
      • Add Performance operator (Validation) to the current OperatorChain operator.
  3. Run.

In RapidMiner, partitioning of dataset must be accompanied by learning a model from the newly created training set and evaluation of the model on the newly created testing set. There are no operators which can just partition the dataset into a training set and testing set without model building and evaluation. Although RapidMiner has a lot of operators, its selection of operators for partitioning datasets seem to be rather limited. For example, it does not contain other algorithms, like the Kennard and Stone algorithm, for partitioning datasets.

RapidMiner - Part I: Overview

Thursday, March 27th, 2008

RapidMiner (version Free 4.1beta2, licensed under GPL version 3)
From their official website, “is the world-wide leading open-source data mining solution due to the combination of its leading-edge technologies and its functional range. Applications of RapidMiner cover a wide range of real-world data mining tasks”. RapidMiner “is available in different flavours: the open-source version licensed under the GPL which can be used by everyone for free, another free version including an improved graphical user interface, and a proprietary version which can be used by commercial developers where the open-source license does not suit their needs”.

If you install the current version of RapidMiner, with all its optional plugins, you will have a total of 436 (+127 Weka’s) operators, with the following operators distribution:

  • Core: 9
  • IO: 75
  • Learner: 65 (+119 Weka’s)
  • Meta: 21
  • OLAP: 3
  • Other: 11
  • Postprocessing: 6
  • Preprocessing: 205 (+8 Weka’s)
  • Validation: 31
  • Visualization: 10

This is an enormous collection of operators. However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, RapidMiner can read data from quite a number of sources, e.g. ARFF, csv, Excel, sparse format, SPSS, database, etc, so most users should not have any problems opening their existing data files in RapidMiner.

RapidMiner has a few descriptor selection operators. However, it seems like it does not have some common ones like stepwise regression. Also the filter and wrapper descriptor selection methods seems to be mixed together. Will explore this in more detail when I start the testing proper.

Currently, RapidMiner contains 11 (+21 Weka’s) algorithms for developing regression models and 30 (+69 Weka’s) algorithms for constructing classification models.

RapidMiner contains a few validation methods like cross-validation, boosting and bagging.

Overall, my first impression of RapidMiner is that it has a very comprehensive set of tools for a full QSAR experiment. However, the large number of tools can make it difficult for an inexperienced data miner to decide which to use. Also the learning curve for the software seems to be quite steep as it is not easy to visualize the experimental workflow using the current graphical user interface. Fortunately, RapidMiner has a useful wizard for constructing experimental workflow for a few common data mining scenarios automatically. In addition, it has a comprehensive set of sample files which can help users to learn how to construct the experimental workflow for different parts of the data mining process.

KNIME - Part VI: Model validation using cross-validation and/or independent validation set.

Tuesday, March 25th, 2008

The previous post already provides the steps for model validation using cross-validation. So how do we validate a model using an independent validation set?

Develop a model

  1. Put File Reader node (IO->Read) to workbench and configure it to load a training set from a file.
  2. Put SVM Learner node (Mining->SVM) to workbench and connect the output port from the File Reader node to its input port.
    • Configure it by selecting the correct column for the Class column listbox.
    • Set the kernel and parameter values to the optimum values that have been determined by the parameter optimization procedure.
  3. Put Model Writer node (IO->Write) to workbench and connect the model port of the SVM Learner node to its input port. Configure it to save the model to a file.
  4. Execute all nodes.

Validate model on an independent validation set

  1. Put File Reader node (IO->Read) to workbench and configure it to load an independent validation set from a file.
  2. Put SVM Predictor node (Mining->SVM) to workbench and connect the output port from the File Reader node to its test data input port.
  3. Put Model Reader node (IO->Read) to workbench and connect its output port to the model port of the SVM Predictor node. Configure it to load the model that is saved in the model development phase.
  4. Put Cross validation node (Meta) to workbench and open the Cross validation node Meta-workflow editor.
    • Copy and paste the Aggregator node to the workbench.
    • Exit the Meta-workflow editor and delete the Cross validation node from the workbench.
    • Connect the output port from the SVM Predictor node to the input port of the Aggregator node.
  5. Execute all nodes.

It can be seen that KNIME is able to validate a model using either cross-validation or an independent validation set. However, it is rather limited in the number of available error measurement methods. For example, for classification problems, it does not have sensitivity and specificity measurements, and for regression problems, it does not have r2 or mean square error.

KNIME - Part V: Parameter optimization of machine learning/statistical methods

Sunday, March 23rd, 2008
  1. Put File Reader node (IO->Read) to workbench and configure it to load a training set from a file.
  2. Put Cross validation node (Meta) to workbench and connect the output port from the File Reader node to its input port.
    • Configure it by setting Number of validations to 10.
    • Ensure the Random sampling box is checked.
    • Select the correct column for the Column with class labels listbox.
  3. Open the Cross validation node Meta-workflow editor.
    • Put K Nearest Neighbor node (Mining->Misc Classifiers) on the editor and connect the training data and test data output ports of the X-Partitioner node to the training data and test data input ports of the K Nearest Neighbor node, respectively.
      • Configure it by setting the Number of neighbours to consider (k) to 3.
  4. Exit the Meta-workflow editor and put Statistics View node (Statistics) to workbench and connect the error rates port from the Cross validation node to its input port.
  5. Execute all nodes.

The above procedure shows how KNIME can be used to train and assess the performance of a model. However, it is not possible to automatically determine the optimum parameter value (e.g. Number of neighbours to consider (k) in the above procedure) for a machine learning/statistical method (According to their forum, this feature may be available in version 2.0). To determine the optimum parameter value, you have to do it manually by setting a parameter value, run all the nodes, record the mean error rates given in the Statistics View node, set another parameter value, run all the nodes again, record the mean error rates and so on, until you have evaluated all the parameter values that you are interested in. Then the parameter value which gives the lowest mean error rates will be the optimum parameter value of the machine learning/statistical method for the training set.

KNIME - Part IV: Descriptor selection

Friday, March 21st, 2008

KNIME does not seems to have any common descriptor selection methods like forward selection, backward elimination, stepwise regression, genetic algorithm, etc.

KNIME - Part III: Descriptor scaling

Wednesday, March 19th, 2008

Scale the training set

  1. Put File Reader node (IO->Read) to workbench and configure it to load a training set from a file.
  2. Put Normalizer node (Data Manipulation->Column) to workbench and connect the output port from the File Reader node to its input port.
    • Configure it by choosing Min-Max Normalization with the Min set as 0.0 and Max set as 1.0.
    • Select the columns to normalize.
  3. Put Model Writer node (IO->Write) to workbench and connect the model port of the Normalizer node to its input port. Configure it to save the model to a file.
  4. Execute all nodes.

Scale the testing set

  1. Put File Reader node (IO->Read) to workbench and configure it to load a testing set from a file.
  2. Put Normalizer (Apply) node (Data Manipulation->Column) to workbench and connect the output port from the File Reader node to its input port.
  3. Put Model Reader node (IO->Read) to workbench and connect its output port to the model port of the Normalizer (Apply) node. Configure it to load the model that is saved during the scaling of the training set.
  4. Execute all nodes.

KNIME score full marks for its ease in scaling descriptors.


Close
E-mail It