Archive for the ‘Data mining’ Category

Orange - Part VI: Model validation using cross-validation and/or independent validation set

Friday, May 16th, 2008

The previous post already provides the steps for model validation using cross-validation. So how do we validate a model using an independent validation set?

Validate model on an independent validation set

  1. Put File widget (Data) to canvas and configure it to load a training set from a file.
  2. Put Select Attributes widget (Data) to canvas and connect the output port from the File widget to its input port.
    • Specify the attributes and class for the training set.
    • Click on the Apply button.
  3. Put K Nearest Neighbours widget (Classify) to canvas and connect the output port from the Select Attributes widget to its input port.
    • Configure it by setting Number of neighbours to 3.
    • Click on the Apply button.
  4. Put File widget (Data) to canvas and configure it to load an independent validation set from a file.
  5. Put Select Attributes widget (Data) to canvas and connect the output port from the second File widget to its input port.
    • Specify the attributes and class for the independent validation set.
    • Click on the Apply button.
  6. Put Test Learners widget (Evaluate) to canvas.
    • Connect the output port from K Nearest Neighbours widget to its Learner input port.
    • Connect the output port from the first Select Attributes widget to its Data input port.
    • Connect the output port from the second Select Attributes widget to its Separate Test Data input port.
    • Configure it by choosing Test on test data.

It can be seen that Orange is able to validate a model using either cross-validation or an independent validation set. However, it seems that Orange is unable to save a model and thus the model has to be rebuild each time it is to be used for validating an independent validation set.

Orange - Part V: Parameter optimization of machine learning/statistical methods

Wednesday, May 14th, 2008

  1. Put File widget (Data) to canvas and configure it to load a training set from a file.
  2. Put Select Attributes widget (Data) to canvas and connect the
    output port from the File widget to its input port.
    • Specify the attributes and class for the training set.
    • Click on the Apply button.
  3. Put K Nearest Neighbours widget (Classify) to canvas and connect the output port from the Select Attributes widget to its input port.
    • Configure it by setting Number of neighbours to 3.
    • Click on the Apply button.
  4. Put Test Learners widget (Evaluate) to canvas.
    • Connect the output port from K Nearest Neighbours widget to its Learner input port.
    • Connect the output port from the Select Attributes widget to its Data input port.
    • Configure it by choosing Cross-validation and setting the Number of folds to 10.

The above procedure shows how Orange can be used to train and assess the performance of a model. However, it is not possible to automatically determine the optimum parameter value (e.g. Number of neighbours to consider (k) in the above procedure) for a machine learning/statistical method. To determine the optimum parameter value, you have to do it manually by setting a parameter value, execute, record the overall error rates, set another parameter value, execute again, record the overall error rates and so on, until you have evaluated all the parameter values that you are interested in. Then the parameter value which gives the lowest overall error rates will be the optimum parameter value of the machine learning/statistical method for the training set.

Orange - Part IV Descriptor selection

Monday, May 12th, 2008

Orange does not have any wrapper descriptor selection methods.

Orange - Part III: Descriptor scaling

Saturday, May 10th, 2008

Orange does not have any capability for scaling descriptors. Zero marks for this one.

Orange - Part II: Partitioning of dataset into training and testing sets

Thursday, May 8th, 2008

  1. Put File widget (Data) to canvas and configure it to load a dataset from a file.
  2. Put Data Sampler widget (Data) to canvas and connect the output port from the File widget to its input port.
    • Configure it by choosing Random sampling and setting the Sample size to 80%.
    • Click on the Sample Data button.
  3. Put Save widget (Data) on the canvas. Connect the Examples output port from the Data Sampler node to the input node of the Save widget and configure it to save the training set to a file. Then click on the Save current data button.
  4. Put Save widget (Data) on the canvas. Connect the Remaining Examples output port from the Data Sampler node to the input node of the Save and configure it to save the testing set to a file. Then click on the Save current data button.

As can be seen from the above procedure, it is very easy to partition a dataset randomly into a training set and testing set. However, Orange does not seems to contain other algorithms, like the Kennard and Stone algorithm, for partitioning datasets.

Orange - Part I: Overview

Tuesday, May 6th, 2008

Orange (Snapshot 11 April 2008)

From their official website, “Orange is a component-based data mining software. It includes a range of preprocessing, modelling and data exploration techniques. It is based on C++ components, that are accessed either directly (not very common), through Python scripts (easier and better), or through GUI objects called Orange Widgets”. Orange is distributed under GPL.

If you install the current version of Orange, you will have a total of 77 widgets, with the following nodes distribution:

  • Data: 15
  • Classify: 14
  • Evaluate: 6
  • Visualize: 13
  • Associate: 13
  • Prototypes: 13
  • Regression: 3

However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, Orange can read data from five sources: text-delimited files (which include csv files), C4.5 files, and three other formats which I am not familar with. Orange cannot read data from SVMlight files, LIBSVM files or Microsoft Excel files. The lack of support for Microsoft Excel files is no big deal since you can easily convert them to csv format using Microsoft Excel. However, the lack of support for SVMlight and LIBSVM files will inconvenient users who are already using these two popular support vector machine softwares.

Orange has a few filter descriptor selection methods such as ReliefF, Information gain, Gain ratio and Gini gain.

Currently, Orange contains one algorithm for developing regression models and 10 algorithms for constructing classification models. It seems strange that Orange does not have multiple linear regression algorithm, which is the most basic of regression algorithms.

Orange has a Data Sampler widget that provides validation methods like cross-validation and leave-one-out.

Overall, my first impression of Orange is that it has a nice graphical user interface but it seems quite inadequate for QSAR experiments.

TANAGRA - Part VI: Model validation using cross-validation and/or independent validation set

Friday, May 2nd, 2008

The previous post already provides the steps for model validation using cross-validation. TANAGRA does not provide any functionality for loading another dataset into the same diagram, or saving and loading a model. Thus TANAGRA is unable to validate an independent validation set (TANAGRA is able to validate on a testing set only if the testing set is derived using its Sampling operator).

TANAGRA - Part V: Parameter optimization of machine learning/statistical methods

Wednesday, April 30th, 2008

  1. Create a new diagram and configure it to load a training set from a file. This will put a Dataset operator on the diagram.
  2. Put Define status operator (Feature selection) to diagram under the Dataset operator and configure it to set the correct attributes as Input and Target.
  3. Put K-NN operator (Spv learning) to diagram under Define status operator and configure it.
  4. Put Cross validation operator (Spv learning assessment) to diagram under K-NN operator and configure it.
  5. Execute.

The above procedure shows how TANAGRA can be used to train and assess the performance of a model. However, it is not possible to automatically determine the optimum parameter value (e.g. Number of neighbours to consider (k) in the above procedure) for a machine learning/statistical method. To determine the optimum parameter value, you have to do it manually by setting a parameter value, execute, record the overall error rates, set another parameter value, execute again, record the overall error rates and so on, until you have evaluated all the parameter values that you are interested in. Then the parameter value which gives the lowest overall error rates will be the optimum parameter value of the machine learning/statistical method for the training set.

TANAGRA - Part IV Descriptor selection

Monday, April 28th, 2008

TANAGRA have a few wrapper descriptor selection methods, like forward selection and backward elimination, but these methods are limited to using logistic regression as the statistical learning method. Other supervised learning methods cannot be used during the descriptor selection process. Hence, TANAGRA can be viewed as not having any wrapper descriptor selection methods.

TANAGRA - Part III: Descriptor scaling

Thursday, April 24th, 2008

Scale the training set

  1. Create a new diagram and configure it to load a training set from a file. This will put a Dataset operator on the diagram.
  2. Put Define status operator (Feature selection) to diagram under the Dataset operator and configure it to set the correct attributes as Input and Target.
  3. Put Standardize operator (Feature construction) to diagram under the Dataset operator and configure it to use the formula (x-x_min)/(x_max-x_min).
  4. Execute.

It is easy to use TANAGRA to scale descriptors in the training set. However, it seems that there is no option to save the parameters used to scale the descriptors in the training set and then apply them on a testing set. This would make it difficult to assess the performance of a model on an independent validation set.


Close
E-mail It