Archive for the ‘Review’ Category

TANAGRA - Part IV Descriptor selection

Monday, April 28th, 2008

TANAGRA have a few wrapper descriptor selection methods, like forward selection and backward elimination, but these methods are limited to using logistic regression as the statistical learning method. Other supervised learning methods cannot be used during the descriptor selection process. Hence, TANAGRA can be viewed as not having any wrapper descriptor selection methods.

TANAGRA - Part III: Descriptor scaling

Thursday, April 24th, 2008

Scale the training set

  1. Create a new diagram and configure it to load a training set from a file. This will put a Dataset operator on the diagram.
  2. Put Define status operator (Feature selection) to diagram under the Dataset operator and configure it to set the correct attributes as Input and Target.
  3. Put Standardize operator (Feature construction) to diagram under the Dataset operator and configure it to use the formula (x-x_min)/(x_max-x_min).
  4. Execute.

It is easy to use TANAGRA to scale descriptors in the training set. However, it seems that there is no option to save the parameters used to scale the descriptors in the training set and then apply them on a testing set. This would make it difficult to assess the performance of a model on an independent validation set.

TANAGRA - Part II: Partitioning of dataset into training and testing sets

Tuesday, April 22nd, 2008

  1. Create a new diagram and configure it to load a dataset from a file. This will put a Dataset operator on the diagram.
  2. Put Sampling operator (Instance selection) to diagram under the Dataset operator and configure the proportion size setting to 80%.
  3. Put Export dataset operator (Data visualization) to diagram under the Sampling operator.
    • Configure it by setting the Examples selection to selected examples.
    • Set the filename to save the training set to.
  4. Put Recover examples operator to diagram under the Sampling operator and set the Examples to recover option to unselected.
  5. Put Export dataset operator (Data visualization) to diagram under the Recover examples operator.
    • Configure it by setting the Examples selection to selected examples.
    • Set the filename to save the testing set to.
  6. Execute.

As can be seen from the above procedure, it is very easy to partition a dataset randomly into a training set and testing set. However, TANAGRA does not seems to contain other algorithms, like the Kennard and Stone algorithm, for partitioning datasets.

TANAGRA - Part I: Overview

Sunday, April 20th, 2008

TANAGRA (version 1.4.21)

From their official website, “TANAGRA is the successor of SIPINA which implements various supervised learning algorithms, especially an interactive and visual construction of decision trees. TANAGRA is more powerful, it contains some supervised learning but also other paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms”. TANAGRA “is an “open source project” as every researcher can access to the source code, and add his own algorithms, as far as he agrees and conforms to the software distribution license”. According to the English translation of the license (the original is in French), the software is free for use but if you have used it for your research, you have to cite it in your publications.

If you install the current version of TANAGRA, you will have a total of 137 operators, with the following nodes distribution:

  • Data visualization: 6
  • Statistics: 17
  • Nonparametric statistics: 20
  • Instance selection: 6
  • Feature construction: 12
  • Feature selection: 12
  • Regression: 6
  • Factorial analysis: 6
  • PLS: 4
  • Clustering: 12
  • Spv learning: 17
  • Meta-spv learning: 4
  • Spv learning assessment: 6
  • Scoring: 3
  • Association: 6

However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, TANAGRA can only read data from three sources: text-delimited files (which include csv files), ARFF files (which are Weka files), and Microsoft Excel files. There are no nodes for reading from SVMlight files or LIBSVM files. The lack of support for SVMlight and LIBSVM files will inconvenient users who are already using these two popular support vector machine softwares.

TANAGRA has a few descriptor selection operators. However, it seems like it does not have some common ones like genetic algorithm. Also the filter and wrapper descriptor selection methods seems to be mixed together. Will explore this in more detail when I start the testing proper.

Currently, TANAGRA contains 6 algorithms for developing regression models and 17 algorithms for constructing classification models.

TANAGRA contains a few validation methods like cross-validation, boosting and bagging.

Overall, my first impression of TANAGRA is that although it may not have a very pretty graphical user interface or a lot of operators, it seems to be quite suitable for QSAR experiments.

Weka (KnowledgeFlow) - Part VI: Model validation using cross-validation and/or independent validation set

Saturday, April 19th, 2008

The previous post already provides the steps for model validation using cross-validation. So how do we validate a model using an independent validation set?

Validate model on an independent validation set

  1. Put ArffLoader component (DataSources) to layout area and configure it to load a training set from a file.
  2. Put ClassAssigner component (Filters) to layout area and connect the dataSet connection from the ArffLoader component to it.
    • Configure it by setting the classIndex to the class column.
  3. Put TrainingSetMaker component (Evaluation) to layout area and connect the dataSet connection from the ClassAssigner component to it.
  4. Put ArffLoader component (DataSources) to layout area and configure it to load an independent validation set from a file.
  5. Put ClassAssigner component (Filters) to layout area and connect the dataSet connection from the ArffLoader component to it.
    • Configure it by setting the classIndex to the class column.
  6. Put TestSetMaker component (Evaluation) to layout area and connect the dataSet connection from the ClassAssigner component to it.
  7. Put SMO component (Classifiers) to layout area and connect the trainingSet connection from the TrainingSetMaker component and the testSet connection from the TestSetMaker component to it.
    • Configure it by choosing RBFKernel and setting the gamma value for the kernel to 0.01.
  8. Put ClassifierPerformanceEvaluator component (Evaluation) to layout and connect the batchClassifier connection from the SMO component to it.
  9. Put TextViewer component (Visualization) to layout and connect the text connection from the ClassifierPerformanceEvaluator component to it.
  10. Run.

It can be seen that KnowledgeFlow is able to validate a model using either cross-validation or an independent validation set. However, it seems that KnowledgeFlow is unable to save a model and thus the model has to be rebuild each time it is to be used for validating an independent validation set.

Weka (KnowledgeFlow) - Part V: Parameter optimization of machine learning/statistical methods

Wednesday, April 16th, 2008

  1. Put ArffLoader component (DataSources) to layout area and configure it to load a training set from a file.
  2. Put ClassAssigner component (Filters) to layout area and connect the dataSet connection from the ArffLoader component to it.
    • Configure it by setting the classIndex to the class column.
  3. Put CrossValidationSplitMaker component (Evaluation) to layout area and connect the dataSet connection from the ClassAssigner component to it.
    • Configure it by setting the folds to 10.
  4. Put SMO component (Classifiers) to layout area and connect the trainingSet and testSet connections from the CrossValidationSplitMaker component to it.
    • Configure it by choosing RBFKernel and setting the gamma value for the kernel to 0.01.
  5. Put ClassifierPerformanceEvaluator component (Evaluation) to layout and connect the batchClassifier connection from the SMO component to it.
  6. Put TextViewer component (Visualization) to layout and connect the text connection from the ClassifierPerformanceEvaluator component to it.
  7. Run.

The above procedure shows how KnowledgeFlow can be used to train and assess the performance of a model. However, it is not possible to automatically determine the optimum parameter values (e.g. c and gamma value for the kernel) for a machine learning/statistical method (There is a GridSearch component (Classifiers) but I could not get it to work. Keep having the error “Can’t have more folds than instances” even though I am not using cross validation.). To determine the optimum parameter values, you have to do it manually by setting a parameter value, run, record the mean absolute error given in the TextViewer component, set another parameter value, run again, record the mean absolute error and so on, until you have evaluated all the parameter values that you are interested in. Then the parameter values which gives the lowest mean absolute error will be the optimum parameter value of the machine learning/statistical method for the training set.

Weka (KnowledgeFlow) - Part IV Descriptor selection

Monday, April 14th, 2008

KnowledgeFlow does not seems to have any common descriptor selection methods like forward selection, backward elimination, stepwise regression, genetic algorithm, etc. However, the Explorer application in Weka does have quite a number of descriptor selection methods.

Weka (KnowledgeFlow) - Part III: Descriptor scaling

Saturday, April 12th, 2008

Scale the training set

  1. Put ArffLoader component (DataSources) to layout area and configure it to load a training set from a file.
  2. Put ClassAssigner component (Filters) to layout area and connect the dataSet connection from the ArffLoader component to it.
    • Configure it by setting the classIndex to the class column.
  3. Put Normalize component (Filters) layout area and connect the dataSet connection from the ClassAssigner component to it.
  4. Run.

It is easy to use KnowledgeFlow to scale descriptors in the training set. However, it seems that there is no option to save the parameters used to scale the descriptors in the training set and then apply them on a testing set. This would make it difficult to assess the performance of a model on an independent validation set.

Weka (KnowledgeFlow) - Part II: Partitioning of dataset into training and testing sets

Thursday, April 10th, 2008

  1. Put ArffLoader component (DataSources) to layout area and configure it to load a dataset from a file.
  2. Put TrainTestSplitMaker component (Evaluation) to layout area and connect the dataSet connection from the ArffLoader component to it.
    • Configure it by setting the trainPercent to 80.
  3. Put two ArffSaver components (DataSinks) on the layout area. Connect the trainingSet connection from the TrainTestSplitMaker component to the first ArffSaver component and configure it to save the training set to a file. Connect the second testSet connection from the TrainTestSplitMaker component to the second ArffSaver component and configure it to save the testing set to a file.
  4. Run.

As can be seen from the above procedure, it is very easy to partition a dataset randomly into a training set and testing set. However, KnowledgeFlow does not seems to contain other algorithms, like the Kennard and Stone algorithm, for partitioning datasets.

Weka - Part I: Overview

Tuesday, April 8th, 2008

Weka (version 3.5.7)

From their official website, “Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License”.

Weka consists of four different applications: Explorer, Experimenter, KnowledgeFlow and SimpleCLI. For this review, I will concentrate mainly on KnowledgeFlow since it is similar to both RapidMiner and KNIME in terms of user interface.

If you install the current developer version of Weka, you will have a total of 225 components in KnowledgeFlow, with the following components distribution:

  • DataSources: 8
  • DataSinks: 7
  • Filters: 69
  • Classifiers: 110
  • Clusterers: 9
  • Associations: 5
  • Evaluation: 10
  • Visualization: 7

As mentioned before, I am interested in using it for QSAR experiments so I will only examine those nodes that are relevant. Basically, KnowledgeFlow can read data from quite a number of sources, e.g. ARFF, csv, LIBSVM files, database, etc, so most users should not have any problems opening their existing data files in RapidMiner. But it does not have a component for reading from Microsoft Excel files, which is not a big deal since you can easily convert them to csv format using Microsoft Excel.

At first sight, KnowledgeFlow does not seem to have any descriptor selection capability. Will explore this in more detail when I start the testing proper.

Currently, KnowledgeFlow contains 22 algorithms for developing regression models and 66 algorithms for constructing classification models.

KnowledgeFlow contains validation methods like cross-validation and bagging.

Overall, my first impression of KnowledgeFlow is that its graphical user interface seems easy to use. However, the layout of the components may make it difficult to find those components that you require in an experiment.


Close
E-mail It