Archive for April, 2008

Weka - Part I: Overview

Tuesday, April 8th, 2008

Weka (version 3.5.7)

From their official website, “Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License”.

Weka consists of four different applications: Explorer, Experimenter, KnowledgeFlow and SimpleCLI. For this review, I will concentrate mainly on KnowledgeFlow since it is similar to both RapidMiner and KNIME in terms of user interface.

If you install the current developer version of Weka, you will have a total of 225 components in KnowledgeFlow, with the following components distribution:

  • DataSources: 8
  • DataSinks: 7
  • Filters: 69
  • Classifiers: 110
  • Clusterers: 9
  • Associations: 5
  • Evaluation: 10
  • Visualization: 7

As mentioned before, I am interested in using it for QSAR experiments so I will only examine those nodes that are relevant. Basically, KnowledgeFlow can read data from quite a number of sources, e.g. ARFF, csv, LIBSVM files, database, etc, so most users should not have any problems opening their existing data files in RapidMiner. But it does not have a component for reading from Microsoft Excel files, which is not a big deal since you can easily convert them to csv format using Microsoft Excel.

At first sight, KnowledgeFlow does not seem to have any descriptor selection capability. Will explore this in more detail when I start the testing proper.

Currently, KnowledgeFlow contains 22 algorithms for developing regression models and 66 algorithms for constructing classification models.

KnowledgeFlow contains validation methods like cross-validation and bagging.

Overall, my first impression of KnowledgeFlow is that its graphical user interface seems easy to use. However, the layout of the components may make it difficult to find those components that you require in an experiment.

RapidMiner - Part VI: Model validation using cross-validation and/or independent validation set

Sunday, April 6th, 2008

Model validation using cross-validation

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add XValidation operator (Validation) to Root and configure it by setting the number_of_validations to 10.
    • Add LibSVMLearner operator (Learner->Supervised->Functions) to XValidation operator.
    • Add OperatorChain operator to XValidation operator.
      • Add ModelApplier operator to OperatorChain operator.
      • Add Performance operator (Validation) to OperatorChain operator.
  3. Run.

Develop a model

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add LibSVMLearner operator (Learner->Supervised->Functions) to Root.
  3. Add ModelWriter operator (IO->Models) to Root and configure it to save the model to a file.
  4. Run.

Model validation using independent validation set

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load an independent validation set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add ModelLoader operator (IO->Models) to Root and configure it to load the model that is saved in the model development phase.
  3. Add ModelApplier operator to Root.
  4. Add Performance operator (Validation) to Root.
  5. Run.

RapidMiner score full marks again for its ease in model validation.

RapidMiner - Part V: Parameter optimization of machine learning/statistical methods

Friday, April 4th, 2008

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add GridParameterOptimization operator (Meta->Parameter) to Root.
    • Add XValidation operator (Validation) to GridParameterOptimization operator.
      • Add LibSVMLearner operator (Learner->Supervised->Functions) to XValidation operator.
      • Add OperatorChain operator to XValidation operator.
        • Add ModelApplier operator to OperatorChain operator.
        • Add Performance operator (Validation) to OperatorChain operator.
    • Add ProcessLog operator to GridParameterOptimization operator.
      • Configure the log by editing the Edit List.
        • Click the Add button and enter C for the log portion. Choose LibSVMLearner, parameter, C for the column_name.
        • Click the Add button and enter gamma for the log portion. Choose LibSVMLearner, parameter, gamma for the column_name.
        • Click the Add button and enter performance for the log portion. Choose XValidation, value, performance for the column_name.
    • Configure the GridParameterOptimization operator by editing the Edit List
      • Select LibSVMLearner, C for parameters and enter 50,100,150,200,250 for values.
      • Select LibSVMLearner, gamma for parameters and enter 0.0001,0.001,0.01,0.1 for values.
  3. Add ParameterSetWriter operator (IO->Other) to Root and configure it to save the parameters to a file.
  4. Add GnuplotWriter operator (IO->Other) to Root.
    • Configure it to save the plot to a file.
    • Set name as Log.
    • Set x_axis as C.
    • Set y_axis as gamma.
    • Set values as performance.
  5. Run.

It is easy for RapidMiner to perform parameter optimization. Full marks for this part.

RapidMiner - Part IV: Descriptor selection

Wednesday, April 2nd, 2008

  1. Add ArffExampleSource operator (IO->Examples) to Root.
    • Configure it to load a training set from a file.
    • Set the value of the label_attribute to the class column.
  2. Add FeatureSelection operator (Preprocessing->Attributes->Selection) to Root.
    • Add XValidation operator (Validation) to FeatureSelection operator.
      • Add NearestNeighors operator (Learner->Supervised->Lazy) to XValidation operator.
      • Add OperatorChain operator to XValidation operator.
        • Add ModelApplier operator to OperatorChain operator.
        • Add Performance operator (Validation) to OperatorChain operator.
    • Add ProcessLog operator to FeatureSelection operator.
      • Configure the log by editing the Edit List.
        • Click the Add button and enter generation for the log portion. Choose FeatureSelection, value, generation for the column_name.
        • Click the Add button and enter performance for the log portion. Choose FeatureSelection, value, performance for the column_name.
  3. Run.

It is a simple matter for RapidMiner to perform descriptor selection and there are a number of descriptor selection methods available.


Close
E-mail It