• No results found

3.3 Implementation of Models in WEKA

3.3.3 Extending WEKA

One of WEKA’s major strengths is that it is easily extended with customized or new classifiers, Filters, Clusterers, attribute selection methods, and other components. To add a new Filter or classifier, all that is needed is a class that derives from the Classifier class and implements the buildFilter or buildClassifier method for learning, and a FilterInstance or ClassifyInstance method for testing/predicting the value for a data point.

Any new class is picked up by the Graphical User Interfaces (GUI) through Java introspection: no further coding is needed to deploy it from WEKA’s GUIs. This makes it easy to evaluate how new algorithms perform compared to any of the existing ones, which explains WEKA’s popularity among machine learning researchers.

3.3.3.1 Writing a new Filter

The enhanced schemes were added to WEKA filter library. Filters perform many tasks, from resampling data, to deleting and standardizing attributes.

UNIVERSITY OF IBADAN LIBRARY

76

The following methods are of importance for the implementation of a filter. These methods are declared in the weka.filters.Filter class. These are:

a. getCapabilities()

b. setInputFormat(Instances) c. getInputFormat()

d. setOutputFormat(Instances) e. getOutputFormat()

f. input(Instance) g. bufferInput(Instance) h. push(Instance) i. output()

j. batchFinished() k. flushInput() l. getRevision()

But only the following methods were modified in this study. In order to include the enhanced data sampling scheme in WEKA

i. getCapabilities()

ii. setInputFormat(Instances) iii. input(Instance)

iv. batchFinished() v. getRevision()

setInputFormat(Instances)

With this call, the user tells the filter what structure, i.e., attributes, the input data has. This method also tests, whether the filter can actually process this data, according to the capabilities specified in the getCapabilities()method. If the output format of the filter, i.e., the new Instances header, can be determined based alone on this information, then the method should set the output format via setOutputFormat(Instances) and return true, otherwise it has to return false.

getInputFormat()

UNIVERSITY OF IBADAN LIBRARY

77

This method returns an Instances object containing all currently buffered Instance objects from the input queue.

setOutputFormat(Instances)

This method defines the new Instances header for the output data. For filters that work on a row-basis, there should not be any changes between the input and output format. But filters that work on attributes, e.g. removing, adding, modifying, will affect this format.

This method must be called with the appropriate Instances object as parameter, since all Instance objects being processed will rely on the output format (they use it as dataset that they belong to).

getOutputFormat()

This method returns the currently set Instances object that defines the output format. In case setOutputFormat(Instances) has not been called yet, this method will return null.

input(Instance)

This method returns true if the given Instance can be processed straight away and can be collected immediately via the output() method (after adding it to the output queue via push(Instance), of course). This is also the case if the first batch of data has been processed and the Instance belongs to the second batch. Via isFirstBatchDone() one can query whether this Instance is still part of the first batch or of the second.

If the Instance cannot be processed immediately, e.g., the filter needs to collect all the data first before doing some calculations, then it needs to be buffered with bufferInput(Instance) until batchFinished() is called. In this case, the method needs to return false.

bufferInput(Instance)

In case an Instance cannot be processed immediately, one can use this method to buffer them in the input queue. All buffered Instance objects are available via the getInputFormat() method.

push(Instance)

This method adds the given Instance to the output queue.

UNIVERSITY OF IBADAN LIBRARY

78 Output()

This method returns the next Instance object from the output queue and removes it from there. In case there is no Instance available this method returns null.

batchFinished()

This method signals the end of a dataset being pushed through the filter. In case of a filter that could not process the data of the first batch immediately, this is the place to determine what the output format will be (and set if via setOutputFormat(Instances)) and finally process the input data. The currently available data can be retrieved with the getInputFormat() method.

After processing the data, one needs to call flushInput() to remove all the pending input data.

flushInput()

This method removes all buffered Instance objects from the input queue. This method must be called after all the Instance objects have been processed in the batchFinished() method.

Option handling

If the filter should be able to handle command-line options, then the interface weka.core.OptionHandler needs to be implemented. In addition to that, the following code should be added at the end of the setOptions(String[]) method:

if (getInputFormat() != null) { setInputFormat(getInputFormat());

}

This will inform the filter about changes in the options and therefore reset it.

The following examples, covering batch and stream filters, illustrate the filter framework and how to use it. Unseeded random number generators like Math.random() should never be used since they will produce different results in each run and repeatable experiments are essential in machine learning.

UNIVERSITY OF IBADAN LIBRARY

79 BatchFilter

This simple batch filter adds a new attribute called blah at the end of the dataset. The rows of this attribute contain only the row’s index in the data. Since the batch-filter does not have to see all the data before creating the output format, the setInputFormat(Instances) sets the output format and returns true (indicating that the output format can be queried immediately). The batchFinished() method performs the processing of all the data.

import weka.core.*;

import weka.core.Capabilities.*;

public class BatchFilter extends Filter { public String globalInfo() {

return "A batch filter that adds an additional attribute ’blah’ at the end "

+ "containing the index of the processed instance. The output format "

+ "can be collected immediately.";

}

public Capabilities getCapabilities() { Capabilities result = super.getCapabilities();

result.enableAllAttributes();

result.enableAllClasses();

result.enable(Capability.NO_CLASS); // filter doesn’t need class to be set return result;

}

public boolean setInputFormat(Instances instanceInfo) throws Exception { super.setInputFormat(instanceInfo);

Instances outFormat = new Instances(instanceInfo, 0);

outFormat.insertAttributeAt(new Attribute("blah"), outFormat.numAttributes());

setOutputFormat(outFormat);

return true; // output format is immediately available }

public boolean batchFinished() throws Exception { if (getInputFormat() = null)

UNIVERSITY OF IBADAN LIBRARY

80

throw new NullPointerException("No input instance format defined");

Instances inst = getInputFormat();

Instances outFormat = getOutputFormat();

for (int i = 0; i < inst.numInstances(); i++) {

double[] newValues = new double[outFormat.numAttributes()];

double[] oldValues = inst.instance(i).toDoubleArray();

System.arraycopy(oldValues, 0, newValues, 0, oldValues.length);

newValues[newValues.length - 1] = i;

push(new Instance(1.0, newValues));

}

flushInput();

m_NewBatch = true;

m_FirstBatchDone = true;

return (numPendingOutput() != 0);

}

public static void main(String[] args) { runFilter(new BatchFilter(), args);

} }