Report

Intro ] [ codedescription ] [ summary ]

Experiments ] [ Initial Experiments ] [ Slides ] [ Up: Genetic Algorithms ]

Code Description


Click to jump to a particular section of this page.


Code Reference

      The the code used can be found through the link(s) below.

      The file and data handling code used in this project is from the Weka project (www.cs.waikato.ac.nz/ml/weka/).

      Weka is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU public license.

      The Weka code used in this project was that which handles files and various data manipulation tasks. It was also used to collect statistics on the performance of the Naive Bayes Classifier algorithm implemented. The version of the Instance Based nearest k neighbor classifier was adapted from the Weka package Instance Based nearest neighbor (IB1) algorithm.

Back to Top

Code Features

Instance Representation
Each instance or example is represented as a set of binary attributes. If a data set is not in this form, it is run through a discretization filter and then a binary filter, creating a new attribute for each value an original attribute could have.
Missing Values
Missing Values were replaced with the most frequently occurring value.
Hypothesis Representation
Each hypothesis is represented by a bit array. Each bit corresponds to an attribute, including the class attribute.
Hypothesis Logic
A single hypothesis is applied to an instance by performing a logical operation with the bit representation of an instance and the hypothesis. The logical operations supported include AND, OR and XOR. The number of set bits (bit = 1) in the result is then counted. This sum determines if the hypothesis classifies the instance. If this is the case, then the classification specified by the last bit in the hypothesis is used. Otherwise, a neutral answer is returned.
AND
AND has the effect of demanding that an attribute is set. The ratio of these bits set determines the likely hood that this hypothesis' classification is correct.
If the ratio of the result bits set to hypothesis bits set if above a certain threshold, the classification specified by the hypothesis is used.
OR
OR can only identify attributes that have no significance. This means the more bits set in the instance over the hypothesis, the more likely the hypothesis' classification is correct.
So, we really want to compare the difference of bits NOT set in the hypothesis to bits set in the result over the number that were set in the hypothesis.
If the ratio of hypothesis bits set to result bits set is above a threshold, use this hypothesis' classification.
XOR
XOR specifies the value each bit should be. The ratio of bits set in the result represents the likelihood that this hypothesis' classification will be used..
Multiple Logical Operators
It is possible to use any combination of the three logical operations supported. This is done by splitting the population into three camps of equal size and training them separately using one of the chosen logical operations. When testing the overall accuracy, each population is asked to classify each instance. The separate results are averaged to form an answer.
Fitness
Fitness for each hypothesis is determined by the number of training examples that it classified correctly. This count is taken over the examples who's classification matches that of the hypothesis'.
Fitness also considers the number of examples a hypothesis classifies incorrectly, effectively reducing the count of correctly classified examples by one for each. This includes examples of both classifications.
This was done so that hypotheses that were good at classifying a small set of instances could be promoted. Otherwise, hypotheses that classified according to the most frequently seen classification would most likely dominate the population. So, you could say a "Be correct or be quiet" policy was used.
Mutation
With some probability, one or more random bits in a hypothesis may be changed. If chosen for mutation, at least one bit in the hypothesis is changed.
The mutation rate decays over time proportional to the number of populations generated so far. It is linear, starting at the specified rate and ending near zero. The number of bits changed is also specified by this decaying rate.
This was done because a large fluctuation in overall fitness (and average fitness) could be seen from one generation to the next. This is fine for the beginning populations, but tends to make results unpredictable for the last populations.
Cross Over
Pairs of hypotheses are selected in the same manner from the current population as they to continue into the next population. If the pair are equivalent, a new hypothesis is chosen until otherwise. A random bit position is selected as the cross over point.
Options
  • Number of Hypotheses in the population
  • Number of Populations to produce
  • Carry Over rate
  • Mutation Rate
  • Random generator Seed
  • Fitness Threshold at which to stop producing populations (Overrides Number of Populations)
  • Logical Operations to use

     

Back to Top

 

by: Keith A. Pray
Last Modified: July 4, 2004 9:00 AM
© 2004 - 1975 Keith A. Pray.
All rights reserved.