|
Report
Code Description
Click to jump to a particular section of this page.
Code Reference
The the code used can be found through the link(s) below.
The file and data handling code used in this project is from the
Weka project
(www.cs.waikato.ac.nz/ml/weka/).
Weka is a collection of machine learning algorithms for solving
real-world data mining problems. It is written in Java and runs on
almost any platform. The algorithms can either be applied directly
to a dataset or called from your own Java code. Weka is also
well-suited for developing new machine learning schemes. Weka is
open source software issued under the GNU public license.
The Weka code used in this project was that which
handles files and various data manipulation tasks. It was
also used to collect statistics on the performance of the
Naive Bayes Classifier algorithm implemented. The version of
the Instance Based nearest k neighbor classifier was adapted
from the Weka package Instance Based nearest neighbor (IB1)
algorithm.
Back to Top
Code Features
- Instance Representation
-
Each instance or example is represented as a set of binary attributes.
If a data set is not in this form, it is run through a discretization
filter and then a binary filter, creating a new attribute for each
value an original attribute could have.
- Missing Values
-
Missing Values were replaced with the most frequently occurring
value.
- Hypothesis Representation
-
Each hypothesis is represented by a bit array. Each bit corresponds
to an attribute, including the class attribute.
- Hypothesis Logic
-
A single hypothesis is applied to an instance by performing a
logical operation with the bit representation of an instance and
the hypothesis. The logical operations supported include AND, OR and
XOR. The number of set bits (bit = 1) in the result is then counted.
This sum determines if the hypothesis classifies the instance. If
this is the case, then the classification specified by the last bit
in the hypothesis is used. Otherwise, a neutral answer is returned.
- AND
-
AND has the effect of demanding that an attribute is set.
The ratio of these bits set determines the likely hood that this
hypothesis' classification is correct.
If the ratio of the result bits set to hypothesis
bits set if above a certain threshold, the classification
specified by the hypothesis is used.
- OR
-
OR can only identify attributes that have no significance.
This means the more bits set in the instance over the hypothesis,
the more likely the hypothesis' classification is correct.
So, we really want to compare the difference of bits NOT set in the
hypothesis to bits set in the result over the number that were
set in the hypothesis.
If the ratio of hypothesis bits set to result bits set is above a
threshold, use this hypothesis' classification.
- XOR
-
XOR specifies the value each bit should be. The ratio of bits set
in the result represents the likelihood that this hypothesis'
classification will be used..
- Multiple Logical Operators
-
It is possible to use any combination of the three logical operations
supported. This is done by splitting the population into three camps
of equal size and training them separately using one of the chosen
logical operations. When testing the overall accuracy, each population
is asked to classify each instance. The separate results are averaged
to form an answer.
- Fitness
-
Fitness for each hypothesis is determined by the number of training
examples that it classified correctly. This count is taken over
the examples who's classification matches that of the hypothesis'.
Fitness also considers the number of examples a hypothesis
classifies incorrectly, effectively reducing the count of
correctly classified examples by one for each. This includes
examples of both classifications.
This was done so that hypotheses that were good at classifying
a small set of instances could be promoted.
Otherwise, hypotheses that classified
according to the most frequently seen classification would most likely
dominate the population.
So, you could say a "Be correct or be quiet" policy was used.
- Mutation
-
With some probability, one or more random bits in a hypothesis
may be changed. If chosen for mutation, at least
one bit in the hypothesis is changed.
The mutation rate decays over time proportional to the number
of populations generated so far. It is linear, starting at the
specified rate and ending near zero. The number of bits changed
is also specified by this decaying rate.
This was done because
a large fluctuation in overall fitness (and average fitness) could
be seen from one generation to the next. This is fine for the
beginning populations, but tends to make results unpredictable
for the last populations.
- Cross Over
-
Pairs of hypotheses are selected in the same manner from the
current population as they to continue into the next population.
If the pair are equivalent, a new hypothesis is chosen until
otherwise. A random bit position is selected as the cross over
point.
- Options
-
- Number of Hypotheses in the population
- Number of Populations to produce
- Carry Over rate
- Mutation Rate
- Random generator Seed
- Fitness Threshold at which to stop producing populations
(Overrides Number of Populations)
- Logical Operations to use
Back to Top
|
|