Keith A. Pray - Professional and Academic Site
About Me
·
·
·
LinkedIn Profile Facebook Profile GoodReads Profile
Professional
Academic
Teaching
                                          
Printer Friendly Version
Project 2 Report

Intro ] [ C45PruneableClassifierTree.java ] [ ClassifierTree.java ] [ J48.java ] [ code ] [ codedescription ] [ summary ]

Control Experiments ] [ Other Experiments ] [ Up: Decision Trees ]

Code Description


Click to jump to a particular section of this page.


Code Reference

      The code used in this project is from the Weka project (www.cs.waikato.ac.nz/ml/weka/).

Weka is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU public license.

      Weka provides some basic learning functionality for educational (among others reasons) purposes. One of these is ID3. ID3 forms the basis of C4.5 which will be used in this project. The source is provided for both algorithms, but for illustrative purposes, ID3 will be examined here. Weka's implementation of C4.5 (actually C4.5 Version 8) handles some issues with ID3. Weka ID3 implements three major methods, buildClassifier(), classifyInstance() and distributionForInstance(). The buildClassifier() is the main focus of this project.

Back to Top

Code Adaptation

      The only adaption needed was to specify that pruning not be used. This was possible through command line arguments. The Weka package assumes data be in ARFF format. The original data sets provided were modified to include the header info the ARFF format expects. This involved including information provided in the census-income.names file found in the data set directory specified in the assignment. One thing was puzzling about the data in the census-income.test file. There were "." at the end of each data line. These were removed from the file used in the actual testing. This seemed reasonable since the extra characters did not appear in the training data set.

Back to Top

Code Algorithm

      The buildClassifier() method constructs a decision tree from a set of training data. Weka assumes the data is in ARFF format so there will be some translation of the original data set provided. This implementation checks the training data for missing values and non-nominal attributes. Since the pure ID3 algorithm does not handle missing values, those instances are removed from the training data. The C4.5 algorithm only removes instances where the class attribute is not assigned a value. With ID3, if the class attribute is non-nominal, an exception is thrown. The C4.5 algorithm does not throw this exception.

      It is now time to construct the tree.

  1. If the data set is empty
    1. create a leaf
    2. set class value to missing
    3. the estimated probability for each of the dataset's classes is initialized to 0.
    4. the method returns
    otherwise...
  2. find the attribute that that yields the greatest information gain.
    1. create an enumeration of the attributes, excluding the class attribute
    2. store the information gain for each attribute (computed by computeInfoGain())
    3. The attribute with the greatest information gain is used for the current node, (if there is more than one attribute with equivalent greatest information gain, the first found is chosen)
  3. if the information gain of the attribute chosen is 0
    1. create a leaf
    2. compute both the distribution of class probabilities and the class with greatest probability
    3. normalize the distribution
    4. method returns
    otherwise
  4. split the data set according the chosen attribute's possible values
  5. recursively build subtrees for each of the new datasets

computeInfoGain()

This is a straight forward implementation of the formula

Gain ( S, A ) º Entropy ( S ) - å v Î Values ( A ) ( | Sv | / | S | ) Entropy ( Sv )

It calculates the entropy of the entire data set and then uses the same method for splitting a data set as buildClassifier() and computes the entropy for each resulting data set.

      The main differences between this ID3 algorithm and the one used in Weka's C4.5 is in the data model used to represent the training instances. The model decides exactly how each attribute is determined for the information gain function and how to split continuous value attributes.

Back to Top


by: Keith A. Pray
Last Modified: July 4, 2004 8:58 AM
© 2004 - 1975 Keith A. Pray.
All rights reserved.

Current Theme: 

Kapowee Hosted | Kapow Generated in 0.025 second | XHTML | CSS