Keith A. Pray : Home : Academic : Machine Learning : Concept Learning, Ordering : 2.5

Project 1: Concept Learning, Ordering

[ Intro ] [ 2.1 ] [ 2.2 ] [ 2.4 ] [ 2.5 ] [ 2.7 ] [ 2.8 ] [ 2.9 ]

Exercise 2.5

Consider the following sequence of positive and negative training examples describing the concept "pairs of people who live in the same house." Each training example describes an ordered pair of people, with each person described by their sex, hair color (black, brown, or blonde), height (tall, medium, or short), and nationality (US, French, German, Irish, Indian, Japanese, or Portuguese).

Training Examples
1	+	< < male brown tall US > < female black short US > >
2	+	< < male brown short French > < female black short US > >
3	-	< < female brown tall German > < female black short Indian > >
4	+	< < male brown tall Irish > < female brown short Irish > >

Consider a hypothesis space defined over these instances, in which each hypothesis is represented by a pair of 4-tuples, and where each attribute constraint may be a specific value, "?," or "Ø," just as in the EnjoySport hypothesis representation. For example, the hypothesis

< < male ? tall ? > < female ? ? Japanese > >

represents the set of all pairs of people where the first is a tall male (of any nationality and hair color), and the second is a Japanese female (of any hair color and height).

Provide a hand trace of the Candidate-Elimination algorithm learning from the above training examples and hypothesis language. In particular, show the specific and general boundaries of the version space after it has processed the first training example, then the second training example, etc.

Initial State
S₀:

{ < < Ø Ø Ø Ø > < Ø Ø Ø Ø > > }

G₀:

{ < < ? ? ? ? > < ? ? ? ? > > }

Training Example #1
S₁:

{ < < male brown tall US > < female black short US > > }

G₁:

{ < < ? ? ? ? > < ? ? ? ? > > }

Training Example #2
S₂:

{ < < male brown ? ? > < female black short US > > }

G₂:

{ < < ? ? ? ? > < ? ? ? ? > > }

Training Example #3
S₃:

{ < < male brown ? ? > < female black short US > > }

G₃:

{ < < male ? ? ? > < ? ? ? ? > >
< < ? ? ? ? > < ? ? ? US > > }

Training Example #4
S₄:

{ < < male brown ? ? > < female ? short ? > > }

G₄:

{ < < male ? ? ? > < ? ? ? ? > > }

How many distinct hypotheses from the given hypothesis space are consistent with the following single positive training example?

+ << male black short Portuguese > < female blonde tall Indian >>
Each hypothesis consistent with the above example can have either the specified value seen above or "?" for each attribute. That makes 2 per attribute. There are 8 attributes.

2 ⁸ = 256

Assume the learner has encountered only the positive example from part (b), and that it is now allowed to query the trainer by generating any instance and asking the trainer to classify it.
1. Give a specific sequence of queries that assures the learner will converge to the single correct hypothesis, whatever it may be (assuming that the target concept is describable within the given hypothesis language). Give the shortest sequence of queries you can find.
  
  Let's concentrate on S. S is currently:
  S ₁: < < male black short Portuguese > < female blonde tall Indian > >
  Query 2 (considering the first example as # 1)
  < < female brown medium US > < male black medium US > >
  S ₂: < < (? or male) (? or black) (? or short) (? or Portuguese) > < (? or female) (? or blonde) (? or tall) (? or Indian) > >
  Please note that the sex attribute in the hypothesis is determined. There are no more values it could assume to force S to generalize that attribute, if it hasn't already.
  Query 3
  < < done blonde tall French > < done brown short French > >
  S ₃: < < done (? or black) (? or short) (? or Portuguese) > < done (? or blonde) (? or tall) (? or Indian) > >
  Please note that the hair color and height attributes in the hypothesis are determined. There are no more values they could assume to force S to generalize those attributes, if they haven't already.
  Query 4
  < < done done done German > < done done done German > >
  
  This goes on until all the nationalities for each person has been tried (3 more queries). This makes for total of 7 queries.
2. How does the length of this sequence relate to your answer to question (b)?
  
  For each query the hypothesis space was reduced by half. That is except for the second query where both the hair color attribute and the height attribute were guaranteed convergence. This is consistent with the 2 ⁸ calculation for the number of total hypotheses consistent with the original training example, or any positive training example for that matter.
  Please note that while G was not examined closely, since the attributes were generalized to "?" and converged with G or took on the value from the first training example and the inconsistent hypotheses in G were removed and replaced by those more specific. This eventually leads to S and G converging.

Note that this hypothesis language cannot express all concepts that can be defined over the instances (i.e., we can define sets of positive and negative examples for which there is no corresponding describable hypothesis). If we were to enrich the language so that it could express all concepts that can be defined over the instance language, then how would you answer to (c) change?

Instead of checking just 2 possible values for attributes in the hypotheses we would have to check for every combination of values over all the possible values in the instance space. There would be a lot more work, that's for sure...

Note: I should really put exactly how many here.

by: Keith A. Pray
Last Modified: July 4, 2004 8:58 AM