Apriori Sets And Sequences - Keith's MS Thesis
Apriori Sets And Sequences
About
Code
Performance Data Collection
·
Data Sets
Documents
Results
·
                                          
Printer Friendly Version
Previous Duplicate Hash

[ Intro ]

Up performance-improvements ]

Using the previous duplicate hash effects the time it takes to
generate candidate itemsets. 

How does it effect the time?

It is used to check if an itemset could
possibly be a duplicate of a previously created itemset or the
original itemset that future candidates may duplicate. We can
definitely say if an itemset does not belong to either of these
groups. By not looking up which itemset another is a duplicate of we
can save time. We can also save time not adding an itemset to the list
of those that will have duplicate itemsets generated in the future and
making more itemsets to look through.

History of this feature:

To support mining for frequent itemsets containing more than one of a
single type of event item, duplicate items of event items had to be
created with different integer representations. This allows itemsets
to contain more than one item that is conceptually the same event.
This also creates more work to do during mining. More itemsets must be
tested and joined, more candidate itemsets must have their support
counted. More memory must be used to represent the duplicate
itemsets. 

We can take advantage of the the fact that there are
conceptual duplicates. Duplicate itemsets will have the same
support. Therefor only one need be counted. To take advantage of this
the duplicate itemsets must be identified. The candidates are stored
in a simple list. Originally, this list had to be searched for each
candidate that was generated as they were generated. As more and more
candidates were generated, the time to search this list became
prohibitive. 

The previous duplicate hash feature took advantage of the way
candidate itemsets are created. Candidates are the result of
joining two frequent itemsets from the previous level. If either of
these parent itemsets are a duplicate itemset then its child itemset
could also be a duplicate. If neither parent is a duplicate there is
no need to search the list of candidates for a duplicate.

This technique was applied further to identifying the first of a set
of duplicates. Since the duplicate itemsets from the previous level
are mapped to the first of a set of duplicates, the one that gets its
support counted, those itemsets can be easily identified. A separate
list to search for duplicate candidate itemsets is created. If a
parent itemset was the first of a set of duplicates its child is added
to the list to be searched. This list is often much shorter than the
list of all candidates to be counted.

Using these two applications of the previous level's record of
duplicate itemsets saves many comparisons of itemsets.

Testing this feature:

Need to test this feature using a dataset and parameters that result
in many candidates being generated but still completes the mining in a
reasonable amount of time.

Tasks:

1. Choose main testing dataset.

   - vary the number of duplicate event names and hence the number of
   duplicate itemsets that get generated

- ad hoc testing showed

     perfdata-I-D-N4-discretized-no-sequences.arff 

  to be a good data set
  to use. This data set includes increase, decrease, and sustain event
  types. Each event has a minimum length of 4. Numeric attributes are
  discretized and the original sequences have been removed.

- we could use a smaller data set, say with one less event type. Even
  though at first, the time it takes to generate candidates is
  negligible we can increase the number of duplicates to make up for
  it. This might exaggerate the effect using the previous duplicate
  hash will have on candidate generation but may be practical for many
  domains where repeated events are numerous and rules containing them
  are valuable.

2. Run Experiments

   - first with no duplicates, then increase number of duplicates
     until impractical to run

   - with many event type it will not take long to reach an
     impractical number

   - with less event type I don't know how long we can increase the
     duplicate number.

3. Put results into spreadsheet
   
   - important metrics

     - time to generate candidates for each level

     - total time to generate all candidates

     - overall mining time

     - number of candidates generated at each level

     - percentage of duplicate itemsets generated

     - percentage of unique itemsets generated

     - percentage of itemsets that we count support for

4. Answer original question. :)

   - how much wood could a woodpecker peck if a wood pecker could peck
     wood.

by: Keith A. Pray
Last Modified: July 4, 2004 8:03 AM
© 2004 - 1975 Keith A. Pray.
All rights reserved.

Current Theme: