Apriori Sets And Sequences - Keith's MS Thesis
Apriori Sets And Sequences
About
Code
Performance Data Collection
·
Data Sets
Documents
Results
·
                                          
Printer Friendly Version
performance-improvements

Intro ] [ check-subsets ] [ post-candidate-generation-garbage-collection ] [ prune-candidate-generation ] [ remove-candidates ] [ template ]

Itemset Prefix Tree ] [ Previous Duplicate Hash ] [ Up: Results ]

Pruning candidate generation detects when trying to combine itemset i
with all the other frequent itemsets left in the list to produce the
candidate for the next level will always fail. With the list of
frequent itemsets ordered lexigraphically it is only necessary to
detect a single itemset with which the itemset i cannot be
combined. The time it would take to test if the rest of the itemsets
can be combined with i is saved.

History of this feature:

ARMiner originally implemented this feature. Since the frequent
itemsets are ordered lexigraphically as soon as an item number in the
i itemsets is not found in the next frequent itemset in the list to be
tested (need a good way of saying tested if it can be combined with)
then it is guaranteed that that item will not be found in any other
itemsets further along in the list.

When event items were introduced this feature had to be
disabled. Event items, as current implemented, are not guaranteed to
have duplicates numbered contiguously. An event item numbered 67 could
be conceptually the same as one numbered 93. Because if this it is not
guaranteed that itemset i could not be combined with another itemset
further along in the frequent itemset list.

After event items were successfully introduced and Apriori Sets and
Sequences was producing valid results a revised implementation of this
prune was constructed. In the new implementation if an item was an
event items it was ignored when determining if the current branch, the
branch combining itemset i with the other frequent itemsets, should be
pruned. this is only effective if there are regular items present in
the data set on a macro scale. On a micro scale the prune is only
effective if itemset i contains a regular item.

This prune method can be modified to use event items if event items in
the frequent itemsets can be guaranteed to be numbered contiguously. 
Before the prune could take place we would have to go beyond the
contiguous range of item numbers without finding the current item in
itemset i. Even with this consideration a significant savings in time
could be realized.

Testing this feature:

Counting the number of comparisons to test if an itemset can be
combined with another that are saved by using the prune method should
do. It is certain to be a time saver if the dataset contains regular
items. The time saved depends mostly on the data. It will suffice to
show that it takes less time for the data used than without. It must
not be forgotten that the reason this and the other performance
improvements were implemented was to make the time it takes to mine
associations from this data feasiable.

Tasks:

1. Choose main testing dataset.

2. Run Experiments

3. Put results into spreadsheet

4. Answer original question, how much time is saved?

by: Keith A. Pray
Last Modified: July 4, 2004 8:03 AM
© 2004 - 1975 Keith A. Pray.
All rights reserved.

Current Theme: