performance-improvements

Intro ] [ check-subsets ] [ post-candidate-generation-garbage-collection ] [ prune-candidate-generation ] [ remove-candidates ] [ template ]

Itemset Prefix Tree ] [ Previous Duplicate Hash ] [ Up: Results ]

This implements the Apriori prune. Every subset of a frequent itemset
must be frequent. If a candidate itemset is frequent all its subsets
must be. In the previous level all the frequent itemsets were
found. If a candidate itemset's subsets are not all found in the
previous list of frequent itemsets it can be removed from the
candidate list.

History of this feature:

The prune as implemented in ARMiner does not work for itemsets
containing event items. The main problem is due to the use of a prefix
hash tree in ARMiner to index the candidate itemsets. In the interest
of verifying the new system for event items worked the prune was
simply removed. Once Apriori Sets and Sequences was producing valid
results it was clear performance of the system had to be
addressed. The first obvious step of this was to put back in place the
Apriori prune.

The prune has thus been rewritten to accommodate event items. There is
still room for improvement in the CheckSubsets method. Right now the
list of frequent itemsets from the previous level is searched
sequentially for the subset itemsets from the newly generated
candidate. This could be done more quickly using a prefix hash tree
object or an itemset prefix tree to index the frequent itemsets.

Testing:

How should CheckSubsets and ARMiner's implementation be compared?

How do they differ in the time it takes to generate candidates.

It would be nice to compare how many candidates get pruned but it
seems too related to the characteristics of the data rather than the
algorithm used.

I can run both over regular data I guess and prove they prune the
same itemsets anyway. This would involve some code changes. I don't
know how much right now.

 

by: Keith A. Pray
Last Modified: July 4, 2004 8:03 AM
© 2004 - 1975 Keith A. Pray.
All rights reserved.