wpi.associations.arminerSequence
Class ARMinerApriori

java.lang.Object
  extended bywpi.associations.arminerSequence.ARMinerApriori
All Implemented Interfaces:
FrequentItemsetsFinder

public class ARMinerApriori
extends java.lang.Object
implements FrequentItemsetsFinder

Implements the AprioriSetsAndSequences version of the Apriori algorithm for finding frequent itemsets from datasets containing time sequence attributes.

Author:
Keith A. Pray (kap@wpi.edu)

Field Summary
private  java.util.Vector allCandidates
          The list of all candidates including those with duplicate events items.
private  DBCacheWriter cache_writer
          Writes itemsets to a cache file
private  java.util.Vector candidates
          The list of candidate itemsets to count support for.
private  DBReader db_reader
          Reads rows from a database/dataset file
private static int debug
          Specifies debug info level 0: no debug info 1: input to methods 2: and out from method 3: and all sorts of stuff
private  java.util.Hashtable duplicateHash
          The map between duplicate itemsets (containing duplicate event items) and the first itemset.
private  int haveEvents
          Indicates if events are present in the data set.
private  HashTree ht_candidates
          The hashtree of candidate itemsets
private  HashTree ht_k_frequent
          The hashtree of frequent itemsets of size k
private static int INITIAL_CAPACITY
          Size to initialize data structures holding itemsets
private  java.util.Vector k_frequent
          The list of frequent itemsets of size k
protected  Logger logger
          The object responsible for logging.
private  int maxEvents
          Maximum number of events from the same attribute allowed in a rule
private  long min_weight
          Minimum number of rows/instances needed for an itemset to be considered frequent
private static int NO
          Don't use a feature
private  long num_rows
          Number of rows in the dataset
private  int numPruned
          Tracks how many itemsets are pruned using checkSubsets or hash tree's count subsets
private  int pass_num
          This remembers the number of passes and also indicates the current cardinality of the candidates.
private  java.util.Vector possibleDuplicates
          Itemsets that could possibly have duplicates later on.
private  ItemsetPrefixTree possibleDuplicatesTree
          Itemsets that could possibly have duplicates later on.
private  java.util.Hashtable previousDuplicateHash
          The map between duplicate itemsets (containing duplicate event items) and the first itemset that were found to be frequent.
private static int useItemsetPrefixTree
          Specify use of itemset prefix tree instead of vector for looking up possible duplicate itemsets during generation.
private static int usePostGenerationGC
          Specify post candidate generation garbage collection.
private static int usePreviousDuplicateHash
          Specify use of previous duplicate hash for testing possibility of an itemset being a duplicate or first of a set of duplicate itemsets.
private static int usePruneCandidateGeneration
          Specify use of pruneCandidateGeneration to save calls to getCandidate
private static int YES
          Use a feature
 
Constructor Summary
ARMinerApriori()
           
 
Method Summary
private  boolean checkSubsets(ARMinerItemset itemset)
          Checks to see if all the subsets of the specified itemset are frequent.
private  void evaluateCandidates()
          This procedure checks to see which itemsets are frequent
 int findFrequentItemsets(DBReader dbReader, DBCacheWriter cacheWriter, float minSupport)
          Find the frequent itemsets in a database
private  void generateCandidates()
          Generates new candidates out of itemsets that are frequent.
private  boolean getCandidate(ARMinerItemset is_i, ARMinerItemset is_j)
          This procedure tries to combine itemsets i and j and returns true if succesful, false if it can't combine them.
private  boolean getCandidate(int i, int j)
          This procedure tries to combine itemsets i and j and returns true if succesful, false if it can't combine them.
 int getMaxEvents()
          Returns the number of event items of the same kind allowed in an itemset.
private  boolean pruneCandidateGeneration(ARMinerItemset isi, ARMinerItemset isj)
          This method is used to prune candidate generation when the dataset contains events.
 boolean setLogger(Logger log)
          Sets the logging mechanism.
 void setMaxEvents(int num)
          Sets the number of event items of the same kind allowed in an itemset.
private  void weighCandidates()
          This procedure scans the database and computes the weight of each candidate.
private  void weighItemset(ARMinerItemset itemset)
          This procedure scans the database and computes the weight of the itemset.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INITIAL_CAPACITY

private static final int INITIAL_CAPACITY
Size to initialize data structures holding itemsets

See Also:
Constant Field Values

YES

private static final int YES
Use a feature

See Also:
Constant Field Values

NO

private static final int NO
Don't use a feature

See Also:
Constant Field Values

usePruneCandidateGeneration

private static int usePruneCandidateGeneration
Specify use of pruneCandidateGeneration to save calls to getCandidate


useItemsetPrefixTree

private static int useItemsetPrefixTree
Specify use of itemset prefix tree instead of vector for looking up possible duplicate itemsets during generation.


usePostGenerationGC

private static int usePostGenerationGC
Specify post candidate generation garbage collection. Default should be NO


usePreviousDuplicateHash

private static int usePreviousDuplicateHash
Specify use of previous duplicate hash for testing possibility of an itemset being a duplicate or first of a set of duplicate itemsets. This is useful when a large percentage of unique generated itemsets are not duplicates. This can happen when there are few sequence event attributes. If set to NO useItemsetPrefixTree must be set to YES.


allCandidates

private java.util.Vector allCandidates
The list of all candidates including those with duplicate events items.


candidates

private java.util.Vector candidates
The list of candidate itemsets to count support for.


k_frequent

private java.util.Vector k_frequent
The list of frequent itemsets of size k


ht_candidates

private HashTree ht_candidates
The hashtree of candidate itemsets


ht_k_frequent

private HashTree ht_k_frequent
The hashtree of frequent itemsets of size k


duplicateHash

private java.util.Hashtable duplicateHash
The map between duplicate itemsets (containing duplicate event items) and the first itemset. This is used to update the support of these itemsets after the support is counted for the others.


previousDuplicateHash

private java.util.Hashtable previousDuplicateHash
The map between duplicate itemsets (containing duplicate event items) and the first itemset that were found to be frequent. This is used for helping identify new generated itemsets as duplicates.


possibleDuplicates

private java.util.Vector possibleDuplicates
Itemsets that could possibly have duplicates later on. Hopefully this list will be shorter than the candidate list and hence save time when searching for the first of a set of duplicate item sets.


possibleDuplicatesTree

private ItemsetPrefixTree possibleDuplicatesTree
Itemsets that could possibly have duplicates later on. Hopefully this list will be shorter than the candidate list and hence save memory when searching for the first of a set of duplicate item sets. For larger lists of itemsets this should be much faster than using the Vector possibleDuplicates.


pass_num

private int pass_num
This remembers the number of passes and also indicates the current cardinality of the candidates.


db_reader

private DBReader db_reader
Reads rows from a database/dataset file


cache_writer

private DBCacheWriter cache_writer
Writes itemsets to a cache file


num_rows

private long num_rows
Number of rows in the dataset


min_weight

private long min_weight
Minimum number of rows/instances needed for an itemset to be considered frequent


logger

protected Logger logger
The object responsible for logging.


maxEvents

private int maxEvents
Maximum number of events from the same attribute allowed in a rule


haveEvents

private int haveEvents
Indicates if events are present in the data set.


numPruned

private int numPruned
Tracks how many itemsets are pruned using checkSubsets or hash tree's count subsets


debug

private static final int debug
Specifies debug info level 0: no debug info 1: input to methods 2: and out from method 3: and all sorts of stuff

See Also:
Constant Field Values
Constructor Detail

ARMinerApriori

public ARMinerApriori()
Method Detail

findFrequentItemsets

public int findFrequentItemsets(DBReader dbReader,
                                DBCacheWriter cacheWriter,
                                float minSupport)
Find the frequent itemsets in a database

Specified by:
findFrequentItemsets in interface FrequentItemsetsFinder
Parameters:
dbReader - the object used to read from the database
cacheWriter - the object used to write to the cache if this is null, then nothing will be saved, this is useful for benchmarking
minSupport - the minimum support
Returns:
the number of passes executed over the database

weighCandidates

private void weighCandidates()
This procedure scans the database and computes the weight of each candidate.


weighItemset

private void weighItemset(ARMinerItemset itemset)
This procedure scans the database and computes the weight of the itemset.


evaluateCandidates

private void evaluateCandidates()
This procedure checks to see which itemsets are frequent


generateCandidates

private void generateCandidates()
Generates new candidates out of itemsets that are frequent.


getCandidate

private boolean getCandidate(int i,
                             int j)
This procedure tries to combine itemsets i and j and returns true if succesful, false if it can't combine them.


getCandidate

private boolean getCandidate(ARMinerItemset is_i,
                             ARMinerItemset is_j)
This procedure tries to combine itemsets i and j and returns true if succesful, false if it can't combine them.


checkSubsets

private boolean checkSubsets(ARMinerItemset itemset)
Checks to see if all the subsets of the specified itemset are frequent.

Parameters:
itemset - the itemset to check the subsets of
Returns:
true if all the subsets are frequent, false otherwise

pruneCandidateGeneration

private boolean pruneCandidateGeneration(ARMinerItemset isi,
                                         ARMinerItemset isj)
This method is used to prune candidate generation when the dataset contains events. It is assumed that the itemsets are of the same length. Compares only the regular items in the specified itemsets. Returns true if the generation of candidates should be stopped with the current isi itemset.

Parameters:
isi - the first itemset to compare
isj - the itemset to compare to the first
Returns:
true if it is ok to prune the itemsets that would be generated with isi and larger isj, otherwise false

getMaxEvents

public int getMaxEvents()
Returns the number of event items of the same kind allowed in an itemset.

Returns:
the maximum number of event items

setMaxEvents

public void setMaxEvents(int num)
Sets the number of event items of the same kind allowed in an itemset.

Parameters:
num - the maximum number of event items

setLogger

public boolean setLogger(Logger log)
Sets the logging mechanism. Specifying null turns logging off.

Parameters:
log - the logging mechanism, null turns logging off
Returns:
true if logging is on, false if logging off