next up previous contents
Next: Level 1 Counting Support Up: Apriori Sets And Sequences Previous: Input   Contents

Level 1 Candidate Generation

In regular Apriori the first level of candidate itemsets is generated by simply creating an itemset of size one for each of the items appearing in the data set. These itemsets are added to a list of candidate itemsets. The number of instances in the data set that contain the itemsets in this list will be counted. This count is used to calculate the support of each itemset. Only those itemsets that have a support equal to or higher than the minimum support specified by the user will be kept to generate the next level of candidates.

This basic process is used in Apriori Sets And Sequences with some important differences. Instead of one list of candidate itemsets there are two. The candidate list holds the candidates that will be counted in the data set. The all candidate list holds itemsets that do not have to be counted. These itemset are conceptual duplicates of those in the candidate list.

The items that are placed in the first level of candidates are the items that will eventually be used in the association rules found during mining. There is a distinction made between event items in the data set and the event items that exist in rules. In the data set an event item refers to a particular occurrence of an event in a specific time sequence attribute during a specific time interval. There can only be one event of that type occurring in that sequence during that time interval. This event item could appear in more than one instance but it is more likely that very few instance will have the exact same time interval. This make event items very numerous. Furthermore the time interval is too specific. Trying to mine rules directly from these would result in very few rules.

A new event item is created and added to the list of items for each event type in the data set. It might be that the user wishes to find associations that contains more than one occurrence of an event type. In this case the user will have specified the maximum number of event items of the same type allowed in a rule. As Apriori Sets And Sequences creates the candidate itemsets of size 1 it scans for event items. When it finds an event item, a count of new event items of that type that have been created is made. If this number is below the maximum a new event items of that type is created with begin time 0 and end time 1. The total number of new event items that get created and used during mining is equal to or less than the maximum number times the number of of event types. This usually greatly reduces the number of total items and hence reduces the time to mine.

Since we have event items that are of the same event type and have the same begin and end times it is only necessary to count the support of one of the itemsets containing each. An itemset is considered a duplicate of another when both contain the same regular items and each event item has a corresponding event item of the same type with the same begin and end times. It is trivial to identify these duplicate itemsets when generating candidates of size one. When a duplicate itemset is generated it is not added to the list of candidates that have their support counted. It gets added to a master list of all candidates, the all candidates list. A hash table stores the map of duplicate itemsets in the all candidates list to itemsets in the candidate list.


next up previous contents
Next: Level 1 Counting Support Up: Apriori Sets And Sequences Previous: Input   Contents
Keith A. Pray 2003-06-17