Level 3 (and up) Candidate Generation

Next: Calculating Confidence Up: Apriori Sets And Sequences Previous: Level 2 (and up) Contents

Level 3 (and up) Candidate Generation

At the third level of candidate itemset generation, generating possible frequent itemsets of size 3, there are ways to eliminate some of the candidates before counting support. This saves time. Regular Apriori uses a join rule. The items in an itemset are sorted in ascending order according to their item number. To join two itemsets to form a new itemset one size larger there are a list of conditions that must be met. They must have the same number of items. The list of items in both itemsets must be the same except for the last item which must be different. Furthermore, the last item in the second itemset must have a higher item number than the last item in the first itemset. If two itemsets do not meet these criteria they are not combined to generate a candidate itemset. The candidate itemset formed by combining two itemsets simply has the superset of the items in each. This is easily done by adding the last item of the second itemset to the first itemset.

Combining itemsets has been modified for Apriori Sets And Sequences. All of the conditions still hold for regular items. The event items have to be handled as a special case. Since there is the possibility of there being more than one temporal relationship between the same set of event items the item number alone cannot be used to decide if two itemsets should not be joined. Rather, a mapping must exist from all the event items in the first itemset to all the event items in the second itemset. This excludes event items that are the last item in an itemset. Once this mapping is found the itemsets can be combined.

As for level two candidate generation, it is possible for more than one candidate itemset to be generated. The algorithm for generating these possibilities is more involved than the straight forward iteration of thirteen possible temporal relationships. When more than two event items are present the number of temporal combinations grows. This is limited during candidate generation by using the relative temporal information contained in each itemset.

**Figure 5:** Combine Itemsets
$\begin{figure}\begin {center} A: \{ Disk Increase (0:2), CPU Increase (1:3) \} ... ...B: \{ Disk Increase (1:2), Memory Sustain (0:3) \} \end {center} \end{figure}$

Consider combining the itemsets A and B shown in Figure 5. These can be combined since they have the same number of items, a mapping exists between the Disk Increase event in itemset A and the Disk Increase event in itemset B, and the event items listed last in A and B are different. The temporal relationship between the CPU Increase event in A and the Memory Sustain event in B is not known. Some of the possible relationships can be eliminated by inferring information from the fact that the Disk Increase event in both A and B is the same event. Figure 6 illustrates this inference.

**Figure 6:** Selective Candidate Generation
$\includegraphics[width=0.5\textwidth, bb=0 0 720 382, clip=true]{Slide19}$

Combining the two itemsets is done by adding the last item from the first itemset to the second itemset. In this example the event item Memory Sustain will be added to itemset A. First, the begin time of Memory Sustain event to be added to itemset A must be determined. In itemset B the Memory Sustain event began before the Disk Increase event began. This temporal relationship must hold for the itemsets being generated. In the itemset A there is nothing before the begin time of Disk Increase. Therefore the only relative time Memory Sustain can begin is one time line value before time 0.

Next the end time of the Memory Sustain event must be determined. From itemset B it can be seen that the Disk Increase event ended before the Memory Sustain event ended. This means that each relative time in the itemset A that occurs after the end of the Disk Increase event is potentially when the end of the Memory Sustain event occurs. This includes the times noted in the time line for A and each time in between those marked times. Figure 4 illustrates how each event can be related in time.

The possible begin and end time pairs determined are in relation to itemset A's existing relative time line. A candidate itemset is generated for each pair. In these candidate itemsets the relative time line will be renumbered starting from 0 to include the Memory Sustain event as shown in Figure 6.

By inferring temporal relationships from the event items in common between the itemsets that are being combined many candidate itemsets representing different temporal relationships can be eliminated and hence save time during mining. An important observation to make is that this reasoning is only necessary when each itemset has more than one event item. If each itemset contains exactly one event item each the generation of candidates follows the procedure described in Level 2 Candidate Generation.

Next: Calculating Confidence Up: Apriori Sets And Sequences Previous: Level 2 (and up) Contents

Keith A. Pray 2003-06-17