Temporal
[ Intro ]
[ Adding Temporal Semantics to Association Rules-Notes ]
[ An Approach To Discovering Temporal Association Rules ]
[ An Approach to Discovering Temporal Association Rules-Notes ]
[ Up: References ]
Copyright ACM, 2000
AN APPROACH TO DISCOVERING TEMPORAL ASSOCIATION RULES�
� Keywords��� Data Mining, Association Rules, Temporal Data Mining, Temporal Rules. ABSTRACT��� The goal of discovering association rules is to discover all possible associations that accomplish certain restrictions (minimum support and confidence and interesting). However, it is possible to find interesting associations with a high confidence level but with little support. This problem is caused by the way support is calculated, as the denominator represents the total number of transactions in a time period when the involved items may have not existed. If, on the other hand, we limit the total transactions to the ones belonging to the items’ lifetime, those associations would be now discovered, as they would count on enough support. Another difficulty is the large number of rules that could be generated, for which many solutions have been proposed. Using age as an obsolescence factor for rules helps reduce the number of rules to be presented to the user. In this paper we expand the notion of association rules incorporating time to the frequent itemsets discovered. The concept of temporal support is introduced and, as an example, the known algorithm A priori is modified to incorporate the temporal notions. 1. INTRODUCTION��� The problem of the discovery of association rules comes from the need to discover patterns in transaction data in a supermarket. But transaction data are temporal. For example, when gathering data about products purchased in a supermarket, the time of the purchase is registered in the transaction. This is called transaction time, in temporal databases jargon , which matches the valid time, corresponding to the time of the business transaction confirmation at the register. In [12], the author expresses: "The time dimension is the one dimension virtually guaranteed to be present in every data warehouse, because virtually every data warehouse is a time series". ��� In large data volumes, as used for data mining purposes, we may find information related to products that did not necessarily exist throughout the data gathering period. So we can find some products that, at the moment of performing that mining, have already been discontinued. There may be also new products that were introduced after the beginning of the gathering. Some of these new products should participate in the associations, but may not be included in any rule because of support restrictions. For example, if the total number of transactions is 30,000,000 and we fix as minimum support 0.5 %, then a particular product must appear in, at least, 150,000 transactions to be considered frequent. Moreover, suppose that these transactions were recorded during the last 30 months, at 1,000,000 per month. Now, take a product that has been sold during the 30 months and has just the minimum support: it appears on average in 5,000 transactions per month. Consider now another product that was incorporated in the last 6 months and that appears in 20,000 transactions per month. The total number of transactions in which it occurs is 120,000; for that reason, it is not frequent, even though it is four times as popular as the first. However, if we consider just the transactions generated since the product appeared in the market, its support might be above the stipulated minimum. In our example, the support for the new product would be 2%, relative to its lifetime, since in 6 months the total of transactions would be about 6,000,000 and this product appears in 120,000 of them. Therefore, these new products would appear in interesting and potentially useful association rules. ��� One way to solve this problem is by incorporating time in the model of discovery of association rules. We will call these rules Temporal Association Rules. ��� One subproduct of this idea is the possibility of eliminating outdated rules, according to the user criteria. Moreover, it is possible to delete obsolete itemsets as a function of their lifetime, reducing the amount of work to be done in the determination of the frequent itemsets and, hence, in the determination of the rules. ��� The temporal association rules introduced in this paper are an extension of the nontemporal model. The basic idea is to limit the search for frequent sets of items or itemsets to the lifetime of the itemset’s members. For that reason, the concept of temporal support is introduced. Thus, each rule has an associated time frame, corresponding to the lifetime of the items participating in the rule . If the extent of a rule’s lifetime exceeds a minimum stipulated by the user, we analyze if the rule is frequent in that period. This concept allows us to find rules that, with the traditional frequency viewpoint, it would not be possible to discover. ��� The remainder of this paper is organized as follows. Related work on discovery of association rules, temporal data mining in general, and discovery of temporal rules in temporal databases, in particular is given in Section 2. The temporal model is introduced in Section 3. In Section 4 we discuss the discovery of temporal rules, adapting the A priori method as an example. Also in section 4 we analyze changes to an existing algorithm for the rules’ generation. Finally, in section 5 we conclude and briefly discuss future work. 2. RELATED WORK��� The problem of discovering associations from data was introduced by Agrawal et al. in [1]. It was followed by successive refinements, generalizations and improvements ([5], [3], [9], [15], [17], [18]). Among these we can find improved algorithms for the discovery of frequent itemsets, generalized and quantitative association rules, and new measures for other types of data, different from the market basket. ��� Previous work about data mining that includes temporal aspects is usually related to the sequence of events’ analysis ([2], [7], [8], [13]). The usual objective is to discover regularities in the occurrence of certain events and temporal relationships between the different events. In particular, in [13], the problem of recognizing frequent episodes in an event sequence is discussed; an episode is defined there as a collection of events that occur during time intervals of a specific size. Meanwhile [6] reviews the problem of discovering sequential patterns in transactional data bases. The solution consists in creating a sequence for every client and to look for frequent patterns into each sequence. ��� In [7] and [8] more complex patterns than in the cases mentioned above are considered. In these cases temporal distances with multiple granularities are treated. ��� Now we will analyze how the present work is related to others, specifically in mining temporal association rules. All of them have the same goals as ours: the discovery of association rules and their periods or interval time of validity. Our proposal was formulated independently of the others but shares with them some similarities. In [14] they study the problem of association rules that exist in certain time intervals and thus display regular cyclic variations over time. They present algorithms for efficiently discover what they called "cyclic association rules". It is assumed that time intervals are specified for the user. ��� In [16] the authors study how the association rules vary over time, generalizing the work in [14]. They introduce the notion of calendar algebra to describe temporal phenomena of interest to the users and present algorithms for discovering "calendric association rules", that is, association rules that follow the temporal patterns set forth in the user supplied calendar expressions. ��� The third study([11]) also suggests calendar time expressions to represent temporal rules. They present only the basic ideas of the algorithms for discovering the temporal rules. ��� Our approach is based on taking into account the items’ period of life or lifespan, this being the period between the first and the last time the item appears in transactions in the database. We compute the support of an itemset in the interval defined by its lifespan and define temporal support as the minimum interval width. Our approach differs from the others in that it is not necessary to define interval or calendars since the lifespan is intrinsic to the data. In addition, we describe in detail the temporal extension to the Apriori algorithm. 3. THE TEMPORAL MODEL��� Let T = { . . ., t0, t1, t2, . . . } be a set of times, countably infinite, over which a linear order <T is defined, where ti <T tj means ti occurs before or is earlier than tj([21]). We will assume that T is isomorphic to N ( natural numbers) and restrict our attention to closed intervals [ti, tj].
X1 = {A}, lX1 =[1,6], since we find A
conf(X� Y,[t1, t2], d) = s(X � Y, lX� Y, d) / s(X, lX� Y, d),
4. TEMPORAL RULES DISCOVERY��� The discovery of all the association rules in a transaction set d can be made in two phases [AIS93]: ��� Phase 1. Find every set of items(itemsets) X � R that is frequent, i.e. their frequency exceeds the established minimum support s . ��� Phase 2. Use the frequent itemsets X to find the rules: test for every Y � X, with Y � � , if the rule X \ Y � Y satisfies with enough confidence, i.e. it exceeds the established minimum confidence q . ��� In the following paragraph, we introduce suitable modifications to support temporal association rules discovery: ��� Phase 1T. Find every itemset X � R such that X is frequent in its lifespan lX, i.e. s(X,lX, d) � s and |lX| � t . ��� Phase 2T. Use the frequent itemsets X to find the rules: verify for every Y � X, with Y � � , if the rule X \ Y � Y [t1, t2] is satisfied with enough confidence, in other words, exceeds the minimum confidence q established in the interval [t1, t2]. 4.1 Generating Frequent Itemsets��� Any of the proposed algorithms in the literature ([5], [9], [15]) for association rule discovery may be conveniently modified for its application for temporal association rules. ��� Let’s see, for example, the Apriori algorithm of [5] into which we will introduce some slight changes to generate association rules taking time into consideration. As in the original notation, Lk represents the set of frequent k-itemsets. Each member of this set will have the following fields: i) itemset, ii) lower and upper limits of the life time of the item: t1 and t2, iii)count of support (Fr) of the itemset in [t1, t2] and iv) total number of transactions (FTr) found in the interval [t1, t2]. Ck is the set of candidate k-itemsets; in other words, potentially frequent itemsets that have associated the same information as the members of Lk.
��� L1 is obtained in the first pass, in which the items’ occurrences are counted to determine the 1-frequent itemsets. For each itemset we store its lifespan [t1, t2]. Besides counting the absolute frequency for each itemset, Fr, we count the total number of transactions between t1 and t2, FTr. Then if Fr / FTr � minimum support s and if t2 – t1 � minimum temporal support t , we will say that the itemset X has minimum support in [t1, t2]. ��� Some items could be deleted from L1 because they were obsolete, i.e., they have interval lifespans [t1, t2] and t2 < to. After deleting the obsolete items, the following lemma assures us that it is not necessary to check for obsolete k-itemsets, with k > 1, anymore.
��� Proof: based on the definition of k-itemset lifespan, with k > 1. ��� Every following pass k consists of two phases: in the first are obtained the candidate itemsets Ck of size k, based on the frequent itemsets Lk-1 of size k-1 obtained in the k-1 pass, by means of the function apriori-gen. The lifespan of a k-itemset with k > 1 is obtained in the following way: if the k-itemset u is obtained putting together k-1-itemsets v and w, then the lifespan of u is [u.t1,u. t2], with u.t1 = max {v.t1, w.t1} and u.t2 = min {v.t2, w.t2}. In the second phase we read the transactions’ database to compute the support of the candidate itemsets of Ck, for which the function subset is used; it determines if each c member of Ck is contained in the transaction s. The timestamp t of s must satisfy t � lc. The algorithm does as many passes over the database as the maximal cardinality of a candidate itemset. Generating a priori candidates��� The candidates’ generation is made by means of the function apriori-gen. This function takes as argument Lk-1, the set of all frequent (k-1)-itemsets, and returns a superset of the frequent k-itemsets, each one with their associated lifespan represented generically by the interval [t1, t2]. This function has been organized into a Join Step anda Pruning Step.
���� In the next step (pruning) all the candidate itemsets c � Ck such that any subset of c with k-1 items is not in Lk-1 are deleted. In the same way the itemsets c such that |lc| < t are deleted too.
4.2 Generating Rules��� To generate the rules, it is necessary to find all the subsets for every frequent itemset. Then, given a frequent itemset Z we must find, for each proper subset X of Z, the rules X � (Z – X) [t1, t2] such that s(Z,lZ,d) / s(X,lZ,d) � q . ��� One of the problems we find in computing confidence, in accordance with the definition 8, conf(X� Y,[t1, t2], d) = s(X � Y, lX� Y, d) / s(X, lX� Y, d), ��� where lX� Y = {[t1, t2]}, is the determination of s(X, lX� Y, d). Evidently, s(X, lX� Y, d) may not be equal to s(X, lX, d), since lX� Y � lX. But in Phase1T we have calculated s(X, lX, d), and not s(X, lX� Y, d). Since, if XY is a frequent itemset of size k we will have 2k possible subsets, we should recalculate the�frequency for 2k-2 itemsets in lX� Y, and repeat that in each k-th pass, with k > 1. A way to avoid this, is to use an estimation in its place. In the simplest case, if we consider that all itemsets X have a uniform temporal distribution, then the chance of appearance in any subset of lX, in particular in lX� Y, will be the same. Then we will be able to estimate s(X, lX� Y, d) as s(X, lX, d). Then, modifying an algorithm as [ASr94a] to obtain all possible rules, given a frequent itemset Z, is immediate. 5. CONCLUSIONS AND FUTURE WORK��� In this paper we have introduced time in the problem of association rules discovery, given place to what we call Temporal Association Rules. Each item, itemset and rule has now an associated lifespan, which comes from the explicitly defined time in database transactions. We have also introduced the concept of temporal support. This gives way to the discovery of new rules that, due to the lack of necessary support, were not discovered with the traditional viewpoint. Now, with the concept of time, we consider the rules that have enough support in their lifespan, as long as they also have temporal support. ��� One of the problems related to the discovery of association rules that is often mentioned, is the great number of rules that can be generated. A solution is that the user may say which dates are old enough, so the rules with lifespan previous to those dates would be considered obsolete and not presented to the user. Furthermore, if the algorithm used to generate the frequent itemsets finds old items or itemsets, it may eliminate them directly, which it would be an additional pruning. ��� To show the incidence of time in the amount and quality of the obtained rules we extend, as an example, the A priori algorithm that generates the frequent itemsets. ��� We are currently implementing our algorithm for temporal association rule discovery. Besides, we will analyze the problem of the maintenance of temporal association In addition to what was considered in [10], we will investigate the concept of temporal border; the temporal border includes itemsets that do not have enough temporal support, but such that the upper limit of their lifespan corresponds, at less than a Dt, with the temporal limit of the original database.
�
�
�
Figure 1: Example 3 � 6. REFERENCES
�
� Copyright 2000 ACMPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and or fee.SAC 2000 March 19-21 Como, Italy (c) 2000 ACM 1-58113-239-5/00/003>...>$5.00
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|