next up previous contents
Next: Background Up: Master's Thesis: Mining Association Previous: Master's Thesis: Mining Association   Contents

Introduction

Mining association rules has been the focus of much study since its application to market basket data was introduced in [AIS93]. The appeal of association rules lies in their comprehensibility. It is much easier for people to read association rules and understand the knowledge those rules represent than say the weights of a trained neural network. Association rules can be used in rule based expert systems, filtered to form classification rules, and can provide explanations for reoccurring values in data.

To date, available systems and tools that mine association rules using Apriori [AS94], the standard algorithm for mining association rules, rely on relational data. Data naturally occurring in a non-relational format has to be transformed to be used in these systems. Transformations are frequently used when data occurs as a sequence of values. The closing price of a stock may be used to predict tomorrow's closing price instead of the prices occurring throughout the day. The average value of a engine sensor may be used to predict a component failure instead of the sensor readings gathered the entire time the engine is running. By not using the entire sequence, individual values that may be useful are lost and many temporal relationships are not represented.

We define a time sequence attribute as having a value that is an ordered sequence. The time sequence attributes in a single instance all share the same time line. Each value is associated with a specific time relative to that instance. Values can be numeric or symbolic. Time sequence attributes are in addition to regular attributes having a single value in an instance. A small data sample can be seen in Figure [*].

Figure 1: Data Sample - A Single Stock Market Instance
\includegraphics[width=0.5\textwidth, bb=0 0 720 379,
clip=true]{Slide33}

The motivation for mining rules from time sequence attributes comes from previous experience working in a software engineering performance group. The group's focus was a hardware and software backup/restore product. The time it takes to complete a backup can be of great importance. One of the most labor consuming tasks was to predict this time. A rule based expert system was built as a way to disseminate the performance knowledge used to predict backup time. This expert system quickly grew to include over six hundred manually written rules and the need for automating the discovery of new rules was apparent.

To create a rule to handle a new feature or new technology an estimation is made about how performance will be affected. Tremendous amounts of information is collected to verify that the estimation is accurate. For a small scenario including a backup server, a single client, the Ethernet network that connects them, and a single SCSI connected tape drive, an average of 2 MB of data can be collected every ten minutes a test is run. This includes CPU and memory usage for all processes running on both server and client, I/O metrics for storage sub-systems and the tape drive and the connections between them and the client and server machines as well as network traffic. With a short test run of an hour, analyzing this amount of data manually is very impractical. The reasoning behind the initial estimation serves to focus analysis efforts but was usually inadequate. It is often the case that the estimation is not sufficiently accurate and more analysis must be done to identify the factors behind the observed performance. This process can take weeks or months and often involves running new tests and collecting even more data to be analyzed.

If there was a way to mine for association rules from these data, the resulting rules could be used to focus analysis efforts and possibly explain the observed behaviour. An initial investigation of machine learning techniques and data mining showed there were no systems or tools that accepted this type of data as input let alone mining association rules from it. Our efforts have since been focused on enhancing Apriori to mine association rules from data sets containing time sequence attributes and expanding a data mining system to accept this type of data. Not only would this be a substantial contribution to the computer system performance domain but it would also expand the general use of association rules to complex temporal relationships essential to many other domains such as stock market analysis, complex system diagnostics, medical monitoring, and clinical studies.

Our algorithm, Apriori Sets And Sequences, takes as input a complex temporal data set as described, a set of predefined event types, and the regular Apriori support and confidence parameters. The event types are actually filters that detect patterns or sub-sequences in the time sequence attributes. These events can be found using any pattern matching algorithm or system the user wishes. The only requirement is that each event be identified by a begin and end time.

Once events are found Apriori Sets And Sequences uses the begin and end times to count support for all the possible temporal relationships between event types. A list of frequent itemsets are found and association rules are produced. The rules produced may contain more than one event. These events are related in time as expressed by their relative begin and end times. This produces an easy to read description of the behavior observed in the data set.

The rest of this document discusses data representation, events, the event detection methods used, changes to the Apriori algorithm to create Apriori Sets And Sequences, implementation challenges, initial results in the computer system performance domain, and other domains and projects to which Apriori Sets And Sequences has been applied.


next up previous contents
Next: Background Up: Master's Thesis: Mining Association Previous: Master's Thesis: Mining Association   Contents
Keith A. Pray 2003-06-17