next up previous contents
Next: Simple Event Detection Up: Master's Thesis: Mining Association Previous: Related Work   Contents

Data Representation

Apriori Sets And Sequences deals with two data types not commonly found in data mining systems, time sequences and events. A time sequence is an ordered list of values, each associated with a time of occurrence along a time line specific to the data instance the sequence is a member of. An event has a begin and end time that signifies an instance of an event type along a time line specific to the data instance the event is a member of. All the events of a single type that occur in a single instance are stored as a set attribute.

Weka, the machine learning and data mining system from the University of Waikato, New Zealand, uses the ARFF data file format. There is no provision for time sequence attributes, event attributes, or set attributes in ARFF or Weka. Work done at WPI [Sho01] has incorporated set attributes into the ARFF file format by overloading the string data type. Weka ignores the set specific notation and treats the value as a string with no other meaning. This same approach was used to store time sequence attributes and events in ARFF.

A time sequence is stored as a list of values each delimited by a colon ' : '. Each value corresponds to the time value in the same position as in the time line sequence. This detail is only important to event detectors that concern themselves with the absolute time at which certain values occur when detecting events. For many purposes the relative order of the time sequence values is enough. This format allows for more complicated representations such as a nonuniform time line where increments in time are not consistent.

An event uses the same delimiter as a sequence but there are only two values in the sequence. The first is the begin time of the event and the second is the end time of the event. In this work the convention of using a caret ' ^ ' has been adopted to delimit values in a set attribute. An event attribute would then have zero or more events represented as sequences of two values each delimited by carets.

Figure 3: Event Attribute
\begin{figure}\begin {center}
CPU-Increase = \{ 3:8 \^ 13:18 \^ 23:28 \}
\end {center}
\end{figure}

Figure 3 shows an example of an event attribute for the CPU time sequence attribute. The CPU time sequence attribute is the percentage of CPU usage in the overall computer system. This event attribute specifies when the CPU usage increases. Increases occur from time 3 to time 8, 13 to 18, and 23 to 28.

Apriori Sets And Sequences mines temporal associations directly from events. This keeps intact the temporal information represented in the data set while eliminating much of the work involved in scanning the actual sequences. The events are akin to indexes into the sequences.

A time sequence generally has a real time line associated with it. The values in the real time line can be in any units and the distance between each point on the time line does not need to be uniform. This time line represents the actual time at which a value in a time sequence was observed. The events that are identified in a time sequence use the values along the real time line for their begin and end times.

In the itemsets of Apriori Sets And Sequences a second time line is used in addition to the real time line. This is the relative time line. The relative time line always begins at time 0. There are no units implied. The distance between each point on the relative time line is always 1. There is a one to one mapping of times on the real time line and times on the relative time line. The first value on the real time line corresponds to the relative time 0. The second corresponds to relative time 1, and so forth.

All the event items in an itemset have a begin and end time. The individual begin and end times are sorted in ascending order. The relative time for each is simply the position each holds in the list. If the user requires absolute time to be used in mining the temporal associations events can be detected that are specific to the length of events or time between events.

Adding an event item to an itemset containing event items is a simple procedure. The begin and end times for the new event item are inserted into the sorted list of begin and end times according to the real time line. The begin and end time for each event now in the itemset is then renumbered with the relative time according to their order in the list.


next up previous contents
Next: Simple Event Detection Up: Master's Thesis: Mining Association Previous: Related Work   Contents
Keith A. Pray 2003-06-17