The computer system performance domain will serve as the example used throughout this document. The original motivation for this work was to assist a software engineering performance group with very complicated estimation and investigation tasks. The example paradigm is not as complicated. Machine learning algorithms have been run with the same base data set while metrics were captured on the single computer system the experiments were run on. All the machine learning algorithms had the same basic process of building a model, training the model over the training data set, and testing the accuracy of the model over the test data set.
The computer system performance domain is used because there exists many basic and easy to understand behaviors that are fairly easy to measure. Such relationships as processor utilization increasing as memory utilization increases are common, understood and easy to reproduce. By mining associations of this sort Apriori Sets And Sequences can be shown to produce valid, expected results.
Another advantage of working with the computer system performance domain is the readily available source of data. Preliminary investigations showed there was little work dealing with data of the sort Apriori Sets And Sequences takes as input. No preexisting data sets of this kind was found. This required the compilation of original data sets. Computer performance was an obvious choice.
The machine learning algorithms that were used include Neural Networks with Back Propagation, Instance Based classification, Naive Bayes, and C4.5 Decision Trees. The training data set used was the census income data available at the UCI Machine Learning Repository's website, http://www.ics.uci.edu/ mlearn/MLRepository.html. The data used took several forms including normalized numeric values, discretized numeric values, and filtering out all instances that included missing value for attributes. Multiple experiments using different parameters with the same algorithm were run.
The experiments with machine learning algorithms were run on a Windows 2000 machine. Metrics were captured at the system level and at the process level. The same metrics were captured for each process running on the machine. Following is a list of the Windows 2000 performance metrics captured during each experiment. The descriptions shown were retrieved from the Microsoft's MSDN web site, http://msdn.microsoft.com and are also available from the Performance Monitor Utility provided with Windows 2000.
This seems to be not the metric I meant to capture. The data collection experiments will have to be re-run to get better disk information.
Figure 2 shows a partial sample of the data collected for a single experiment. This data along with all the other time sequences not shown comprise a single instance of the data set. The amount of data in a single instance is large even when compared to other data sets containing sequences or time series. This makes handling a data set with many instances very difficult computing resource and time wise.
Much of the focus of Apriori Sets And Sequences is implementing a very efficient system. Many of the improvements to efficiency benefit data sets that do not contain temporal information. Apriori Sets And Sequences was used in this capacity by the Promoter MQP project team at WPI mining genetic motifs for gene expression and cell type rules. They report a great improvement in performance over the Apriori implementation Apriori Sets And Sequences is built upon.
Work has since been done applying Apriori Sets And Sequences to other domains. Results from mining complex temporal associations from the stock market domain and a clinical study of sleep disorders will be presented.