next up previous contents
Next: Related Work Up: Background Previous: Pattern Matching   Contents

Example Domain: Computer System Performance

The computer system performance domain will serve as the example used throughout this document. The original motivation for this work was to assist a software engineering performance group with very complicated estimation and investigation tasks. The example paradigm is not as complicated. Machine learning algorithms have been run with the same base data set while metrics were captured on the single computer system the experiments were run on. All the machine learning algorithms had the same basic process of building a model, training the model over the training data set, and testing the accuracy of the model over the test data set.

The computer system performance domain is used because there exists many basic and easy to understand behaviors that are fairly easy to measure. Such relationships as processor utilization increasing as memory utilization increases are common, understood and easy to reproduce. By mining associations of this sort Apriori Sets And Sequences can be shown to produce valid, expected results.

Another advantage of working with the computer system performance domain is the readily available source of data. Preliminary investigations showed there was little work dealing with data of the sort Apriori Sets And Sequences takes as input. No preexisting data sets of this kind was found. This required the compilation of original data sets. Computer performance was an obvious choice.

The machine learning algorithms that were used include Neural Networks with Back Propagation, Instance Based classification, Naive Bayes, and C4.5 Decision Trees. The training data set used was the census income data available at the UCI Machine Learning Repository's website, http://www.ics.uci.edu/ mlearn/MLRepository.html. The data used took several forms including normalized numeric values, discretized numeric values, and filtering out all instances that included missing value for attributes. Multiple experiments using different parameters with the same algorithm were run.

The experiments with machine learning algorithms were run on a Windows 2000 machine. Metrics were captured at the system level and at the process level. The same metrics were captured for each process running on the machine. Following is a list of the Windows 2000 performance metrics captured during each experiment. The descriptions shown were retrieved from the Microsoft's MSDN web site, http://msdn.microsoft.com and are also available from the Performance Monitor Utility provided with Windows 2000.

Memory % Committed Bytes In Use
shows the ratio of Memory Committed Bytes to the Memory Commit Limit. Committed memory is physical memory in use for which space has been reserved in the paging file so that it can be written to disk. The commit limit is determined by the size of the paging file. If the paging file is enlarged, the commit limit increases, and the ratio is reduced.

Memory Available Bytes
Shows the amount of physical memory, in bytes, available to processes running on the computer. It is calculated by summing adding the amount of space on the zeroed, free, and standby memory lists. Free memory is ready for use; zeroed memory consists of pages of memory filled with zeros to prevent later processes from seeing data used by a previous process; standby memory is memory that has been removed from a process's working set (its physical memory) en route to disk but is still available to be recalled.

PhysicalDisk (*) Current Disk Queue Length
Shows the number of requests that are outstanding on the disk at the time that the performance data is collected. It includes requests in service at the time of the collection. This is a snapshot, not an average over the time interval. It includes requests in service at the time of the collection. Multispindle disk devices can have multiple requests active at one time, but other concurrent requests are awaiting service. This counter might reflect a transitory high or low queue length, but if there is a sustained load on the disk drive, it is likely that this is consistently high. Requests experience delays proportional to the length of this queue minus the number of spindles on the disks. This difference should average less than two.

This seems to be not the metric I meant to capture. The data collection experiments will have to be re-run to get better disk information.

Process(*) % Processor Time
Shows the percentage of elapsed time that all of the threads of this process used the processor to execute instructions. An instruction is the basic unit of execution in a computer; a thread is the object that executes instructions; and a process is the object created when a program is run. Code executed to handle some hardware interrupts and trap conditions are included in this count.

Process (*) Working Set
Shows the current number of bytes in the working set of this process. The working set is the set of memory pages touched recently by the threads in the process. If free memory in the computer is above a certain threshold, pages are left in the working set of a process even if they are not in use. When free memory falls below a certain threshold, pages are trimmed from working sets. If they are needed, they are then soft-faulted back into the working set before they leave main memory.

Processor Total % Processor Time
Shows the percentage of time that the processor is executing application or operating system processes other than Idle. This counter is a primary indicator of processor activity. It is calculated by measuring the time that the processor spends executing the thread of the Idle process in each sample interval, and subtracting that value from 100%. Each processor has an Idle thread which consumes cycles when no other threads are ready to run.

System Processes
Total number of processes currently running.

System Threads
Total number of threads currently running.

Figure 2 shows a partial sample of the data collected for a single experiment. This data along with all the other time sequences not shown comprise a single instance of the data set. The amount of data in a single instance is large even when compared to other data sets containing sequences or time series. This makes handling a data set with many instances very difficult computing resource and time wise.

Figure 2: Partial Sample of A Single Performance Instance
\includegraphics[width=0.5\textwidth, bb=0 0 720 379,
clip=true]{Slide3}

Much of the focus of Apriori Sets And Sequences is implementing a very efficient system. Many of the improvements to efficiency benefit data sets that do not contain temporal information. Apriori Sets And Sequences was used in this capacity by the Promoter MQP project team at WPI mining genetic motifs for gene expression and cell type rules. They report a great improvement in performance over the Apriori implementation Apriori Sets And Sequences is built upon.

Work has since been done applying Apriori Sets And Sequences to other domains. Results from mining complex temporal associations from the stock market domain and a clinical study of sleep disorders will be presented.


next up previous contents
Next: Related Work Up: Background Previous: Pattern Matching   Contents
Keith A. Pray 2003-06-17