Knowledge Farming Log #1

Data Mining
CS 510 (DM)
Winter,2004
home | news | site map
review | project | subject | group
weka | mining | gawk | bash
modeling | reference | pods
Display: big | small

Why all the scripting?

Data mining is extracting perals from historical data.

When data is absent, we don't mine. Instead, we farm.

Knowledge farming is a three stage process:

Plant the seed
Model some domain, validate the model.

Grow the data
Perform Monte Carlo simulation on the model to sample the \space of possiblities in the model.

Our case study will be software process design. Using our method, a set of minimal recommendations are highly specialized to a specific software project will be developed.

Harvest
Sumamrize the simulation using data miners to find the important features.

That is, after recording what is known in a model, that model should be validated, explored using simulations, then summarized to find the key factors that most improve model behavior.

[TOP]


Case study: When to Inpect

Many researchers have predicated their work on the truism that catching errors during early software lifecycle is very important. For example, I have said:

Is this general truism true is specific cases? Well...

[TOP]


1. Plant the seed

Modelling

The model for this case study originally developed in 1995 By Dr. David Raffo (PSU) and subsequently tailored to a specific large-scale development project at a leading software development firm with the following properties:

A high-level block diagram of this model is shown below The model is far more complex that suggested by this figure since each block references many variables shared by all other blocks.

The model is really two models:

The discrete event model contained 30+ process steps with two levels of hierarchy. The main performance measures of development cost, product quality and project schedule were computed by the model. These performance measures could also be recorded for any individual process step as desired. Some of the inputs to the simulation model included productivity rates for various processes; the volume of work (i.e. KSLOC); defect detection and injection rates for all phases; effort allocation percentages across all phases of the project; rework costs across all phases; parameters for process overlap; the amount / effect of training provided; and resource constraints.

Actual data were used for model parameters where possible. For example, inspection data was collected from individual inspection forms for the two past releases of the product. Distributions for defect detection rates and inspection effectiveness where developed from these individual inspection reports. Also, effort and schedule data were collected from the corporate project management tracking system. Lastly, senior developers and project managers were surveyed and interviewed to obtain values for other project parameters when hard data were not available.

Models were developed from this data using multiple regression to predict defect rates and task effort. The result was a model that predicted the three main performance measures of cost, quality, and schedule. A list of all of the process modification supported by the model are too numerous to list here. Suffice to say that small to medium scope process changes could be easily incorporated and tested using the model.

For each phase of the modelled process, there was a choice of either having one of four inspection types;

N=none
Do no inspection

F=full fagan
A full fagan inspection is a well-researched manual inspection method. Such inspections are precisely defined, including a seven step process plus pre-determined roles for inspection participants. Fagan inspections can find many errors in a software product. For example, for the company studied here, the defect detection capability of their full Fagan inspections was TR(0.35, 0.50, 0.65):
Defect detection capability
The percentage of defects that are latent in the artifact that is being inspected that are detected.

TR(a,b,c)
Denotes a triangular distribution with minimum, mode, mean of a,b,c respectively. In this study, a full Fagan inspection used between 4 and 6 staff, plus the author of the artifact being inspected.

B=baseline
A baseline inspection was a continuation of current practice at the company under study. The baseline inspection at this company was essentially a poorly performed Fagan inspection, The distinction between a proper Fagan inspection and the baseline is that staff would receive new training, checklists and support in order to significantly improve the effectiveness of the inspections. The data showed that baseline inspections had varying defect detection capabilities ranging from a minimum of 0.13, a maximum of 0.30 and an average of 0.21 (these figures were obtained from actual inspection records)

W=walk through
Walk through inspections were conducted by one by the author of the artifact being inspected in a relatively informal atmosphere. Process experts estimated the amount of time and defect detection capability for this type of inspection. Those estimates were TR(0.07, 0.15, 0.23).

Summarizing Model Output

The outputs of the model are assessed via a multi-attribute utility function:

 utility =  40*(14 - quality) +
           320*(70 - expense) +
           640*(24 - duration)

where quality, expense and duration are defined as follows. Quality is the number of major defects (i.e. severity 1 and 2) estimated to remain in the product when released to customers. Expense is the number of person-months of effort used to perform the work on the project and to implement the changes to the process that were studied. Duration is the number of calendar months for the project from the beginning of functional specification until the product was released to customers.

This function was created after extensive debriefing of the business users who funded the development of this model.

Validation

The baseline simulation model of the lifecycle development process was validated in a number of ways. The most important of which were as follows:

Face validity
Process diagrams, model inputs, model parameters and outputs were reviewed by members of the software engineering process group as well as senior developers and managers for their fidelity to the actual.

Output validity
The model was used to accurately predict the performance of several past releases of the project.

Special case
The model was used to predict unanticipated special cases. Specifically, when predicting the impact of developing overly complex functionality, the model predicted that development would take approximately double the normal development schedule. This result was not accepted initially by management as it was too long, however, upon further investigation it was found that the model predictions corresponded quite accurately with this company's actual experience.

[TOP]


2. Grow the data

The simulation model described above contains four phases of development and four inspection types at each phase. The four phases were functional specification (FS); high level design (HLD); low level design (LLD); and coding (CODE). The four inspection types were full Fagan, baseline, walk through, and none.

This results in 4^4 different configurations:NNNN, NNNF, NNNB, NNNW, NNFN, NNFF, ..., etc. Each configuration was executed 50 times, resulting in 50*4^4=12800 runs. Each run was summarized using the utility equation.

[TOP]


3. Harvest

Pre-process

Before we can run data miners, we have to pre-process the data.

The data came in an EXCEL spreadsheet with some lines and coloumns added in for formatting purposes. After saving the sheet as a comma-seperated file, our first task is to get rid of all the blank lines and column.

[TOP]


Related work

Many general tools and methodologies exist for modelling such as:

Note that for software process programming, elaborate new modelling paradigms may not be required. For example, the Little-JIL process programming language just uses standard programming constructs such as pre-conditions, post-conditions, exception handlers, and a top-down decomposition tree showing sub-tasks inside tasks.

Simulations can be based on nominal or off-nominal values. Nominal simulations draw their inputs from known operational profiles of system inputs. Off-nominal monte-carlo (also called stochastic) simulations, where inputs are selected at random, can check for unanticipated situations. Stochastic simulation has been extensively applied to models of software process.

In a sensitivity analysis, the key factors that most influence a model are isolated. Also, recommended settings for those key factors are generated. We take care to distinguish sensitivity analysis from traditional optimization methods. In our experience, the real systems we deal with are so complex that they do not always fit into (e.g.) a linear optimization framework. Studying data grown from simulators lets us investigate complex, non-linear systems using a variety of data driven distributions. These models can capture complex feedback and rework loops which are not possible for traditional optimization methods. Our experience is that simulation models can look at processes in detail as well as at a high level of abstraction which is where the more analytic models must reside. Finally, simulation models can capture multiple performance measures not able to be explored using these optimization formulations. This is not to say that traditional optimization models are not useful. For certain questions, traditional optimization formulations provide the best fit for the question that is trying to be answered. However, for many questions, the standard optimization models are not the best choice and something like the learners discussed below may be more useful.

[TOP]


Credits

Author

Tim Menzies , tim@menzies.us, http://menzies.us

Software

This page generated by Site: see http://www.cs.pdx.edu/~timm/dm/site.html

Acknowledgements

This site is built using PerlPod.

Style sheet switching method taken from Eddie Traversa's excellent and simple-to-apply tutorial: http://dhtmlnirvana.com/content/styleswitch/styleswitch1.html.

Search engine powered by ATOMZ http://www.atomz.com/search/. Note, the indexes to this site are only updated weekly (heh, its a free service- what more ja want?).

Icons on this site come from http://www.sql-news.de/rubriken/olap.asp and http://www.ifnet.it/webif/centrodi/eng/toolbar.htm.

The JAVA machine learners used at this site come from the extensive data mining libraries found in the University of Waikato's Environment for Knowledge Analysis (the WEKA) http://www.cs.waikato.ac.nz/ml/weka/

[TOP]


Legal

Copyright

Copyright (C) Tim Menzies 2004

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; see http://www.gnu.org/copyleft/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Disclaimer

The content from or through this web page are provided 'as is' and the author makes no warranties or representations regarding the accuracy or completeness of the information. Your use of this web page and information is at your own risk. You assume full responsibility and risk of loss resulting from the use of this web page or information. If your use of materials from this page results in the need for servicing, repair or correction of equipment, you assume any costs thereof. Follow all external links at your own risk and liability.

[TOP]