Data mining is extracting perals from historical data.
When data is absent, we don't mine. Instead, we farm.
Knowledge farming is a three stage process:
Our case study will be software process design. Using our method, a set of minimal recommendations are highly specialized to a specific software project will be developed.
That is, after recording what is known in a model, that model should be validated, explored using simulations, then summarized to find the key factors that most improve model behavior.
Many researchers have predicated their work on the truism that catching errors during early software lifecycle is very important. For example, I have said:
Is this general truism true is specific cases? Well...
The model for this case study originally developed in 1995 By Dr. David Raffo (PSU) and subsequently tailored to a specific large-scale development project at a leading software development firm with the following properties:
A high-level block diagram of this model is shown below The model is far more complex that suggested by this figure since each block references many variables shared by all other blocks.

The model is really two models:
The discrete event model contained 30+ process steps with two levels of hierarchy. The main performance measures of development cost, product quality and project schedule were computed by the model. These performance measures could also be recorded for any individual process step as desired. Some of the inputs to the simulation model included productivity rates for various processes; the volume of work (i.e. KSLOC); defect detection and injection rates for all phases; effort allocation percentages across all phases of the project; rework costs across all phases; parameters for process overlap; the amount / effect of training provided; and resource constraints.
Actual data were used for model parameters where possible. For example, inspection data was collected from individual inspection forms for the two past releases of the product. Distributions for defect detection rates and inspection effectiveness where developed from these individual inspection reports. Also, effort and schedule data were collected from the corporate project management tracking system. Lastly, senior developers and project managers were surveyed and interviewed to obtain values for other project parameters when hard data were not available.
Models were developed from this data using multiple regression to predict defect rates and task effort. The result was a model that predicted the three main performance measures of cost, quality, and schedule. A list of all of the process modification supported by the model are too numerous to list here. Suffice to say that small to medium scope process changes could be easily incorporated and tested using the model.
For each phase of the modelled process, there was a choice of either having one of four inspection types;
TR(a,b,c)
The outputs of the model are assessed via a multi-attribute utility function:
utility = 40*(14 - quality) +
320*(70 - expense) +
640*(24 - duration)
where quality, expense and duration are defined as follows. Quality is the number of major defects (i.e. severity 1 and 2) estimated to remain in the product when released to customers. Expense is the number of person-months of effort used to perform the work on the project and to implement the changes to the process that were studied. Duration is the number of calendar months for the project from the beginning of functional specification until the product was released to customers.
This function was created after extensive debriefing of the business users who funded the development of this model.
The baseline simulation model of the lifecycle development process was validated in a number of ways. The most important of which were as follows:
The simulation model described above contains four phases of development and four inspection types at each phase. The four phases were functional specification (FS); high level design (HLD); low level design (LLD); and coding (CODE). The four inspection types were full Fagan, baseline, walk through, and none.
This results in 4^4 different configurations:NNNN, NNNF, NNNB, NNNW, NNFN, NNFF, ..., etc. Each configuration was executed 50 times, resulting in 50*4^4=12800 runs. Each run was summarized using the utility equation.
Before we can run data miners, we have to pre-process the data.
The data came in an EXCEL spreadsheet with some lines and coloumns added in for formatting purposes. After saving the sheet as a comma-seperated file, our first task is to get rid of all the blank lines and column.
Many general tools and methodologies exist for modelling such as:
Note that for software process programming, elaborate new modelling paradigms may not be required. For example, the Little-JIL process programming language just uses standard programming constructs such as pre-conditions, post-conditions, exception handlers, and a top-down decomposition tree showing sub-tasks inside tasks.
Simulations can be based on nominal or off-nominal values. Nominal simulations draw their inputs from known operational profiles of system inputs. Off-nominal monte-carlo (also called stochastic) simulations, where inputs are selected at random, can check for unanticipated situations. Stochastic simulation has been extensively applied to models of software process.
In a sensitivity analysis, the key factors that most influence a model are isolated. Also, recommended settings for those key factors are generated. We take care to distinguish sensitivity analysis from traditional optimization methods. In our experience, the real systems we deal with are so complex that they do not always fit into (e.g.) a linear optimization framework. Studying data grown from simulators lets us investigate complex, non-linear systems using a variety of data driven distributions. These models can capture complex feedback and rework loops which are not possible for traditional optimization methods. Our experience is that simulation models can look at processes in detail as well as at a high level of abstraction which is where the more analytic models must reside. Finally, simulation models can capture multiple performance measures not able to be explored using these optimization formulations. This is not to say that traditional optimization models are not useful. For certain questions, traditional optimization formulations provide the best fit for the question that is trying to be answered. However, for many questions, the standard optimization models are not the best choice and something like the learners discussed below may be more useful.
Tim Menzies ,
tim@menzies.us,
http://menzies.us
This page generated by Site:
see http://www.cs.pdx.edu/~timm/dm/site.html
This site is built using PerlPod.Style sheet switching method taken from Eddie Traversa's excellent and simple-to-apply tutorial: http://dhtmlnirvana.com/content/styleswitch/styleswitch1.html.
Search engine powered by ATOMZ http://www.atomz.com/search/. Note, the indexes to this site are only updated weekly (heh, its a free service- what more ja want?).
Icons on this site come from http://www.sql-news.de/rubriken/olap.asp and http://www.ifnet.it/webif/centrodi/eng/toolbar.htm.
The JAVA machine learners used at this site come from the extensive data mining libraries found in the University of Waikato's Environment for Knowledge Analysis (the WEKA) http://www.cs.waikato.ac.nz/ml/weka/
Copyright (C) Tim Menzies 2004
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; see http://www.gnu.org/copyleft/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
The content from or through this web page are provided 'as is' and the author makes no warranties or representations regarding the accuracy or completeness of the information. Your use of this web page and information is at your own risk. You assume full responsibility and risk of loss resulting from the use of this web page or information. If your use of materials from this page results in the need for servicing, repair or correction of equipment, you assume any costs thereof. Follow all external links at your own risk and liability.