Project: CS510 Special Topics (4 credits) Data Mining

Data Mining
CS 510 (DM)
Winter,2004
home | news | site map
review | project | subject | group
weka | mining | gawk | bash
modeling | reference | pods
Display: big | small

Why all the scripting?

Goals

Highest level goals: To show students empirical methods for assessing different learners.

Lower level goal: To generate a comment on the following hypothesis: Bayesian learners are just as effective, and simpler to build, than entrophy-based decision-tree learners

Low-level goal: Conduct a 10-by-10-cross validation study to generate win-loss tables comparing:

[TOP]


Organization

Students to work in groups of three or four.

The project has many parts and the total marks for these parts adds up to 150 points. That is, many parts of the assignment are optional. Further, if you do badly in one part, you can make up for it by doing extra work elsewhere.

Parts are either print outs or reports:

print out
Print outs are short ascii print outs with your group number, date, and part number written on the top. These you show me in class and which get a mark 100% or 0%.

report
Reports are longer documents be submitted as html files. The easiest way I know to generate web pages is the PerlPod tool. For more on PerlPod, see perlpod page.

If your submission contains multiple files, submit as a zip file. Zip files are to be labeled as follows:

 gGpP.zip

where P is which part you are answering and G is your group number. Html files are to be labeled as follows:

 gGpP.html

Reports are to be submitted to tim@menzies.us with the subject line:

 part P group G CS510 submission

If you get the subject line right, then you will get back an automatic report saying ``submission to CS510 received''. Your assignment is NOT submitted unless your get that notice back.

[TOP]


Marks

 week  marks
 ----+------
   2     5
   3     1 
   4    10 
   5     1
   6     1
   7    15
   8     1 
   9     1
  10    25
 ----+-----
 total  60

[TOP]


Parts

Week 2: REPORT

Submission: Email me a report with photos and names of your group members.

Week 3: print out

Submission: A print out of http://groups.yahoo.com/group/dm04pdx1/members with the emails of your group members highlighted (this means you have to join yahoo first).

Submission: Give me a print out of 12 WEKA outputs, numbered OneR1,OneR2,OneR3, NB1,NB2,NB3, J1,J2,J3 (all stapled together).

To generate this print out, run 3 data sets listed in http://www.cs.pdx.edu/~timm/dm/data/somediscretedatasets.dat through three learners: OneR, NaiveBayes, and J48.

For this exercise, run the learners through manually. Later on, we'll automate this process.

Week 4: REPORT

Learn what you can using OneR, NaiveBayes, J48 from the first ten data sets in the table http://www.cs.pdx.edu/~timm/dm/data/somediscretedatasets.dat .

To build the learners, modify my J48 example to generate shell scripts for OneR and NaiveBayes.

Submission: Give me a report with the following sections

About OneR
In your own words, describe the implementation of OneR (see [Ho95 and section 4.1 of WF02]). Include in your description of the learners how they handle missing and numeric values.

About NB
In your own words, describe the implementation of simple NaiveBayes (see section 4.2 of WF02). Include in your description of the learners how they handle missing and numeric values.

About the Data
Describe the data sets your are using:
                                        attributes
                             +-------------+-----------+-------+
 dataset          #instances | #continuous | #discrete | total | #classes
 ---------------+------------+-------------+-----------+-------+---------
 breast cancer  |        230 |           0 |         9  |    9 |      2
 contact-lenses |        119 |           0 |         4 |     4 |      3      
 ...                     ...           ...       ...    ...      ...

learners.sh
Include and fully comment the files http://www.cs.pdx.edu/~timm/dm/learners.sh explaining the following bash features:
j48correct.awk
Include and fully comment the file http://www.cs.pdx.edu/~timm/dm/j48correct.awk explaining the following gawk features:
reportav.awk
Write, include, and fully comment, an awk script to input the ``learners.out'' file and automatically generate the results table shown in the next section.

Results
Show the average classification accuracies on your learners on different datasets. For example:
  dataset           one1    nb      j48
  ----------------+-------+-------+------
  breast-cancer   | 34.21 | 11.67 | 23.5
  contact-lenses  | 81.90 | 22.60 | 11.90
  ...
  ----------------+-------+-------+------
  average           72.34   17.89   36.23

Don't forget the ``average'' row at the end.

Win-loss
Manually read the win-loss table and create a win-loss table for the different learners showing how often the classification accuracy of one learner was better than another. For example:
  learner  wins-losses  wins  losses
  -------+------------+-----+-----
  oneR   |    5       |  10 |  5
  nb     |    3       |  7  |  4
  j48    |    2       |  6  |  4

We will define a win/loss as follows: greater/less than 1 percent difference in the accuracy of the three learners.

Note that a win-loss table is always sorted by the column wins-losses. Also, for three learners and ten data sets, each learner can be compared to two others, repeated for ten data sets. Hence the maximum number of wins or losses in a win-loss table is 20. The minimum number is 0 since if a learner ties with all other learners, that is neither a win nor a loss.

Conclusion: COCOMO
After considering the win-loss numbers, and taking into consideration the average accuracies seen in the results section, comment on which seems to be the better learner(s).

Notes:

Stuff not to do:

Week 5: print out

Using my PerlPod stuff, document a modified version of bars that accepts a -a flag on the command line. When -a is set, bars assumes it is reading an ARFF file and:

Run this on soybean

  bars -a -r 0 -1 30 data/soybean.arff

and you should see something like this:

                             2-4-d-injury|  16| *******
                      alternarialeaf-spot|  91| ****************************************
                              anthracnose|  44| *******************
                         bacterial-blight|  20| *********
                        bacterial-pustule|  20| *********
                               brown-spot|  92| ****************************************
                           brown-stem-rot|  44| *******************
                             charcoal-rot|  20| *********
                            cyst-nematode|  14| ******
              diaporthe-pod-&-stem-blight|  15| *******
                    diaporthe-stem-canker|  20| *********
                             downy-mildew|  20| *********
                       frog-eye-leaf-spot|  91| ****************************************
                         herbicide-injury|   8| ***
                   phyllosticta-leaf-spot|  20| *********
                         phytophthora-rot|  88| **************************************
                           powdery-mildew|  20| *********
                        purple-seed-stain|  20| *********
                     rhizoctonia-root-rot|  20| *********

Your document should include the output as an example.

For more on bars, see http://www.cs.pdx.edu/~timm/dm/bars.html

Week 6: print out

Week 7: REPORT

Two tasks: understand the COCOMO model and improve our definition of win/loss from the previous assignment.

Regarding COCOMO:

As to the win/loss table, our previous definition was inadequate. In statistics, two samples are assessed to be different by looking at their means and their standard deviations. This is called a t-test and is parameterized with some confidence alpha. In the following code, mean and sd are computed from as follows. Let the accuracies of two learners l1, l2 over k data sets be acc[l1,1],acc[l1,2],..,acc[l1,k] and acc[l2,1],acc[l2,2],...acc[l2,k]. Let di=acc[l1,i]-acc[l2,i]. mean and sd are the average and standard deviation of di. Then the compare function returns some symbol indicating if the di numbers are truly differen.

 function compare(mean,sd,k,alpha) {
        if ( same10(mean,sd,k,alpha) ) { return "0" }
        if ( mean < 0 ) return "-"
        if ( mean > 0 ) return "+"
 }

Assuming ten degrees of freedom, then same10 does the t-test.

 function same10(mean,sd,k,alpha,      t) {
        t=abs(tval(mean,sd,k)),
        if ( alpha==0.05 ) return t < 2.22814;
        if ( alpha==0.01 ) return t < 3.169277;
        print "unknown alpha level " alpha; 
        exit;
 }
 function tval(mean,sd,k)  {return mean/(sqrt(sd/k) }
 function abs(x)           {if (x<0) {return -1*x} else {return x}}

To assess two learners, a standard method is a 10x10 way study:

 10 times do:
        a) randomize order of data in dataset
        b) for fold=1 to 10 times do:
                generate test and train set for "fold"
                for learner in learners do:
                        acc[learner,fold] = learn(train,test)

This will generate 100 entries in acc for each learner for each dataset. The win-loss table is then generated by taking each dataset and comparing all pairs of learners using the above t-test.

Note that that is not a fast process: 10*10*Learners*Data. Good thing it is all automated, heh?

The statistically literate amongst you will be surprised that at the following. Using 10 degrees of freedom, not 99, for the t-test (i.e. the above same10 code. A recent study at a data mining conference made a convincing argument that this was better for technical reasons.

Some bash techniques will be useful.

 REPEATS=10
 FOLDS=10
 for DATUM in $DATA
 do
    OUTER=1
    while [ $OUTER -le $REPEATS ]; do
         INNER=1
         #use randomarff to randomize order of data set
         while [  $INNER -le $FOLDS ]; do
              echo "nway=$OUTER:$INNER">"/dev/tty"
              #use traintest to extract fold "INNER" from $DATUM
              for LEARNER in $LEARNERS
              do
                  #
                  #a) Call weka using the -t, -T -v flags described 
                  #   on p280 of your text. Ignore the -x and -s flags
                  #b) Collect the accuracies via a script
                  #
                  echo "$DATUM, $LEARNER, $ACC"
                  echo "$DATUM, $LEARNER, $ACC">"/dev/tty"
              done
              let INNER=$INNER+1 
         done
         let OUTER=$OUTER+1
     done 
 done > report.txt

(See also randomarff.html and traintest.html are in http://www.cs.pdx.edu/~timm/dm ).

The above code stores its output in report.txt. You MUST use that file to automatically build the win-loss tables (bring on the scripting).

Re-compute your win-loss tables from naiveBayes (with kernel estimation), J48, OneR running on the data sets shown in http://www.cs.pdx.edu/~timm/dm/data/somediscretedatasets.dat

Your report should include all code, well-commented in my PerlPod format, and the new win-loss tables at the alpha=5% and alpha=1% levels.

MOST IMPORTANT: Your report should also comment on patterns in the results. If you just report what you have done that will get you average marks for this report. For good marks, you have to report on what you have learnt. Eg. are their any features of a data set that make it more/less suitable for (e.g.) J48?

Week 8: print out

Run TAR3. Use ``stable'' to seek stable treatments within http://www.cs.pdx.edu/~timm/dm/stable.html. Hand in the ``stable'' results.

Run model trees and LSR (these are setting inside M5 prime) on one data set with continuous classes. Hand in the learnt theories.

Week 9: print out

Do nothing. Gain one mark for relaxing and preparing your mind for the final push.

Week 10: REPORT

Task: 1) log of working data miners with COCOMO; 2) win-loss tables for numeric attributes.

Don't panic at the size of this assignment spec. Read it carefully and you'll see you already have 90% of the machinery you need for all the following.

Also, take care in planning your workload. Task 2 is in three parts:

STRONGLY suggest you START with do the first and second parts of task 2 and then, while the second part is running, turn to task1.

Overview:

Task1:
You are to applying machine learning to the output of COCOMO simulations to find project management options that reduce development time AND risk.

Do two case studies: KC1 and FB3 (see page 2 of http://menzies.us/pdf/00ase.pdf). Modify restraints.dat to reflect the union of nowi and changesi (these are columns labeled in figure 2 of that paper). Assuming newKsloc=100.

Task2:
Auto-generate win-loss tables from 10x10-way cross-val comparing the performance of the following learners after applying the following discretization methods:
 discretization method               NB NK+ke J48
 ---------------------               -- ----- ---
 1. none:                             x   x    x
 2. equal interval: 
      2a. kMM:  
             5MM (max-min)/5          x        x
            10MM (max-min)/10         x        x
            15MM (max-min)/15         x        x
            20MM (max-min)/20         x        x
       2b. binlogI:  
            (max-min)/max(1,2*log(U)) x        x
 3. equal frequency
      3a. binRoot:
             max(30,sqrt(T))          x        x

Notes (on task2)

Equal interval, equal frequency
Defined on page 240 of [WF02].

NB
Naive bayes, no kernel estimation

NB+ke
Naive bayes, plus kernel estimation

U
number of UNIQUE entries in each column

T
TOTAL number of entries.

Be careful:
Ensure when you discretize that the numbers are symbols; e.g. output ``_1'' not ``1''. Use ranges (http://www.cs.pdx.edu/~timm/dm/ranges.html) to generate and arff from your discretized datasets and check that the generated arff files are NOT reals/integers.

Picking your datasets:

To hand in (for the COCOMO experiments):

Observed Improvement
The START and END of the work; i.e. distribution of Person-months and total risk level BEFORE and AFTER you tried machine learning. I.e. did you do any good?

Design of Utility Function
Notes on how you selected a utility function to combine RISK and person-months into one score, then how you discretized that score to generate target classes for discrete learners.

Learner Output
Output from several different learners applied to COCOMO output. Use TAR3, NB (with kernel estimation), J4.8 on discretized classes. Use LSR and model trees on raw numeric classes.

Analysis
Notes on what you saw in TAR3, NB, J4.8, LSR, model trees and how you used it to find a better cocomorestraints.dat file.

Conclusion
A conclusion commenting on (a)the merits of how well general truisms of software development apply to specific projects (i.e. did you see the same things helping FB3 and KC1); and (b)the merits of different learners for fining key control factors in a domain.

To hand in (for the discretization experiments):

Describe the J48 (C45) learner.
Using pseudo-code and examples where appropriate. Include in your description how J48 handles (a)missing and (b)numeric variables (reference: [WF02],p159 to 166]); J48's (c)pre-pruning and (d)post-pruning over fitting bias avoidance operators.

Describe the TAR3 learner.
Using pseudo-code and examples where appropriate. Include in your description how TAR3 handles (a)numeric variables and (b)over-fitting bias avoidance.

Describe kernel estimation
Using using pseudo code and examples where appropriate. Take your description from JL95. Include in your description how the mean and standard deviation of each estimator is calculated.

Data set selection
A log of how you selected/rejected data sets to run.

Code
Hand in all bash/perl/gawk/whatever scripts.

Mean classification scores
An auto-generated table showing all the learners L1..Ln in columns and all the datasets D1..Dm in rows. Table cells are the mean accuracy of Li on data set Di PLUS characters denoting significant difference from naive bayes(with the 10MM discretization):

Last row of table should be mean of each column.

Win-Loss table
Auto-generated, of-course.

Conclusion: cross-vals
Comment on how your results relate to that of Dougherty et.al. Do95]. Comment on the drawbacks or merits of relatively simple learners like NB.

[TOP]


References

Bo03
Choosing between two learning algorithms based on calibrated tests (2003) by Remco Bouckaert ICML 03 http://www.cs.pdx.edu/~timm/dm/10x10way.pdf

Do95
Supervised and Unsupervised Discretization of Continuous Features (1995) by James Dougherty and Ron Kohavi and Mehran Sahami, ICML, pages 194-202, 1995, http://www.cs.pdx.edu/~timm/dm/dougherty95supervised.pdf

Hall03
Benchmarking Attribute Selection Techniques for Discrete Class Data Mining (2003) by M.A. Hall and G. Holmes; IEEE Transactions On Knowledge And Data Engineering (to appear); to appear. http://www.cs.pdx.edu/~timm/dm/hall-benchmark.pdf

Ho95
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets (1993) by Robert C. Holte; Machine Learning; vol. 3, pp. 63-91; http://www.cs.pdx.edu/~timm/dm/simple_rules.pdf

JL95
Estimating Continuous Distributions in Bayesian Classifiers (1995) by George H. John and Pat Langley; Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence; Morgan Kaufmann; http://citeseer.nj.nec.com/john95estimating.html

WF02
Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (1999) by Ian H. Witten, Eibe Frank; Morgan Kaufmann; 1st edition; BSBN: 1558605525; http://www.cs.waikato.ac.nz/~ml/weka/book.html

[TOP]


Credits

Author

Tim Menzies , tim@menzies.us, http://menzies.us

Software

This page generated by Site: see http://www.cs.pdx.edu/~timm/dm/site.html

Acknowledgements

This site is built using PerlPod.

Style sheet switching method taken from Eddie Traversa's excellent and simple-to-apply tutorial: http://dhtmlnirvana.com/content/styleswitch/styleswitch1.html.

Search engine powered by ATOMZ http://www.atomz.com/search/. Note, the indexes to this site are only updated weekly (heh, its a free service- what more ja want?).

Icons on this site come from http://www.sql-news.de/rubriken/olap.asp and http://www.ifnet.it/webif/centrodi/eng/toolbar.htm.

The JAVA machine learners used at this site come from the extensive data mining libraries found in the University of Waikato's Environment for Knowledge Analysis (the WEKA) http://www.cs.waikato.ac.nz/ml/weka/

[TOP]


Legal

Copyright

Copyright (C) Tim Menzies 2004

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; see http://www.gnu.org/copyleft/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Disclaimer

The content from or through this web page are provided 'as is' and the author makes no warranties or representations regarding the accuracy or completeness of the information. Your use of this web page and information is at your own risk. You assume full responsibility and risk of loss resulting from the use of this web page or information. If your use of materials from this page results in the need for servicing, repair or correction of equipment, you assume any costs thereof. Follow all external links at your own risk and liability.

[TOP]