Highest level goals: To show students empirical methods for assessing different learners.
Lower level goal: To generate a comment on the following hypothesis: Bayesian learners are just as effective, and simpler to build, than entrophy-based decision-tree learners
Low-level goal: Conduct a 10-by-10-cross validation study to generate win-loss tables comparing:
Students to work in groups of three or four.
The project has many parts and the total marks for these parts adds up to 150 points. That is, many parts of the assignment are optional. Further, if you do badly in one part, you can make up for it by doing extra work elsewhere.
Parts are either print outs or reports:
If your submission contains multiple files, submit as a zip file. Zip files are to be labeled as follows:
gGpP.zip
where P is which part you are answering and
G is your group number. Html files are to be labeled as follows:
gGpP.html
Reports are to be submitted to tim@menzies.us with the subject line:
part P group G CS510 submission
If you get the subject line right, then you will get back an automatic report saying ``submission to CS510 received''. Your assignment is NOT submitted unless your get that notice back.
week marks ----+------ 2 5 3 1 4 10 5 1 6 1 7 15 8 1 9 1 10 25 ----+----- total 60
Submission: Email me a report with photos and names of your group members.
Submission: A print out of http://groups.yahoo.com/group/dm04pdx1/members with the emails of your group members highlighted (this means you have to join yahoo first).
Submission: Give me a print out of 12 WEKA outputs, numbered OneR1,OneR2,OneR3, NB1,NB2,NB3, J1,J2,J3 (all stapled together).
To generate this print out, run 3 data sets listed in http://www.cs.pdx.edu/~timm/dm/data/somediscretedatasets.dat through three learners: OneR, NaiveBayes, and J48.
For this exercise, run the learners through manually. Later on, we'll automate this process.
Learn what you can using OneR, NaiveBayes, J48 from the first ten data sets in the table http://www.cs.pdx.edu/~timm/dm/data/somediscretedatasets.dat .
To build the learners, modify my J48 example to generate shell scripts for OneR and NaiveBayes.Submission: Give me a report with the following sections
attributes
+-------------+-----------+-------+
dataset #instances | #continuous | #discrete | total | #classes
---------------+------------+-------------+-----------+-------+---------
breast cancer | 230 | 0 | 9 | 9 | 2
contact-lenses | 119 | 0 | 4 | 4 | 3
... ... ... ... ... ...
dataset one1 nb j48 ----------------+-------+-------+------ breast-cancer | 34.21 | 11.67 | 23.5 contact-lenses | 81.90 | 22.60 | 11.90 ... ----------------+-------+-------+------ average 72.34 17.89 36.23
Don't forget the ``average'' row at the end.
learner wins-losses wins losses -------+------------+-----+----- oneR | 5 | 10 | 5 nb | 3 | 7 | 4 j48 | 2 | 6 | 4
We will define a win/loss as follows: greater/less than 1 percent difference in the accuracy of the three learners.
Note that
a win-loss table is always sorted by the column wins-losses.
Also,
for three learners and ten data sets, each learner can be
compared to two others, repeated for ten data sets.
Hence the maximum number of wins
or losses in a win-loss table is 20.
The minimum number is 0 since if a learner ties with
all other learners, that is neither a win nor a loss.
Notes:
http://www.cs.pdx.edu/~timm/dm/j48correct.awk :
#j48correct.awk
function round(x) {return int(x+0.5)}
/Correctly/ {correct=round($5*100)/100}
END {print correct}
http://www.cs.pdx.edu/~timm/dm/learners.sh :
#!/usr/bin/bash
## example usage:
# chmod +x learners.sh config
# ./learners.sh
. config
sep=","
outfile="learners.out"
learners="j48 j48"
data="audiology weather"
function accuracy() {
./$1 $2 | $gawk -f j48correct.awk
}
function header() {
echo -n "dataset"
for learner in $learners
do
echo -n "$sep$learner"
done
echo ""
}
function learners() {
for datum in $data
do echo -n "$datum"
for learner in $learners
do echo -n "$learner $datum = ">"/dev/tty"
acc=`accuracy $learner $datum`
echo -n "$sep$acc"
echo "$acc%">"/dev/tty"
done
echo ""
done
}
main() {
header
learners
}
main > "$outfile"
Note that for this to work, the files j48correct.awk and
config has to be in the same
directory as learners.sh. See http://www.cs.pdx.edu/~timm/dm/config
This script reports the maximum seen in every column.
http://www.cs.pdx.edu/~timm/dm/max.awk
## example usage
# . config ; $gawk -f max.awk skip=1 learners.out
BEGIN {FS=",";
Inf=2**32;
skip=1;
}
## initializations
FNR==1{for(i=1;i<=NF;i++) {
if (i==skip) continue
max[i]= -1*Inf};
cols=NF;
next;
}
## functions
function bigger(i,j) {if (i>j) {return i} else {return j}}
function a2s(a,n,sep0,ignore, str,i,sep) {
sep="";
str="";
for(i=1;i<=n;i++) {
if (i==ignore) continue;
str=str sep a[i];
sep=sep0};
return str
}
## process data
{for(i=1;i<=cols;i++) {
if (i==skip) continue;
max[i]=bigger($i,max[i])}
}
END {print "average," a2s(max,cols,",",skip)}
Stuff not to do:
Using my PerlPod stuff,
document
a modified version of bars that accepts a -a
flag on the command line. When -a is set, bars assumes it is reading an ARFF
file and:
@data.
Run this on soybean
bars -a -r 0 -1 30 data/soybean.arff
and you should see something like this:
2-4-d-injury| 16| *******
alternarialeaf-spot| 91| ****************************************
anthracnose| 44| *******************
bacterial-blight| 20| *********
bacterial-pustule| 20| *********
brown-spot| 92| ****************************************
brown-stem-rot| 44| *******************
charcoal-rot| 20| *********
cyst-nematode| 14| ******
diaporthe-pod-&-stem-blight| 15| *******
diaporthe-stem-canker| 20| *********
downy-mildew| 20| *********
frog-eye-leaf-spot| 91| ****************************************
herbicide-injury| 8| ***
phyllosticta-leaf-spot| 20| *********
phytophthora-rot| 88| **************************************
powdery-mildew| 20| *********
purple-seed-stain| 20| *********
rhizoctonia-root-rot| 20| *********
Your document should include the output as an example.
For more on bars, see http://www.cs.pdx.edu/~timm/dm/bars.html
cocomo. Print out what comes back when you execute the following
from your own directory:
cocomo -t cocomotypes1.dat -r cocomorestraints1.dat -n 1 | gawk -F, '{print $(NF-1)}'
bars and hand in the resulting histogram.
accumulate function that accepts (a)
an array (in AWK) or a hash (in PERL)
and (b) a number x
and updates
fields in that array for sum=sum+x and sumSquared=sumSquared+x*x and n=n+1
of that array
Also hand-in the output from this code executing.
update function that the same array/hash as above and
adds a mean and sd field to the array/hash as follows:
function mean(sum,n) {return sum/n}
function sd(sumSq,sum,n) {return sqrt((sumSq-((sum*sum)/n))/(n-1))}
Also hand-in the output from this code executing.
Two tasks: understand the COCOMO model and improve our definition of win/loss from the previous assignment.
Regarding COCOMO:
aam mean?),
with COCOMO expert coded up and seen running on a few examples. Include a section
listing all the fixes you made to my COCOMO model.
As to the win/loss table, our previous definition was inadequate.
In statistics, two samples are assessed to be different by looking at
their means and their standard deviations. This is called a t-test
and is parameterized with some confidence alpha. In the following
code, mean and sd are computed from as follows. Let the
accuracies of two learners l1, l2 over k data sets be
acc[l1,1],acc[l1,2],..,acc[l1,k] and
acc[l2,1],acc[l2,2],...acc[l2,k]. Let di=acc[l1,i]-acc[l2,i].
mean and sd are the average and standard deviation of di.
Then the compare function returns some symbol indicating if the
di numbers are truly differen.
function compare(mean,sd,k,alpha) {
if ( same10(mean,sd,k,alpha) ) { return "0" }
if ( mean < 0 ) return "-"
if ( mean > 0 ) return "+"
}
Assuming ten degrees of freedom, then same10 does the t-test.
function same10(mean,sd,k,alpha, t) {
t=abs(tval(mean,sd,k)),
if ( alpha==0.05 ) return t < 2.22814;
if ( alpha==0.01 ) return t < 3.169277;
print "unknown alpha level " alpha;
exit;
}
function tval(mean,sd,k) {return mean/(sqrt(sd/k) }
function abs(x) {if (x<0) {return -1*x} else {return x}}
To assess two learners, a standard method is a 10x10 way study:
10 times do:
a) randomize order of data in dataset
b) for fold=1 to 10 times do:
generate test and train set for "fold"
for learner in learners do:
acc[learner,fold] = learn(train,test)
This will generate 100 entries in acc for each learner for each dataset.
The win-loss table is then generated by taking each dataset
and comparing all pairs of learners using the above t-test.
Note that that is not a fast process: 10*10*Learners*Data. Good thing it is all automated, heh?
The statistically literate amongst you will be surprised that at the following.
Using 10 degrees of freedom, not 99, for the t-test (i.e. the above
same10 code.
A recent study at a data mining conference made a convincing argument
that this was better for technical reasons.
Some bash techniques will be useful.
REPEATS=10
FOLDS=10
for DATUM in $DATA
do
OUTER=1
while [ $OUTER -le $REPEATS ]; do
INNER=1
#use randomarff to randomize order of data set
while [ $INNER -le $FOLDS ]; do
echo "nway=$OUTER:$INNER">"/dev/tty"
#use traintest to extract fold "INNER" from $DATUM
for LEARNER in $LEARNERS
do
#
#a) Call weka using the -t, -T -v flags described
# on p280 of your text. Ignore the -x and -s flags
#b) Collect the accuracies via a script
#
echo "$DATUM, $LEARNER, $ACC"
echo "$DATUM, $LEARNER, $ACC">"/dev/tty"
done
let INNER=$INNER+1
done
let OUTER=$OUTER+1
done
done > report.txt
(See also randomarff.html
and traintest.html are in http://www.cs.pdx.edu/~timm/dm ).
The above code stores its output in report.txt.
You MUST use that file to
automatically build the win-loss tables (bring on the scripting).
Re-compute your win-loss tables from naiveBayes (with kernel estimation), J48, OneR running on the data sets shown in http://www.cs.pdx.edu/~timm/dm/data/somediscretedatasets.dat
Your report should include all code, well-commented in my PerlPod format, and the new win-loss tables at the alpha=5% and alpha=1% levels.
MOST IMPORTANT: Your report should also comment on patterns in the results. If you just report what you have done that will get you average marks for this report. For good marks, you have to report on what you have learnt. Eg. are their any features of a data set that make it more/less suitable for (e.g.) J48?
Run TAR3. Use ``stable'' to seek stable treatments within http://www.cs.pdx.edu/~timm/dm/stable.html. Hand in the ``stable'' results.
Run model trees and LSR (these are setting inside M5 prime) on one data set with continuous classes. Hand in the learnt theories.
Do nothing. Gain one mark for relaxing and preparing your mind for the final push.
Task: 1) log of working data miners with COCOMO; 2) win-loss tables for numeric attributes.
Don't panic at the size of this assignment spec. Read it carefully and you'll see you already have 90% of the machinery you need for all the following.
Also, take care in planning your workload. Task 2 is in three parts:
STRONGLY suggest you START with do the first and second parts of task 2 and then, while the second part is running, turn to task1.
Overview:
Do two case studies: KC1 and FB3 (see page 2 of http://menzies.us/pdf/00ase.pdf). Modify restraints.dat to reflect the union of nowi and changesi (these are columns labeled in figure 2 of that paper). Assuming newKsloc=100.
discretization method NB NK+ke J48
--------------------- -- ----- ---
1. none: x x x
2. equal interval:
2a. kMM:
5MM (max-min)/5 x x
10MM (max-min)/10 x x
15MM (max-min)/15 x x
20MM (max-min)/20 x x
2b. binlogI:
(max-min)/max(1,2*log(U)) x x
3. equal frequency
3a. binRoot:
max(30,sqrt(T)) x x
Notes (on task2)
To hand in (for the COCOMO experiments):
To hand in (for the discretization experiments):
Last row of table should be mean of each column.
Tim Menzies ,
tim@menzies.us,
http://menzies.us
This page generated by Site:
see http://www.cs.pdx.edu/~timm/dm/site.html
This site is built using PerlPod.Style sheet switching method taken from Eddie Traversa's excellent and simple-to-apply tutorial: http://dhtmlnirvana.com/content/styleswitch/styleswitch1.html.
Search engine powered by ATOMZ http://www.atomz.com/search/. Note, the indexes to this site are only updated weekly (heh, its a free service- what more ja want?).
Icons on this site come from http://www.sql-news.de/rubriken/olap.asp and http://www.ifnet.it/webif/centrodi/eng/toolbar.htm.
The JAVA machine learners used at this site come from the extensive data mining libraries found in the University of Waikato's Environment for Knowledge Analysis (the WEKA) http://www.cs.waikato.ac.nz/ml/weka/
Copyright (C) Tim Menzies 2004
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; see http://www.gnu.org/copyleft/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
The content from or through this web page are provided 'as is' and the author makes no warranties or representations regarding the accuracy or completeness of the information. Your use of this web page and information is at your own risk. You assume full responsibility and risk of loss resulting from the use of this web page or information. If your use of materials from this page results in the need for servicing, repair or correction of equipment, you assume any costs thereof. Follow all external links at your own risk and liability.