Kinds of Learning

Data Mining
CS 510 (DM)
Winter,2004
home | news | site map
review | project | subject | group
weka | mining | gawk | bash
modeling | reference | pods
Display: big | small

Why all the scripting?

Values can be discrete or continuous.

Attributes are discrete or continuous depending on the kinds of values they can take. For example, here are some discrete attributes from a car domain (in aarf format):

 @attribute Manufacturer        { Acura, Audi, BMW, Buick, ...}
 @attribute Air_Bags_standard   { 0, 2, 1}
 @attribute Drive_train_type    { 1, 0, 2}

For example, here are some numeric attributes from a car domain (in aarf format):

 @attribute City_MPG            real
 @attribute Highway_MPG         real
 @attribute Number_of_cylinders real
 @attribute Engine_size         real
 @attribute Horsepower          real
 ...

As we shall see, different kinds of learers work for differernt kinds of attributes

[TOP]


Learning for discrete classes

Often, one discrete attribute is called the class attribute. For example, our cars could be good or bad.

 @attribute 'class' {'bad','good'}

J48part

Classifiers try to find combinations of non-class attributes that predict for the class value.

One classifier is J48part. For example:

 if   wage-increase-first-year > 2.5 AND
      longterm-disability-assistance = yes AND
      statutory-holidays > 10
 then class=good
 if   wage-increase-first-year <= 4 AND
      working-hours > 36
 then class=bad

J48

Another classifier is the J48 decision tree where each branch is a conjunction and alternate branches are disjunctions:

 wage-increase-first-year <= 2.5: class=bad 
 wage-increase-first-year > 2.5
 |   statutory-holidays <= 10: class=bad 
 |   statutory-holidays > 10: class=good

APRIORI

Association rule learners grant no special meaning to some class attribute. These learners try to find sets of attribute values that occur other.

APRIORI is such a learner. It only executes on discrete attributes. Here's what that learner generates on a data set with 19 classes. Note that no class attribute appears in the output and more than one attribute value can appear after the ``==>'' symbol (e.g. see association 5).

  1. int-discolor=none              ==> sclerotia=absent 
  2. mycelium=absent int-discolor=none ==> sclerotia=absent 
  3. leaves=abnorm sclerotia=absent ==> mycelium=absent 
  4. sclerotia=absent               ==> mycelium=absent 
  5. int-discolor=none              ==> mycelium=absent sclerotia=absent 
  6. int-discolor=none sclerotia=absent ==> mycelium=absent 
  7. int-discolor=none              ==> mycelium=absent
  8. leaf-malf=absent               ==> mycelium=absent
  9. mycelium=absent                ==> sclerotia=absent
 10. leaves=abnorm mycelium=absent  ==> sclerotia=absent

Association rule learners can run slower than classifiers since the latter have a very small target concept (the single class attribute) while learners like APRIORI struggle to find associations that predict for any number of attributes.

[TOP]


Learning for Weighted Discrete Classes

TAR3

Lift learners seek combinations of non-class attributes that most improve (or ``lift'') the weighted distribution of the classes over some baseline distribution. Calculating such a weighted distribution requires users attaching weights to each class; e.g. this class is better than that class.

TAR3 is a lift learner that seeks the smallest number of attributes that most lift the baseline. For example, the housing set contains 506 examples of great houses (29%), good houses (29%), poor houses (21%) and bad houses (21%). If we assume that

 great > good > poor > bad

then TAR3 learns the rule:

   6.7 <= rooms < 9.8 AND
   12.6 <= parent teacher ratio < 15.9

This rule, when applied as a constraint to the housing data, removes all but 39 houses which contain 97% great houses and 3% good houses, and no poor or bad houses. This is called the control rule since it offer a policy of what to do to make things better.

If we assume that

  bad > poor > good > great

then TAR3 learns the rule:

 0.6 <= nitrous oxide level < 1.9 AND
 17.16 <= living standard < 39.0

This rule, when applied as a constraint to the housing data, removes all but 81 houses which contain 98% bad houses and 1% poor houses and 1% good houses and no great houses. This is called the monitor rule since it describes what to watch for to detect the worst situation.

The effects of TAR3 are shown below:

[TOP]


Continuous Classes

(Note: WEKA's M5' learner generated all the following.)

M5': simple linear regression

When the class is continuous, simple linear regression can be used to learn a linear model; i.e. the equation of one line that falls closest to all the examples. For example, from the autoPrice data set, a linear regression tool learns:

 price  = -59400 + 79.8symboling + 7.14normalized-losses + 198wheel-base
        - 92.5length + 767width + 38.9height + 5.09curb-weight + 49.9engine-size
        - 1810bore - 1840stroke + 104compression-ratio + 26.1horsepower
        + 0.753peak-rpm + 18.9city-mpg - 13.5highway-mpg

M5': regression trees

A regression tree is a decision tree whose leaves are the average value of the numeric classes that fall down each branch. The internal nodes of such a tree are non-class attributes. For example from the autoPrice data set, a regression tree tool learns:

 curb-weight <= 2660 : 
 |   curb-weight <= 2290 : 
 |   |   curb-weight <= 2090 : 
 |   |   |   length <= 161 : price=6220
 |   |   |   length >  161 : price=7150
 |   |   curb-weight >  2090 : price=8010
 |   curb-weight >  2290 : 
 |   |   length <= 176 : price=9680
 |   |   length >  176 : 
 |   |   |   normalized-losses <= 157 : price=10200
 |   |   |   normalized-losses >  157 : price=15800
 curb-weight >  2660 : 
 |   width <= 68.9 : price=16100
 |   width >  68.9 : price=25500

M5': model trees

A model tree is a decision tree whose internal nodes are non-class attributes and whose leaves are linear models. The branches of that tree select for what model to execute. For example, from the autoPrice data set, a model tree tool learns:

 curb-weight <= 2660 : 
 |   curb-weight <= 2290 : LM1 
 |   curb-weight >  2290 : 
 |   |   length <= 176 : LM2 
 |   |   length >  176 : LM3 
 curb-weight >  2660 : 
 |   width <= 68.9 : LM4
 |   width >  68.9 : LM5
 LM1:  price = -5280 + 6.68normalized-losses + 4.44curb-weight
                  + 22.1horsepower - 85.8city-mpg + 98.6highway-mpg
 LM2:  price = 9680
 LM3:  price = -1100 + 91normalized-losses
 LM4:  price = 9940 + 47.5horsepower
 LM5:  price = -19000 + 13.2curb-weight

[TOP]


Credits

Author

Tim Menzies , tim@menzies.us, http://menzies.us

Software

This page generated by Site: see http://www.cs.pdx.edu/~timm/dm/site.html

Acknowledgements

This site is built using PerlPod.

Style sheet switching method taken from Eddie Traversa's excellent and simple-to-apply tutorial: http://dhtmlnirvana.com/content/styleswitch/styleswitch1.html.

Search engine powered by ATOMZ http://www.atomz.com/search/. Note, the indexes to this site are only updated weekly (heh, its a free service- what more ja want?).

Icons on this site come from http://www.sql-news.de/rubriken/olap.asp and http://www.ifnet.it/webif/centrodi/eng/toolbar.htm.

The JAVA machine learners used at this site come from the extensive data mining libraries found in the University of Waikato's Environment for Knowledge Analysis (the WEKA) http://www.cs.waikato.ac.nz/ml/weka/

[TOP]


Legal

Copyright

Copyright (C) Tim Menzies 2004

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; see http://www.gnu.org/copyleft/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Disclaimer

The content from or through this web page are provided 'as is' and the author makes no warranties or representations regarding the accuracy or completeness of the information. Your use of this web page and information is at your own risk. You assume full responsibility and risk of loss resulting from the use of this web page or information. If your use of materials from this page results in the need for servicing, repair or correction of equipment, you assume any costs thereof. Follow all external links at your own risk and liability.

[TOP]