[TOP]


Why all the Scripting?

Data Mining
CS 510 (DM)
Winter,2004
home | news | site map
review | project | subject | group
weka | mining | gawk | bash
modeling | reference | pods
Display: big | small

Why all the scripting?

This site is meant to be about data mining. So why is there so much stuff here about scripting languages like gawk and bash?

Well:

So, what's left? Well, firstly, there is the research frontier and what to say about that depends on what dot of that frontier is your personnel obsession. I can't say much more on that. As to the rest...

[TOP]


Need to Script

In my experience, the single feature that predicts for a successful data miner, or data mining grad student is their ability to script. See, most of time I spend data mining is really doing data pre-processing. Often what is required is some domain-specific translator between data in different formats. After the learning, the learnt theory then has to be used in some, again, domain-specific way. All that means doing some scripting.

This site illustrates that pre/post processing using examples based on the simplest scripting systems I could find: i.e. gawk and the bash shell. For more on scripting for data mining, see http://www.cs.pdx.edu/~timm/dm/wekatools.html. For more on gawk, see http://www.cs.pdx.edu/~timm/dm/gawk.html. For more on bash, see http://www.cs.pdx.edu/~timm/dm/bash.html.

[TOP]


Scripting is good

This site also reflects my endorsement of the software design philosophy that I learnt from UNIX gurus such as Eric Raymond. Which means I prefer lots of little tiny scripts to giant Smalltalk/ C++/ JAVA/ whatever applications. For more on that philosophy, see http://www.faqs.org/docs/artu/

[TOP]


When data is missing, run a model!

When a domain lacks data but has models representing what is believed about that domain, then data mining can proceed as follows. That model is quickly coded up and exercised in a Monte Carlo fashion. The results are a random sampling of what is believed for that field. This sample can become input to a data miner.

Scripting languages, such as gawk and bash, are good tools for quickly writing domain specific languages for building such models. For techniques on organizing such Monte Carlo methods, see http://www.cs.pdx.edu/~timm/dm/assume.html. For more on a worked example (based on software project planning), see http://www.cs.pdx.edu/~timm/dm/cocomo.html. For more on domain-specific languages, see http://www.cs.pdx.edu/~timm/dm/dsl.html.

[TOP]


The quest for simpler data miners

I can't prove this to you yet, but my thesis is as follows: data mining is simple. My aim here is to write the simplest scripts to generate simple data miners that are work as well as more complicated tools.

Watch this space.

[TOP]


Credits

Author

Tim Menzies , tim@menzies.us, http://menzies.us

Software

This page generated by Site: see http://www.cs.pdx.edu/~timm/dm/site.html

Acknowledgements

This site is built using PerlPod.

Style sheet switching method taken from Eddie Traversa's excellent and simple-to-apply tutorial: http://dhtmlnirvana.com/content/styleswitch/styleswitch1.html.

Search engine powered by ATOMZ http://www.atomz.com/search/. Note, the indexes to this site are only updated weekly (heh, its a free service- what more ja want?).

Icons on this site come from http://www.sql-news.de/rubriken/olap.asp and http://www.ifnet.it/webif/centrodi/eng/toolbar.htm.

The JAVA machine learners used at this site come from the extensive data mining libraries found in the University of Waikato's Environment for Knowledge Analysis (the WEKA) http://www.cs.waikato.ac.nz/ml/weka/

[TOP]


Legal

Copyright

Copyright (C) Tim Menzies 2004

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; see http://www.gnu.org/copyleft/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Disclaimer

The content from or through this web page are provided 'as is' and the author makes no warranties or representations regarding the accuracy or completeness of the information. Your use of this web page and information is at your own risk. You assume full responsibility and risk of loss resulting from the use of this web page or information. If your use of materials from this page results in the need for servicing, repair or correction of equipment, you assume any costs thereof. Follow all external links at your own risk and liability.

[TOP]