BARS: simple histograms

Data Mining
CS 510 (DM)
Winter,2004
home | news | site map
review | project | subject | group
weka | mining | gawk | bash
modeling | reference | pods
Display: big | small

Why all the scripting?
 copyleft() {
        cat<<-EOF
        bars: print histogram of a column of data
        Copyright (C) 2004 Tim Menzies
        This program is free software; you can redistribute it and/or
        modify it under the terms of the GNU General Public License
        as published by the Free Software Foundation, version 2.
        This program is distributed in the hope that it will be useful,
        but WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
        GNU General Public License for more details.
        You should have received a copy of the GNU General Public License
        along with this program; if not, write to the Free Software
        Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
        EOF
  }

Motivation

Before learning, it's good to have a peek at data distributions.

[TOP]


Usage

 usage() {
        cat<<-EOF
        Usage: bars  [FLAGS]... FILE
        Print histogram of a column of data
        Flags: 
         -1 NUM         space reserved for the histogram first column
                        (shows the bin's keys); default:  $Q$one0$Q
         -2 NUM         space reserved for the histogram second column
                        (shows the bin's counter); default: $Q$two0$Q
         -3 NUM         space reserved for the histogram's third column
                        (shows the histogram bars); default: $Q$three0$Q
         -c NUM         column to process; default: $Q$c0$Q
         -d CHAR        column deliminters; default:$Q$sep0$Q
         -h             print this help text
         -l             copyright notice
         -m CHAR        mark for drawing histogram bars; default:$Q$mark0$Q
         -r NUM         rounding factor; default:$Q$round0$Q;
                        increase to decrese number of bins; 
                        set to zero to disable rounding 
         -u             don't sort output; default:$Q$unsort0$Q
         -x             run an example
        EOF
 }

Examples

How big are the files in my home directory right now? I could check using freqx ( http://www.cs.pdx.edu/~timm/dm/freqx.html ):

 bash-2.05$ ls -sa $HOME | gawk 'NR>1{print $1}' | freqx
   36 2
    5 4
    2 22
    2 10
    2 0
    1 6
    1 32
    1 24
    1 140
    1 110
    1 1

Can't read that. I want a histogram

 bash-2.05$ ls -sa $HOME | gawk 'NR>1{print $1}' | bars
    0| 44| ********************
   10|  3| **
   20|  3| **
   30|  1| *
  110|  1| *
  140|  1| *

Hmm, want more details on those smaller files. So I'll decrease the rounding factor.

 bash-2.05$  ls -sa $HOME | gawk 'NR>1{print $1}' | bars -r4
    0|   3| ***
    4|  41| ****************************************
    8|   1| *
   12|   2| **
   24|   3| ***
   32|   1| *
  112|   1| *
  140|   1| *

Installation

Copy the following files to your own directory: http://www.cs.pdx.edu/~timm/dm/bars and http://www.cs.pdx.edu/~timm/dm/bars.awk.

Make bars executable:

 chmod +x bars

Edit your paths (see the section Settings. The first line of percentile should point to your local bash shell and the gawk variable (below) should point to your local version of gawk.

Check that all it works:

 bars -x

If the installation worked, then you should see a histogram on the size of the files in your root directory. For me, that looks like:

    0| 44| **********************
   10|  3| **
   20|  3| **
   30|  1| *
  110|  1| *
  140|  1| *

[TOP]


Source code

Settings

Defaults:

 c0="NF"
 sep0=","
 mark0="*"
 round0=10
 unsort0=0
 one0=4
 two0=4
 three0=40

Paths:

 gawk="/pkgs/gnu/bin/gawk"
 junk="/tmp/bars$$"

Minor details:

 Q="\""

Demo code

 barsDemo() { 
        ls -as $1 | $gawk 'NR>1 {print $1}' > ${junk}.data
        main ${junk}.data       
        }

Main

 main() {
    $gawk -F$sep -f bars.awk Collect="$c" Mark="$mark" NoSort="$unsort" \
                             Col1="$one" Col2="$two" Col3="$three"\
                             Round="$round" \
                             $1
 }

Bars.awk: The Worker

 BEGIN {

Command-line options

   FS=",";       #column seperator
   Round=10;     #size of bins
   Collect=2;    #column to process
   Mark="*";     #marks to draw histogram
   NoSort=0;     #disable sorting of keys
   Col1=4;       #width of histogram key column
   Col2=3;       #width of histogram key value column
   Col3=20;      #width of histogram bar display column

Internal globals

   Num;          #array where we store the numbers
   Inf= 2**32;   #the largest number we can process
   Max = -1*Inf; #max count seen in any bucket
                 #initialized to the smallest number
   Here;         #the key of the current bucket
 }

Convert the symbol ``NF'' to the number of the last field

 NR==1 {if (Collect=="NF") Collect=NF;}

Collect the number, rounded.

 { if (Round) {
     Here=round($Collect/Round)*Round}
   else {Here=$Collect};
   Num[Here]++;
   if (Num[Here]>Max) Max=Num[Here];
 }

Report

 END {
   if (NoSort) {
     histogram(Num) }
   else {
     sortedgram(Num)}
 }

Generate histogram bars for the entire histogram

  function histogram(a, i) { for(i in a) print bar(a,i) }

Pre-sort the histogram and generated the bars in sorted order

 function sortedgram(a, add,i,j,keys,n) {
   for(i in a) {
     if (Round) { 
       keys[j++]=i+0}   #ensures numeric, not string, sort
     else keys[j++]=i}   
   n=asort(keys);
   for(i=1;i<=n;i++) 
     print bar(a,keys[i]);
 }

Genrate a single histogram bar of Marks, resized according to the column3 width.

 function bar(a,i,     scale) {
   if (Max < Col3) { 
     scale=1 }
   else {scale=Col3/Max};
   if (Round) {
     return  sprintf(" %" Col1".0f|%" Col2 "d| %s",\
                     i, a[i],string(round(a[i]*scale),Mark))}
   else {
      return  sprintf(" %"Col1" s|%"Col2" d| %s",\
                     i, a[i],string(round(a[i]*scale),Mark))}
 }

Round a number

 function round(x) { return int(x+0.5) }

Generate a string n long of characters c.

 function string(n,c,  s) { while(n--) {s=s c}; return s}

Command line processing

 demo=""
 while getopts "1:2:3:c:d:hlm:r:ux" flag
 do case "$flag" in
        1) one=$OPTARG;;
        2) two=$OPTARG;;
        3) three=$OPTARG;;
        c) c=$OPTARG;;
        d) sep=$OPTARG;;
        h) usage;exit;;
        l) copyleft; exit;;
        m) mark=$OPTARG;;
        r) round=$OPTARG;;
        u) unsort=1;;
        x) demo="barsDemo $HOME";;
    esac
 done
 shift $(($OPTIND - 1))
 one=${one:=$one0}
 two=${two:=$two0}
 three=${three:=$three0}
 c=${c:=$c0}  
 sep=${sep:=$sep0}  
 c=${c:=$c0}  
 mark=${mark:=$mark0}  
 round=${round:=$round0}  
 unsort=${unsort:=$unsort0}  
 [ -n "$demo" ] && $demo && exit;
 main $*

[TOP]


Credits

Author

Tim Menzies , tim@menzies.us, http://menzies.us

Software

This page generated by Site: see http://www.cs.pdx.edu/~timm/dm/site.html

Acknowledgements

This site is built using PerlPod.

Style sheet switching method taken from Eddie Traversa's excellent and simple-to-apply tutorial: http://dhtmlnirvana.com/content/styleswitch/styleswitch1.html.

Search engine powered by ATOMZ http://www.atomz.com/search/. Note, the indexes to this site are only updated weekly (heh, its a free service- what more ja want?).

Icons on this site come from http://www.sql-news.de/rubriken/olap.asp and http://www.ifnet.it/webif/centrodi/eng/toolbar.htm.

The JAVA machine learners used at this site come from the extensive data mining libraries found in the University of Waikato's Environment for Knowledge Analysis (the WEKA) http://www.cs.waikato.ac.nz/ml/weka/

[TOP]


Legal

Copyright

Copyright (C) Tim Menzies 2004

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; see http://www.gnu.org/copyleft/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Disclaimer

The content from or through this web page are provided 'as is' and the author makes no warranties or representations regarding the accuracy or completeness of the information. Your use of this web page and information is at your own risk. You assume full responsibility and risk of loss resulting from the use of this web page or information. If your use of materials from this page results in the need for servicing, repair or correction of equipment, you assume any costs thereof. Follow all external links at your own risk and liability.

[TOP]