PSU CS199 — Introduction to Computer Science

Homework 4: Linear Regression

Due Monday 10 August 2009

Language Level: Intermediate Student
Teachpacks: image.ss

This homework will let you gain experience with functions that manipulate lists. You will want to use local definitions (and thus the Intermediate student language). If you want to use higher-order functions, which simplify the assignment, you will need the Intermediate Student language.

The homework involves linear regression. The problem is well-suited to being decomposed into a set of related functions, as you will discover for yourself.


Preliminaries

Work with a partner, or alone: your choice. If you work with a partner, be sure to Sit Together.

Read Sections 19 thru 21.3 from HTDP. The crib sheet in Figure 57 will be useful. (Note that the figure in the printed version of HTDP has some typos.)

Download this file, and save to your Scheme directory under as new file name (the name of you and your partner). We have discovered (the hard way) that on Windows Vista, it is dangerous to save a file to your downloads directory and then run DrScheme on it. The file contains some data on which you will run your functions, and parts of some data definitions and templates to get you started.

Your Assignment

Part I: Datasets and Regression Analysis

You will need three data definitions in order to do this homework: datum, dataset and linear-eqn (for "linear equation"). A datum will consist simply of an x and a y. A dataset will consist of a nonempty list of datums. A linear-eqn will consist of a slope and a y-intercept (the coefficients m and b from the familiar linear equation form y = mx + b).

A linear regression analysis is applied to a set of data; its result is the line that mathematically best fits the data as it would appear on a two-dimensional plot. Operationally, with linear regression analysis, you put a dataset in and get a linear equation out. Your task is to provide a function linear-eqn-fitting whose contract is

    ;; linear-eqn-fitting: dataset -> linear-eqn

What is Linear Regression?

In the following discussion, assume that n is the number of datums in the given dataset.

The best-fit slope for the linear model of a set of data is given by

The y-intercept for the linear model of a set of data is given by

These two formulae can be represented as Scheme functions with the following contracts:

 ;; slope: dataset -> num
 ;; intercept: dataset -> num

As you develop these functions, you will find that you will want to use local definitions and helper functions. The helper functions you will need to write in order to get linear-eqn-fitting to work will mostly be list-processing functions that operate on datasets. It is up to you whether you write structurally recursive functions that follow the How to Design Programs template, or whether you use Scheme's built-in abstract functions for list processing. Either way, be sure you surround all your functions by contracts, purposes, and tests, as always.

Part II: Graphics

The second part of this homework is a visualization of linear regression. You will write a function

  ;; linreg-image: dataset -> image
that, given a dataset, will produce a plot of the data overlaid with the best-fit line as determined by the linear regression analysis from Part 1, as well as the text of the linear equation.

Your visual representation of linear regression analysis must include, at a minimum,

You may break this problem into smaller pieces as you see fit. Here is a skeleton of my own implementation:

  • Part III: Finishing up

    Pull all the pieces together. You should have been writing tests of the pieces as you developed them; now make sure that there are tests for the whole program. There is no good way to test the graphical part, but do make sure that when I click run in the definitions window, the code runs and produces a diagram similar to this one:

    If you didn't manage to do all the parts, hand in what you did complete. For example, a set of axes with a regression line but no points is still good for partial credit. Parts I and II are independent: if you never managed to get linear-eqn-fitting working correctly, fake it! Dummy up a function that returns

       (make-linear-eqn 0.3 38)

    regardless of its input, and carry on with Part II.

    If you have time on your hands, you can get really fancy, as shown here:

    This is more than I'm expecting, but it should give you some ideas. You may not have time to do anything this elaborate, but you should see that the detail is a "mere matter of programming" and that you could get there if you had the time and inclination. R2 is a measure of how closely the line fits the data:

    R =

    where the overbar represents the arithmetic mean of the term beneath the bar, and σ the standard deviation (as defined in HW2) of the subscript term.

    Hand in your work.

    Put your names as a comment at the top of the definitions window. Save your file from DrScheme, attach it to an email message. In addition, put the names of both partners (or say that you din't use a partner) in the body of the email and in the name of the file. Submit your email to CS199Homework.

    Acknowledgements

    This homework is based on a lab used in the University of Chicago's CS105 Class, originally designed by Adam Shaw. The data were obtained from http://exploringdata.cqu.edu.au/oly_gold.htm