CS386 Introduction to Databases
Spring 2006 Quarter


Assignment 6 ¨C Indexes and Query Evaluation
Due: Thursday, 1 June 2006 at the beginning of class

You may do this assignment individually or you may work with one partner.  That is, this assignment is to be completed by individuals or by teams of two students.  If you work with a partner, then you should turn in one assignment paper, with both of your names on the paper. You should only talk to the instructor, the TA and your partner about this assignment. You may also post questions to the course mailing list, cs386@cs.pdx.edu.

Please turn in your completed assignments on paper. Put your last name, first name, the assignment number in that order in the first line of your assignment.  List last name and first name for your partner, if you have one, on the second line of your assignment. (If you are working with a partner, turn in one assignment paper.)

Some of this assignment is based on the Spy relational database. Information about this database and information how to access it is on the Database Info Page. Note that the database you want is called ¡®introdb_spy¡¯.
Question 1: Relational Algebra Equivalences

(5 points) How many different (but equivalent) join orders are there for the join of three tables R, S and T? Two orders are considered different if they have parentheses in different places. That is (R JOIN S) JOIN T is different from R JOIN (S JOIN T).

 

Question 2: Statistics

a. (5 points) Determine the min and max salary values in the agent table, and the number of rows in that table.
b. (5 points) Give an estimate for the number of rows in agent with salary < 65000, assuming a uniform distribution between min and max salary. Explain how you derived your estimate.
c. (5 points) Find the 25th, 50th and 75th percentile values for salaries in the agent table. (The 50th percentile value, for instance, is the smallest number s such that 50% of the rows have salary value less than or equal to s.)
d. (5 points) Give an estimate of the number of rows in agent with salary < 65000, assuming in a uniform distribution in each quartile determined in c. Explain how you derived your estimate.
e. (5 points) How many rows in agent have salary < 65000?

Question 3: Selectivity Estimation

a. (5 points) How many distinct cities in the agent table?
b. (5 points) Give an estimate for the number of rows in agent that satisfy the predicate city = C (assuming C is actually the name of one of the cities).
c. (5 points) How many distinct countries in the agent table?
d. (5 points) Give an estimate for the number of rows in agent that satisfy the predicate country = N (assuming N is a country name).
e. (5 points) If city and country are uncorrelated, then one estimate for the number of rows that satisfy the predicate city = C AND country= N is #rows/((# distinct cities)*(# distinct countries)). Calculate this value.
f. (10 points) How many different city-country pairs actually appear in agent? What does that number suggest about the independence of city and country? Propose a better formula to estimate the number of rows that satisfy city = C AND country = N.

Question 4: Query Plans

For each SQL statement below, draw the query plan the Postgres poduces (which you can obtain with the EXPLAIN command). For each plan, suggest a reason that the particular join algorithm(s) is being chosen.
a. (10 points)
SELECT last, sc_level FROM agent, securityclearance WHERE agent_id = sc_id   (I know, the query doesn't make much sense.)
b. (10 points)
SELECT last, language FROM agent NATURAL JOIN languagerel NATURAL JOIN language WHERE language = 'Russian' and salary > 62000.
c. (10 points) SELECT last, language FROM agent NATURAL JOIN languagerel NATURAL JOIN language WHERE language <> 'Russian' and salary > 62000.

Note: You can find info on the EXPLAIN command at http://www.postgresql.org/docs/8.1/interactive/performance-tips.html

Question 5: Index Selection

(5 points) Assume the agent table had 600,000 tuples rather than around 600. Suggest an index structure that would benefit query 4b above. Show a plan for the query that makes use of the index.