CS386 Introduction to Databases
Spring 2006 Quarter


Assignment 6 C Indexes and Query Evaluation
Due: Thursday, 1 June 2006 at the beginning of class

You may do this assignment individually or you may work with one partner.  That is, this assignment is to be completed by individuals or by teams of two students.  If you work with a partner, then you should turn in one assignment paper, with both of your names on the paper. You should only talk to the instructor, the TA and your partner about this assignment. You may also post questions to the course mailing list, cs386@cs.pdx.edu.

Please turn in your completed assignments on paper. Put your last name, first name, the assignment number in that order in the first line of your assignment.  List last name and first name for your partner, if you have one, on the second line of your assignment. (If you are working with a partner, turn in one assignment paper.)

Some of this assignment is based on the Spy relational database. Information about this database and information how to access it is on the Database Info Page. Note that the database you want is called ‘introdb_spy.
Question 1: Relational Algebra Equivalences

(5 points) How many different (but equivalent) join orders are there for the join of three tables R, S and T? Two orders are considered different if they have parentheses in different places. That is (R JOIN S) JOIN T is different from R JOIN (S JOIN T).

 There are just three different join orders for the three tables:

  1. R JOIN (S JOIN T)

  2. S JOIN (R JOIN T)

  3. T JOIN (R JOIN S)

Each of these represents an equivalence class under commutativity, and can be expressed as four distinct expressions. For example, R JOIN (S JOIN T) = (S JOIN T) JOIN R = (T JOIN S) JOIN R = R JOIN (T JOIN S). None of these expressions parenthesizes the joins differently, so all are the same join order. However, since all twelve expressions are equivalent and name the tables in different relative positions to the parentheses and operators, twelve was also accepted as an answer.

Question 2: Statistics

a. (5 points) Determine the min and max salary values in the agent table, and the number of rows in that table.

SELECT MIN(salary), MAX(salary), COUNT(salary) FROM agent;

minimum salary: 50,008

maximum salary: 366,962

rows: 662


b. (5 points) Give an estimate for the number of rows in agent with salary < 65000, assuming a uniform distribution between min and max salary. Explain how you derived your estimate.


The total salary range is 366,962 – 50,008 = 316,954. The threshold value is 65,000 – 50,008 = 14,992 above the minimum, and so 14,992 / 316,954 = 0.0473 = 4.73% of the salaries should lie below this value under the uniformity assumption. Therefore we estimate (0.0473)(662) = 31 rows.


c. (5 points) Find the 25th, 50th and 75th percentile values for salaries in the agent table. (The 50th percentile value, for instance, is the smallest number s such that 50% of the rows have salary value less than or equal to s.)


By definition, these values are the least S such that the query “SELECT COUNT(*) <= (P * 662) FROM agent WHERE salary < S” returns TRUE, where P is 0.25, 0.5, or 0.75 as appropriate. One way to find these values is to order the agent table by salary and observe the value at each quartile boundary row.


25th Percentile: 54,802

50th Percentile: 58,430

75th Percentile: 89,643


d. (5 points) Give an estimate of the number of rows in agent with salary < 65000, assuming in a uniform distribution in each quartile determined in c. Explain how you derived your estimate.


The results above reveal that 65,000 is in the third quartile. By uniformity within this quartile, (65,000 – 58,430) / (89,643 – 58,430) = 6,570 / 31,213 = 21% of the 165.5 rows there have salary values less than 65,000. Of course all of the 331 rows in the first two quartiles also have salary values less than 65,000. Therefore with the new information we now estimate a total of 331 + 35 = 366 rows.


e. (5 points) How many rows in agent have salary < 65000?



SELECT COUNT(*) FROM agent WHERE salary < 65000;

339 rows.

Note how substantially the estimate improved with a few statistics that allowed our uniformity assumption to be restricted to a much smaller range.

Question 3: Selectivity Estimation

a. (5 points) How many distinct cities in the agent table?

SELECT COUNT(DISTINCT city) FROM agent;

46 cities.


b. (5 points) Give an estimate for the number of rows in agent that satisfy the predicate city = C (assuming C is actually the name of one of the cities).


Without additional information, the best assumption is uniformity. We estimate 1/46 of the agents will be from each city, and therefore that 662 / 46 = 14 rows will satisfy such a predicate.


c. (5 points) How many distinct countries in the agent table?


SELECT COUNT(DISTINCT country) FROM agent;


22 countries.


d. (5 points) Give an estimate for the number of rows in agent that satisfy the predicate country = N (assuming N is a country name).


Again uniformity is an appropriate assumption, so we estimate 662 / 22 = 30 rows.


e. (5 points) If city and country are uncorrelated, then one estimate for the number of rows that satisfy the predicate city = C AND country= N is #rows/((# distinct cities)*(# distinct countries)). Calculate this value.


There are 662 rows, 46 distinct cities, and 22 distinct countries, so the value is 662 / ((46)(22)) = 662 / 1012 = 0.654. Assuming uniform distribution among cities and countries and that these values are independent thus leads to quite a low estimate.


f. (10 points) How many different city-country pairs actually appear in agent? What does that number suggest about the independence of city and country? Propose a better formula to estimate the number of rows that satisfy city = C AND country = N.


SELECT DISTINCT city, country FROM agent;


Just 47 pairs actually show up in the table. Since there are 46 distinct city names, we know that there must be exactly one city name that appears with two different country names. Instead of being independent, it seems that city very nearly determines country in the agent table. Therefore it would be better to assume that the conjunction will be satisfied as the city predicate alone, so that we would use the formula # rows / # distinct cities for our estimate.

Question 4: Query Plans

For each SQL statement below, draw the query plan the Postgres poduces (which you can obtain with the EXPLAIN command). For each plan, suggest a reason that the particular join algorithm(s) is being chosen.
a. (10 points) SELECT last, sc_level FROM agent, securityclearance WHERE agent_id = sc_id   (I know, the query doesn't make much sense.)






Since there is a clustered index on agent_id, the index scan on agent returns entries in sorted order. It is cheap to sort the 7 row SecurityClearance table. So with sorted inputs, merge-join is possible.


b. (10 points) SELECT last, language FROM agent NATURAL JOIN languagerel NATURAL JOIN language WHERE language = 'Russian' and salary > 62000.






The selection language = 'Russian' is pushed down to act directly on the Language tree, resulting in a single row returned from that sub-query. The inner loop of the nested-loops join then amounts to a single scan of LanguageRel on just the range with the right lang_id (since it is an index scan).


c. (10 points) SELECT last, language FROM agent NATURAL JOIN languagerel NATURAL JOIN language WHERE language <> 'Russian' and salary > 62000.






This is a much different query from the previous one, mostly because language <> 'Russian' is a much less selective constraint than language = 'Russian'. Since we have such low selectivity and sorting may be expensive, we resort to hash joins.



Note: You can find info on the EXPLAIN command at http://www.postgresql.org/docs/8.1/interactive/performance-tips.html

Question 5: Index Selection

(5 points) Assume the agent table had 600,000 tuples rather than around 600. Suggest an index structure that would benefit query 4b above. Show a plan for the query that makes use of the index.

An additional index for Agent on salary would be a big win because we could then avoid scanning the entire table.