CS386
Introduction to Databases
Spring 2006
Quarter
Assignment 6
ᄄC
Indexes and Query Evaluation
Due:
Thursday, 1 June 2006 at the beginning of class
You may do this assignment individually or you may work with one partner. That is, this assignment is to be completed by individuals or by teams of two students. If you work with a partner, then you should turn in one assignment paper, with both of your names on the paper. You should only talk to the instructor, the TA and your partner about this assignment. You may also post questions to the course mailing list, cs386@cs.pdx.edu.
Please turn in your completed assignments on paper. Put your last name, first name, the assignment number in that order in the first line of your assignment. List last name and first name for your partner, if you have one, on the second line of your assignment. (If you are working with a partner, turn in one assignment paper.)
(5 points) How many different (but equivalent) join orders are there for the join of three tables R, S and T? Two orders are considered different if they have parentheses in different places. That is (R JOIN S) JOIN T is different from R JOIN (S JOIN T).
There are just three different join orders for the three tables:
R JOIN (S JOIN T)
S JOIN (R JOIN T)
T JOIN (R JOIN S)
Each of these represents an equivalence class under commutativity, and can be expressed as four distinct expressions. For example, R JOIN (S JOIN T) = (S JOIN T) JOIN R = (T JOIN S) JOIN R = R JOIN (T JOIN S). None of these expressions parenthesizes the joins differently, so all are the same join order. However, since all twelve expressions are equivalent and name the tables in different relative positions to the parentheses and operators, twelve was also accepted as an answer.
a. (5 points) Determine the min and max salary values in the agent table, and the number of rows in that table.
SELECT MIN(salary), MAX(salary), COUNT(salary) FROM agent;
minimum salary: 50,008
maximum salary: 366,962
rows: 662
b. (5
points) Give an estimate for the number of rows in agent with
salary < 65000, assuming a uniform distribution between min and
max salary. Explain how you derived your estimate.
The total salary range is 366,962 – 50,008 = 316,954. The threshold value is 65,000 – 50,008 = 14,992 above the minimum, and so 14,992 / 316,954 = 0.0473 = 4.73% of the salaries should lie below this value under the uniformity assumption. Therefore we estimate (0.0473)(662) = 31 rows.
c. (5
points) Find the 25th, 50th and 75th percentile values for salaries
in the agent table. (The 50th percentile value, for instance, is the
smallest number s such that 50% of the rows have salary value less
than or equal to s.)
By definition, these values are the least S such that the query “SELECT COUNT(*) <= (P * 662) FROM agent WHERE salary < S” returns TRUE, where P is 0.25, 0.5, or 0.75 as appropriate. One way to find these values is to order the agent table by salary and observe the value at each quartile boundary row.
25th Percentile: 54,802
50th Percentile: 58,430
75th Percentile: 89,643
d. (5
points) Give an estimate of the number of rows in agent with
salary < 65000, assuming in a uniform distribution in each
quartile determined in c. Explain how you derived your estimate.
The results above reveal that 65,000 is in the third quartile. By uniformity within this quartile, (65,000 – 58,430) / (89,643 – 58,430) = 6,570 / 31,213 = 21% of the 165.5 rows there have salary values less than 65,000. Of course all of the 331 rows in the first two quartiles also have salary values less than 65,000. Therefore with the new information we now estimate a total of 331 + 35 = 366 rows.
e. (5
points) How many rows in agent have salary < 65000?
SELECT COUNT(*) FROM agent WHERE salary < 65000;
339 rows.
Note how substantially the estimate improved with a few statistics that allowed our uniformity assumption to be restricted to a much smaller range.
a. (5 points) How many distinct cities in the agent table?
SELECT COUNT(DISTINCT city) FROM agent;
46 cities.
b. (5
points) Give an estimate for the number of rows in agent that satisfy
the predicate city = C (assuming C is actually the name of one of the
cities).
Without additional information, the best assumption is uniformity. We estimate 1/46 of the agents will be from each city, and therefore that 662 / 46 = 14 rows will satisfy such a predicate.
c. (5
points) How many distinct countries in the agent table?
SELECT COUNT(DISTINCT country) FROM agent;
22 countries.
d. (5
points) Give an estimate for the number of rows in agent that satisfy
the predicate country = N (assuming N is a country name).
Again uniformity is an appropriate assumption, so we estimate 662 / 22 = 30 rows.
e. (5
points) If city and country are uncorrelated, then one estimate for
the number of rows that satisfy the predicate city = C AND country= N
is #rows/((# distinct cities)*(# distinct countries)). Calculate this
value.
There are 662 rows, 46 distinct cities, and 22 distinct countries, so the value is 662 / ((46)(22)) = 662 / 1012 = 0.654. Assuming uniform distribution among cities and countries and that these values are independent thus leads to quite a low estimate.
f. (10
points) How many different city-country pairs actually appear in
agent? What does that number suggest about the independence of city
and country? Propose a better formula to estimate the number of rows
that satisfy city = C AND country = N.
SELECT DISTINCT city, country FROM agent;
Just 47 pairs actually show up in the table. Since there are 46 distinct city names, we know that there must be exactly one city name that appears with two different country names. Instead of being independent, it seems that city very nearly determines country in the agent table. Therefore it would be better to assume that the conjunction will be satisfied as the city predicate alone, so that we would use the formula # rows / # distinct cities for our estimate.
For each SQL statement
below, draw the query plan the Postgres poduces (which you can obtain
with the EXPLAIN command). For each plan, suggest a reason that the
particular join algorithm(s) is being chosen.
a. (10 points)
SELECT last, sc_level FROM agent,
securityclearance WHERE agent_id = sc_id (I know,
the query doesn't make much sense.)

Since there is a clustered index on agent_id, the index scan on agent returns entries in sorted order. It is cheap to sort the 7 row SecurityClearance table. So with sorted inputs, merge-join is possible.
b. (10 points) SELECT
last, language FROM agent NATURAL JOIN languagerel NATURAL JOIN
language WHERE language = 'Russian' and salary > 62000.

The selection language = 'Russian' is pushed down to act directly on the Language tree, resulting in a single row returned from that sub-query. The inner loop of the nested-loops join then amounts to a single scan of LanguageRel on just the range with the right lang_id (since it is an index scan).
c. (10 points) SELECT
last, language FROM agent NATURAL JOIN languagerel NATURAL JOIN
language WHERE language <> 'Russian' and salary > 62000.

This is a much different query from the previous one, mostly because language <> 'Russian' is a much less selective constraint than language = 'Russian'. Since we have such low selectivity and sorting may be expensive, we resort to hash joins.
Note: You can find info on the EXPLAIN command at http://www.postgresql.org/docs/8.1/interactive/performance-tips.html
An additional index for Agent on salary would be a big win because we could then avoid scanning the entire table.
