This query
will return
a table of
customer names
and the total quantity of items
they have ordered.
(To understand
the exact meaning
of this query,
refer to
a description of SQL
such as
[Date 93].)
From this point on,
I will abbreviate the attribute names
C_CUSTKEY to CCK,
O_CUSTKEY to OCK,
O_ORDERKEY to OOK and
L_ORDERKEY to LOK, etc.
After the user
submits the SQL query,
the database system would
produce a corresponding physical plan.
The first step
in this process
is to translate
the logical query
from SQL
into a query tree
in logical algebra.
This step is the parser.
The next step
is to translate
the query tree
in logical algebra
into a
physical plan.
There are generally
a large number of plans that implement the query tree;
the process of finding the best one is called
query optimization.
That is,
for some query execution performance measure
(e.g. execution time),
we want to find the plan
with the
best execution performance.
The goal is
that the plan
be optimal
or near optimal
within the search space
of the optimizer.
The optimizer starts by
copying the
relational algebra
query tree
into its
search space.
The optimizer
then expands the search space
and finds the best plan.
At this level of generality,
the optimizer
can be viewed as the
code generation2
part
of a
query compiler
for the SQL language,
that produces code
to be interpreted
by the query execution engine,
except that the
optimizer's emphasis
is on producing
"very efficient" code.
For example,
the optimizer
uses the database system's catalog
to get information
(e.g. number of tuples)
about the stored relations
referenced by the query,
something traditional
programming language
compilers normally do not do.
Finally,
the optimizer
copies the optimal3
physical plan
out of its
memo structure
and sends it to
the query execution engine.
The query execution engine
executes the plan
using the relations
in the stored database
as input,
and produces
the table
of customer names and item quantities,
described above,
as output.
2. Code generation may include various post-optimization phases where
actual machine code is generated.
This thesis is only concerned with the generation of a plan.
3. In theory,
optimality is the goal.
However optimality is always
with respect to
a particular search space of possible plans
-- defined
in Cascades
primarily
by rules,
and with respect
to a particular cost model
and catalog information
-- estimates subject to
potentially substantial inaccuracies.
In practice
(for reasons discussed in section 6)
the search space may not be
unambiguously defined.
Also,
for various reasons,
most notably
the huge size of the search space,
the optimizer may not even attempt to search
the entire search space.
So in practice,
we look for
a "good" plan,
near
the optimal plan.