Note that the cardinality reported here
is the final cardinality
of the table
that results
from running the query.
Equally or even more important in the cost model
are intermediate cardinalities
-- the cardinalities
of intermediate results.
These intermediate cardinalities
will have
a major effect
on the cost
of the join physical operators,
for example.
Many of the errors in cardinality estimates
are the result of the independence assumption
made in Model D.
For example,
in query 3 (see Appendix Q.3),
Model D assumes
that L.L_SHIPDATE is independent of
O.O_ORDERDATE -- clearly a suspect assumption.
In the actual TPC-D relations
there is a dependency
and so
the selection
of tuples
where
the order date is before a certain date
and the ship date is after this same date
is much more selective
(fewer tuples satisfy the criteria)
in TPC-D,
than it is
(under the independence assumption)
in Model D.
Hence the final cardinality in reality (11541) is much
less than the result of the optimizer (957558).
In the case of query 2,
the use of a prior aggregation in a later join predicate
causes the Model D cardinality estimate to be off
by a large factor.
Some of the cardinality estimates
might also be improved by using
histograms to represent the distribution
of values for an attribute instead of
relying on the uniform distribution assumption.
While
the accuracy Model D achieves
in some queries
appears to be dismal,
the impact
on the plans
the optimizer produces
may not be catastrophic.
If the logical property estimates
for all groups
and the physical property estimates
for all plans
are affected similarly,
then the
impact
on the optimality relationship
between plans
might not be as significant
as the magnitude
of the inaccuracies
in the logical model.