Logical Operators


           Here we explain the figures for logical operators in this appendix.

           For each logical operator, we show a diagram of the operator's inputs and parameters, an example use of an operator, a description of how the logical properties of the output are computed and the values of the output's logical properties for the given example.

           The figure in the top right is a diagram showing a logical operator, its parameters and its inputs. A logical operator can take 0 or more inputs, have 0 or more parameters and has exactly one output.

           In the top right corner is an alternate representation of the logical operator, either in relational algebra if the operator is in the relational model, or as an SQL query, if the operator represents an SQL extension to the relational model.

           Below these two figures is an example, shown as a query subtree on the left, and in the relational algebra or SQL notation on the right.

           Below the example figures is a section on the "update" of logical properties. For every logical property in our model, this describes the output logical property based on the logical properties of the input(s) -- that is, how each logical operator calculates the logical properties of the output table. This calculation is based on the logical properties of the input(s) and the logical operator itself. (In the case of the GET logical operator, which has no inputs, the output logical properties are determined by looking in the catalog, based on the collection parameter of the GET operator.)

           When there are no inputs, the logical properties of the output table are determined by looking them up in the catalog. For example, In the case of the GET logical operator, which has no inputs, the output logical properties are determined by looking up the required data in the catalog, based on the 'collection' parameter of the GET operator. When there is only one input, and the operator does not affect a logical property, the notation, "same" indicates that the value of the logical property for the input is copied to this logical property for the output.

           As with other parts of the model, the computation of output logical properties is an approximation: there is no attempt to be comprehensive in computing output logical properties. In general, we try to be conservative, saying, for example, that an output table is a set, only if we are sure it contains no duplicates, and a functional dependency applies only if it is guaranteed to apply. Clearly there are instances where it can be shown that the output table is a set and our model classifies it as a bag, and clearly there are more functional dependencies on the output tables than we track. Our emphasis is on refining the model as required to enhance the quality of the output plan and, secondarily, the accuracy of the cost estimates computed by the optimizer.

           Here is a brief description of the logical properties used in model D:
Schema - A list of columns (or attributes) comprising the table. In Cascades, a schema is represented as a set of equivalence classes. Each equivalence class consists of a set of equivalent attributes, where two attributes are considered equivalent if they are made equal by the equality conditions of an EQJOIN. In the example schemas for the logical operators figures, when the equivalence class consists of a single attribute, as a shorthand, we simply list the attribute. When the equivalence class contains more than one attribute, we use the set notation. For example, in the EQJOIN logical operator figure, the schema includes the singleton equivalence class {OD}, written as simply OD, while the equivalence class created by the EQJOIN condition CCK = OCK is written as {CCK, OCK}. In fact, when the distinction is not important, we say that the schema contains an attribute when, more precisely, the schema contains an equivalence class which contains the attribute.

Candidate Key - In general, a table could have many candidate keys. In our model, however, only one candidate key is tracked. The TPC-D Benchmark Specification lists a primary key for each of the stored relations. A primary key is the combination of the logical concept of candidate key and the physical concept of ordering. Here we are concerned only with logical properties, so for stored relations (i.e. those listed in the catalog), we use the primary key in the TPC-D Benchmark specification for the candidate key logical property. This in turn will be the candidate key for the output table from the GET logical operator for that stored relation. Note that the candidate key could also be considered a functional dependency of the form:
candidate key schema
which says that, for a given table, the attributes of the candidate key determine all the attributes of a tuple.

Tracked Functional Dependencies - These are functional dependencies (other than the candidate key, described above) that we track for each table. No functional dependencies for the stored relations are listed in the TPC_D Benchmark Specification, so the functional dependencies list will be empty after the initial GET of a stored TPC_D relation. In fact, the primary source of the functional dependencies tracked by the optimizer are those computed when updating logical properties for an EQJOIN.

           When the functional dependencies of a table need to be accessed, they are referred to as funcd_list. Each individual functional dependency is referred to as funcd. Each funcd has a left hand side and a right hand side, which are called defining and dependent, respectively. Both defining and dependent are sets of attributes, and the functional dependency can be written as:
defining dependent
That is, the attributes of the left hand side functionally determine the attributes on the right hand side.

Cardinality - Number of tuples in the output.

Unique Cardinality - Number of distinct tuples in the output.

Unique - Indicates whether the output table is a true set (guaranteed to have no duplicate tuples), designated by 'yes', or a bag, designated by 'no'.
Column Unique Cardinality - For every attribute in the output table's schema, the number of distinct values that occur for this attribute in the output table. (Note that a column's cardinality is trivially equal to the cardinality of the output table.) The unique cardinality of an attribute in the schema is written as attr.cucard. The dot between attr and cucard could be thought of as representing a reference to an element of a structure in the C programming language which represents the attribute. Most of the Cascades data structures (e.g. the data structure for the type, attribute) are C++ classes, and the access is through a member function. So from this, and the discussion of schema above, it is easy to understand how attr.cucard could actually appear in Cascades' C++ code as eqclass.get_cucard().



Example Output Logical Properties


           After listing how the output logical properties are computed, the values of the output logical properties for the example figure are shown. The output logical properties for the example are based on runs of the optimizer on the TPC-D queries with a scale factor (SF) of 1. For the logical properties corresponding to the initial TPC-D stored relations, see Appendix C, which lists the catalog information we used for the TPC-D queries. This includes a table of attribute abbreviations used in these figures. Also refer to the figure for the GET logical operator, which shows the logical properties of the output table resulting from the GET of the stored relation, ORDER.


More Notation


           In the context of these logical property tables, yes is synonymous with TRUE, no with FALSE. For notational convenience, we make use of the ?: operator from the C and C++ languages, where:
			bool ? value1 : value2

means value1 when bool evaluates TRUE, value2 otherwise. In many cases we nest this operator to form the equivalent of a case statement:
			bool1 ? value1 : 
			bool2 ? value2 :
			bool3 ? value3 :
			value4


           In addition, ordered sets, used for join equality conditions, and the ordered property o_attrs, also have the [i] operator, where i must be between 1 and the size of the list, which returns the ith element of the list.

           The word same is used for those logical operators with a single input, and applies when a particular logical property of the input is simply to be copied to the logical property of the output.

           Finally, we list four tables for the types, the parameters, inputs and output of logical operators and the variables used in calculating logical properties.

Here are the types which are used for parameters of logical operators, or for variables used in calculating the logical properties of a logical operator's output table.




Types
Type Context Meaning
schema logical property of a table Set of attributes comprising the table. Actually representation is a set of attribute equivalence classes -- see text above for discussion
attr or attribute parameter of a operator Column of a table: Includes the table range_variable and the attribute range variable to uniquely identify any column in any table.
func_d funcd_list logical property of a table Two sets of attributes defined on the table. The first set of attributes is the defining set and the second set of attributes is the determined set.
agg_op parameter of AGG_LIST logical operator A new attribute computed as an aggregate function of zero or more of the existing attributes of a table.


Below are the different parameters that occur in the model's logical operators.














Parameters
Parameter Type Context, Meaning
p_attrs {attr} For PROJECT, the project list (set)
f_attrs {attr} For FUNC_OP, the list (set) of attributes used to compute the new attribute.
agg_ops {agg_op} For AGG_LIST, the set of aggregate operators each aggregate operator consisting of a list of attributes used in an aggregate function and an attribute range_var
attribute range_var STRING In agg_op or FUNC_OP, the name that will be used to reference the attribute created by the agg_op or FUNC_OP. This corresponds to the name after the "as" keyword in the SQL SELECT clause. This is the name of a column within a table.
gby {attr} For AGG_LIST, the group_by list
lattrs = rattrs 2 attr[]'s For EQJOIN, the equality conditions
collection STRING For GET, a name identifying a stored relation
table range_var STRING For GET, the name of the table. Corresponds to the as value, specified in the SQL from clause This is a table range variable. It will be used to form the name of an attribute in the table output by the GET, as well as tables formed by applying other logical operators to this table.
o_attrs attr[] For ORDER_BY, an ordered list of attributes
order_func (asc|desc)[] For ORDER_BY, an ordered list corresponding to the o_attrs list, which indicates, for each attribute in o_attrs, whether to order in ascending or descending order.


Here are the names used to describe the inputs and the output of a logical operator.




Inputs:
Input Type Context
input table or input stream of tuples when a single bulk input
left table or left stream of tuples when two bulk inputs, the outer
right table or right stream of tuples when two bulk inputs, the inner
predicate item operator the predicate

Output:
Output Type Context
output table or output stream of tuples Every logical operator

Here are the variables that we use to determine the logical properties of the output table of a logical operator.



Variables