CS301 W'99 Lecture Notes Lecture 13 PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 1 Values and Types Values are the entities or ob jects manipulated by programs. We divide the universe of values according to types. We characterize types by: o a set of values. o a set of operations defined on those values; and/or o a set of valid contexts for those values. (In particular, values that can be anonymously constructed, used in expressions, passed to and from procedures, and as- signed into variables are called first-class values.) o How values are represented and operations are imple- mented. o How literal values are described. Examples: Integers with the usual arithmetic operations. Booleans with operators and,or,not and valid as arguments to conditional operations. Arrays with operations like fetch and store. Sets with operations like membership testing, union, inter- section, etc. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 2 Hardware Types Machine language doesn't distinguish types; all values are just bit patterns until used. As such they can be loaded, stored, moved, etc. But certain operations are supported directly by hardware; the operands are thus implicitly typed. Typical hardware types: o Integers of various sizes, signedness, etc. with standard arithmetic operations. o Booleans with boolean and conditional operations. (Usu- ally just a special view of integers.) o Floating point numbers of various sizes, with standard arithmetic operations. o Characters with i/o operations. o Pointers to values stored in memory. o Instructions, i.e., code, which can be executed. o Many others are possible, e.g., binary coded decimal. Details of behavior (e.g., numeric range) are machine- dependent, though often sub ject to standards (e.g., IEEE floating point, ASCII characters). PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 3 Primitive (Atomic,Basic) Values and Types Primitive values cannot be further broken down by user- defined code; they can be managed only via operators built into the language. Typical primitive types include integers, floats, characters, booleans, enumerations, etc. Usually closely allied to hardware types. Example: booleans. Note that in most languages (except C/C++), this is a different type from integers, even though boolean values may be represented internally by integers. Numeric types only approximate behavior of true num- bers. Also, they often inherit machine-dependent aspects of machine types, causing serious portability problems. Example: Integer arithmetic in most languages. Partial counter-example: Numerics in Lisp. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 4 Composite Values Composite values are constructed from more primitive val- ues, which can usually later be selected back from the com- posite, and perhaps selectively updated. Example: Records (C Syntax) struct emp - char *name; int age; "; struct emp e = -"Andrew",99"; if (strcmp(e.name,"Fred")) ...; e.age = 88; In statically typed languages, it is generally necessary to declare new composite types (e.g., struct emp) before defin- ing composite values (e.g., e). The type definition indicates how the type is constructed from more primitive types, using one of a few predefined type constructors (e.g., struct). PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 5 Static and Dynamic Typing HLL's differ from machine language in that explicit types appear and type violations are ordinarily caught at some point. Static typing is most common. o Types are associated with identifiers (esp. variables, pa- rameters, functions). o Can be statically checked, if language and compiler allow. o Compiler can optimize representations of values used at runtime. Dynamic typing occurs in Lisp, Scheme, Smalltalk, VB, JavaScript, etc. o Types are attached to values (usually implicitly). o The type associated with identifiers can vary. o Correctness of operations can't generally be checked until runtime. o Optimized representation hard. Static typing offers the great advantage of catching errors early, and generally supports more efficient execution. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 6 Flexibility of Dynamic Typing Why ever settle for dynamic typing? o Simplicity. For short or simple programs, it's nice to avoid the need for declaring the types of identifiers. o Flexibility. Dynamic typing allows container types, like lists or arrays, to contain mixtures of values of arbitrary types. Note: Some statically-typed languages (e.g., Standard ML) offer alternative ways to achieve these aims, via type infer- ence and polymorphic typing. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 7 Dynamic Typing Example Consider a function that reads and returns an integer or string literal. function readliteral(); begin read a string of nonblanks; if (string constitutes an integer literal) then return numeric-value-of-string; else return string; end; (* read dates: year month day *) y = read(); m = readliteral(); d = read(); if (m >=1) and (m <= 12) then (* do nothing *) else if m = "JAN" then m = 1 else if m = "FEB" then m = 2 else ... PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 8 Type Constructors Programmers usually define composite types in order to implement data structures appropriate to an application and/or algorithm. Abstractly, such data structures can be seen as mathemat- ical operators on underlying sets of simpler values. A small number of type operators suffices to describe most useful data structures: o Cartesian product (S1 x S2 ) o Disjoint union (S1 + S2 ) o Mapping (by explicit enumeration or by formula) (S1 ! S2 ) S o Set (P ) * o Sequence (S ) o Recursive structures (lists, trees, etc.) Concretely, each language defines the internal represen- tation of values of the composite type, based on the type constructor and the types used in the construction. Example: The fields of a record might occupy successive memory addresses (perhaps with some alignment restric- tions). The total size of the record is (roughly) the sum of the field sizes. Often a range of representations are possible, from highly packed to highly indirected. There's often a tradeoff between space and access time. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 9 Representation of Data Structures Historically, most languages provide direct representations only for a few data structures, usually those whose values can be represented efficiently on a conventional computer. Often, they are restricted so that all values will be of fixed size. For conventional languages, this is the short list: o Records. o Unions. o Arrays. Many languages also support manipulation of pointers to values of these types, in order to allow moving data "by reference" and to support recursive structures; more later. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 10 Records = Cartesian Products Records, tuples, "structures", etc. Nearly every language has them. "Take a bunch of existing types and choose one value from each." Examples (Ada Syntax) type EMP is record NAME : STRING; AGE : INTEGER; end record; E: EMP := (NAME => "ANDREW", AGE => 99); (ML syntax): type emp = string * int (unlabeled fields) val e : emp = ("ANDREW",99); type emp = -name: string, age: int" (labeled fields) val e : emp = -name="ANDREW",age=99"; ML also permits record values to be written without declar- ing explicit named type first. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 11 Records (continued) Standard operations: construction, selection, selective up- date. Representation: Usually as described above. Because records may be large, they are often manipulated by ref- erence, i.e., represented by a pointer. The fields within a record may also be represented this way. Allowed contexts: In many languages, treated like primitive values, e.g., can be assigned as a unit, passed to or returned by functions, etc. But since they may be large, some lan- guages add restrictions. Literals: Most languages allow a literal record to be specified by specifying each component, either by position or by name. (But C doesn't permit literals except as initializers.) Some languages require components to be initialized after creation. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 12 Disjoint Unions Variant records, discriminated records, unions, etc. "Take a bunch of existing types and choose one value from one type." Pascal Example: type RESULT = record case found : Boolean of true: (value:integer); false: (error:STRING) end; function search (...) : RESULT; ... Generally behave like records, with tag as an additional field. Represented by the variant's representation, usually plus a tag (thus forming a record). Size typically equals the size of the largest variant plus tag size. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 13 Variant Insecurities Pascal variant records are insecure because it is possible to manipulate the tag independently from the variant contents. tr.value := 101; write tr.error; if (tr.found) then begin ... tr := tr1; x := tr.value These problems were fixed in Ada by requiring tag and vari- ant contents to be set simultaneously, and inserting a run- time check on the tag before any read of the variant contents. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 14 Non-discriminated unions C unions don't even have a tag mechanism: the programmer must provide the tag separately: union resunion - int value; char *error; "; struct result - int found; /* boolean tag */ union resunion u; " struct result search (...); The tag need not be tightly associated with the union: int search (union resunion *r,...); union resunion res; if (search(&res)) - ...res->value... " else - ...res->error... " This might permit more efficient code in some cases. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 15 Disjoint Unions Done Properly ML has very clean approach to building and inspecting dis- joint unions: datatype result = FOUND of integer _ NOTFOUND of string fun search (..) : result = if ... then FOUND 10 else NOTFOUND "problem" val r = search (...) case r of FOUND x => print ("Found it : " ^ (Int.toString x)) _ NOTFOUND s => print ("Couldn't find it : " ^ s) Here FOUND and NOTFOUND tags are not ordinary fields. Case combines inspection of tag and extraction of values into one operation. Ob ject-oriented languages like Java don't support disjoint unions directly, but subclasses provide a (somewhat awk- ward) way to achieve the same effect. (More later.) PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 16 Arrays and Mappings Basic implementation idea: a table laid out in adjacent memory locations permitting indexed access to any ele- ment. Mathematically: A finite mapping from an index set to a component set. Index set is nearly always a set of integers 0..n, where n is small enough to allow space for the entire array, or some other small discrete set isomorphic to them. Pascal Example: type day = (Sunday, Monday, ..., Saturday); var workday = array[day] of boolean; workday[Saturday] := false; More general index sets are seldom supported directly by language because of the lack of a single, uniform, good imple- mentation. Arrays with arbitrary index sets are sometimes called "associative arrays" Awk Example: workday["Saturday"] = workday["Sunday"] = false; How might this be implemented? PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 17 Array Size Many languages require the index set (and hence size) of arrays to be specified as part of each array type declaration, e.g., in Fortran: ARRAY Q(100) Others permit the size independently for each array value, when the array is first created o as a local variable, e.g., in Ada: function fred(size:integer); var bill: array(0..size) of real; o or on the heap, e.g., in Java: int[] bill = new int[size]; Arrays are often large, and hence manipulated by reference. Ma jor security issue for arrays is bounds checking of index values. In general, it's not possible to check all bounds at compile time (though often possible in particular cases). Runtime checks are always possible, but may be costly. But they are a good idea! PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 18 Functions and Mappings Mathematical mappings can also be represented by an algo- rithmic formula. A function gives a "recipe" for computing a result value from an argument value. A program function can describe an infinite mapping. But differs from mathematical function in that: o it must be specified by an explicit algorithm o executing the function may have side-effects on variables. It can be very handy to manipulate functions as first-class values. But most languages put severe limitations on what can be done with functions. How does one represent a function as a first-class value? In some languages, can just use a code pointer. In others, representation must include values of free variables, so can get expensive. More on this later. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 19 Sequences What about data structures of essentially unbounded size, such as sequences (or lists)? "Take an arbitrary number of values of some type." Such data structures require special treatment: they are typ- ically represented by small segments of data linked by point- ers, and dynamic storage allocation (and deallocation) is re- quired. The basic operations on a sequence include o concatenation (especially concatenating a single element onto the head or tail of an existing sequence); and o extraction of elements (especially the head). An important example is the (unbounded) string, a se- quence of characters. Best representation depends heavily on what nature and frequency of various operations. Hard to give single, uni- formly efficient implementation. So many older languages don't support directly. But so useful that newer languages increasingly do (esp. strings). PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 20 Defining Sequences Unless the programming language supports sequences di- rectly, the programmer must define them using a recursive definition. For example, a list of integers is either o empty, or o has a head which is an integer and tail which is itself a list of integers. ML has particularly clean mechanisms for describing recur- sive types. datatype intlist = EMPTY _ CELL of int * intlist Internally, the non-empty case can be represented by a two- element heap-allocated record, containing an integer and a pointer to another list. (Obviously, the tail list itself cannot be embedded in the record, since it's size is unknown.) The empty case is conveniently represented by a null pointer. Corresponds directly to C representation: typedef struct intlist *Intlist; struct intlist - int val; Intlist next; "; PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 21 Processing Sequences Note that an iterative or recursive loop is required to process the data in a sequence, e.g., /* Iterative version */ int inlist(Intlist list, int i) - while (list) - if (list->val == i) return 1; else list = list->next; "; return 0; " /* Recursive Version */ int inlist(Intlist list, int i) - if (list) - if (list->val == i) return 1; else return inlist(list->rest,i); " else return 0; " PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 22 Recursive Types Recursion can be used to define and operate on more complex types, in which the type being defined appears more than once in the definition. ML Example: binary trees with integer labels (only) at the leaves. datatype 'a tree = INTERNAL of -left:'a tree,right:'a tree" _ LEAF of -contents:'a" Now we must use recursion (not iteration) to process the full tree: fun sum(tree: int tree) = case tree of INTERNAL-left,right" => sum(left) + sum(right) _ LEAF-contents" => contents PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 23 Reference Semantics Recursive structures naturally grow without fixed bound, so it is common practice to allocate them on the heap. Many modern languages, such as Java and ML, implicitly allocate records (and disjoint unions) on the heap, and rep- resent record values by references (pointers) into the heap. As a natural result, both languages use shallow copy se- mantics for assignment and argument passing. Java Exam- ple: class emp - String name; int age; " emp e1; e1.age = 91; emp e2 = e1; e1.age = 18; System.out.print(e2.age); prints 18 Neither language allows user programs to manipulate the in- ternal pointers directly. And neither supports explicit deal- location of records (or ob jects) either; both provide auto- matic garbage collection of unreachable heap values, thus avoiding both dangling pointer and memory leak bugs. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 24 Explicit Pointers Many previous languages had pointer types to enable pro- grammers to construct recursive data structures, e.g., in C: typedef struct intlist *Intlist; struct intlist - int head; Intlist tail; " Intlist mylist = (Intlist) malloc(sizeof(struct intlist)); ...free(mylist)... Note that programmers must make explicit malloc (C++: new) and free calls to manage heap values, and must explic- itly manipulate pointers. Lots of opportunity for dangling pointer bugs and memory leaks! In most such languages, pointers are restricted to addresses returned by allocation operations, but C/C++ allows the ad- dress of anything to be taken and later dereferenced, and supports pointer arithmetic. While this feature can sup- port very sufficient code, it also destroys the safety of the type system. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 25 Type Equivalence When do two identifiers have the "same" type, or "compat- ible" types? I.e., if a has type t1 , b has type t2 and f has type t2 ! t3 , how must t1 and t2 be related for these to make sense? a := b f (a) To maintain whatever security type-checking of primitive types gives us, we must insist at a minimum that t1 and t2 are structurally equivalent. Structural equivalence is defined inductively: o Primitive types are equivalent iff they are the same type. o Cartesian product types are equivalent if their correspond- ing component types are equivalent. (Record field names are typically ignored.) o Disjoint union types are equivalent if their corresponding component types are equivalent. o Mapping types (arrays and functions) are the same if their domain and range types are the same. (Sometimes the car- dinality of the index type of an array is ignored.) PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 26 Equivalence (continued) Another way to say this: two types are equal if they have the same set of values. Recursive types are a problem. Are these two types struc- turally equivalent? type t1 = record a:int, b: POINTER TO t1 end; type t2 = record a:int, b: POINTER TO t2 end; Intuitively yes, but it's (a little) tricky for a type-checking algorithm to determine this! PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 27 Type Names Question becomes more interesting because of type names. We name types for two possible reasons: o As a convenient shorthand to avoid giving the full type each time. E.g., fun f(x:int * bool * real) : int * bool * real = ... type t = int * bool * real fun f(x:t) : t = ... o As a way of improving program correctness by subdividing values into types according to their meaning within the program. type polar = record r:real, a:real end; type rect = record x:real, y:real end; function polar_add(x:polar,y:polar) : polar ... function rect_add(x:rect,y:rect) : rect ... var a:polar; c:rect; a := (150.0,30.0) (* ok *) polar_add(a,a) (* ok *) c := a (* type error *) rect_add(a,c) (* type error *) For this to be useful, some structurally equivalent types must be treated as inequivalent. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 28 Name Equivalence Basic idea: Two types are equivalent iff they have the same name. Supports polar/rect distinction. But pure name equivalence is very restrictive, e.g.: type ftemp = real type ctemp = real var x:ftemp, y:ftemp, z: ctemp; x := y; (* ok *) x := 10.0; (* probably ok *) x := z; (* type error *) x := 1.8 * z + 32.0; (* probably type error *) Different types now seem too distinct; can't even convert from one form of real to another. Also: what about unnamed type expressions? type t = int * int procedure f(x: int * int) = ... procedure g(x: t) = ... var a:t = (3,4) g(a); (* ok *) f(a); (* ok or not ?? *) Most languages use mixed solutions. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 29 C Type Equivalence C uses structural equivalence for array and function types, but name equivalence for struct, union, and enum types. For example: char a[100]; void f(char b[]); f(a); (* ok *) struct polar-float x; float y;"; struct rect-float x; float y;"; struct polar a; struct rect b; a = b; (* type error *) A type defined by a typedef declaration is actually just an abbreviation for an existing type. Note this policy makes it easy to check equivalence of recur- sive types, which can only be built using structs. struct fred -int x; struct fred *y;" a; struct bill -int x; struct fred *y;" b; a = b; (* type error *) PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 30 Pascal Type Equivalence The original Pascal definition is vague. The standard imple- mentation method is a variant of name equivalence, with the following wrinkles: o Each type declaration defines a new type, unless the right hand side is a simple type name. E.g., T and U are different types, but T and V are the same type. type T = RECORD a:INTEGER; b: REAL END; type U = RECORD a:INTEGER; b: REAL END; type V = T (* just an abbreviation for T *) o Each anonymous type expression defines a new type. E.g., the types of x,y,z are all different, but z and w are the same. type T = RECORD a:INTEGER; b: REAL END; VAR x: T; VAR y: RECORD a:INTEGER; b: REAL END; var z,w: RECORD a:INTEGER; b: REAL END; Another way to describe the type system is that each ap- plication of a type constructor (i.e., RECORD, ARRAY, POINTER, etc.) creates a new type. In this case, must not think of built-in types like INTEGER and REAL as con- structors. PSU CS301 W'99 Lecture 13 Oc Andrew Tolmach 1992-99 31 ML Type Equivalence ML uses structural equivalence, except that each datatype declaration creates a new type unlike all others. datatype polar = POLAR of real * real datatype rect = RECT of real * real val a = POLAR(1.0,2.0) and b = RECT(1.0,2.0) if (a = b) ... (* type error *) Note that the mandatory use of constructors makes it pos- sible to uniquely identify the types of literals. Note that a datatype need not declare a record: datatype fahrenheit = F of real datatype celsius = C of real val a = F 150.0 val b = C 150.0 if (a = b) ... (* type error *) fun convert(F x) = C(1.8 * x + 32.0) (* ok *) For type abbreviation, ML offers the type declaration, which simply gives a new name for an existing type. type centigrade = celsius fun g(x:centigrade) = if x = b ... (* ok *)