No Title

CS301 W'99 Lecture Notes
Lecture 4

Lexical Analysis

Convert source file characters into token stream.

Remove content-free characters (comments, whitespace, ...)

Detect lexical errors (badly-formed literals, illegal characters, ...)

Output of lexical analysis is input to syntax analysis.

Could just do lexical analysis as part of syntax analysis.

But choose to handle separately for better modularity, portability.

Idea: Look for patterns in input character sequence, convert to tokens with attributes, and pass them to parser in stream.

Lexical Analysis Example

Pattern Token Attribute

if IF

else ELSE

print PRINT

then THEN

:= ASSIGN

= or < or > RELOP enum

letter followed by letters or digits ID symbol

digits NUM int

chars between double quotes STRING string

Source code:
$\begin{code}if x>17 then count:= 2 else (* oops !*) print ''bad!''\end{code}$

Lexeme Token Attribute

if IF

x ID "x"

> RELOP GT

17 NUM 17

then THEN

count ID "count"

:= ASSIGN

2 NUM 2

else ELSE

print PRINT

"bad!" STRING "bad!"

More Details

A token describes a class of character strings with some distinguished meaning in language.

$\bullet$ May describe unique string (e.g., IF, ASSIGN)

$\bullet$ or set of possible strings, in which case an attribute is needed to indicate which.

(Tokens are represented as elements of an enumeration.)

A lexeme is the string in the input that actually matched the pattern for some token.

Attributes represent lexemes converted to a more useful form, e.g.,:

$\bullet$ strings
$\bullet$ symbols (like strings, but perhaps handled separately)
$\bullet$ numbers (integers, reals, ...)
$\bullet$ enumerations

Whitespace (spaces, tabs, new lines, ...) and comments just disappear!

Stream Interface

Could convert entire input file to list of tokens/attributes.

But parser needs only one token at a time, so use stream instead:

$\psfig{figure=l4streams.eps,height=5.5in,width=7.5in}$

Hand-coded Scanner (in Psuedo-C)

$\begin{code}get_token() \{ while (true) \{ c = getchar(); if (c is whitespace... ...ot an alphanumeric); ungetchar(); return (ID,S); \} else ... \} \}\end{code}$
Efficient! Easy to get wrong!

Note intermixed code for input, output, patterns, conversion.

Hard to specify! (esp. patterns). Example

How can we formalize this pattern description?

``An identifier is a letter followed by any number of letters or digits.''

$\bullet$ Exactly what is a letter?
$\begin{code}\cdmath {\rm LETTER} \mbox{$\rightarrow$ }{\tt {a}} \mbox{$\mid$ }{\... ...mbox{$\mid$ }{\tt {X}} \mbox{$\mid$ }{\tt {Y}} \mbox{$\mid$ }{\tt {Z}}\end{code}$
$\bullet$ Exactly what is a digit?
$\begin{code}\cdmath {\rm DIGIT} \mbox{$\rightarrow$ }{\tt {0}} \mbox{$\mid$ }{\t... ...mbox{$\mid$ }{\tt {7}} \mbox{$\mid$ }{\tt {8}} \mbox{$\mid$ }{\tt {9}}\end{code}$
$\bullet$ How can we express ``letters or digits'' ?
$\begin{code}{\rm LORD} \mbox{$\rightarrow$ }LETTER \mbox{$\mid$ }DIGIT\end{code}$
$\bullet$ How can we express ``any number of'' ?
$\begin{code}\cdmath {\rm LORDS} \mbox{$\rightarrow$ }LORD$^*$\end{code}$
$\bullet$ How can we express ``followed by'' ?
$\begin{code}IDENT \mbox{$\rightarrow$ }LETTER LORDS\end{code}$
Regular Expressions

A regular expression (R.E.) is a concise formal characterization of a regular language (or regular set).

Example: The regular language containing all IDENTs is described by the regular expression
$\begin{code}\cdmath letter (letter \mbox{$\mid$ }digit)$^*$\end{code}$

where `` $\mid$

Andrew P. Tolmach
1999-01-13

Pattern	Token	Attribute
`if`	IF
`else`	ELSE
`print`	PRINT
`then`	THEN
`:=`	ASSIGN
`=` or `<` or `>`	RELOP	enum
letter followed by letters or digits	ID	symbol
digits	NUM	int
chars between double quotes	STRING	string

Lexeme	Token	Attribute
`if`	IF
`x`	ID	`"x"`
`>`	RELOP	`GT`
`17`	NUM	`17`
`then`	THEN
`count`	ID	`"count"`
`:=`	ASSIGN
`2`	NUM	`2`
`else`	ELSE
`print`	PRINT
`"bad!"`	STRING	`"bad!"`