CS301 W'99 Lecture Notes
Lecture 4
Lexical Analysis

Convert source file characters into token stream.

Remove content-free characters (comments, whitespace, ...)

Detect lexical errors (badly-formed literals, illegal characters, ...)

Output of lexical analysis is input to syntax analysis.

Could just do lexical analysis as part of syntax analysis.

But choose to handle separately for better modularity, portability.

Idea: Look for patterns in input character sequence, convert to tokens with attributes, and pass them to parser in stream.

Lexical Analysis Example

Pattern Token Attribute
if IF  
else ELSE  
print PRINT  
then THEN  
:= ASSIGN  
= or < or > RELOP enum
letter followed by letters or digits ID symbol
digits NUM int
chars between double quotes STRING string

Source code:
\begin{code}if x>17 then count:= 2
else (* oops !*) print ''bad!''\end{code}
Lexeme Token Attribute
if IF  
x ID "x"
> RELOP GT
17 NUM 17
then THEN  
count ID "count"
:= ASSIGN  
2 NUM 2
else ELSE  
print PRINT  
"bad!" STRING "bad!"

More Details

A token describes a class of character strings with some distinguished meaning in language.

$\bullet$ May describe unique string (e.g., IF, ASSIGN)

$\bullet$ or set of possible strings, in which case an attribute is needed to indicate which.

(Tokens are represented as elements of an enumeration.)

A lexeme is the string in the input that actually matched the pattern for some token.

Attributes represent lexemes converted to a more useful form, e.g.,:

$\bullet$ strings
$\bullet$ symbols (like strings, but perhaps handled separately)
$\bullet$ numbers (integers, reals, ...)
$\bullet$ enumerations

Whitespace (spaces, tabs, new lines, ...) and comments just disappear!

Stream Interface

Could convert entire input file to list of tokens/attributes.

But parser needs only one token at a time, so use stream instead:

\psfig{figure=l4streams.eps,height=5.5in,width=7.5in}

Hand-coded Scanner (in Psuedo-C)


\begin{code}get_token() \{
while (true) \{
c = getchar();
if (c is whitespace...
...ot an alphanumeric);
ungetchar();
return (ID,S);
\} else ...
\}
\}\end{code}
Efficient! Easy to get wrong!

Note intermixed code for input, output, patterns, conversion.

Hard to specify! (esp. patterns). Example

How can we formalize this pattern description?

``An identifier is a letter followed by any number of letters or digits.''

$\bullet$ Exactly what is a letter?
\begin{code}\cdmath
{\rm LETTER} \mbox{$\rightarrow$ }{\tt {a}} \mbox{$\mid$ }{\...
...mbox{$\mid$ }{\tt {X}} \mbox{$\mid$ }{\tt {Y}} \mbox{$\mid$ }{\tt {Z}}\end{code}
$\bullet$ Exactly what is a digit?
\begin{code}\cdmath
{\rm DIGIT} \mbox{$\rightarrow$ }{\tt {0}} \mbox{$\mid$ }{\t...
...mbox{$\mid$ }{\tt {7}} \mbox{$\mid$ }{\tt {8}} \mbox{$\mid$ }{\tt {9}}\end{code}
$\bullet$ How can we express ``letters or digits'' ?
\begin{code}{\rm LORD} \mbox{$\rightarrow$ }LETTER \mbox{$\mid$ }DIGIT\end{code}
$\bullet$ How can we express ``any number of'' ?
\begin{code}\cdmath
{\rm LORDS} \mbox{$\rightarrow$ }LORD$^*$\end{code}
$\bullet$ How can we express ``followed by'' ?
\begin{code}IDENT \mbox{$\rightarrow$ }LETTER LORDS\end{code}
Regular Expressions

A regular expression (R.E.) is a concise formal characterization of a regular language (or regular set).

Example: The regular language containing all IDENTs is described by the regular expression
\begin{code}\cdmath
letter (letter \mbox{$\mid$ }digit)$^*$\end{code}

where ``$\mid$


Andrew P. Tolmach
1999-01-13