Convert source file characters into token stream.
Remove content-free characters (comments, whitespace, ...)
Detect lexical errors (badly-formed literals, illegal characters, ...)
Output of lexical analysis is input to syntax analysis.
Could just do lexical analysis as part of syntax analysis.
But choose to handle separately for better modularity, portability.
Idea: Look for patterns in input character sequence, convert to tokens with attributes, and pass them to parser in stream.
Lexical Analysis Example
|= or < or >||RELOP||enum|
|letter followed by letters or digits||ID||symbol|
|chars between double quotes||STRING||string|
A token describes a class of character strings with some distinguished meaning in language.
May describe unique string (e.g., IF, ASSIGN)
or set of possible strings, in which case an attribute is needed to indicate which.
(Tokens are represented as elements of an enumeration.)
A lexeme is the string in the input that actually matched the pattern for some token.
Attributes represent lexemes converted to a more useful form, e.g.,:
symbols (like strings, but perhaps handled separately)
numbers (integers, reals, ...)
Whitespace (spaces, tabs, new lines, ...) and comments just disappear!
Could convert entire input file to list of tokens/attributes.
But parser needs only one token at a time, so use stream instead:
Hand-coded Scanner (in Psuedo-C)
Efficient! Easy to get wrong!
Note intermixed code for input, output, patterns, conversion.
Hard to specify! (esp. patterns). Example
How can we formalize this pattern description?
``An identifier is a letter followed by any number of letters or digits.''
Exactly what is a letter?
Exactly what is a digit?
How can we express ``letters or digits'' ?
How can we express ``any number of'' ?
How can we express ``followed by'' ?
A regular expression (R.E.) is a concise formal characterization of a regular language (or regular set).
Example: The regular language containing all IDENTs is described by the regular expression