Department of
Computer Science

 

CS 420/520: Object-Oriented Programming

 

Assignment 2: Implement a Lexical Analyzer as a FSA




Set: Monday 18th April 2016

Code due: Wednesday 27th April 2016 at or before 13:00

 

Summary

Implement a lexical analyzer that takes as input a text string and produces as output a list of tokens.
The input represents programs in a simple language of identifiers, assignments and method requests.

The goals of the assignments are
  1. Use a FSA for lexing, with objects representing the states.  Delegate behavior to the states using the state pattern
  2. Limit dependencies between the state objects.
  3. Make your code easy to read and as modular as possible, because it is likely that the language you are lexing will change.
  4. To practice test-driven design and development.

Lexical Structure

The languages comprises the following lexical elements:

  • identifiers, which start with a letter and contain letters, digits, underscore _, and dash '.
  • numerals, which are sequences of digits and contain at most one decimal point
  • strings, which are sequences of characters enclosed by " and ".  Inside a string, \" represents a quote, \n a newline, and \t a tab.  It's illegal to have a newline in a string.
  • uninterpreted strings, which are sequences of arbitrary characters enclosed by """ and """.  Any characters, including " and newline, can appear in an uninterpreted string.
  • dots, which are the separators in method requests
  • assignment symbols, :=
  • definition symbols, =
  • operators, which are sequences of the operator characters -&|:$#%^@?*\/+!=><~
  • parentheses ( and )
  • brackets [ and ]
  • braces { and }
  • newlines, which can be represented by the Unicode line feed (LF) character, by the Unicode carriage return (CR) character, or by the Unicode line separator (U+2028) character; a line feed that immediately follows a carriage return is ignored.
Notice that some of these categories seem to overlap.  So, while := should be lexed as an assignment symbol, == and :=: are operator symbols.  Similarly, newlines are treated as ordinary characters while inside an uninterpreted string, but in other contexts are

Inputs that don't end with a newline should be treated as if they end with a newline.

Input and Output

The input to your lex method should be a String.  The output should be a list of Tokens; the tokens are defined in token.grace.

Pairing

If you want to work on this assignment as a pair (with another student from the class), that's OK.  In fact, it's good!  You will learn more, provided that you actually pair program.
Conversely, it's not ok partner with someone else by dividing up the modules between you so that each of you only works on half of the code.

Tests

A tiny start at a testing framework is in lexer_test.grace.  Expand this into a full set of tests.

Hints

  • Don't try and build a FSA for all of the above categories at once. Start with a test for some of the basic functionality, like lexing identifiers and operator symbols.  Once you have one test passing, add another.
  • You may want to write unit tests for behavior of some of the internals of the lexer, like the behavior of a particular state, or for the pattern of state transitions.  This will involve testing methods that would normally be internal to the lexer.  You may need to make them visible just so that you can test them. 
  • Don't write long methods. Use the composed method pattern from Beck, both because I'm grading on it, and because it will keep your code clean. If you write more than 10 lines in a method, think about decomposing it.
  • As you learn more about the problem space, you will realize that some of your early decisions were wrong. Don't be afraid to go back and change them. Software is soft!  Embrace Change!  I already changed my mind twice about what the arguments of the token creation methods should be.  You may decide to change them once again. That's OK; but explain why you make any changes that you find necessary.
  • Once a test is working, go back and refactor your code to keep it clean.  Saying "I'll clean up later" is a great way to get into Technical Debt.
  • Save your work frequently.  Put your code in version control (git, subversion, mercurial ... )  

Submitting your work

Submit your work as a .zip file in the d2l dropbox.  You will need to tell me about the composition of your group so that I can create the dropbox.  Your zip file should contain:

  1. a module A2Lexer.grace that contains your implementation
  2. a module A2Lexer_test.grace that contains your tests
  3. a file A2report.pdf (or A2report.txt) that contains your report.  Use this to tell me what you learned, problems you solved, issues you ran into and couldn't solve, features of Grace that made this easier or harder than you expected, etc.   The report can be quite short; half a page is probably ample.  Write more here only if you need to.  The code itself and the intention-revealing names you use in it should tell me the details.   If you feel the need to explain your code in the report, stop!  Go back and clarify the code.

Resources

 





Most recently modified sometime in the past


Andrew P. Black