Department of
Computer Science

 

CS 420/520: Object-Oriented Programming

 

Assignment 3: Design a Data Structure for the Unicode Code Tables




Set: Friday 1st May 2015

Code due: Monday 11th May 2015 at or before 17:00

 

Summary

We talked a bit about the Unicode standard in class.  Unicode is character set that contains more than 100,000 "characters" (called codepoints).  As far as this homework is concerned,
each "character" has

  1. a name, which is a sequenec of ASCII characters, such as LATIN SMALL LETTER M or GREEK SMALL LETTER ETA WITH TONOS.  A few characters have alternative names called aliases, which are mostly abbreviations and corrections.
  2. a codepoint, a number in the range 0hex to 10FFFFhex.   (66 non-character codepoints are not used for encoding characters.)
  3. a category.  This is made up of a Major Category, such as "L" for letter, "N" for number, "P", for punctuation, "S" for symbol, and a minor category, which is used to subdivide the Major category.  Examples of minor categories are "Lu" for Letter, uppercase; "Ll" for Letter, lowercase; "Nd" for Number, decimal digit; "Sm" for Symbol, mathematical; and "Sc" for Symbol, currency.

This assignment involves defining an object structure that will support efficient implementations of the following methods (Codepoint is an alias for Number)

  • name(char:Codepoint) -> String — map a codepoint to the corresponding name.
  • character(name:String) -> Codepoint — map a string representing a name, or an alias, to a copdepoint
  • majorCategory(char:Codepoint) -> String — map a codepoint to a single-character string representing its major category
  • category(char:Codepoint) -> String — map a codepoint to a string of length 2 that represents its category.

By efficient, I mean that none of these functions should take time in O(n), where n is the size of the Unicode charctaer set.   They should all be in O(1) or O(log n).  Neither should they use an unreasonable amount of space; in particular, the table of names, which is large, should not be duplicated.

In addition to working code, you should turn in your tests, and make a presentation (with pictures!) to me and to the class on the structure that you used.

Groups

Please do this assignment in groups of 3.   Having the design discussion within your group is invaluable. 

Where do I find out more about Unicode?

Unicode is fascinating, and more complicated than you need know about for this assignment.   You will need the unicode code tables, though.  These are in the files UnicodeData.txt and NameAliases.txt, which can be found at http://www.unicode.org/Public/UCD/latest/ucd/.  The format of UnicodeData.txt is defined here.

What Data Structure to Use?

The point of this assignment is to practice your design skills.  I'm asking you to do the design as a group, because that will help you to debug the design before implementing it.  When you make your presentation to me, I will expect all three group members to defend your design choice.  Thus, the design must be a group effort.   It's OK to have your design discussion interactively in Cyberspace, e.g., using a Google Hangout.

I'm also requiring you to do Test-Driven-Development.  If you want to split up the implemetation task betrween you, that's OK.  For example, one member might write tests, while another pre-processes the unicode tables into initialization code for your data structure, and the third writes methods that use the data structure.   If you want to program all of the code as pairs or as a group of three, that's even better, but not required.

What are the constraints?

Because this is the real world, there are some constraints.

  1. As mentioned above, your methods need to be time efficeient: O(1) or O(log n).  So this rules out linear search.
  2. They also need to be space-effieicent.  The table of names and aliases is large, so you should not duplicate it.  This rules out the naοve solution of a pair of dictionaries, one mapping from names to codepoints, and the other mapping from codepoints to names.
  3. The time to initialize the data structures can be ignored.  This is reasonable because the data is constant, so initialization could in principle be done once in a pre-processing step.  However, it also meansthat you should not be reading text files at runtime; the data you need should be embedded in your program.
  4. The implementation can be done in Grace, or in JavaScript, or in Ruby.  In each case, I expect you to exercise the design skills that we have been learning during the quarter.
  5. For testing, you can use either gUnit, or minitest for Grace, and whatever Javascript or Ruby testing framework you like.
  6. If you use Grace, and it turns out that Grace can't handle data sets of the given size, then use a small portion of the data to test your implementation.

Hints

  • You are welcome to use the basic collections objects of Grace as building blocks, or similar functionality in Javascript — indexable lists, sets, and dictionaries.
  • There is a Python script that pre-processes the Unicode code tables at http://github.com/apblack/minigrace/tools/unicodegen.py.  That particualr script outputs C code, which won't be of interest to you, but the code that reads the unicode tables may be useful.  If it’s not useful, ignore it.

Submitting your work

Submit your work as a .zip file in the d2l dropbox.  You will need to tell me about the composition of your group so that I can create the dropbox.  Your zip file should contain:

  1. a file unicode.grace (or unicode.js) that contains your implementation
  2. a file unicodeTests.grace (or .js) that contains your tests
  3. whatever auxilliary files you need containing your data, the programs you used to pre-process the data, etc.
  4. a file report.pdf (or report.txt) that contains your report.  Use this to tell me what data structure you used, what you learned, problems you solved, issues you ran into and couldn't solve, features of your implementation language that made this easier or harder than you expected, etc.   The report can be quite short; 2 pages, including diagrams, is probably ample.  Write more here only if you need to.  The code itself and the intention-revealing names you use in it should tell me the details.   If you feel the need to explain your code in the report, stop!  Go back and clarify the code. 
  5. a file presentation.pdf that contains the graphics that you will use to present your design to me and to the class.
  6. you will also need to sign up for a presentation slot with me during the week of 11–15 May.  All three group memebers should be in attendance.  Sign up for this on Piazza.
  7. In class on 12th May, your group (or some subset of your group) should be ready to give a 5-minute description of your solution.  You should use the graphics in presentation.pdf
  8. (which need not be fancy; a scan of some hand-drawn diagrams might work fine) or paper that we can put on the document camera. The purpose of these descriptions is not  to assess your presentation skills, but to expose the whole class to the range of solutions.

Resources

 





Most recently modified sometime in the past


Andrew P. Black