|
Set: Friday 1st May 2015
Code due: Monday 11th May
2015 at or before 17:00
|
|
Summary
We talked a bit about the Unicode standard in
class. Unicode is character set that contains more
than 100,000 "characters" (called codepoints).
As far as this homework is concerned,
each "character" has
- a name, which is a sequenec of ASCII characters, such
as LATIN SMALL LETTER M or GREEK SMALL LETTER ETA WITH
TONOS. A few characters have alternative names
called aliases, which are mostly abbreviations
and corrections.
- a codepoint, a number in the range 0hex
to 10FFFFhex. (66 non-character
codepoints are not used for encoding characters.)
- a category. This is made up of a Major
Category, such as "L" for letter, "N" for number, "P",
for punctuation, "S" for symbol, and a minor category,
which is used to subdivide the Major category.
Examples of minor categories are "Lu" for Letter,
uppercase; "Ll" for Letter, lowercase; "Nd" for Number,
decimal digit; "Sm" for Symbol, mathematical; and "Sc"
for Symbol, currency.
This assignment involves defining an object structure
that will support efficient implementations of the
following methods (Codepoint is an alias for
Number)
name(char:Codepoint) -> String map a
codepoint to the corresponding name.
character(name:String) -> Codepoint
map a string representing a name, or an alias, to a
copdepoint
majorCategory(char:Codepoint)
-> String map a codepoint to a
single-character string representing its major category
category(char:Codepoint)
-> String map a codepoint to a string of
length 2 that represents its category.
By efficient, I mean that none of these functions
should take time in O(n), where n is the
size of the Unicode charctaer set. They should
all be in O(1) or O(log n). Neither should
they use an unreasonable amount of space; in particular,
the table of names, which is large, should not be
duplicated.
In addition to working code, you should turn in your
tests, and make a presentation (with pictures!) to me and
to the class on the structure that you used.
Groups
Please do this assignment in groups of 3.
Having the design discussion within your group is
invaluable.
Where do I find out more about Unicode?
Unicode is fascinating, and more complicated than you
need know about for this assignment. You will
need the unicode code tables, though. These are in
the files UnicodeData.txt and NameAliases.txt, which can
be found at http://www.unicode.org/Public/UCD/latest/ucd/.
The format of UnicodeData.txt is defined here.
What Data Structure to Use?
The point of this assignment is to practice your design
skills. I'm asking you to do the design as a group,
because that will help you to debug the design before
implementing it. When you make your presentation to
me, I will expect all three group members to defend your
design choice. Thus, the design must be a
group effort. It's OK to have your design
discussion interactively in Cyberspace, e.g., using a
Google Hangout.
I'm also requiring you to do
Test-Driven-Development. If you want to split up the
implemetation task betrween you, that's OK. For
example, one member might write tests, while another
pre-processes the unicode tables into initialization code
for your data structure, and the third writes methods that
use the data structure. If you want to program
all of the code as pairs or as a group of three, that's
even better, but not required.
What are the constraints?
Because this is the real world, there are some
constraints.
- As mentioned above, your methods need to be time
efficeient: O(1) or O(log n). So this
rules out linear search.
- They also need to be space-effieicent. The table
of names and aliases is large, so you should not
duplicate it. This rules out the naοve solution of
a pair of dictionaries, one mapping from names to
codepoints, and the other mapping from codepoints to
names.
- The time to initialize the data structures can be
ignored. This is reasonable because the data is
constant, so initialization could in principle be done
once in a pre-processing step. However, it also
meansthat you should not be reading text files at
runtime; the data you need should be embedded in your
program.
- The implementation can be done in Grace, or in
JavaScript, or in Ruby. In each case, I expect you
to exercise the design skills that we have been learning
during the quarter.
- For testing, you can use either gUnit, or minitest
for Grace, and whatever Javascript or Ruby testing
framework you like.
- If you use Grace, and it turns out that Grace can't
handle data sets of the given size, then use a small
portion of the data to test your implementation.
Hints
- You are welcome to use the basic collections objects
of Grace as building blocks, or similar functionality in
Javascript indexable lists, sets, and dictionaries.
- There is a Python script that pre-processes the
Unicode code tables at http://github.com/apblack/minigrace/tools/unicodegen.py.
That particualr script outputs C code, which won't be of
interest to you, but the code that reads the unicode
tables may be useful. If its not useful, ignore
it.
Submitting your work
Submit your work as a .zip file in the d2l
dropbox. You will need to tell me about the
composition of your group so that I can create the
dropbox. Your zip file should contain:
- a file unicode.grace (or unicode.js)
that contains your implementation
- a file unicodeTests.grace (or .js)
that contains your tests
- whatever auxilliary files you need containing your
data, the programs you used to pre-process the data,
etc.
- a file report.pdf (or report.txt)
that contains your report. Use this to tell me
what data structure you used, what you learned, problems
you solved, issues you ran into and couldn't solve,
features of your implementation language that made this
easier or harder than you expected, etc. The
report can be quite short; 2 pages, including diagrams,
is probably ample. Write more here only if you
need to. The code itself and the
intention-revealing names you use in it should tell me
the details. If you feel the need to explain
your code in the report, stop! Go back and
clarify the code.
- a file presentation.pdf that contains the
graphics that you will use to present your design to me
and to the class.
- you will also need to sign up for a presentation slot
with me during the week of 1115 May. All three
group memebers should be in attendance. Sign up
for this on Piazza.
- In class on 12th May, your group (or some subset of
your group) should be ready to give a 5-minute
description of your solution. You should use the
graphics in presentation.pdf
(which need not be fancy; a scan of some hand-drawn
diagrams might work fine) or paper that we can put on the
document camera. The purpose of these descriptions is
not to assess your presentation skills, but to
expose the whole class to the range of solutions.
Resources
|