Flat sequences of characters have few advantages as a data structure for character strings other than simplicity. Tree or graph structures are generally more flexible and more efficient. This example of the use of AOP is an (incomplete) implementation of "stronger strings" called cords in Java, in an object-oriented style.
One of the most common operations on strings is concatenation. For concatenation to be both time and space efficient in a world where strings are immutable, it is necessary for the result to share as much of the data structure with its arguments as possible.
The natural data structure represents a concatenation of two
strings as a Concat
object that
contains references to its arguments. Since the arguments may also be
Concat
objects, we quickly arrive
at a tree structure in which the internal nodes represent
concatenation and the leaves are
Flat
strings that store the
actual characters. The basic string operations like
charAt
,
equals
, etc. propagate down the
tree as far as necessary.
In fact, the structure is not a tree but a directed graph, since it is quite possible to concatenate a string to itself. Because strings are immutable this sharing causes no problems and locking is never necessary, so will will continue to refer to trees.
A substring of such a structured string can be represented by a
Substring
object that stores a
reference to the base string of which it is a substring, and the
bounding indexes.
Given this tree representation, fetching the ith
character of a string s
will take
time proportional to the depth of the tree. If the tree is balanced,
this will be log2
s.length()
. However, since
strings are usually built by concatenation, maintaining balance at
all times requires that the concatenation operation rebalance the
tree, which means that concatenation will also take logarithmic
time.
Because the typical usage pattern of a string is a period in which it is built by concatenation followed by a period in which it is manipulated, this is probably a bad tradeoff. Instead we provide a client-callable balance method, which achieves approximate balance, and only force a rebalance if the tree becomes extremely lopsided.
The cord data structure is very amenable to the addition of other
kinds of leaf string in addition to flat strings. For example, a
FileString
might store most of
its characters in a file, except for a small buffer. New kinds of
interior nodes can also be added. For example, the
map
message currently traverses
every character in a string applying a client-provided translation
function, and produces a new string containing the translated
characters. This operation could also be implemented lazily, by
adding a Map
class that stores
the base string and the translation function, and which applies the
function to the characters when (and if) they are demanded.
To make these extensions possible,
cord.String
is an interface, not
a class. cord.String
is similar
to java.lang.String
, but contains
some additional methods such as
map
and
perform
that do string traversal.
This is because such traversals can be performed in linear time,
whereas traversing a string by repeated calls to
charAt
would take time n
log n for a string of length n.
These extensions do not seem to require aspects; class structure provides appropriate locality.
The basic cord classes were first implemented, using the interface
of cord.String
. These are
Flat
,
Substring
,
Concat
and
Empty
. All of these classes
inherit from AbstractString
,
which provides no representation but does provide abstract
implementation of many of the utility methods. No account was taken
of balancing. These classes were tested and debugged without
balancing, and as expected gave rise to excessively deep trees.
Balancing was then added as an aspect. To know when a string is
balanced, and to avoid rebalancing already balanced sub-trees, every
implementation of cord.String
must be able to answer its depth efficiently. The depth of a
Flat
string is zero and the depth
of a Concat
string is one greater
than the maximum depth of its children. So the first thing that the
Balance
aspect does is introduce
an additional (protected
) field
int
depth
into Substring
and
Concat
objects, an additional
method int
depth()
into interface
String
, and four implementations
of the depth method:
Empty.depth()
and
Flat.depth()
simply return
0
, while
Substring.depth()
and
Concat.depth()
return the value
of the depth field.
It is also necessary to initialize the depth field when the
Concat
and
Substring
objects are created.
This is accomplished by before advise weaves on the
constructors. In addition to calculating the value of depth, these
weaves take the opportunity to explicitly rebalance the argument
Strings if they are excessively deep.
A String s is balanced if
s.length()
>
Fibs.depth()+2
, where Fibn
is the nth Fibonacci number. To test this
condition, we use an array
minLength[]
, initialized
to the Fibonacci sequence. This array is part of the aspect and is
within the lexical scopes of all of the methods defined in the
aspect. Since the length field in this implementation is an
int
, the length of a
cord.String
cannot exceed
231, and thus the depth of a balanced tree must be 44 or
less, since Fib47 > 231.
The balance
method itself is
introduced into the interface
cord.String
. This does two
things: it introduces an abstract version of the method into the
String
interface, and a concrete
version of the method into every class that directly implements the
String
interface. In this case,
this means AbstractString
:
although Flat
,
Empty
,
Concat
and
Substring
also implement the
String
interface, they also
inherit from AbstractString
, so
it is not necessary for AspectJ to introduce the method into these
classes.
The method balance
is written
in terms of several helper methods. These are also defined in the
Balance aspect. In some cases variants of the method can be defined
for the various classes. For example, if an
Empty
string object is asked to
add itself to the balancing structure, the code is particularly
efficient!
Although balancing is the only aspect that has been implemented to date, other possibilities have been identified.
At present, the code does not optimize the case of concatenating a
null string to the right of this
.
(The class Empty
does optimize
the case of concatenating a string to the null string.) Inserting all
of the appropriate tests for the null string might possibly be done
as an aspect. Indeed, it is possible that Empty should itself have
been an aspect, although it seemed at the time that dealing with null
strings was part of the base functionality.
When the strings involved are sufficiently short, it does not make
sense to build these elaborate tree structures: it is more efficient
simply to copy the characters. In the Cedar rope
implementation (see below) "sufficiently short" means 24 characters
or less. The changes to
Flat.Concat
necessary to simply
build a longer Flat
string object
rather than a Concat
string
object could be added as an aspect.
As has been mentioned above, repeatedly calling
str.charAt(i) for i=0 to
str.length()
is not an efficient way of programming a
traversal of str
, because of the
time necessary to dive down into the appropriate sub-tree for each
character. The perform
method is
provided as an efficient alternative. However, old programming habits
die hard, and it might therefore be desirable to speed up
charAt
by using a cache.
At any level of the tree, a
Concat
or
Substring
node could keep a cache
of one or more piece records containing the following
information:
p
somewhere in the tree below
this
startIndex
and
endIndex
of the substring of
this
that is represented by
p
adjustment
(a signed
integer) that must be added to an index
i
of
this
, so thatstartIndex
<
i
<
endIndex
, then
this.charAt(i) =
p.charAt(i+adjustment)
. These records would enable
charAt
to skip all of the
intermediate levels in the tree in the event of a cache hit. An
aspect would allow the cache to be added to the representation of the
objects, and also provide a way of introducing the code that builds
and takes advantage of the cache.
The Xerox Cedar system provided an industrial strength implementation of strings in the Mesa language called ropes. A version of the Cedar system is available as a CD (Using Threads in Interactive Systems: A Case Study, by Carl Hauser, Christian Jacobi, Marvin Theimer, Brent Welch, and Mark Weiser, December 1993. Xerox PARC Technical Report CSL-93-16). A lighter-weight implementation of the same concept, in the C language called cords was undertaken by Hans Boehm and is available as part of a portable garbage collector (download one of the tar files and look at the cord subdirectory).
More information about the data structure is available online (page originally prepared by Hans Boehm) and in:
Boehm, Atkinson, and Plass, "Ropes: An Alternative to Strings", SoftwarePractice and Experience 25, 12 (Dec 1995), pp. 1315 - 1330.