Cords

Flat sequences of characters have few advantages as a data structure for character strings other than simplicity. Tree or graph structures are generally more flexible and more efficient. This example of the use of AOP is an (incomplete) implementation of "stronger strings" called cords in Java, in an object-oriented style.

The Basic Idea

One of the most common operations on strings is concatenation. For concatenation to be both time and space efficient in a world where strings are immutable, it is necessary for the result to share as much of the data structure with its arguments as possible.

The natural data structure represents a concatenation of two strings as a Concat object that contains references to its arguments. Since the arguments may also be Concat objects, we quickly arrive at a tree structure in which the internal nodes represent concatenation and the leaves are Flat strings that store the actual characters. The basic string operations like charAt, equals, etc. propagate down the tree as far as necessary.

In fact, the structure is not a tree but a directed graph, since it is quite possible to concatenate a string to itself. Because strings are immutable this sharing causes no problems and locking is never necessary, so will will continue to refer to trees.

A substring of such a structured string can be represented by a Substring object that stores a reference to the base string of which it is a substring, and the bounding indexes.

Engineering Tradeoffs

Given this tree representation, fetching the ith character of a string s will take time proportional to the depth of the tree. If the tree is balanced, this will be log2 s.length(). However, since strings are usually built by concatenation, maintaining balance at all times requires that the concatenation operation rebalance the tree, which means that concatenation will also take logarithmic time.

Because the typical usage pattern of a string is a period in which it is built by concatenation followed by a period in which it is manipulated, this is probably a bad tradeoff. Instead we provide a client-callable balance method, which achieves approximate balance, and only force a rebalance if the tree becomes extremely lopsided.

User-defined Cords

The cord data structure is very amenable to the addition of other kinds of leaf string in addition to flat strings. For example, a FileString might store most of its characters in a file, except for a small buffer. New kinds of interior nodes can also be added. For example, the map message currently traverses every character in a string applying a client-provided translation function, and produces a new string containing the translated characters. This operation could also be implemented lazily, by adding a Map class that stores the base string and the translation function, and which applies the function to the characters when (and if) they are demanded.

To make these extensions possible, cord.String is an interface, not a class. cord.String is similar to java.lang.String, but contains some additional methods such as map and perform that do string traversal. This is because such traversals can be performed in linear time, whereas traversing a string by repeated calls to charAt would take time n log n for a string of length n.

These extensions do not seem to require aspects; class structure provides appropriate locality.

Using Aspects

The basic cord classes were first implemented, using the interface of cord.String. These are Flat, Substring, Concat and Empty. All of these classes inherit from AbstractString, which provides no representation but does provide abstract implementation of many of the utility methods. No account was taken of balancing. These classes were tested and debugged without balancing, and as expected gave rise to excessively deep trees.

Balancing was then added as an aspect. To know when a string is balanced, and to avoid rebalancing already balanced sub-trees, every implementation of cord.String must be able to answer its depth efficiently. The depth of a Flat string is zero and the depth of a Concat string is one greater than the maximum depth of its children. So the first thing that the Balance aspect does is introduce an additional (protected) field int depth into Substring and Concat objects, an additional method int depth() into interface String, and four implementations of the depth method: Empty.depth() and Flat.depth() simply return 0, while Substring.depth() and Concat.depth() return the value of the depth field.

It is also necessary to initialize the depth field when the Concat and Substring objects are created. This is accomplished by before advise weaves on the constructors. In addition to calculating the value of depth, these weaves take the opportunity to explicitly rebalance the argument Strings if they are excessively deep.

The Balance Method

A String s is balanced if s.length() > Fibs.depth()+2, where Fibn is the nth Fibonacci number. To test this condition, we use an array minLength[], initialized to the Fibonacci sequence. This array is part of the aspect and is within the lexical scopes of all of the methods defined in the aspect. Since the length field in this implementation is an int, the length of a cord.String cannot exceed 231, and thus the depth of a balanced tree must be 44 or less, since Fib47 > 231.

The balance method itself is introduced into the interface cord.String. This does two things: it introduces an abstract version of the method into the String interface, and a concrete version of the method into every class that directly implements the String interface. In this case, this means AbstractString: although Flat, Empty, Concat and Substring also implement the String interface, they also inherit from AbstractString, so it is not necessary for AspectJ to introduce the method into these classes.

The method balance is written in terms of several helper methods. These are also defined in the Balance aspect. In some cases variants of the method can be defined for the various classes. For example, if an Empty string object is asked to add itself to the balancing structure, the code is particularly efficient!

Other possibilities for Aspects

Although balancing is the only aspect that has been implemented to date, other possibilities have been identified.

Null-string optimizations

At present, the code does not optimize the case of concatenating a null string to the right of this. (The class Empty does optimize the case of concatenating a string to the null string.) Inserting all of the appropriate tests for the null string might possibly be done as an aspect. Indeed, it is possible that Empty should itself have been an aspect, although it seemed at the time that dealing with null strings was part of the base functionality.

Short String optimizations

When the strings involved are sufficiently short, it does not make sense to build these elaborate tree structures: it is more efficient simply to copy the characters. In the Cedar rope implementation (see below) "sufficiently short" means 24 characters or less. The changes to Flat.Concat necessary to simply build a longer Flat string object rather than a Concat string object could be added as an aspect.

A cache for charAt()

As has been mentioned above, repeatedly calling str.charAt(i) for i=0 to str.length() is not an efficient way of programming a traversal of str, because of the time necessary to dive down into the appropriate sub-tree for each character. The perform method is provided as an efficient alternative. However, old programming habits die hard, and it might therefore be desirable to speed up charAt by using a cache.

At any level of the tree, a Concat or Substring node could keep a cache of one or more piece records containing the following information:

  1. A reference to a string p somewhere in the tree below this
  2. The startIndex and endIndex of the substring of this that is represented by p
  3. The adjustment (a signed integer) that must be added to an index i of this, so that
    if startIndex < i < endIndex, then this.charAt(i) = p.charAt(i+adjustment)

These records would enable charAt to skip all of the intermediate levels in the tree in the event of a cache hit. An aspect would allow the cache to be added to the representation of the objects, and also provide a way of introducing the code that builds and takes advantage of the cache.

Some Background on Cords

The Xerox Cedar system provided an industrial strength implementation of strings in the Mesa language called ropes. A version of the Cedar system is available as a CD (Using Threads in Interactive Systems: A Case Study, by Carl Hauser, Christian Jacobi, Marvin Theimer, Brent Welch, and Mark Weiser, December 1993. Xerox PARC Technical Report CSL-93-16). A lighter-weight implementation of the same concept, in the C language called cords was undertaken by Hans Boehm and is available as part of a portable garbage collector (download one of the tar files and look at the cord subdirectory).

More information about the data structure is available online (page originally prepared by Hans Boehm) and in:

Boehm, Atkinson, and Plass, "Ropes: An Alternative to Strings", Software—Practice and Experience 25, 12 (Dec 1995), pp. 1315 - 1330.

The Source Code

The Woven Code


Andrew P Black
Last updated: 9.10.2003 at 15:45