Data Structures

Topic #10

Today’s Agenda

Continue Discussing Trees

Examine more advanced trees

2-3 (evaluate what we learned)

B-Trees

AVL

2-3-4

red-black trees

Discuss 2-3 Trees

A 2-3 tree is always balanced

Therefore, you can search it in all situations with logarithmic efficiency of the binary search

You might be concerned about the extra work in the insertion/deletion algorithms to split and merge the nodes...

Discuss 2-3 Trees

But, rigorous mathematical analysis has proved that this extra work to maintain structure is not significant

It is sufficient to consider only the time required to locate an item (or a position to insert)

Discuss 2-3 Trees

So, if 2-3 trees are so good, why not have nodes that can have more data items and more than 3 children?

Well, remember why 2-3 trees are great?

because they are balanced and that balanced structure is pretty easy to maintain

Discuss 2-3 Trees

The advantage is not that the tree is shorter than a balanced binary search tree

the reduction in height is actually offset by the extra comparisons that have to be made to find out which branch to take

actually a binary search tree that is balanced minimizes the amount of work required to support ADT table operations

Discuss 2-3 Trees

But, with binary search trees balance is hard to maintain

A 2-3 tree is really a compromise

Searching may not be quite as efficient as a binary tree of minimum height

but, it is relatively simple to maintain

Discuss 2-3 Trees

Allowing nodes to have more than 3 children would require more comparisons and would therefore be counter productive

unless you are working with external storage and each node requires a disk access, then we use b-trees which have the minimum height possible

Discuss B-Trees

Tables stored externally can be searched with B-Trees.

B-Trees are a more generalized approach than the 2-3 Tree

With externally stored tables, we want to keep the search tree as short as possible; it is much faster to do extra comparisons at a particular node than try to find the next node.

Discuss B-Trees

Every time we want to get another node,

we have to access the external file and read in the appropriate information.

It takes far less time to operate on a particular node (i.e., doing comparisons) once it has been read in.

This means that for externally stored tables we should try to reduce the height of the tree...even if it means doing more comparisons at every node.

Discuss B-Trees

Therefore, with an external search tree,

we allow each node to have as many children as possible.

If a node is to have m children, then you must be able to allocate enough memory for that node to contain the data and m pointers to the node.

The data such a node must have must be m-1 key values.

Discuss B-Trees

Remember in a binary search tree,

if a node has 2 children then it contains one data value (i.e., one value).

You can think of the data value at a node as separating the data values in the two child subtrees.

All keys to the left are less than the node's data value and all key values to the right are greater than or equal.

The value of the data at a particular node tells you which branch to take.

Discuss B-Trees

In a 2-3 tree,

if a node has 3 children then it must contain two key values.

These two values separate the key values in the node's three child subtrees.

All of the key values in the left subtree are less than the node's smaller key value;

all of the key values in the middle subtree are between the node's two key values;

all of the key values in the right subtree are greater than or equal to the node's larger key

Discuss B-Trees

Ideally, you should structure these types of trees such that every internal node has m children and all leaves are at the same level.

For example, if m is 5 -- then every node should have 5 children and 4 data values.

But, this is too difficult to maintain when you are doing a variety of insertions and deletions.

Discuss B-Trees

So, we can require that B-trees be balanced (as we saw with 2-3 trees)...

but the number of children for any internal node should be able to be somewhere between m and (m div 2)+1.

We call this a B-Tree of degree m

This requires that all leaves be at the same level (balanced).

Discuss B-Trees

Each node contains between m-1 and (m div 2) values.

Each internal node has one more child than it has values.

There is one exception;

the root of the tree can contain as few as 1 value and can have as few as two children (or none -- if the tree consists of only a root!).

Discuss B-Trees

Notice, a 2-3 tree is a B-tree of degree 3.

Data can be inserted into a B-tree using the same strategy

of splitting and

merging nodes

that we discussed

Here is a B-tree

of degree 5:

Discuss B-Trees

Then, insert 55.

The first step is to locate the leaf of the tree in which this index belongs by determining where the search for 55 would terminate.

We would find that we would want to insert 55 in the node containing 50,56,57, 58.

But, that would cause this node to contain 5 records. Since a node can contain only 4 records, you must split this node into two...the new left node gets the two smaller values and the new right node gets the two larger values.

Discuss B-Trees

The record with the middle key value (56) is moved up to the parent:

Discuss B-Trees

This causes two problems,

the parent now has six children and five records!!

So, we must split the parent into two nodes and move the middle data value up to its parent.

Remember, when we split an internal node, we need to also move that node's children too

Since the root only has 2 data items, we can simply add 56 there.

The solution is on the next slide...

Discuss B-Trees

Discuss B-Trees

Notice, that if the root had needed to be spit,

the new root will contain only one value and have only 2 children (that is why we have the exception to the B-Tree definition stated earlier).

To traverse a B-Tree in sorted order, all we need to do is visit the search keys in sorted order by using an inorder traversal of the B-Tree.

Balancing Algorithms

But, are there other alternatives?

Remember the advantage of trees is that they are well suited for problems that are hierarchical in nature and they are much faster than linked lists

but, this is not valid if the tree in not balanced

luckily, there are a number of techniques to balance a binary tree

Balancing Algorithms

Some of the balancing techniques require constant restructuring of the tree as data is inserted

the AVL algorithm uses this approach

Some algorithms consist of build an unbalanced tree and then reordering the data once the tree is generated

this can be simple but depending on the frequency of data being inserted, it may not be realistic

Balancing Algorithms

The “brute force” technique is to create an array of pointers to your data by traversing an unbalanced BST using “inorder” traversal

then re-build the tree by splitting the array in the middle for each subarray (much like what we have seen with the binary search algorithm used with arrays)

the middle data item should be the root, as it splits what is less than it, and what is greater!

Balancing Algorithms

The algorithm for the “brute force” approach is:

balance(data_type data [], int first, int last)

if (first <= last) {

int middle = (first + last)/2;

insert(data[middle]);

balance(data, first, middle-1);

balance(data, middle+1, last);

Balancing Algorithms

The “brute force” technique has a serious drawback

all of the data must be put in an array before a balanced tree can be created

what would happen if you weren’t using pointers to the data but instances of the data?

if an unbalanced tree is not used (i.e., the data is directly inserted into the array from the client), then a sorting algorithm must be used and fixed size issues arise

AVL Trees

The AVL tree is a classical method proposed by Adelson-Velskii and Landis

creates an “admissible tree” (its original name!)

focuses on rebalancing the tree locally to the portion of the tree affected by insertion and deletions

it allows the height of the left and right subtrees of every node to differ by at most one

AVL Trees

With AVL trees

each node must now keep track of the “balance factors” which records the differences between the heights of the left and right subtrees

the balance factor is the height of the right subtree minus the height of the left subtree

all balance factors must be +1, 0, or -1

notice, this does meet the definition we learned about for a balanced tree

AVL Trees

However, the concept of AVL trees always includes implicitly the techniques for balancing trees

and does not guarantee that the resulting tree is perfectly balanced (unlike all of the other techniques we have seen so far)

but, an AVL tree can be searched almost as efficiently as a minimum height binary search tree

but insert and removal are not as efficient

AVL Trees

AVL trees actually maintains the height close to minimum by monitoring the shape of the tree as you insert and delete

After you insert/delete

the tree is checked to see if any node differs by more than 1 in height

if it does, you rearrange the nodes to restore balance

But, as you can guess, we can’t arbitrarily rearrange nodes....we must keep proper order

AVL Trees

What we do is rotate the tree to make it balanced

Rotations are not necessary after every insertion & deletion (it is only needed when the height differs by more than 1)

experiments indicate that deletions in 78% of the cases require no rebalancing

and only 53% of the insertions do not bring the tree out of balance

AVL Trees

Single rotation is one type of rotation:

In the following, the tree was fine after inserting 20, 10, 40, 30, 50...but when 60 is inserted...

AVL Trees

Start at the node inserted...move up the tree (recursively return)

examining the balancing factor

stop when it is not +1, 0, -1 and rotate from the “heavy” side to the “light”

AVL Trees

If a single rotation does not create a balanced tree

then a double rotation is required

first rotate the subtree at the root where the problem occurred

and then rotate the tree’s root

there is, however, on special case:

AVL Trees

In class, walk through a few examples on your own (and then on the board) building AVL trees

so you can understand the process of rotations

insert: 50,60,30,70,55,20,52,65,40

or, insert: 10, 20, 30, 40, 50, 60, 70, 80

what would the corresponding BST and 2-3 tree looked like?

AVL Trees

The main question you should be facing with an AVL tree is

whether or not such restructuring is always necessary

binary search trees are used to insert, retrieve, and delete elements quickly and the speed of performing these operations i the issue, not the shape of the tree

performance can be improved by balancing the tree but luckily this is not the only method available

2-3-4 and red-black Trees

Now let’s go back to rethinking about how we organize our nodes

maybe instead of trying to balance the tree we keep the tree balancing at all times (perfectly balanced)

but the 2-3 tree had a flaw in that there may be situations where each node is “full” requiring a rippling effect of nodes being split as you recursively return back to the root

2-3-4 and red-black Trees

A 2-3-4 tree solves this problem

which allows 4-nodes which are nodes that have 4 pieces of data and 3 children

each insertion and deletion can have fewer steps than are required by a 2-3 tree (when looking at the insertions/deletions in isolation)

but does this by using more memory

essentially, each node can have 1,2, or 3 pieces of data, and 4 child pointers!!!!!

2-3-4 and red-black Trees

A 2-3-4 tree solves this problem

a node can either be a leaf or,

if it has 1 data item there are 2 children,

2 data items has 3 children, and

3 data items has 4 children

A 2-3-4 tree remains perfectly balanced

but its insertion algorithm splits the nodes as it traverses down the tree toward a leaf, rather than upon the return to the root

2-3-4 and red-black Trees

As you travel down the tree to insert a data item,

if you encounter a node with 3 pieces of data you immediate split the node at that time (just as we did with a 2-3 tree...but now we don’t use the new data we are trying to insert...because we haven’t inserted it yet!)

then, you continue traveling towards a leaf to insert the data

2-3-4 and red-black Trees

What this means is that the tree cannot contain all nodes with 3 pieces of data. Impossible.

In fact, on insert, once you insert data at a leaf it is guaranteed that the leaf’s parent will not have 3 pieces of data...

because if it did, it would have split on the way to find the leaf!

2-3-4 and red-black Trees

The advantage of both the 2-3 and 2-3-4 trees

is that they are easy to maintain balance (not that their height is shorter due to the extra comparisons required)

where the 2-3-4 tree has an advantage is that the insertion/deletion algs require only one pass through the tree so they are simpler than those for a 2-3 tree

decrease in effort makes them attractive..........

2-3-4 and red-black Trees

On the other hand, 2-3-4 trees require more storage than a binary search tree

and more storage (and less efficiently used storage) than a 2-3 tree

But, a binary search tree may be inappropriate

because it may not be balanced

so we use a red-black tree which is a special binary search tree

2-3-4 and red-black Trees

A red-black tree is a BST representation of a 2-3-4 tree with 2 extra fields in the node to represent whether the connection is within the current node or a child

it retains the advantages of a 2-3-4 tree without the storage overhead!

with all of the benefits of a binary search tree and none of the drawbacks!

2-3-4 and red-black Trees

The idea is to represent a node with 2 pieces of data and 3 children as a binary search tree with red and black child pointers

2-3-4 and red-black Trees

And, we represent a node with 3 pieces of data and 4 children as a binary search tree with red and black child pointers

2-3-4 and red-black Trees

In class, walk through examples of

2-3

2-3-4

AVL

BST

and see how you can take a 2-3-4 and turn it into a red black tree (make sure to read the chapter on advanced trees!!!)

2-3-4 and red-black Trees

For next time,

practice creating each of these trees on your own so that you understand the insertion algorithms

think about what would be needed to remove nodes from these trees

try deleting a leaf and an internal node from you 2-3, AVL, and 2-3-4 trees


	Continue Discussing Trees
	Examine more advanced trees
		2-3 (evaluate what we learned)
		B-Trees
		AVL
		2-3-4
		red-black trees


	A 2-3 tree is always balanced
	Therefore, you can search it in all situations with logarithmic efficiency of the binary search
	You might be concerned about the extra work in the insertion/deletion algorithms to split and merge the nodes...


	But, rigorous mathematical analysis has proved that this extra work to maintain structure is not significant
	It is sufficient to consider only the time required to locate an item (or a position to insert)


	So, if 2-3 trees are so good, why not have nodes that can have more data items and more than 3 children?
	Well, remember why 2-3 trees are great?
		because they are balanced and that balanced structure is pretty easy to maintain


	The advantage is not that the tree is shorter than a balanced binary search tree
		the reduction in height is actually offset by the extra comparisons that have to be made to find out which branch to take
		actually a binary search tree that is balanced minimizes the amount of work required to support ADT table operations


	But, with binary search trees balance is hard to maintain
		A 2-3 tree is really a compromise
		Searching may not be quite as efficient as a binary tree of minimum height
		but, it is relatively simple to maintain


	Allowing nodes to have more than 3 children would require more comparisons and would therefore be counter productive
		unless you are working with external storage and each node requires a disk access, then we use b-trees which have the minimum height possible


	Tables stored externally can be searched with B-Trees.
		B-Trees are a more generalized approach than the 2-3 Tree
		With externally stored tables, we want to keep the search tree as short as possible; it is much faster to do extra comparisons at a particular node than try to find the next node.


	Every time we want to get another node,
		we have to access the external file and read in the appropriate information.
		It takes far less time to operate on a particular node (i.e., doing comparisons) once it has been read in.
		This means that for externally stored tables we should try to reduce the height of the tree...even if it means doing more comparisons at every node.


	Therefore, with an external search tree,
		we allow each node to have as many children as possible.
		If a node is to have m children, then you must be able to allocate enough memory for that node to contain the data and m pointers to the node.
		The data such a node must have must be m-1 key values.


	Remember in a binary search tree,
		if a node has 2 children then it contains one data value (i.e., one value).
		You can think of the data value at a node as separating the data values in the two child subtrees.
		All keys to the left are less than the node's data value and all key values to the right are greater than or equal.
		The value of the data at a particular node tells you which branch to take.


	In a 2-3 tree,
		if a node has 3 children then it must contain two key values.
		These two values separate the key values in the node's three child subtrees.
		All of the key values in the left subtree are less than the node's smaller key value;
		all of the key values in the middle subtree are between the node's two key values;
		all of the key values in the right subtree are greater than or equal to the node's larger key


	Ideally, you should structure these types of trees such that every internal node has m children and all leaves are at the same level.
	For example, if m is 5 -- then every node should have 5 children and 4 data values.
		But, this is too difficult to maintain when you are doing a variety of insertions and deletions.


	So, we can require that B-trees be balanced (as we saw with 2-3 trees)...
		but the number of children for any internal node should be able to be somewhere between m and (m div 2)+1.
	We call this a B-Tree of degree m
	This requires that all leaves be at the same level (balanced).


	Each node contains between m-1 and (m div 2) values.
	Each internal node has one more child than it has values.
	There is one exception;
		the root of the tree can contain as few as 1 value and can have as few as two children (or none -- if the tree consists of only a root!).


	Notice, a 2-3 tree is a B-tree of degree 3.
	Data can be inserted into a B-tree using the same strategy
	of splitting and
	merging nodes
	that we discussed
	Here is a B-tree
	of degree 5:


	Then, insert 55.
		The first step is to locate the leaf of the tree in which this index belongs by determining where the search for 55 would terminate.
	We would find that we would want to insert 55 in the node containing 50,56,57, 58.
		But, that would cause this node to contain 5 records. Since a node can contain only 4 records, you must split this node into two...the new left node gets the two smaller values and the new right node gets the two larger values.


	The record with the middle key value (56) is moved up to the parent:


	This causes two problems,
		the parent now has six children and five records!!
		So, we must split the parent into two nodes and move the middle data value up to its parent.
		Remember, when we split an internal node, we need to also move that node's children too
		Since the root only has 2 data items, we can simply add 56 there.
		The solution is on the next slide...


	Notice, that if the root had needed to be spit,
		the new root will contain only one value and have only 2 children (that is why we have the exception to the B-Tree definition stated earlier).
	To traverse a B-Tree in sorted order, all we need to do is visit the search keys in sorted order by using an inorder traversal of the B-Tree.


	But, are there other alternatives?
	Remember the advantage of trees is that they are well suited for problems that are hierarchical in nature and they are much faster than linked lists
		but, this is not valid if the tree in not balanced
		luckily, there are a number of techniques to balance a binary tree


	Some of the balancing techniques require constant restructuring of the tree as data is inserted
		the AVL algorithm uses this approach
	Some algorithms consist of build an unbalanced tree and then reordering the data once the tree is generated
		this can be simple but depending on the frequency of data being inserted, it may not be realistic


	The “brute force” technique is to create an array of pointers to your data by traversing an unbalanced BST using “inorder” traversal
		then re-build the tree by splitting the array in the middle for each subarray (much like what we have seen with the binary search algorithm used with arrays)
		the middle data item should be the root, as it splits what is less than it, and what is greater!


The algorithm for the “brute force” approach is:
	balance(data_type data [], int first, int last)
		if (first <= last) {
		int middle = (first + last)/2;
		insert(data[middle]);
		balance(data, first, middle-1);
		balance(data, middle+1, last);


	The “brute force” technique has a serious drawback
		all of the data must be put in an array before a balanced tree can be created
		what would happen if you weren’t using pointers to the data but instances of the data?
		if an unbalanced tree is not used (i.e., the data is directly inserted into the array from the client), then a sorting algorithm must be used and fixed size issues arise


	The AVL tree is a classical method proposed by Adelson-Velskii and Landis
		creates an “admissible tree” (its original name!)
		focuses on rebalancing the tree locally to the portion of the tree affected by insertion and deletions
		it allows the height of the left and right subtrees of every node to differ by at most one


	With AVL trees
		each node must now keep track of the “balance factors” which records the differences between the heights of the left and right subtrees
		the balance factor is the height of the right subtree minus the height of the left subtree
		all balance factors must be +1, 0, or -1
		notice, this does meet the definition we learned about for a balanced tree


	However, the concept of AVL trees always includes implicitly the techniques for balancing trees
		and does not guarantee that the resulting tree is perfectly balanced (unlike all of the other techniques we have seen so far)
		but, an AVL tree can be searched almost as efficiently as a minimum height binary search tree
		but insert and removal are not as efficient


	AVL trees actually maintains the height close to minimum by monitoring the shape of the tree as you insert and delete
	After you insert/delete
		the tree is checked to see if any node differs by more than 1 in height
		if it does, you rearrange the nodes to restore balance
		But, as you can guess, we can’t arbitrarily rearrange nodes....we must keep proper order


	What we do is rotate the tree to make it balanced
	Rotations are not necessary after every insertion & deletion (it is only needed when the height differs by more than 1)
		experiments indicate that deletions in 78% of the cases require no rebalancing
		and only 53% of the insertions do not bring the tree out of balance