A Strong Formal Model of Linux-Kernel Memory Ordering[LWN.net]

October 29, 2016

This article was contributed by Jade Alglave, Paul E. McKenney, Alan Stern, Luc Maranget, Andrea Parri and TBD

Introduction

This article is organized as follows, with the intended audience for each section in parentheses:

Introduction to the Linux-Kernel Memory Models (people interested in understanding the memory model).
Design of the Strong Model (people interested in the ideas behind the formal memory model).
Strong-Model Bell File (masochists and other people interested in a deep understanding of the Linux-kernel memory model).
Strong-Model Cat File (masochists and other people interested in a deep understanding of the Linux-kernel memory model).
(More TBD.)

This is followed by the inevitable answers to the quick quizzes.

Introduction to the Linux-Kernel Memory Models

This section is mostly concerned with the strong memory model. The weak (but still fairly strong as these things go) model is derived from the strong one by relaxing several of the less-important constraints.

The strong Linux-kernel memory model started out as an operational model, based on the PPCMEM model for PowerPC as presented in two papers (“Understanding POWER Multiprocessors” [pdf] and “Synchronising C/C++ and POWER” [pdf]) by Susmit Sarkar, Peter Sewell, and others. Our model was a modified version of theirs, changed to take into account the requirements of the kernel. herd-style Bell and Cat files were developed as a formal axiomatization of this model. The model then was modified to handle the peculiarities of the DEC Alpha, and the Cat file was modified accordingly.

Some time later we incorporated ideas from the Flowing and POP models for ARM, as presented in “Modelling the ARMv8 Architecture, Operationally: Concurrency and ISA” [pdf] by Shaked Flur, Peter Sewell, and others (together with the supplementary material [pdf]). This design proved to be so different from the PowerPC-oriented operational model that there was no reasonable way to unify the two. Instead, we abandoned our operational model and concentrated on the formal herd model, weakening it so that it would accept litmus tests allowed by the ARM architecture as defined by the Flowing model. Nevertheless, the original operational model offers a very good basis for understanding our formal model and so we present it here, along with a discussion of the changes needed to adapt it to Alpha and the issues raised by ARM.

The operational model divides a computer system into two parts: the processors (or CPUs), which execute instructions, and the memory subsystem, which propagates information about writes and barriers among the CPUs and is also responsible for determining the coherence order. When a CPU executes a write or certain kinds of barriers, it tells the memory subsytem. And when a CPU needs to load a value from memory or cache to execute a read, it asks the memory subsystem to provide the value.

The Processor Subsystem

Although the underlying operations involved in executing an instruction on a modern CPU can be quite complicated, nevertheless there always comes a point where the CPU has finished evaluating all of the instruction's inputs and outputs and commits itself irrevocably to using those values. Conceptually, each instruction that gets executed is committed at a single, precise moment in time. (Instructions that don't get executed, such as those started speculatively in what turns out to be the unused arm of a conditional branch, are not committed.)

Instructions may commit in any order and at any rate, subject to certain constraints. For example, an instruction controlled by a conditional branch can't be committed before the branch is, because until that time the CPU hasn't decided for certain whether the branch will be taken. The full set of constraints on the order of instruction execution (which is almost but not quite the same as comittal, differing only for read instructions) is listed below. For instructions involving only quantities that are local to the CPU, such as those computing register-to-register arithmetic, that's all there is to it: The CPU carries out the operations required by the instruction when the inputs are available, eventually commits to the result, and moves on. But some instructions need more. In particular, some require the CPU to communicate with the memory subsystem.

Writes and memory barriers are the simplest case. When a CPU commits a write instruction, it tells the memory subsystem the target address of the write and the value to be stored there. It can't do this before the write commits, because once the information has been sent to the memory subsystem there's no way to take it back. Similarly, when a CPU commits one of the barriers that affect write-propagation order, it informs the memory subsystem, which then uses that information to control the way writes get propagated.

Quick Quiz 1: But couldn't a CPU designer create a memory subsystem that did allow writes to be taken back?
Answer

Reads are more complicated. When a CPU starts to execute a read instruction, it first has to calculate the target address, which may involve adding index or base register values to a constant offset. It then checks to see if the most recent write (in program order) to that target address is still uncommitted; if it is then the CPU takes the value to be stored by that write and uses it as the value for the read. This is called store forwarding, and it is a form of out-of-order execution (the read can be committed before the program-earlier write). But if there was no prior write to that address or the most recent one has already been committed, then the CPU has to ask the memory subsystem to retrieve the value at the target address. Either way, we say that the read is satisfied, and this also takes place at a precise moment in time. A read instruction cannot commit until it has been satisfied.

There's more to it than that, however. The act of satisfying a read is not irrevocable. It may turn out, for example, that the values used in calculating the target address were themselves not yet committed and hence are still subject to change. If that happens, the read instruction will need to be restarted: The target address must be recalculated and the read must be satisfied again. This can happen several times before the read is committed. In fact, it can even happen several times without the read ever being committed, if the read was started speculatively and then abandoned. Of course, once a read has been committed then it can no longer be restarted.

Thus, a CPU carries out a read instruction by satisfying it (perhaps more than once) and eventually committing it. For most other instruction types, execution only involves committing the instruction, but there is one exception. A strong memory barrier (such as smp_mb()) is not finished when it commits. Instead, the CPU has to wait for the strong barrier to be acknowledged by the memory subsystem. This doesn't happen until the memory subsystem has propagated the barrier to all the other CPUs in the system, and the CPU is not allowed to begin executing any instructions that come after the strong barrier in program order until then. This is what makes these barriers so strong (and so slow!).

Quick Quiz 2: Why can't CPU designers use speculation to hide the slowness of strong barriers?
Answer

The Memory Subsystem

The memory subsystem accepts write, read, and barrier requests from the CPUs. It propagates the write and barrier requests to all the CPUs in the system and provides responses to read requests. It also determines the coherence order of all writes to each variable and provides a mechanism for making certain operations atomic.

Handling read requests is quite simple. When a CPU submits a read request for a specified target address, the memory subsystem finds the latest write (in the target address's coherence order) that has propagated to the CPU and returns the value stored by that write. This means, among other things, that a CPU cannot read from a write until the write has propagated to that CPU, as you would expect. It's important that the write be the coherence-latest; otherwise the system could violate the read-read coherence rule if a po- (program order-) earlier read had already read from the coherence-latest write.

Quick Quiz 3: Given that Itanium can reorder reads to the same variable, how can the Linux-kernel memory model reasonably prohibit such reordering?
Answer

Accepting a write request from a CPU is a little more complicated. To begin with, the memory subsystem has to decide where the write will fit into the coherence order for the target address. Specifically, it must ensure that the write is assigned to a position in the coherence order that is after any other writes to the same address which have already propagated to that CPU. This is necessary because the CPU might already have executed a po-earlier instruction which read from one of those other writes; if the new write were to be come before that other write in the coherence order then it would violate the read-write coherence rule.

In addition, the memory subsystem has to propagate the write to all the other CPUs (a write or barrier is considered to “propagate” to its own CPU at the time it is committed) and to the coherence point. The coherence point is a notional place in the system where writes and barriers get sent, in much the same way that they are propagated to CPUs. If you like, you can think of the coherence point as being the place where writes finally pass out of all the internal caches and buffers, down into memory for storage. The key aspect of the coherence point is that different writes to the same address arrive at the coherence point in their coherence order. In effect, the order of their arrival at the coherence point defines the coherence order. Whether this is because the memory subsystem first decides on a coherence order and then sends writes to the coherence point in that order, or because it sends writes willy-nilly to the coherence point and then uses the order of their arrival as the coherence order, doesn't matter. What does matter is that once a write has reached the coherence point, its position in the coherence order is fixed; it is impossible for any future writes to be assigned an earlier position.

Quick Quiz 4: Isn't this single coherence point a huge bottleneck on large systems?
Answer

This fact is crucial for atomic operations. The memory model represents an atomic read-modify-write (RMW) operation as two events: a read R followed by a write W (or conditionally followed by a write, for operations like cmpxchg()). What makes the operation atomic is that no other writes to that address, from any CPU, are allowed to intervene between R and W. In other words, the memory subsystem guarantees that the write immediately preceding W in the coherence order is the write which R reads from. The operational model specifies that it does this, in part, by arranging for W to reach the coherence point at the time when it commits (as opposed to some arbitrarily later time, like an ordinary write). As a result, no future write will be able to sneak in before W in the coherence order, and avoiding such “sneak writes” is what enforces the atomic property.

Quick Quiz 5: But how could the system possibly prevent some other write on some other CPU from taking place between the time the RMW's read and write execute? Is there some Big System Lock implemented in hardware that will totally destroy scalability???
Answer

Other than these requirements and the constraints imposed by memory barriers, the order in which writes are propagated to CPUs and reach the coherence point is unrestricted. In particular, these orders don't have to bear any resemblance to the order in which the write requests were originally sent to the memory subsystem. It's entirely possible for two CPUs to write to the same address at different times and have the second write come before the first in the coherence order or be propagated before the first to a third CPU.

Memory barriers

The kernel memory model has two broad categories of memory barriers: those whose effects are entirely local to a single CPU and those that interact with the memory subsystem. Barriers in the first category can only constrain the order in which the CPU executes instructions, whereas those in the second group can also affect the order of propagation of writes.

Associated with each memory barrier are two sets of instructions, called the barrier's pre-set and its post-set. These sets vary according to the type of barrier, but all barriers other than the three “read-dependency” barriers share these features:

A barrier cannot commit until every instruction in its pre-set has committed, and
An instruction in the barrier's post-set cannot commit or be satisfied until the barrier has committed.

The various barriers included in this memory model, and their types and pre- and post-sets, are listed in the following table in order of increasing strength. (Contrary to what you might expect, smp_load_acquire(), rcu_dereference(), and lockless_dereference() are represented in the memory subsystem as a read request followed by a separate barrier request. Similarly, smp_store_release() and rcu_assign_pointer() are represented as a barrier request followed by a separate write request. Thus, each requires the CPU to issue two requests to the memory subsystem.)

Barrier	Type	Pre-set	Post-set
rcu_dereference(), lockless_dereference()	Read-dependency	itself	all po-later reads with a dependency from this read
smp_read_barrier_depends()	Read-dependency	all po-earlier reads	all po-later reads with a dependency from a po-earlier read
`smp_load_acquire()`	Execution-order	itself	all po-later memory accesses
`smp_rmb()`	Execution-order	all po-earlier reads	all po-later reads
`smp_wmb()`	B-cumulative	all po-earlier writes	all po-later writes (*)
smp_store_release(), rcu_assign_pointer()	A-cumulative (**)	all po-earlier memory accesses (*)	itself and members of its release sequence
smp_mb(), synchronize_rcu()	Strong (A- and B-cumulative)	all po-earlier memory accesses (*)	all po-later memory accesses (*)

(*) as modified by the cumulativity requirements described below.
(**) also B-cumulative when read by a load-acquire, as described here.

Quick Quiz 6: The terms “A-cumulativity” and “B-cumulativity” aren't particularly mnemonic, are they?
Answer

The read-dependency and execution-order barriers are purely local to their own CPU. (In fact, in this model the read-dependency barriers have no effect at all. We will see later that they matter only when the model is altered to work with the DEC Alpha.) However, when the CPU commits one of the others, collectively referred to as “propagation-order” barriers, it informs the memory subsystem about the barrier, and the memory subsystem propagates the barrier to all the other CPUs. This is where the barrier's propagation ordering effects come into play:

The memory subsystem will not propagate a barrier to a CPU until all the writes in the barrier's pre-set have been propagated to that CPU.
The memory subsystem will not propagate a write in a barrier's post-set to a CPU until the barrier has been propagated to that CPU.

The same is true for the order in which writes and barriers reach the coherence point; in this respect barriers treat the coherence point much like another CPU. In addition, the memory subsystem does not acknowledge a strong barrier until the barrier has been propagated to every CPU and has reached the coherence point (and as mentioned above, the CPU will not satisfy or commit any instructions po-after a strong barrier until the barrier has been acknowledged).

The propagation-order barriers enjoy varying degrees of cumulativity. This means that the barriers affect the order of propagations, not just of writes issued by the barrier's own CPU, but also of writes issued by other CPUs. In effect, the barriers' pre- and post-sets are enlarged:

The pre-set of an A-cumulative barrier includes all writes that have propagated to the barrier's CPU before the barrier is committed.

The post-set of a B-cumulative barrier includes all memory accesses on another CPU that execute after the barrier has propagated to that CPU.

Quick Quiz 7: By symmetry, shouldn't a B-cumulative barrier's post-set include all writes that propagate to the barrier's CPU after the barrier is committed?
Answer

The memory model adopts the idea of release sequences (slightly altered) from C11. For any store-release instruction (such as smp_store_release(), rcu_assign_pointer(), or xchg_release()), the release sequence headed by that instruction includes the instruction itself as well as all po-later writes to the same address. The release sequence also includes, recursively, any atomic RMW instruction accessing the same address, on any CPU, that reads from a write in the release sequence. Every write in the release sequence belongs to the associated barrier's post-set.

Some examples of cumulativity and release sequences are presented below.

To see how this works out in practice, consider this litmus test (an example of the “Store-Buffering” pattern):

Strong Model Litmus Test #1

  1 C C-SB+o-mb-o+o-mb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   int r1;
  9
 10   WRITE_ONCE(*x, 1);
 11   smp_mb();
 12   r1 = READ_ONCE(*y);
 13 }
 14
 15 P1(int *x, int *y)
 16 {
 17   int r2;
 18
 19   WRITE_ONCE(*y, 1);
 20   smp_mb();
 21   r2 = READ_ONCE(*x);
 22 }
 23
 24 exists
 25 (0:r1=0 /\ 1:r2=0)

Using the features of the memory model already discussed, we can show that this test's “exists” condition will never be satisfied.

When the test is executed, one of the two memory barriers must be acknowledged before the other (they are both strong barriers). Suppose the barrier in P0 gets acknowledged first. Then the following events have to occur in the order listed, for the reasons shown:

P0's write to x propagates to P1 before P0's memory barrier does, because the write is in the barrier's pre-set.
P0's memory barrier propagates to P1 before it is acknowledged, because a strong memory barrier is not acknowledged until it has propagated to every CPU.
P0's memory barrier is acknowledged before P1's barrier, by assumption.
P1's read of x is satisfied after P1's barrier is acknowledged, because the read comes after the barrier in program order.

Hence the write to x propagates to P1 before P1's read is satisfied. Since that write is the last one in the coherence order for x, it is the one that will be used to satisfy the read. Therefore r2 will end up equal to 1, not 0. The opposite case (where P1's barrier is acknowledged first) is symmetrical. In neither case is it possible for r1 and r2 both to be equal to 0.

The strong model agrees:

Outcome for Strong Model Litmus Test #1

 1 Test C-SB+o-mb-o+o-mb-o Allowed
 2 States 3
 3 0:r1=0; 1:r2=1;
 4 0:r1=1; 1:r2=0;
 5 0:r1=1; 1:r2=1;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 3
 9 Condition exists (0:r1=0 /\ 1:r2=0)
10 Observation C-SB+o-mb-o+o-mb-o Never 0 3
11 Hash=a61f698662bb72c2ed1755812580d385

Processor-Local Ordering Requirements

While executing instructions, a CPU observes various ordering requirements. Some of these are obvious (an instruction can't be executed before the CPU knows how or whether to execute it, as mentioned earlier). Others are less obvious but are necessary to avoid violating the four coherence rules.

The first and simplest requirement is that a CPU will not commit an instruction that is po-after a conditional branch until the branch itself is committed. At the hardware level this is true even for trivial conditionals; CPUs do not recognize that expressions like “x == x” must always hold. The situation in higher-level languages is not as simple, because optimizing compilers do recognize such things. They will happily eliminate the unnecessary test and conditional jump from the object code entirely, leaving no ordering requirement for the CPU to obey at runtime.

Quick Quiz 8: Given the steadily increasing number of transistors, why couldn't a CPU analyze code to detect at least some classes of “x == x” comparisons?
Answer

Compilers also realize that code following the end of an “if (...) then {...} else {...}” statement will be executed regardless of which branch is taken, and they are free to move such code up before the start of the “if” in the absence of any reason not to. Therefore our memory model applies this ordering requirement only to the instructions that are within the “then {...} else {...}” branches of the conditional, i.e., those that are directly under its control. These instructions cannot be committed before the instructions that compute the “if” condition and decide whether to take the branch.

The memory model has a weakness in this area. If both branches of an “if” statement write the same value to the same variable, an optimizing compiler may replace both writes with a single write that is executed before the “if” statement branches. This resulting single write then would not be subject to the ordering requirement at runtime, even though the model says it should be.

(We must emphasize that this requirement applies only to committing instructions. Reads following a conditional branch can be satisfied before the branch is committed, and they often are. Even if a read belongs to the arm of the branch that is ultimately not taken, the CPU may speculatively start executing the read before it knows which way the branch will go. Although all architectures have barriers that can prevent speculative reads, such as the isb and isync instructions on ARM and PowerPC, respectively, the Linux kernel does not include any facility specifically meant for this purpose. If you really want to prevent a read in a conditional branch from being satisfied speculatively, you can always use smp_rmb().)

The preceding requirement was about whether a CPU should execute an instruction. The next ordering requirement is about how an instruction should be executed. If an instruction has a dependency from a po-earlier read, then the instruction cannot commit until the read does. This is simply because the address or data the instruction will use isn't irrevocable until the earlier read commits.

The next pair of requirements affect only writes. A write cannot commit until all sources of address dependencies to a memory-access instruction po-before the write have committed. This is necessary because the po-earlier memory access might generate an invalid-address exception, in which case the write should not be executed. However, the CPU can't know whether the earlier access's target address will turn out to be invalid until the target address is fully determined, which means that all sources of an address dependency have to be committed.

Furthermore, a write cannot commit until all po-earlier instructions accessing the same address have committed. This requirement is necessary to enforce the coherence rules. If two writes to the same address committed and were sent to the memory subsystem out of order, the memory subsystem would put the po-earlier write later in the address's coherence order, which would violate write-write coherence. And if a write committed before a po-earlier read of the same address was satisfied, the memory subsystem would use the value stored by the write (or something even later in the coherence order) to satisfy the read request, which would violate read-write coherence. But since the po-earlier read may be restarted (and thus satisfied again) at any time up until it commits, this means the write must not commit until the read is committed.

The next three requirements are concerned with restarting reads. A read instruction R must be restarted after:

A po-earlier read R' is satisfied, where R' is the source of an address dependency to R, or the source of an address or data dependency to a write that was forwarded to R.
A po-earlier read R' of the same address is satisfied, unless R read from the same write as R' or was forwarded from a write that is po-after R'.
A po-earlier write W to the same address is committed, unless R was forwarded from W or from a write that is po-after W.

It follows that R cannot commit until each of the R' or W accesses mentioned here has committed (because until then, R' might restart and thus be satisfied again or W might commit, requiring R to restart). The reason for the first case is pretty obvious; the other two are more obscure.

For the second case, suppose R' is po-before R and they read from different writes, W' and W respectively. Assuming W was not forwarded to R, this means that W was the most recent write in the coherence order to have propagated to the CPU as of the time when R was satisfied. Similarly, either R' was forwarded from W' or else W' was the most recent write in the coherence order to have propagated to the CPU as of the time when R' was satisfied. But if R' was satisfied after R, then W' must come after W in the coherence order—if it came before then R' would have read from W instead of W'. (We can discount the possibility that W' was forwarded to R'; if it had been then it or a write po-after R' would been forwarded to R.) But W' being coherence-later than W would be a violation of read-read coherence. Thus R' must be satisfied before R, and the only way to guarantee this is to require that R be restarted after R' is satisfied.

For the third case, suppose W was the last write before R (in program order) to access the same address, and R was satisfied before W committed but was not forwarded from it. (This could happen if W's target address had not yet been been determined at the time R was satisfied.) Then R would have either read from some other write W' which had already propagated to the CPU, or else been forwarded from some other write W' that was po-earlier than W. Either way, when W did commit later on, it would be assigned a position in the coherence order after W'. This would violate write-read coherence. Thus R must be satisfied at some time after W commits, and the only way to guarantee this is to require that R be restarted when W commits.

Finally, the memory model assumes that the resources required to carry out an atomic RMW operation (bus locks, reservation registers, or whatever) are limited, and consequently a CPU cannot execute two such operations concurrently or out of program order. Thus, the read part of an RMW instruction cannot be satisfied until all po-earlier RMW instructions have committed. Likewise, the write part cannot commit until the read part has committed.

When we express these ordering requirements in the memory model, it turns out that when a read instruction commits is relatively unimportant; what really matters is when the read is satisfied for the last time. Therefore we will say that a read executes when it is last satisfied, whereas all other instructions execute when they commit. In these terms, the ordering requirements take the following form.

Let A and B be instructions with A before B in program order. Then A must execute before B if any of the following hold:

A is a conditional branch and B is a write instruction controlled by A.

There is a dependency from A to B.

There is a dependency from A to a write that is forwarded to B.

B is a write and A is the source of an address dependency to a memory access instruction between them.

B is a write and A accesses the same address as B.

A and B are reads of the same address, and B does not read from the same write as A and is not forwarded from a write that is between them.

A is a write and B is a read of the same address, and B is not forwarded from A or from a write that is between them.

B is a barrier and A is in its pre-set.

A is a barrier and B is in its post-set.

A and B are both atomic RMW instructions.

(Taken together, requirements 8 and 9 say that A must execute before B whenever they are separated in program order by a suitable barrier, such that A is in the barrier's pre-set and B is in the barrier's post-set.)

Collectively these are called Preserved Program Order (PPO) requirements, because they force the CPU to execute certain instructions in program order. The PPO requirements were expressed almost verbatim in the original version of the strong kernel memory model. Together with the description of how the memory subsystem works, they suffice to guarantee that the four coherence rules will always be obeyed. The resulting model applied quite nicely to x86, Sparc, and PowerPC; however, it was not accurate for some other architectures.

Quick Quiz 9: Given all these constraints, how can weak-memory CPUs possibly expect to attain any benefits of any sort compared to strong-memory CPUs?
Answer

Adjustments for the DEC Alpha

Alpha deviates from the operational model described above in one very significant way: It uses a software-visible split data cache. In terms of the memory model, this means that writes which propagate to a CPU in one order might be perceived by that CPU in the opposite order. For example, suppose that P0 writes to both x and y, and the writes propagate in that order to P1. If x and y happen to be located in different cache lines and the two cache lines are handled by different parts of P1's data cache, it may happen that the part of the cache responsible for handling the write to x is busy while another part of the cache is able to handle the write to y right away. P1's CPU would then see y's new value before seeing x's, a result that would not be allowed by the memory model as described above.

Here's an example litmus test to illustrate the point.

Strong Model Litmus Test #2

  1 C alpha-split-cache-example1
  2 {
  3 int u = 0;
  4 int v = 0;
  5 int *p = &u;
  6 }
  7
  8 P0(int **p, int *v)
  9 {
 10   WRITE_ONCE(*v, 1);
 11   smp_mb();
 12   WRITE_ONCE(*p, v);
 13 }
 14
 15 P1(int **p)
 16 {
 17   int *r1;
 18   int r2;
 19
 20   r1 = READ_ONCE(*p);
 21   r2 = READ_ONCE(*r1);
 22 }
 23
 24 exists (1:r1=v /\ 1:r2=0);

The smp_mb() in P0 forces the write to v to propagate to P1 before the write to p, and the address dependency from P1's read of p to its read of *r1 forces these reads to be executed in program order. Nevertheless, the split-cache arrangement may cause P1 to see the new value of p (i.e., &v) and then the old value of v (i.e., 0). This odd behavior can be observed on real Alpha hardware, and it shows up when we run the litmus test through the model:

Outcome for Strong Model Litmus Test #2

 1 Test alpha-split-cache-example1 Allowed
 2 States 3
 3 1:r1=u; 1:r2=0;
 4 1:r1=v; 1:r2=0;
 5 1:r1=v; 1:r2=1;
 6 Ok
 7 Witnesses
 8 Positive: 1 Negative: 2
 9 Condition exists (1:r1=v /\ 1:r2=0)
10 Observation alpha-split-cache-example1 Sometimes 1 2
11 Hash=b73c984509551a6a5ffe49d86c9a2d04

Inserting a call to smp_read_barrier_depends() (see line 21):

Strong Model Litmus Test #3

  1 C alpha-split-cache-example2
  2 {
  3 int u = 0;
  4 int v = 0;
  5 int *p = &u;
  6 }
  7
  8 P0(int **p, int *v)
  9 {
 10   WRITE_ONCE(*v, 1);
 11   smp_mb();
 12   WRITE_ONCE(*p, v);
 13 }
 14
 15 P1(int **p)
 16 {
 17   int *r1;
 18   int r2;
 19
 20   r1 = READ_ONCE(*p);
 21   smp_read_barrier_depends();
 22   r2 = READ_ONCE(*r1);
 23 }
 24
 25 exists (1:r1=v /\ 1:r2=0);

prevents the unwanted result:

Outcome for Strong Model Litmus Test #3

 1 Test alpha-split-cache-example2 Allowed
 2 States 2
 3 1:r1=u; 1:r2=0;
 4 1:r1=v; 1:r2=1;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (1:r1=v /\ 1:r2=0)
 9 Observation alpha-split-cache-example2 Never 0 2
10 Hash=9dbdccdc417b3ece717775094d822a49

The smp_read_barrier_depends() call forces P1 to wait until all writes that have already propagated to its cache have been fully handled and are available for reading. Thus, the new value of v is always visible to P1 whenever it sees &v in p.

In order to accomodate Alpha's unique behavior, we modified the memory model to include a delay between the time when a write propagates to a CPU and the time when the memory subsystem can use that write to satisfy a read request. Furthermore, instead of assuming a simple split, the model allows the cache to be completely fragmented, with an independent segment for each memory address.

The model introduces the notion of a horizon time. For each processor Pn, any memory address A, and any time t, the horizon time horiz(Pn,A,t) is the time h (at or before t) such that if Pn were to submit a read request for address A to the memory subsystem at time t, the response would be the coherence-latest write that executed on Pn or that propagated to Pn before time h. Writes to A that propagate to Pn after time h are considered still to be “below the horizon” at time t, and so are not visible and cannot be used for satisfying reads. (This restriction does not apply to writes executed by Pn itself; each processor can see its own writes at any time.)

The memory model requires that for each Pn and each address A, the value of horiz(Pn,A,t) must increase with t. This means a write cannot fall back below the horizon after it has become visible; otherwise we could have a violation of the read-read coherence rule. The model also requires that every memory barrier other than smp_wmb() behave like smp_read_barrier_depends(), in that it forces the CPU to wait for all writes that propagated to the CPU before the barrier committed to become visible. In other words, if Pn executes a memory barrier at time t then it will not execute any read instructions in the barrier's post-set until time t', where horiz(Pn,A,t') > t for all addresses A. If the barrier is a strong one, the CPU is required to wait for all writes that propagated to the CPU before the barrier was acknowledged to become visible.

For any read instruction R, we naturally go on to define horiz(R) (the horizon time for R) to be horiz(Pn,A,exec(R)) where Pn is the processor that executes R, A is the address that R accesses, and exec(R) is the time when R executes (is last satisfied). Thus, R will read from the coherence-latest write that has been executed by Pn or has propagated to Pn before horiz(R) (and of course, it is always true that horiz(R) ≤ exec(R)).

Adding the concept of horizon times complicates the ordering of reads. When we say that Pn orders read A before read B, we could now mean any of four things:

exec(A) ≤ horiz(B) (ordering requirements 8-10 above);
horiz(A) ≤ horiz(B) (requirement 6);
exec(A) ≤ exec(B) ((requirements 2, 3, and 6); or
horiz(A) ≤ exec(B).

For example, ordering requirement 6 is the case where A and B read from different writes to the same address, and B is not forwarded from a write that is between them. In this situation, the ordering requirement states that the CPU must execute B after it executes A, so exec(A) ≤ exec(B). But the read-read coherence rule says that the write which B reads from must come later in the coherence order than the write which A reads from. If B's write had propagated to the CPU before horiz(A) then A would have read from it, since reads take their value from the coherence-latest write available. Hence the write must have propagated to the CPU at a time after horiz(A) but before horiz(B) (otherwise B would not have read from it). This means it must also be true that horiz(A) ≤ horiz(B), thus explaining why ordering requirement 6 is listed twice.

The first ordering relation above is the strongest; it implies each of the others. Under most circumstances a read's horizon time is more useful than its execution time, so we will take the second alternative to be the standard meaning for ordering of reads. Nevertheless, the third alternative has its uses. Most notably, it is the ordering imposed by the CPU when there is an address dependency from one read to another but no memory barrier separating them. The fourth alternative is not used in the memory model.

The extended memory model reduces to the original if we assume the horizon times for different memory addresses are always equal, that is, horiz(Pn,A,t) is independent of A. Under this assumption the extended model will allow the same set of behaviors as the original, which is reassuring; it means the model still applies as before to architectures other than Alpha.

Quick Quiz 10: Given how important split caches are for attaining full performance on superscalar CPUs, why don't any non-Alpha architectures have split caches?
Answer

Adjustments for ARM

Unforunately, the memory model as developed above is not a very good fit for the ARM architecture. The published memory models for ARMv8 differ in a number of important respects from the model we have described so far. Most of the differences involve the memory subsystem, and most of the differences that affect the processor subsystem are concerned with how it interacts with the memory subsystem.

The difference with perhaps the most widespread ramifications involves how the memory subsystem responds to read requests. Earlier we said that the response would be the value stored by the coherence-latest write that has propagated to the CPU making the request, because otherwise the read might observe a coherence-earlier value than a po-earlier read of the same address did, violating the read-read coherence rule. But this requirement is stronger than necessary; all we really need is that the response to a read request should be coherence-later than (or the same as) any po-earlier read responses or committed writes for the same address. It doesn't have to be the very latest write available, and on ARM it often isn't.

Furthermore, the ARM memory model does not include any feature analogous to acknowledging a strong memory barrier. Instructions following such a barrier can be executed as soon as the barrier has been committed. These two facts have some rather subtle effects on the ordering properties of memory accesses. For example, either one of them invalidates the reasoning we used when analyzing Strong Model Litmus Test #1 above. The test remains forbidden even on ARM, but not for the reasons we gave.

Instead, the ARM memory model guarantees that if a write W reaches the coherence point before a strong (smp_mb()) barrier, then the response to any read that is po-after the barrier and accesses the same address as W will be the value stored by W or a coherence-later write. (No similar guarantee is made by the PowerPC-based memory model presented above, which is an indication of how much ARM differs from PowerPC.) This is enough to show that the “exists” clause in Strong Model Litmus Test #1 will never be satisfied.

When the test is executed, one of the two memory barriers must reach the coherence point before the other. Suppose the barrier in P0 gets there first. Since P0's write to x is in the barrier's pre-set, it will reach the coherence point before the barrier does and hence before P1's barrier does. Thus P1's read of x, which is po-after the barrier, is guaranteed to see P0's write (there aren't any coherence-later writes to x in the test program), and so r2 will end up equal to 1, not 0. As before, the opposite case is symmetrical.

Another difference concerns the way CPUs execute writes. On ARM, two writes to the same address are permitted to commit out of program order. Earlier we said that if this happened, it would cause the memory subsystem to put the po-later write earlier in the coherence order, thereby violating the write-write coherence rule. ARM gets around this problem in a very straightforward way: When a write W commits after a po-later write W' to the same address, the CPU simply skips sending W to the memory subsystem! As a result, W never gets assigned an explicit location in the coherence order (effectively, it ends up ordered immediately before the next write, in program order, to the same address), it never reaches the coherence point, and it never gets propagated to any other CPUs. We say that W has been obscured by W', or more colloquially, erased. Its effects don't disappear entirely, because the value stored by W can still be forwarded to reads that lie between W and W'. But this is the next best thing; from a system-wide standpoint, the end result is practically the same as if W' had followed so closely on the heels of W that W was overwritten before any other CPU had a chance to read from it.

There are certain circumstances in which a write W cannot be obscured. For example, if the CPU encounters a memory barrier that orders W before some other write W' to the same address, then W' cannot commit until after the barrier does, and the barrier cannot commit until after W does, so the writes cannot commit out of program order and W' will not obscure W. Also, the writes associated with store-release instructions (smp_store_release() or rcu_assign_pointer()) or atomic RMW instructions are not allowed to be obscured. More trivially, W will not be obscured if there are no po-later writes to the same address in its process to obscure it. Taking all of this into account will complicate the final memory model, as you might imagine.

There are some other, less important, differences in the operation of the CPU subsystem in the ARM model:

When a store-release instruction is committed, the CPU does not issue a barrier request followed by a write request; instead it issues a write request that is specially marked as being a store-release.
Similarly, a load-acquire instruction gives rise to a single read request that is specially marked as being a load-acquire.
A read instruction that is po-after a load-acquire is not obliged to wait until the load-acquire instruction has been committed; the CPU is allowed to issue the read's request any time after the load-acquire request has been issued, and the two of them may be satisfied out of order.
An smp_rmb() instruction does not act entirely within the CPU; it causes the CPU to issue a barrier request to the memory subsystem.
The rules for restarting a read instruction after a po-earlier instruction has accessed the same address are slightly looser; the read does not need to be restarted if it was issued after the po-earlier instruction was.
A write instruction that accesses the same address as a po-earlier read may be committed before the read is committed, provided the read has already been issued and the CPU knows that it will not be restarted.

Quick Quiz 11: If a later read instruction can be satisfied before a prior load-acquire instruction, how can the load-acquire possibly order the two instructions?
Answer

To better understand the ARM model, we must examine how the memory subsystem works in some detail. It is more highly structured than the memory subsystem in the PowerPC model, consisting of a hierarchical arrangement of buffers lying between the CPUs at the top and the memory at the bottom. There is a buffer immediately below each CPU; these feed down into some buffers below them, and then buffers below those, and so on, down to the lowermost buffer, which feeds into memory. The coherence point is the place at the bottom of the lowest buffer.

For example, a four-processor system would have four buffers at the topmost layer, then at the next layer there might be a buffer below CPUs 0 and 1 and another below CPUs 2 and 3, and a single buffer below all the CPUs in the final layer, as shown in this figure (panel A):

Other arrangements of buffers (such as that in panel B) are possible, provided they follow the general hierarchical arrangement: buffers always feed down, never up; a buffer can receive input from multiple buffers above it but can provide output only to a single buffer below; and there is a single lowermost buffer which all the others eventually lead to. The memory models do not specify the buffer sizes or topology, and in practice you cannot even rely on the topology remaining unchanged over time, because the scheduler can migrate a process from one CPU to another, effectively altering the arrangement of the buffers below that process.

(The essential difference between the Flowing and POP models is that the Flowing model assumes a fixed buffer topology, whereas the POP model does not keep explicit track of the buffers and thus is compatible with any arrangement. The POP model is more general, but the Flowing model is easier to reason about.)

A key point of this design is that the memory subsystem does not provide the response to a read request immediately. Issuing a read request and receiving the response (which is then used to satisfy the read) are two separate events, and the CPU is free to work on other instructions in between.

When a CPU issues a write, barrier, or read request, the request enters the CPU's buffer at the top, flows through the buffer and then down into the top of the buffer below, and so on, eventually passing out the bottom of the lowermost buffer, to memory. Thus, the coherence order is simply the order in which write requests flow down to memory. When a write request reaches memory, the value in the request gets stored at the write's target address. When a barrier request reaches memory, its job is finished and it disappears. And when a read request reaches memory, a response is generated using the value held in memory at the read's target address.

However, the response to a read request may be generated before the request reaches memory, while it is still flowing through a buffer. If the request immediately below the read in the buffer is a write to the same address, the memory subsystem can respond to the read using the value stored by that write. When this happens, the memory subsystem deletes the read request, but it keeps track of the fact that the write request was used to satisfy the read. (Exception: a load-acquire request is not allowed to be satisfied by a store-release request while still in a buffer. The only way for a load-acquire instruction to read the value written by a store-release instruction is for the load-acquire request to flow all the way down to memory.) Either way, the read request is gone when the response is generated; the memory subsystem never generates more than one response to a request.

The flow of requests down through a buffer is not always First-In-First-Out. Subject to certain restrictions, a request is allowed to exchange places with the request immediately below it (it passes the lower request). The complete list of reorder restrictions is rather elaborate; among the most important ones are:

A read or write request may not pass another read or write request with the same target address.
No barrier request may pass another barrier request.
No request may pass an smp_mb() barrier or vice versa.
An smp_wmb() barrier request may not pass a write request from its own CPU, and it may not be passed by any write request.
An smp_rmb() barrier request may not pass a read request from its own CPU, and it may not be passed by a write or read request from its own CPU.
A store-release write request may not pass any write or read request.
A load-acquire read request may not be passed by a write or read request from its own CPU.

In addition, if a read request R is not allowed to pass some other request A then neither is the write request (if any) that was used to satisfy R, and if A is not allowed to pass R then it's not allowed to pass the write request (if any) that was used to satisfy R.

A write is said to propagate to a CPU when its request flows into a buffer below that CPU, because before that time there is no way for the CPU to read from the write, whereas afterward it is possible for a read request issued by the CPU to be satisfied by the write request, whether in a buffer or in memory. This picture explains why a read might not be satisfied by the coherence-latest write to have propagated to the read's CPU. A read request R for the value of x, for example, might not be satisfied until it reaches memory and obtains an old value, even though a write request W containing a new value for x may already have flowed down to a buffer below R's CPU. Provided that W lies higher than R in the chain of buffers leading from the CPU to memory, R will be unable to read the value stored by W: The write request can't pass the read request because they have the same target address, so it can never reach a buffer position immediately below R and it cannot reach memory before R does. Thus R will end up being satisfied by an earlier value of x even though the coherence-later value in W had already propagated to R's CPU by that time.

Now we can understand how the ARM memory model enforces the strong-barrier guarantee mentioned above. Suppose F is a strong fence and R is a read that is po-after F. Suppose also that W is a write to the same address as R, and W reaches the coherence point before F does. Then W must flow down to memory before F, and since R cannot pass F, it cannot reach memory before W. Thus, if R is satisfied from memory then it must read the value stored by W or a coherence-later write.

But what if R is satisfied while it is still in a buffer? Let W' be the write request that satisfies R. Since it is immediately below R in the buffer at the time that R is satisfied, it must also be above F. And since W' cannot pass F, it must reach memory after W, which means it must be coherence-later than W. The case where R is not issued at all, but is forwarded from a po-earlier write, is left to the reader. Regardless, no matter how things work out, in the end R will read from W or a coherence-later write, as guaranteed.

One other aspect of the storage subsystem needs mentioning. In the PowerPC-based operational model, the write part of an atomic RMW instruction reaches the coherence point at the time it commits. In the ARM memory model this is not true; RMW writes flow down through the buffers just like any other. Instead, the architecture relies on an intricate system of interlocks to prevent other write requests to the same address from flowing down into the wrong buffer at the wrong time and thereby “sneaking” in ahead of the RMW write, violating the atomic property.

Like we saw with the Alpha, the fact that reads are issued and satisfied at different times leads to an ambiguity when we want to order read instructions. If we say that instruction A is ordered before B, where one or both is a read, we could mean:

exec(A) ≤ issue(B) (ordering requirements 2, 7, and 9-10 with one exception);
issue(A) ≤ issue(B) (requirements 6 and 9 where A is load-acquire and B is read);
exec(A) ≤ exec(B); (requirements 3-4 and 8); or
issue(A) ≤ exec(B) (requirement 5).

where issue(A) is the time when A's read request is issued to the memory subsystem, and exec(A) is the time when A is executed (which is the time when the read response is received, if A is a non-forwarded read).

Adjustments for other architectures

Currently there are none. This may change in the future as we become aware of the individual requirements of other CPU families.

Quick Quiz 12: Why weren't adjustments needed for PowerPC, given that it has a weak memory model?
Answer

Quick Quiz 13: But what if some new CPU had an even weaker memory model than Alpha, ARM, and PowerPC? Mightn't that invalidate a lot of Linux-kernel code?
Answer

Examples of cumulativity and release sequences

The following litmus tests illustrate the ideas behind A- and B-cumulativity and release sequences.

Strong Model Litmus Test #4

  1 C C-wmb-is-not-A-cumulative.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x)
  7 {
  8   WRITE_ONCE(*x, 1);
  9 }
 10
 11 P1(int *x, int *y)
 12 {
 13   r1 = READ_ONCE(*x);
 14   smp_wmb();
 15   WRITE_ONCE(*y, 1);
 16 }
 17
 18 P2(int *x, int *y)
 19 {
 20   r2 = READ_ONCE(*y);
 21   smp_rmb();
 22   r3 = READ_ONCE(*x);
 23 }
 24
 25 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)

This test's exists clause can be satisfied. Even though P0's write to x propagates to P1 before the smp_wmb() barrier commits, as proved by the fact that r1=1 at the end, the write to x is not in the barrier's pre-set, because smp_wmb() is not A-cumulative. As a result, the barrier and the write to y are allowed to propagate to P2 before the write to x does. In the end, this is just a fancy way of saying that smp_wmb() doesn't order two writes if the first write was executed on a different CPU from that which executed the smp_wmb().

Strong Model Litmus Test #5

  1 C C-release-is-A-cumulative.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x)
  7 {
  8   WRITE_ONCE(*x, 1);
  9 }
 10
 11 P1(int *x, int *y)
 12 {
 13   r1 = READ_ONCE(*x);
 14   smp_store_release(y, 1);
 15 }
 16
 17 P2(int *x, int *y)
 18 {
 19   r2 = READ_ONCE(*y);
 20   smp_rmb();
 21   r3 = READ_ONCE(*x);
 22 }
 23
 24 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)

By contrast, this test's exists clause cannot be satisfied. Because smp_store_release() is A-cumulative and because P0's write to x propagates to P1 before the smp_store_release() commits, the write is in the release barrier's pre-set. Consequently the new value of x must propagate to P2 before the store to y can. Since P2 is forced to read x after reading y (by the smp_rmb()), and since it sees the new value of y, it must also see the new value of x.

Strong Model Litmus Test #6

  1 C C-wmb-is-B-cumulative.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   WRITE_ONCE(*x, 1);
  9   smp_wmb();
 10   WRITE_ONCE(*y, 1);
 11 }
 12
 13 P1(int *y, int *z)
 14 {
 15   r1 = READ_ONCE(*y);
 16   WRITE_ONCE(*z, r1);
 17 }
 18
 19 P2(int *x, int *z)
 20 {
 21   r2 = READ_ONCE(*z);
 22   smp_rmb();
 23   r3 = READ_ONCE(*x);
 24 }
 25
 26 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)

B-cumulativity refers to writes that occur after the barrier. Even though we don't tend to think of smp_wmb() as ordering writes carried out by other CPUs, in every Linux-supported architecture for which we know the details, it does, courtesy of the fact that smp_wmb() is B-cumulative.

In this example, the write to x, the smp_wmb() barrier, and the write to y must propagate from P0 to P1 in order. The data dependency from P1's read of y to its write of z forces the write to occur after the new value of y has been seen, and hence after the smp_wmb() barrier has propagated to P1. As a result, the write to z is in the barrier's post-set, so the barrier must propagate to P2 before the write can. That's what being B-cumulative means. Before P2 can read the new value of z, the barrier and hence the new value of x must have propagated there. Therefore P2's read of x must see the new value, and so the exists clause cannot be satisfied.

Strong Model Litmus Test #7

  1 C C-release-is-not-B-cumulative.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   WRITE_ONCE(*x, 1);
  9   smp_store_release(y, 1);
 10 }
 11
 12 P1(int *y, int *z)
 13 {
 14   r1 = READ_ONCE(*y);
 15   WRITE_ONCE(*z, r1);
 16 }
 17
 18 P2(int *x, int *z)
 19 {
 20   r2 = READ_ONCE(*z);
 21   smp_rmb();
 22   r3 = READ_ONCE(*x);
 23 }
 24
 25 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)

As before, the write to x, the smp_store_release()'s barrier, and the write to y must propagate from P0 to P1 in order, and so the write to z must occur after the barrier has reached P1. Nevertheless, because smp_store_release() is not B-cumulative, the write to z isn't in the barrier's post-set, and so the new value of z is allowed to propagate to P2 before either the barrier or the new value of x. Consequently it is possible for P2 to read the new value of z followed by the old value of x.

Despite what the previous example shows, in this memory model smp_store_release() is B-cumulative along pathways where it is read by smp_load_acquire(). The following example illustrates this point.

Strong Model Litmus Test #8

  1 C C-release-acquire-is-B-cumulative.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   WRITE_ONCE(*x, 1);
  9   smp_store_release(y, 1);
 10 }
 11
 12 P1(int *y, int *z)
 13 {
 14   r1 = smp_load_acquire(y);
 15   WRITE_ONCE(*z, 1);
 16 }
 17
 18 P2(int *x, int *z)
 19 {
 20   r2 = READ_ONCE(*z);
 21   smp_rmb();
 22   r3 = READ_ONCE(*x);
 23 }
 24
 25 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)

In this litmus test, P0's smp_store_release() is read by P1's smp_load_acquire(). As a result, the release barrier acts B-cumulatively and so P1's write of z cannot propagate to P2 until the barrier has. Hence it is not possible for P2 to read the new value of z followed by the old value of x.

Note that this applies only along the pathway of the smp_load_acquire(). This example:

Strong Model Litmus Test #9

  1 C C-release-B-cumulative-only-on-acquire-path.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   WRITE_ONCE(*x, 1);
  9   smp_store_release(y, 1);
 10 }
 11
 12 P1(int *y, int *z)
 13 {
 14   r1 = READ_ONCE(*y);
 15   WRITE_ONCE(*z, r1);
 16 }
 17
 18 P2(int *x, int *z)
 19 {
 20   r2 = READ_ONCE(*z);
 21   smp_rmb();
 22   r3 = READ_ONCE(*x);
 23 }
 24
 25 P3(int *y)
 26 {
 27   r4 = smp_load_acquire(y);
 28 }
 29
 30 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0 /\ 3:r4=1)

is the same as the Strong Model Litmus Test #7 example above, except that it has a fourth thread P3 which uses smp_load_acquire() to read the value of y stored by P0. However, this interaction does not cause the release barrier's effect on P1 to be B-cumulative; the exists clause is still allowed to succeed. (The barrier's effect on P3 is B-cumulative, but the example does not probe this fact.)

The following litmus test shows a non-trivial use of a release sequence.

Strong Model Litmus Test #10

  1 C C-relseq.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   WRITE_ONCE(*x, 1);
  9   smp_store_release(y, 1);
 10   WRITE_ONCE(*y, 2);
 11 }
 12
 13 P1(int *y)
 14 {
 15   r1 = xchg_relaxed(y, 3);
 16 }
 17
 18 P2(int *x, int *y)
 19 {
 20   r2 = READ_ONCE(*y);
 21   smp_rmb();
 22   r3 = READ_ONCE(*x);
 23 }
 24
 25 exists (1:r1=2 /\ 2:r2=3 /\ 2:r3=0)

The release sequence headed by P0's smp_store_release() to y includes WRITE_ONCE(*y, 2), because that write is po-after the store-release. It also includes the atomic xchg_relaxed() operation in P1, because that operation accesses y and reads from a write in the release sequence (the WRITE_ONCE()). Consequently P1's atomic write belongs to the release barrier's post-set, and it cannot propagate to P2 before the barrier and the write to x do.

Note that smp_store_release() barriers do not become B-cumulative along paths where smp_load_acquire() reads from an arbitrary member of the release sequence, but only when the load-acquire reads directly from the store-release itself. This is illustrated by the following example.

Strong Model Litmus Test #11

  1 C C-relseq.litmus
  2
  3 {
  4 }
  5
  6 P0(int *x, int *y)
  7 {
  8   WRITE_ONCE(*x, 1);
  9   smp_store_release(y, 1);
 10   WRITE_ONCE(*y, 2);
 11 }
 12
 13 P1(int *y)
 14 {
 15   r1 = xchg_relaxed(y, 3);
 16 }
 17
 18 P2(int *y, int *z)
 19 {
 20   r2 = smp_load_acquire(y);
 21   WRITE_ONCE(*z, 1);
 22 }
 23
 24 P3(int *x, int *z)
 25 {
 26   r3 = READ_ONCE(*z);
 27   smp_rmb();
 28   r4 = READ_ONCE(*x);
 29 }
 30
 31 exists (1:r1=2 /\ 2:r2=3 /\ 3:r3=1 /\ 3:r4=0)

Here the smp_load_acquire() in P2 reads from the xchg_relaxed() in P1, which is part of the smp_store_release()'s release sequence as before. But because it does not read directly from the smp_store_release() instruction, it does not cause the barrier to act B-cumulatively. Hence P2's write to z is allowed to propagate to P3 before P0's release barrier or write to x.

Design of the Strong Model

The strong kernel model essentially consists of checks for six types of forbidden cycles, related to: cache coherence, atomic operations, instruction execution times, write and barrier propagation, arrival at the coherence point, and the RCU guarantee. The last one is concerned more with how RCU is impemented in the kernel than with hardware instruction processing, so we will not discuss it here.

The model treats cache coherence in exactly the same way as in the Coherent-RMO model did, by defining the coherence-order relation and requiring that the union of po-loc and coherence-order have no cycles.

The treatment of atomic RMW operations is even simpler. What makes these operations atomic is that no write from another CPU is allowed to intervene (in the coherence order) between the write that the RMW operation reads from and the RMW's own write. The model defines a relation that links the read event of each RMW instruction to the instruction's write event by way of such an intervening write, and requires that this relation must be empty.

Instruction execution ordering comes in several different forms. For now we'll consider the easiest version, as defined for the original PowerPC-oriented operational memory model. Because we are designing a memory model, we are especially concerned with ordering of memory accesses, and we will ignore other kinds of instructions. The conditions which require one memory access to execute after another can be grouped in four general categories (listed along with the relations the model uses to express them):

A read that reads from a write on another CPU must execute after that write (the rfe relation).
An access in a non-read-dependency memory barrier's post-set must execute after an access in the barrier's pre-set (the ordering-fence relation).
Any of the Preserved Program Order requirements listed earlier can force a CPU to execute one access after another (the ppo relation).
A barrier's write-propagation properties can justify an argument by contradiction showing that a first access could not have executed after a second; when the two accesses are on the same CPU it follows that the second must have executed after the first (the obs relation).

The hb relation incorporates these requirements. If there is a sequence of hb links from A to B, it means that A must execute before B. As mentioned earlier, any cycle in the times of execution of instructions is prohibited; accordingly, the memory model contains a check (named “happens-before”) that requires the hb relation not to have any cycles.

It may seem a little odd to group the ordering-fence relation separately from the ppo relation, since they are essentially the same sort of thing; doing it this way seems to be historical practice. Grouping the rfe relation separately from the others makes much more sense, because it is the only way to order two accesses on different CPUS. (Although a barrier's pre- and post-sets may be considered to contain accesses from other CPUs, thanks to cumulativity, the ordering-fence relation only affects accesses on the barrier's CPU. Cumulativity is handled separately.)

The ppo relation requires an in-depth discussion. It is where the differences between the PowerPC, Alpha, and ARM architectures show up most strongly. We'll begin with an approach tailored to the PowerPC-oriented model and then later discuss the changes required for Alpha and ARM.

In a nutshell, the memory model defines a relation for each of the types of PPO links listed earlier, except the ones involving barriers:

The builtin ctrl relation links a read access to any memory access in a conditional branch that depends on the value of the read. The memory model computes (ctrl & (R*W)) to limit the targets of these control dependencies to write accesses.
The builtin addr and data relations link a read access to any later accesses with an address or data dependency from the read. The dep relation is defined as the union of these two.
The model defines the dep-rfi relation as the concatenation of a dep link and an rfi link. rfi links writes to reads in the same thread that read from them. These reads may have been satisfied by forwarding the write or by going through the memory subsystem, but either way the effect is the same: The read at the end of the rfi link must execute after the read at the start of the dep link.
The addrpo relation is defined as the concatenation of an addr link and a po link. The memory model computes (addrpo & (R*W)), which thus connects reads carrying an address dependency to writes that are po-after the target of the dependency.
The builtin po-loc relation links a memory access to any po-later access of the same address. The memory model computes (po-loc & (M*W)) to limit the targets of these links to writes.
The rdw (“read different writes”) relation is defined as (po-loc & (fre ; rfe)). It thus links two reads of the same address in the same thread, if they read from different writes and the po-later read takes its value from a write in a different thread. Thanks to the coherence rules, this is equivalent to saying that the later read was not forwarded from a write between the two reads.
The detour relation is defined as (po-loc & (coe ; rfe)). It thus links a write to a po-later read (of the same address in the same thread) that reads from a write in a different thread. This certainly implies that the read was not forwarded from the write or from another write in between. The case where the read does read from the write, through the memory subsystem instead of by forwarding, is left as an exercise for the reader.
The atomicpo relation is defined as ((RMW*RMW) & po); it thus links atomic read-modify-write accesses in program order.

The ppo relation in the PowerPC-based model is the union of these relations.

The rmb, wmb, mb, sync, acq-po, and po-relass relations are defined to link accesses in each of the barriers' pre-sets to accesses in the corresponding post-sets. (po-relass handles both smp_store_release() and rcu_assign_pointer(); the kernel implements the latter by means of the former.) The ordering-fence relation is the union of these. It thus incorporates all but the read-dependency barriers.

Finally, the hb relation is taken to be the union of ppo, ordering-fence, rfe, and one other relation, called obs (short for “observation”). In order to explain obs, it is necessary to discuss first the effects of the propagation-order barriers: synchronize_rcu(), smp_mb(), smp_wmb(), and the various store-release instructions, including smp_store_release() and rcu_assign_pointer().

This is where cumulativity comes into play. An A-cumulative barrier's pre-set is closed under rfe; that is, if R is a read in the pre-set of an A-cumulative barrier and W ⟶ R is a link in rfe (where W is a write), then W must also be in the barrier's pre-set. The reason is simple. Because it is in the pre-set, R must precede the barrier in program order and must execute before the barrier does. The fact that W is on another CPU and R reads from W means that W must have propagated to the barrier's CPU before R executed, hence before the barrier was committed. By the definition of A-cumulativity, W must therefore belong in the pre-set.

Similarly, a B-cumulative barrier's post-set is closed under hb. That is, if X is in the post-set of a B-cumulative barrier and X ⟶ Y is a link in hb, then Y must also be in the barrier's post-set. The fact that X is in the post-set means that the barrier propagates to X's CPU before X executes. If the hb link from X to Y is anything other than rfe, it means that Y executes on the same CPU as X and after X does, hence after the barrier propagates to that CPU. If the link is an instance of rfe, it means that X is a write and Y reads from X on a different CPU. Since X is in the barrier's post-set, the propagation-order property says that the barrier must propagate to Y's CPU before X does, hence before Y executes. Either way, by the definition of B-cumulativity, Y must belong in the post-set.

Putting these ideas together we can define the propbase relation, which links accesses in a propagation-order barrier's pre-set to accesses in its post-set, taking cumulativity into account. Ignoring release sequences and the effect of load-acquire reading from store-release for simplicity, propbase would look like this:

(rfe? ; strong-fence ; hb* ) |
(rfe? ; po-relass) |
(wmb ; hb* )

Here strong-fence is the union of mb and sync (the two types of strong barriers), and po-relass covers all the store-release instructions. This formula expresses the fact that strong barriers are both A- and B-cumulative, store-release barriers are A-cumulative, and smp_wmb() is B-cumulative. (The extra spaces between the ‘*’ and ‘)’ characters are necessary because herd, following OCaml, reserves “*)” for marking the end of a comment.)

Whenever there is a propbase link from a memory access X to another access Y, the operational model says that the following somewhat unwieldy ordering property (which we can call the “Propbase” property) holds:

If X is a write, let W = X; otherwise let W be any write that propagates to X's CPU before X executes. Then W propagates to Y's thread before Y executes. In addition, if Y is a write then W propagates to any CPU before Y propagates to that CPU (and likewise for the coherence point).

Detailed justification of this property is left to the reader; it depends on the fact that if a barrier's pre-set can contain a read then the barrier is A-cumulative.

Details aside, a notable aspect of the Propbase property is that it is transitive. In other words, if X and Y obey the Propbase property and Y and Z do so too, then X and Z also obey the property. We will use this later.

TO BE CONTINUED...

Strong-Model Bell File

The full Bell file for Alan Stern's strong model is as follows:

strong-kernel.bell

  1 "Linux kernel strong memory model"
  2
  3 (*
  4  * Copyright (C) 2016 Alan Stern <stern@rowland.harvard.edu>,
  5  *          Andrea Parri <parri.andrea@gmail.com>
  6  *
  7  * This program is free software; you can redistribute it and/or modify
  8  * it under the terms of the GNU General Public License as published by
  9  * the Free Software Foundation; either version 2 of the License, or
 10  * (at your option) any later version.
 11  *
 12  * This program is distributed in the hope that it will be useful,
 13  * but WITHOUT ANY WARRANTY; without even the implied warranty of
 14  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 15  * GNU General Public License for more details.
 16  *
 17  * You should have received a copy of the GNU General Public License
 18  * along with this program; if not, you can access it online at
 19  * http://www.gnu.org/licenses/gpl-2.0.html.
 20  *)
 21
 22 enum Accesses = 'once (*READ_ONCE,WRITE_ONCE,ACCESS_ONCE*) ||
 23     'release (*smp_store_release*) ||
 24     'acquire (*smp_load_acquire*) ||
 25     'assign (*rcu_assign_pointer*) ||
 26     'deref (*rcu_dereference*) ||
 27     'lderef (*lockless_dereference*)
 28 instructions R[{'once,'acquire,'deref,'lderef}]
 29 instructions W[{'once,'release,'assign}]
 30 instructions RMW[{'once,'acquire,'release}]
 31
 32 enum Barriers = 'wmb (*smp_wmb*) ||
 33     'rmb (*smp_rmb*) ||
 34     'mb (*smp_mb*) ||
 35     'rb_dep (*smp_read_barrier_depends*) ||
 36     'rcu_read_lock (*rcu_read_lock*)  ||
 37     'rcu_read_unlock (*rcu_read_unlock*) ||
 38     'sync (*synchronize_rcu*)
 39 instructions F[Barriers]
 40
 41 (* Treat 'release and 'assign identically; same for 'deref and 'lderef *)
 42 let ReleaseAssign = Release | Assign
 43 let XDeref = Deref | Lderef
 44
 45 let rmb = fencerel(Rmb) & (R*R)
 46 let wmb = fencerel(Wmb) & (W*W)
 47 let mb = fencerel(Mb)
 48 let sync = (po & (_ * Sync)) ; (po?)
 49
 50 let rb-dep = fencerel(Rb_dep) & (R*R)
 51 let acq-po = po & (Acquire*_)
 52 let xderef-po = po & (XDeref*M)
 53 let po-relass = po & (_*ReleaseAssign)
 54
 55 let rd-dep-fence = rb-dep | xderef-po
 56 let strong-fence = mb | sync
 57
 58 (* Compute matching pairs of nested Rcu_read_lock and Rcu_read_unlock *)
 59 let matched = let rec
 60       unmatched-locks = Rcu_read_lock \ domain(matched)
 61   and unmatched-unlocks = Rcu_read_unlock \ range(matched)
 62   and unmatched = unmatched-locks | unmatched-unlocks
 63   and unmatched-po = (unmatched * unmatched) & po
 64   and unmatched-locks-to-unlocks = (unmatched-locks *
 65       unmatched-unlocks) & po
 66   and matched = matched | (unmatched-locks-to-unlocks \
 67     (unmatched-po ; unmatched-po))
 68   in matched
 69
 70 (* Validate nesting *)
 71 flag ~empty Rcu_read_lock \ domain(matched) as unbalanced-rcu-locking
 72 flag ~empty Rcu_read_unlock \ range(matched) as unbalanced-rcu-locking
 73
 74 (* Outermost level of nesting only *)
 75 let crit = matched \ (po^-1 ; matched ; po^-1)

Taking this one piece at a time:

Bell File: Memory Accesses.
Bell File: Barriers.
Bell File: Relating Barriers and Memory Accesses.
Bell File: Relating One-Sided Barriers and Memory Accesses.
Bell File: Classes of Fences.
Bell File: RCU Read-Side Critical Sections.

Bell File: Memory Accesses

The “"Linux kernel strong memory model"” is a name that has no effect on the model's meaning.

The following portion of the Bell file defines the types of memory accesses, which correspond to the Linux kernel's READ_ONCE(), WRITE_ONCE(), ACCESS_ONCE(), smp_store_release(), smp_load_acquire(), rcu_assign_pointer(), rcu_dereference(), and lockless_dereference() primitives:

22 let RMW = domain(rmw) | range(rmw)
23
24 enum Accesses = 'once (*READ_ONCE,WRITE_ONCE,ACCESS_ONCE*) ||
25     'release (*smp_store_release*) ||
26     'acquire (*smp_load_acquire*) ||
27     'assign (*rcu_assign_pointer*) ||
28     'deref (*rcu_dereference*) ||
29     'lderef (*lockless_dereference*)
30 instructions R[{'once,'acquire,'deref,'lderef}]
31 instructions W[{'once,'release,'assign}]
32 instructions RMW[{'once,'acquire,'release}]

The “enum Accesses” statement defines the types of memory references, corresponding to the C functions listed in the comments. These correspondences are defined in herd's linux.def macro file. The “instructions R” identifies which of the above types of memory references may be associated with a read instruction, the “instructions W” identifies which may be associated with a write instruction, and the “instructions RMW” identifies which may be associated with a read-modify-write instruction. For example, the association of 'acquire with the R set of instructions corresponds to the Linux kernel's smp_load_acquire() primitive.

The predefined “rmw” relation links the read and the write component of each read-modify-write (RMW) operation. The “let RMW” statement thus creates the set of all read and write events corresponding to an RMW instruction in the litmus-test program. (This shouldn't be necessary, because the RMW set is supposed to be built-in to herd. However at the time this memory model was created, the implementation of the RMW set wasn't working, so it was necessary to define the set manually.)

Note well that the above code simply defines names for the Linux-kernel memory-access primitives. The herd tool also uses this code to check the instruction annotations in the linux.def file, for example, __load{acquire) is legal but __load(release) is not. Later code in both the Bell and Cat files will define their effect on memory ordering.

Bell File: Barriers

The next portion of the Bell file defines the types of barrier-like constructs, namely smp_wmb(), smp_rmb(), smp_mb(), smp_read_barrier_depends(), rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu().

34 enum Barriers = 'wmb (*smp_wmb*) ||
35     'rmb (*smp_rmb*) ||
36     'mb (*smp_mb*) ||
37     'rb_dep (*smp_read_barrier_depends*) ||
38     'rcu_read_lock (*rcu_read_lock*)  ||
39     'rcu_read_unlock (*rcu_read_unlock*) ||
40     'sync (*synchronize_rcu*)
41 instructions F[Barriers]
42
43 (* Treat 'release and 'assign identically; same for 'deref and 'lderef *)
44 let ReleaseAssign = Release | Assign
45 let XDeref = Deref | Lderef

The “enum Barriers” defines the types of barriers corresponding to the C functions listed in the comments (as set up in the linux.def macro file). The “instructions F[Barriers]” says that these types may be used in various sorts of barrier instructions.

Quick Quiz 14: Given that this is about memory barriers, why “instructions F[Barriers]” instead of perhaps “instructions B[Barriers]”?
Answer

The definition of ReleaseAssign allows the model to treat the smp_store_release(), rcu_assign_pointer(), and the write portions of xchg() and xchg_release() identically from a memory-ordering perspective, and the definition of XDeref does likewise for lockless_dereference() and rcu_dereference().

As with the memory accesses, the above code only defines names. These barrier-like constructs' ordering properties will be defined by later code in the Bell and Cat files.

Bell File: Relating Barriers and Memory Accesses

The next portion of the Bell file defines the relation between a given barrier-like construct and its process's surrounding memory accesses:

47 let rmb = fencerel(Rmb) & (R*R)
48 let wmb = fencerel(Wmb) & (W*W)
49 let mb = fencerel(Mb)
50 let sync = (po & (_ * Sync)) ; (po?)

The fencerel(S) function in herd's standard library returns a relation containing all pairs of events in which the first event precedes (in program order) an event in the S set (for example, an “Rmb” event in the case of the line 26 above) and the second follows it, with all three events being in the same thread. As an example, the following snippet:

r1 = READ_ONCE(*x);
smp_rmb();
r2 = READ_ONCE(*y);
smp_mb();
WRITE_ONCE(*z, r3);

would produce an “rmb” relation containing only one link:

r1 = READ_ONCE(*x) ⟶ r2 = READ_ONCE(*y)

and an “mb” relation containing three links:

r1 = READ_ONCE(*x) ⟶ WRITE_ONCE(*z, r3),
smp_rmb() ⟶ WRITE_ONCE(*z, r3), and
r2 = READ_ONCE(*y) ⟶ WRITE_ONCE(*z, r3).

The “rmb” relation doesn't include the other possible links because of the “& (R*R)” clause in its definition, which intersects the full fencerel(Rmb) relation with the relation containing all pairs of reads (R*R). This is appropriate because smp_rmb() orders only reads, not writes.

(For database programming fans, the “&” operator can be thought of as doing a database full equijoin operation, so that the result is only those elements that appear in both operands. Similarly, the “*” operator can be thought of as a database unconstrained join operation, in this case providing all combinations of pairs of read events. Later on, we will encounter operations that cannot be easily represented by SQL, so we will shift to the notation used for mathematical sets.)

The “let wmb = fencerel(Wmb) & (W*W)” definition acts similarly, but it extracts pairs of writes rather than reads, as required for smp_wmb(). The “let mb = fencerel(Mb)” definition keeps all events in the fencerel(Mb) relation, as required for smp_mb(). (It even keeps events that don't correspond to memory accesses, such as the smp_rmb() event in the above example, although they are irrelevant here.)

Finally, the “let sync = (po & (_ * Sync)) ; (po?)” definition uses a modified formula in place of “fencerel(Sync)”. It is different from the others in that it also includes pairs where the second event is the synchronize_rcu() call rather than something following it. Otherwise it is like the definition of the mb relation.

Quick Quiz 15: Why wouldn't “let sync = fencerel(Sync)” work just as well as the modified definition?
Answer

This portion of the Bell file relates smp_rmb(), smp_wmb(), smp_mb(), and synchronize_rcu() to the surrounding code within a given process, but says nothing about cross-process ordering properties, which will be defined in later Bell and Cat code.

Bell File: Relating One-Sided Barriers and Memory Accesses

The next portion of the Bell file defines some relations involving “one-sided” barriers (smp_load_acquire(), smp_store_release(), rcu_assign_pointer(), smp_read_barrier_depends(), rcu_dereference(), and lockless_dereference()) and their surrounding instructions:

52 let rb-dep = fencerel(Rb_dep) & (R*R)
53 let acq-po = po & (Acquire*_)
54 let xderef-po = po & (XDeref*M)
55 let po-relass = po & (_*ReleaseAssign)

The “acq-po” line defines the relation appropriate for acquire operations, including smp_load_acquire() and the read portions of xchg() and xchg_acquire(). This is the intersection of the program order (po) relation with the set of all pairs of events in which the first is an Acquire and the second can be anything (the “_” wildcard). The “po-relass” definition works quite similarly, but with prior memory accesses rather than subsequent ones and with releases (smp_store_release(), rcu_assign_pointer(), and the write portions of xchg() and xchg_release()) rather than acquires.

Consider the following example containing code fragments running on two threads, where x, y, and z are all initially zero:

Thread 0                              Thread 1
--------                              --------
WRITE_ONCE(*x, 1);                     r2 = smp_load_acquire(y);
r1 = READ_ONCE(*z);                    r3 = READ_ONCE(*x);
smp_store_release(y, 1);               WRITE_ONCE(*z, 1);

This results in the following po links:

WRITE_ONCE(*x, 1) ⟶ r1 = READ_ONCE(*z),
WRITE_ONCE(*x, 1) ⟶ smp_store_release(y, 1),
r1 = READ_ONCE(*z) ⟶ smp_store_release(y, 1),
r2 = smp_load_acquire(y) ⟶ r3 = READ_ONCE(x);,
r2 = smp_load_acquire(y) ⟶ WRITE_ONCE(z, 1),
r3 = READ_ONCE(*x); ⟶ WRITE_ONCE(*z, 1).

The first three links relate events in Thread 0 and the last three relate events in Thread 1. (The number of links in po is clearly quadratic in the number of statements in a given thread, but that is OK because several other things are exponential! Knowing this, you can understand why this sort of verification technique is unlikely to handle all 20 million lines of the Linux kernel at one go. Instead, these techniques should be applied to small but critical segments of code.)

In this example, there is only one Acquire event: “r2 = smp_load_acquire(y)”. Intersecting po with the set of all pairs of events in which the first is an Acquire gives the acq-po relation:

r2 = smp_load_acquire(y) ⟶ r3 = READ_ONCE(x);,
r2 = smp_load_acquire(y) ⟶ WRITE_ONCE(z, 1),

This naturally lists all pairs of instructions whose execution order is constrained by Thread 1's smp_load_acquire().

The “rb-dep” definition is the same as that of “rmb” earlier, except that it applies to smp_read_barrier_depends() instead of smp_rmb(). The “xderef-po” definition is the same as that of “acq-po”, but for lockless_dereference() and rcu_dereference() instead of smp_load_acquire(). Note that these three relations do not correspond exactly to ordering constraints, because smp_read_barrier_depends(), rcu_dereference(), and lockless_dereference() only order pairs of accesses where the second is “dependent” on the first (more precisely, where there is an address dependency between them); this restriction is described in more detail later on.

Note also that this portion of the Bell file defined only the relationships between these one-sided barriers and the surrounding code within a given process. Cross-process ordering properties are defined by later Bell and Cat code.

Bell File: Classes of Fences

The next portion of the Bell file forms two groups of fences by strength:

57 let rd-dep-fence = rb-dep | xderef-po
58 let strong-fence = mb | sync

The members of the rd-dep-fence group (smp_read_barrier_depends(), rcu_dereference(), and lockless_dereference()) cannot provide any ordering at all unless a dependency is also present.

In contrast, the members of the strong-fence group can enforce full globally visible transitive ordering. Pervasive use of the members of the strong-fence family will result in agreement on the order even of completely unrelated memory references. In fact, as noted earlier, placing one of these strong fences between each pair of memory references in each process will forbid all but SC executions. On the other hand, stronger fences often incur larger performance penalties.

[ @@@ I suspect that we need example litmus tests for the above. Thoughts? ]

These groups will be used in the Cat file to organize the various ordering requirements.

Bell File: RCU Read-Side Critical Sections

The final section of the Bell file is the most complex, due to the fact that rcu_read_lock() and rcu_read_unlock() must come in matching pairs within a given process and can be nested. Therefore, the purpose of the following code is to find the outermost pair of rcu_read_lock() and rcu_rcu_unlock() invocations in a single nested set, and to differentiate correctly between any unrelated nested sets in a given process.

60 (* Compute matching pairs of nested Rcu_read_lock and Rcu_read_unlock *)
61 let matched = let rec
62       unmatched-locks = Rcu_read_lock \ domain(matched)
63   and unmatched-unlocks = Rcu_read_unlock \ range(matched)
64   and unmatched = unmatched-locks | unmatched-unlocks
65   and unmatched-po = (unmatched * unmatched) & po
66   and unmatched-locks-to-unlocks = (unmatched-locks *
67       unmatched-unlocks) & po
68   and matched = matched | (unmatched-locks-to-unlocks \
69     (unmatched-po ; unmatched-po))
70   in matched
71
72 (* Validate nesting *)
73 flag ~empty Rcu_read_lock \ domain(matched) as unbalanced-rcu-locking
74 flag ~empty Rcu_read_unlock \ range(matched) as unbalanced-rcu-locking
75
76 (* Outermost level of nesting only *)
77 let crit = matched \ (po^-1 ; matched ; po^-1)

The “matched” relation is defined by the mutually recursive set of definitions on lines 61-70. The idea behind this code is to associate an unmatched Rcu_read_lock event with a later unmatched Rcu_read_unlock event whenever no unmatched events lie between them, and to repeat this operation recursively until nothing more can be matched.

To that end, lines 61-64 form the sets of not-yet-matched Rcu_read_lock and Rcu_read_unlock events and their union. Line 65 then forms the relation of all pairs of these unmatched events that occur in the same thread, in program order. Lines 66-67 similarly form the relation of all such pairs where the first member of the pair is a Rcu_read_lock event and the second is an Rcu_read_unlock.

The interesting part is lines 68-69, which take pairs of unmatched Rcu_read_lock and Rcu_read_unlock events and add them to the “matched” relation, but only if there are no unmatched events in between. They do this by applying the “\” (backslash) subtraction operator to remove from the unmatched-locks-to-unlocks relation any pairs having an intervening unmatched Rcu_read_lock or Rcu_read_unlock. The “;” operator sequences relations (if relation x contains a⟶b and relation y contains b⟶c then (x ; y) will contain a⟶c). In this case, you can see that “unmatched-po ; unmatched-po” contains all pairs a⟶c of unmatched events for which a third unmatched event b lies between them in program order.

The only purpose of line 70 is to prevent the unmatched-locks, unmatched-unlocks, unmatched, unmatched-po, and unmatched-locks-to-unlocks definitions from leaking out to the surrounding context. (Grammatically speaking, the construction used here is a let rec expression inside a let statement. In fact, let or let rec expressions are very much like GCC's statement expressions; the statement in lines 50-59 is syntactically analogous to “x = ({int x = u; if (x < v) x = v; x;})”.)

Line 73 then checks whether there are any unmatched Rcu_read_lock events, and line 74 does the same for unmatched Rcu_read_unlock events. The “flag ~empty” statement flags the litmus test as containing a semantic error if the specified set isn't empty, and the “as ...” clause merely provides a name to identify the particular failure mode.

Lastly, line 77 computes those matching pairs which lie at the outermost level of nesting. They are the important ones, because they delimit RCU read-side critical sections. It does this by subtracting from “matched” all pairs which lie entirely between another matched pair. The “^-1” inversion operator computes the converse of a given relation; that is, it computes the collection of all links a⟶b such that b⟶a is in the given relation. Thus, po^-1 contains all pairs of events in reverse program order. To see how “(po^-1 ; matched ; po^-1)” selects inner matched pairs, consider the following example:

 1 rcu_read_lock();
 2 rcu_read_lock();
 3 rcu_read_unlock();
 4 rcu_read_unlock();

Starting at line 2, a “po^-1” step takes us back to line 1, a “matched” step takes us to line 4, and a second “po^-1” takes us back to line 3. Thus, this expression correctly identifies line 2 ⟶ line 3 as an inner matched pair. You can easily see that this mechanism will remove from the matched relation any matched pairs that are nested within another matched pair.

We are now ready to proceed to the Cat file.

Strong-Model Cat File

The full Cat file for Alan Stern's strong model is as follows:

strong-kernel.cat

  1 "Linux kernel strong memory model"
  2
  3 (*
  4  * Copyright (C) 2016 Alan Stern <stern@rowland.harvard.edu>,
  5  *          Andrea Parri <parri.andrea@gmail.com>
  6  *
  7  * This program is free software; you can redistribute it and/or modify
  8  * it under the terms of the GNU General Public License as published by
  9  * the Free Software Foundation; either version 2 of the License, or
 10  * (at your option) any later version.
 11  *
 12  * This program is distributed in the hope that it will be useful,
 13  * but WITHOUT ANY WARRANTY; without even the implied warranty of
 14  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 15  * GNU General Public License for more details.
 16  *
 17  * You should have received a copy of the GNU General Public License
 18  * along with this program; if not, you can access it online at
 19  * http://www.gnu.org/licenses/gpl-2.0.html.
 20  *)
 21
 22 include "cos.cat"
 23
 24 let com = rf | co | fr
 25 let coherence-order = po-loc | com
 26 acyclic coherence-order as coherence
 27
 28 empty rmw & (fre;coe) as atomic
 29
 30
 31 let exec-order-fence = rmb | acq-po
 32 let propagation-fence = strong-fence | wmb | po-relass
 33 let ordering-fence = propagation-fence | exec-order-fence
 34
 35 (* Determine the release sequences *)
 36 let rel-seq = [ReleaseAssign] ; coi? ; (rf ; rmw)*
 37 let po-rel-seq = po ; rel-seq
 38
 39 (* On Alpha, rd-dep-fence makes addr, dep-rfi, and rdw strong *)
 40 let dep = addr | data
 41 let dep-rfi = dep ; rfi
 42 let rd-addr-dep-rfi = (addr | dep-rfi)+ & rd-dep-fence
 43 let rdw = po-loc & (fre ; rfe)
 44 let rd-rdw = rdw & rd-dep-fence
 45 let po-loc-ww = po-loc & (W*W)
 46 let detour = (po-loc & (coe ; rfe)) \ (po-loc-ww ; po-loc)
 47 let atomicpo = (RMW*RMW) & po
 48 let addrpo = addr ; po
 49
 50 (* The set of writes that are bounded by the end of the thread
 51    or by a fence before the next write to the same address *)
 52 let BOUNDED-W = W \ domain(po-loc-ww \ ordering-fence)
 53 (* The set of "non-obscurable" writes on ARM *)
 54 let NOW = domain(rfe) | range(rmw) | ReleaseAssign |
 55     BOUNDED-W | domain(detour)
 56 (* The set of "obscurable" writes *)
 57 let OW = W \ NOW
 58
 59 let nco = co & (NOW*W)
 60 let ncoe = nco & ext
 61
 62 let strong-ppo = rd-addr-dep-rfi | ordering-fence | atomicpo |
 63     ((dep | ctrl | addrpo) & (R*W))
 64 let Alpha-strong-ppo = strong-ppo | rd-rdw | detour |
 65     (po-loc & ((M\OW)*W))
 66 let ARM-strong-ppo = strong-ppo | addr | dep-rfi
 67 let ppo = Alpha-strong-ppo | ARM-strong-ppo | rdw
 68
 69 let rfe-ppo = strong-ppo | (ARM-strong-ppo ; ppo* ; Alpha-strong-ppo)
 70 let po-relass-acq-hb = po ; (rfe & (ReleaseAssign*Acquire)) ; rfe-ppo
 71
 72 (* Release paired with Acquire is both A- and B-cumulative *)
 73 let AB-cum-hb = strong-fence | po-relass-acq-hb
 74 let A-cum-hb = AB-cum-hb | po-relass | po-rel-seq
 75 let B-cum-hb = AB-cum-hb | wmb
 76
 77 let hb0 = (ppo* ; Alpha-strong-ppo) | (rfe ; rfe-ppo)
 78 let propbase0 = propagation-fence | (rfe? ; A-cum-hb)
 79
 80 let rec B-cum-propbase = (B-cum-hb ; hb* ) |
 81     (rfe? ; AB-cum-hb ; hb* )
 82     and propbase = propbase0 | B-cum-propbase
 83     and short-obs = ((ncoe|fre) ; propbase+ ; rfe) & int
 84     and obs = short-obs |
 85     ((hb* ; (ncoe|fre) ; propbase* ; B-cum-propbase ; rfe) & int)
 86     and hb = hb0 | (obs ; rfe-ppo)
 87
 88 acyclic hb as happens-before
 89 irreflexive (short-obs ; Alpha-strong-ppo) as observation
 90
 91
 92 let strong-prop = fre? ; propbase* ; rfe? ; strong-fence ; hb*
 93 let prop = (propbase & (W*W)) | strong-prop
 94 let cpord = nco | prop
 95
 96 acyclic cpord as propagation
 97
 98
 99 (* Propagation between strong fences *)
100 let rcu-order = hb* ; cpord* ; fre? ; propbase* ; rfe?
101
102 (* Chains that can prevent the RCU grace-period guarantee *)
103 let gp-link = sync ; rcu-order
104 let cs-link = po? ; crit^-1 ; po? ; rcu-order
105 let rcu-path0 = gp-link |
106   (gp-link ; cs-link) |
107   (cs-link ; gp-link)
108 let rec rcu-path = rcu-path0 |
109   (rcu-path ; rcu-path) |
110   (gp-link ; rcu-path ; cs-link) |
111   (cs-link ; rcu-path ; gp-link)
112
113 irreflexive rcu-path as rcu

Quick Quiz 16: This strong model is insanely complex!!! How can anyone be expected to understand it???
Answer

First, the “Linux kernel strong memory model” is the title, and the “include "cos.cat"” pulls in some common definitions, similar to the C language's “#include <stdio.h>”.

Again, taking the remainder of the file one piece at a time:

Cat File: SC Per Location and Atomics.
Cat File: More Classes of Fences.
Cat File: Release Sequences.
Cat File: Intra-Thread Ordering.
Cat File: Obscured Writes.
Cat File: Preserved Program Order.
Cat File: Cumulativity.
Cat File: Happens-Before.
Cat File: Coherence Points.
Cat File: RCU.

Cat File: SC Per Location and Atomics

The first section of the litmus.cat file defines SC per location, which again means that all CPUs agree on the order of reads and writes to any given single location. Therefore, any situation where CPUs disagree on the order of reads and writes must involve more than one variable. This section also provides ordering constraints for RMW atomic operations.

24 let com = rf | co | fr
25 let coherence-order = po-loc | com
26 acyclic coherence-order as coherence
27
28 empty rmw & (fre;coe) as atomic

The “com” relation shown on line 1 is the union of:

Coherence (co), which connects all writes to any given variable, in the order that those writes were executed.
Reads-from (rf), which connects each read with the write that produced the value read. Note that initial values are considered to be “before the beginning of time” writes, where time is measured by the co ordering.
From-reads (fr), which connects each read with the writes to that same variable that follow the write (in co order) producing the value read.

The resulting “com” relation tracks the communication of data, hence the name.

The predefined “po-loc” relation intersects the program-order relation “po” with the per-location “loc” relation. This results in “po-loc” being an union of relations, one per variable, connecting all per-thread accesses to any given variable, in program order.

The “coherence-order” relation on line 2 takes the union of “po-loc” and “com”, which combines the communication of data with the order in which each location is accessed by each thread, but maintaining all relations on a per-location basis. The “acyclic” constraint on line 3 prohibits cycles in the resulting “coherence” relation, in other words, that everyone agrees on the order of accesses to each location.

Line 5 enforces the atomicity RMW operations on a given variable: More specifically no write to the given variable can intervene between the read and the write of the RMW operation. Recall that the “rmw” relationship connects a given RMW operation's read to its write. Note also that “fre;coe” connects any read to a given variable to some later write to that same variable, where at least one of the intervening writes was executed by some other thread. If the initial read was a given RMW operation's read and the final write was this same RMW operation's write, atomicity has been violated: Some other thread's write appeared after the RMW's read but before its write. Therefore, line 5 requires that the intersection of “rmw” and “fre;coe” be the empty set, thus prohibiting violations of atomicity.

Cat File: More Classes of Fences

The next portion of the file defines classes of fences in addition to those in the Bell file.

Quick Quiz 17: Why aren't these additional classes of fences in the Bell file where the other classes live?
Answer

31 (* On ARM, (read-acquire to read) is weaker than (rmb to read) *)
32 let exec-order-fence = rmb | (acq-po \ (R*R))
33 let propagation-fence = strong-fence | wmb | po-relass
34 let ordering-fence = propagation-fence | exec-order-fence

The members of the exec-order-fence group (smp_rmb() and smp_load_acquire()) can be thought of as providing ordering by restricting execution, for example, waiting for previous reads to complete before executing subsequent instructions. (In practice, hardware architects have all sorts of optimizations at their disposal that provide the needed ordering without necessarily actually waiting.) The ordering properties of any member of the rd-dep-fence and exec-order-fence groups do not propagate outside of that member's process. Such fences cannot provide global ordering except in situations involving only causal reads-from (rf) links; any non-causal coherence or from-read links (co or fr, respectively) require a stronger type of barrier. In addition, smp_load_acquire() only orders trailing loads unless it is paired with smp_store_release().

The members of the propagation-fence group include strong-fence (smp_mb() and synchronize_rcu()) in addition to smp_wmb(), smp_store_release(), and rcu_assign_pointer(). These barriers all provide some form of cumulativity, and when the smp_store_release() and rcu_assign_pointer() are paired with smp_load_acquire(), all of them provide B-cumulativity. Either way, as the name suggests, the effects of these barriers propagate to other processes.

Finally, ordering-fence is simply the union of exec-order-fence and propagation-fence, that is, the set of fence-like things that do not rely on dependencies.

Cat File: Release Sequences

The next portion of the file defines C11-style release sequences:

36 (* Determine the release sequences *)
37 let rel-seq = let
38       co-successor = co \ (co ; co)
39   and rel-seq-candidate = W * RMW
40   and rel-seq0 = toid(ReleaseAssign) ; coi?
41   in let rec
42     rel-seq = rel-seq0 |
43       (rel-seq-candidate & (rel-seq ; co-successor))
44     in rel-seq
45 let po-rel-seq = po ; rel-seq

The co-successor relation is derived from the co relation, but links from each write only to the immediately following write in co, rather than to all subsequent writes as in co itself. The rel-seq-candidate relation links all writes with all read-modify-write atomic operations. The rel-seq0 relation links any release operation with itself and with any subsequent modification of that same variable within the same process.

Quick Quiz 18: Shouldn't rel-seq-candidate intersect with co in order to avoid spurious release sequences involving multiple variables?
Answer

Then the rel-seq relation is recursively defined as being either the rel-seq0 store-after-release base case or the maximal C11 release sequence, moving down the co relation through any read-modify-write operation.

Quick Quiz 19: Shouldn't rel-seq depend on itself as well? After all, C11 allows memory_order_acquire, memory_order_acq_rel, memory_order_consume, and memory_order_seq_cst loads to take a value from any member of a release sequence, and still have the release operation heading that release sequence synchronize with the load.
Answer

Finally, the po-rel-seq relation links any memory reference prior to a release operation to that release operation's release sequence.

Cat File: Intra-Thread Ordering

The next portion of the file defines intra-thread ordering relationships. Here “intra-thread” means that the ordered accesses are within the same thread. Some of the relationships will reference other threads.

47 (* On Alpha, addr, dep-rfi, and rdw without rd-dep-fence are weaker *)
48 let rd-addr = addr & rd-dep-fence
49 let dep = addr | data
50 let dep-rfi = dep ; rfi
51 let rd-dep-rfi = dep-rfi & rd-dep-fence
52 let rdw = po-loc & (fre ; rfe)
53 let rd-rdw = rdw & rd-dep-fence
54 let po-loc-ww = po-loc & (W*W)
55 let detour = (po-loc & (coe ; rfe)) \ (po-loc-ww ; po-loc)
56 let atomicpo = (RMW*RMW) & po
57 let addrpo = addr ; po

The addr and data relations define address and data dependencies respectively. An address dependency occurs when a previously loaded value is used to form the address of a subsequent load or store within the same thread. A data dependency occurs when a previously loaded value is used to form the value stored by a subsequent store within the same thread. However, the Linux kernel respects neither address nor data dependencies unless: (1) The dependency is headed by rcu_dereference() or lockless_dereference() or (2) There is an smp_read_barrier_depends() between the load heading the dependency chain and the dependent memory reference. This requirement for a special operation helps to document the intent, and also allows architectures to include any special instructions required to enforce dependency ordering, for example, DEC Alpha requires a memory barrier if the dependent access is a read.

The rd-addr relation contains only those address dependencies that are headed by lockless_dereference() or rcu_dereference() or that span an smp_read_barrier_depends(). This relation excludes any read-to-read dependencies that DEC Alpha won't provide ordering for.

The dep relation is simply the union of address and data dependencies. the dep-rfi relation contains dependencies leading to a store, but where that store is read by a later load in that same thread. The rd-dep-rfi relation contains relations from dep-rfi that are headed by lockless_dereference() or rcu_dereference() or that span an smp_read_barrier_depends(). the “rdw” relation contains load-store pairs within a given thread, where the load and store are to the same variable, but where at least one store to this same variable from some other thread intervened between this thread's load and store. The rd-rdw relation contains relations from rdw that are headed by lockless_dereference() or rcu_dereference() or that span an smp_read_barrier_depends(). The po-loc-ww relation links pairs of writes by the same thread to the same variable. This relation helps to define the detour relation, which links writes that are overwritten by some other process, whose value is returned by a subsequent read in the first process, but without an intervening write to that same variable within the first process. The atomicpo relation accumulates pairs of RMW operations where both members of each pair are on the same thread, and where the first member of the pair precedes the second member in program order. Such pairs are ordered because a given CPU can execute only one RMW operation at a time. Finally, the addrpo relation relates operations heading address dependencies with any operations following the dependent operation in program order.

Quick Quiz 20: Why would an operation following an address dependency get any special treatment? After all, there does not appear to be any particular ordering relationship in the general case.
Answer

Cat File: Obscured Writes

As described in the ARM section, writes can be obscured under certain conditions. The following section of the Cat file accounts for this:

59 (* The set of writes that are bounded by the end of the thread
60    or by a fence before the next write to the same address *)
61 let BOUNDED-W = W \ domain(po-loc-ww \ ordering-fence)
62 (* The set of "non-obscurable" writes on ARM *)
63 let NOW = domain(rfe) | range(rmw) | ReleaseAssign |
64     BOUNDED-W | domain(detour)
65 (* The set of "obscurable" writes *)
66 let OW = W \ NOW
67
68 let nco = co & (NOW*W)
69 let ncoe = nco & ext

There are a number of reasons why a given write might not be “obscurable”, and the first two relations form the full set of non-obscurable writes. The BOUNDED-W set contains those writes that cannot be obscured either due to them being the last write in the thread (hence the subtraction of the domain of po-loc-ww) or due to them being ordered against some later access (hence the subtraction of the domain of ordering-fence). The NOW set then adds writes that are read by some other process (domain(rfe)), writes that are part of an atomic read-modify-write operation (range(rmw)), writes having release semantics (ReleaseAssign), and writes that were overwritten by a write in some other process, with that other process's write being read by a later read in the first process, but without another intervening write executed by the first process (domain(detour)).

The OW set then forms the set of obscurable writes by subtracting the set of non-obscurable writes from the set of all writes. The nco relation is that subset of the co relation that links only from non-obscurable writes. However, note that nco might contain links to obscurable writes. Finally, the ncoe relation is that subset of the co relation that links from a non-obscurable write to some (possibly obscurable) write to that same variable by some other process.

Cat File: Preserved Program Order

The next section of the file defines several flavors of “preserved program order”, which is abbreviated as ppo:

71 let strong-ppo = rd-addr | rd-dep-rfi | ordering-fence | atomicpo |
72     ((dep | ctrl | addrpo) & (R*W))
73 let Alpha-strong-ppo = strong-ppo | rd-rdw | detour | (acq-po & (R*R)) |
74     (po-loc & ((M\OW)*W))
75 let ARM-strong-ppo = strong-ppo | addr | dep-rfi
76 let ppo = Alpha-strong-ppo | ARM-strong-ppo | rdw
77
78 let rfe-ppo = strong-ppo | (ARM-strong-ppo ; ppo* ; Alpha-strong-ppo)
79 let po-relass-acq-hb = po ; (rfe & (ReleaseAssign*Acquire)) ; rfe-ppo

The basic point of the ppo relations is to arrive at the set of relations providing ordering within the context of a single process. The strong-ppo relation includes a number of components:

Dependencies enforced by lockless_dereference(), rcu_dereference(), or smp_read_barrier_depends() (rd-addr);
Dependencies leading to a store, but where that store is read by a load within that same process, and which are also enforced by lockless_dereference(), rcu_dereference(), or smp_read_barrier_depends() (rd-dep-rfi);
Accesses separated by a sufficiently strong ordering fence (ordering-fence);
Pairs of atomic read-modify-write operations within the same process (atomicpo); and
Dependencies leading to writes (((dep | ctrl | addrpo) & (R*W))).

Next are the aspects of ppo limited by DEC Alpha in the Alpha-strong-ppo relation, all of which are added to strong-ppo:

Read-to-read dependencies involving only one variable whose ordering is enforced by lockless_dereference(), rcu_dereference(), or smp_read_barrier_depends(), and which also have an intervening write to this variable from some other process (rd-rdw);
A write followed by a read from the same variable within the same process, with no intervening write to that same variable within that same process, but where the read returns the value written by some other process (detour);
An acquire followed by another read by that same process ((acq-po & (R*R))); and
A pair of accesses to the same variable by the same process, but excluding pairs of writes where the first write is obscurable ((po-loc & ((M\OW)*W))).

Quick Quiz 21: You said that these were ppo relations limited by DEC Alpha, but obscurable writes are an ARM limitation. What gives?
Answer

Next are the aspects of ppo limited by ARM in the ARM-strong-ppo relation, all of which are again added to strong-ppo:

Address dependencies (addr) and
Read-to-write dependencies of any sort where a later read within that same thread returns the value written (dep-rfi).

This leads to ppo itself, which combines Alpha-strong-ppo, ARM-strong-ppo, and rdw. Recall that rdw links pairs of reads from the same variable within the same process, where there was an intervening write to that variable by some other process.

Quick Quiz 22: But ARM-strong-ppo (and thus ppo) includes addr, which links reads to later dependent reads, which DEC Alpha does not respect. How is this supposed to work?
Answer

Next, rfe-ppo allows ppo relations to be chained together (within the same process, of course).

Finally, the po-relass-acq-hb relation links pairs of accesses ordered by a single link of a release-acquire chain.

Cat File: Cumulativity

The following relations define cumulativity, which is loosely related to the concept of transitivity:

81 (* Release paired with Acquire is both A- and B-cumulative *)
82 let AB-cum-hb = strong-fence | po-relass-acq-hb
83 let A-cum-hb = AB-cum-hb | po-relass | po-rel-seq
84 let B-cum-hb = AB-cum-hb | wmb

The AB-cum-hb relation includes strong fences and release-acquire pairs, both of which come closest to providing full transitivity. The A-cum-hb relation includes AB-cum-bh, and also links memory accesses to later release operations in the same thread, and furthermore takes release sequences into account. Both of these release-related relations provide ordering with certain prior accesses by other threads, hence A-cumulativity. Finally, the B-cum-hb relation also includes AB-cum-bh, and adds writes ordered by smp_wmb(), which provides ordering with certain later accesses by other threads, hence B-cumulativity.

Cat File: Happens-Before

The next portion of the file combines the effects of dependencies, barriers and grace periods to arrive at a causally ordered happens-before (hb) relationship.

86 let hb0 = (ppo* ; Alpha-strong-ppo) | (rfe ; rfe-ppo)
87 let propbase0 = propagation-fence | (rfe? ; A-cum-hb)
88
89 let rec B-cum-propbase = (B-cum-hb ; hb* ) |
90     (rfe? ; AB-cum-hb ; hb* )
91     and propbase = propbase0 | B-cum-propbase
92     and short-obs = ((ncoe|fre) ; propbase+ ; rfe) & int
93     and obs = short-obs |
94     ((hb* ; (ncoe|fre) ; propbase* ; B-cum-propbase ; rfe) & int)
95     and hb = hb0 | (obs ; rfe-ppo)
96
97 acyclic hb as happens-before
98 irreflexive (short-obs ; Alpha-strong-ppo) as observation

The hb relation has two base cases for its mutually assured recursive definition, hb0 and propbase0.

The hb0 relation is either a sequence of intrathread ppo relations followed by an Alpha-strong-ppo or a read that returns a value from some other process's write followed by an rfe-ppo relation that links to accesses that will remain ordered with respect to the initial rfe.

The propbase0 relation is either a fence having B-cumulativity or an optional read from some other process followed by one A-cumulative step.

The general idea behind the hb relation is to build a series of steps that provide ordering, with B-cumulative steps at the beginning of the series and A-cumulative steps at the end. In some cases, it is possible to concatenate instances of such series on either side of a sufficiently strong ordering construct. Of course AB-cumulative steps can act either as B-cumulative steps or as A-cumulative steps.

Taking the components of the mutually assured recursion for hb in turn:

The B-cum-propbase relation forms the B-cumulative start of an hb sequence, beginning with either (1) a strong fence, (2) an smp_wmb(), or (3) a read satisfied by some other process's write and followed by a strong fence or a release-acquire pair. Any of these possibilities may be followed by an hb relation.
The propbase relationship combines propbase0 and recurses on B-cum-propbase.
The short-obs relation starts with a write or a read on one process, links via ncoe or fre (respectively) to another process, goes through at least one propbase step, then comes back to the original process via an fre link. As the name implies, a read executed a given process observes a chain of events headed by either a prior read or a prior write executed by that same process.
The obs relation builds on short-obs, adding a sequence similar to short-obs, but with an additional sequence of zero or more hb at the beginning and an additional B-cum-propbase before the final rfe.
Finally, the hb relation starts with hb0 relation and recurses on obs followed by rfe-ppo. The rfe-ppo allows hb to connect to another instance of itself.

The hb relation is then declared to be acyclic and the concatenation of short-obs and Alpha-strong-ppo is declared irreflexive.

Quick Quiz 23: The hb relation seems to have a lot of moving parts, and the choices seem a bit on the arbitrary side. What gives?
Answer

Cat File: Coherence Points

Even in weakly ordered systems, ordering extends somewhat beyond strict causality, for example, it includes the notion of coherence points. The corresponding relations are described below.

101 let strong-prop = fre? ; propbase* ; rfe? ; strong-fence ; hb*
102 let prop = (propbase & (W*W)) | strong-prop
103 let cpord = nco | prop | (strong-prop ; rfe ; acq-po)
104
105 acyclic cpord as propagation

The strong-prop relation extends the ordering of a strong fence both forwards (hb*) and backwards (fre? ; propbase* ; rfe?). The prop relation adds 2+2W-style ordering to strong-prop, and cpord combines non-obscurable writes (nco), the aforementioned prop, and a strong-prop followed by ordering induced by an acquire operation. Finally, cpord is marked acyclic.

The happens-before and coherence-points machinery can be complex, but fortunately, many common use cases take simple paths through this machinery, for example:

Strong Model Litmus Test #12

  1 C C-ISA2+o-rel+acq-rel+acq-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   WRITE_ONCE(*a, 1);
  9   smp_store_release(b, 1);
 10 }
 11
 12 P1(int *b, int *c)
 13 {
 14   int r1;
 15
 16   r1 = smp_load_acquire(b);
 17   smp_store_release(c, 1);
 18 }
 19
 20 P2(int *c, int *a)
 21 {
 22   int r2;
 23   int r3;
 24
 25   r2 = smp_load_acquire(c);
 26   r3 = READ_ONCE(*a);
 27 }
 28
 29 exists
 30 (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)

After all three threads have completed, is the outcome shown on line 30 possible?

Referring to the Bell and Cat files, we see that line 8⟶9 is a member of po, line 9⟶16 is a member of rfe and furthermore is an example of (rfe & (ReleaseAssign*Acquire)), and line 16⟶17 is a member of acq-po. Because line 17 is a write, this last pair is also a member of (acq-po \ (R*R)), and thus a member of exec-order-fence, of ordering-fence, of strong-ppo, and finally of rfe-ppo. But any member of the sequence po;(rfe & (ReleaseAssign*Acquire));rfe-ppo (that is, line 8⟶9⟶16⟶17, which is line 8⟶17) is also a member of po-relass-acq-hb and thus of AB-cum-hb, and also of B-cum-hb, and in turn of B-cum-propbase and finally of propbase.

Line 26⟶8 is a member of fre, as noted in the paragraph above, line 8⟶17 is a member of propbase, and line 17⟶25 is a member of rfe. Therefore, line 26⟶25 is a member of short-obs.

Line 26⟶26 is a member of acq-po, and because both are reads, also a member of (acq-po & (R*R)), and thus of Alpha-strong-ppo.

Finally, because line 26⟶25 is a member of short-obs and because line 25⟶26 is a member of Alpha-strong-ppo, we have a cycle linking line⟶26 to itself. This cycle is prohibited by an irreflexive statement, so the exists clause of Strong Model Litmus Test #12 cannot be satisfied.

This is confirmed by running the command:

herd7 -conf strong.cfg C-ISA2+o-rel+acq-rel+acq-o.litmus

Which produces the following output:

Outcome for Strong Model Litmus Test #12

 1 Test C-ISA2+o-rel+acq-rel+acq-o Allowed
 2 States 7
 3 1:r1=0; 2:r2=0; 2:r3=0;
 4 1:r1=0; 2:r2=0; 2:r3=1;
 5 1:r1=0; 2:r2=1; 2:r3=0;
 6 1:r1=0; 2:r2=1; 2:r3=1;
 7 1:r1=1; 2:r2=0; 2:r3=0;
 8 1:r1=1; 2:r2=0; 2:r3=1;
 9 1:r1=1; 2:r2=1; 2:r3=1;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
14 Observation C-ISA2+o-rel+acq-rel+acq-o Never 0 7
15 Hash=9762857b08e4db85dbbf52a7b43068e9

The “Never 0 7” should be reassuring, given that this cycle is analogous a series of lock releases and acquires, which had jolly well better be fully ordered!

Let's now look at a roughly similar example:

Strong Model Litmus Test #13

  1 C C-W+WRC+o-rel+acq-o+o-mb-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   WRITE_ONCE(*a, 1);
  9   smp_store_release(b, 1);
 10 }
 11
 12 P1(int *b, int *c)
 13 {
 14   int r1;
 15   int r2;
 16
 17   r1 = smp_load_acquire(b);
 18   r2 = READ_ONCE(*c);
 19 }
 20
 21 P2(int *c, int *a)
 22 {
 23   int r3;
 24
 25   WRITE_ONCE(*c, 1);
 26   smp_mb();
 27   r3 = READ_ONCE(*a);
 28 }
 29
 30 exists
 31 (1:r1=1 /\ 1:r2=0 /\ 2:r3=0)

After all three threads have completed, is the result shown on line 31 possible?

A key point is that line 18⟶25 and line 18⟶8 are both members of fre. Furthermore, the only member of rfe is line 9⟶17. This means that there cannot be a cycle in hb, which requires at least one rfe for each fre in the cycle. The same observation applies to (short-obs ; Alpha-strong-ppo). In addition, although cpord does not require an rfe for each fre, it does require a strong-fence for each fre (via strong-prop, and there is only one smp_mb(), which can cover only one of the two fre relations.

Can we somehow combine hb (for example, via propbase) and cpord (via strong-prop) to pair an rfe with one of the fre relations and a strong-fence with the other? The answer is “no” because the only relations feeding into propbase that include fre must start and end within the same process, and cannot have more fre relations than rfe relations on that path.

This should not be too surprising, given the non-causal nature of fre relations, and can be confirmed by running the following command line:

herd7 -conf strong.cfg C-W+WRC+o-rel+acq-o+o-mb-o.litmus

Which results in the following output:

Outcome for Strong Model Litmus Test #13

 1 Test C-W+WRC+o-rel+acq-o+o-mb-o Allowed
 2 States 8
 3 1:r1=0; 1:r2=0; 2:r3=0;
 4 1:r1=0; 1:r2=0; 2:r3=1;
 5 1:r1=0; 1:r2=1; 2:r3=0;
 6 1:r1=0; 1:r2=1; 2:r3=1;
 7 1:r1=1; 1:r2=0; 2:r3=0;
 8 1:r1=1; 1:r2=0; 2:r3=1;
 9 1:r1=1; 1:r2=1; 2:r3=0;
10 1:r1=1; 1:r2=1; 2:r3=1;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (1:r1=1 /\ 1:r2=0 /\ 2:r3=0)
15 Observation C-W+WRC+o-rel+acq-o+o-mb-o Sometimes 1 7
16 Hash=8e3c5d7d5d36f2b1484ff237e8d22f91

However, full barriers (smp_mb()) can be used to force the Linux kernel to respect full non-causal ordering, and this is the main job of the “cpord” relationship. To see this, consider the following store-buffering litmus test shown in Strong Model Litmus Test #1.

Can the cyclic outcome “!r1&&!r2” called out in line 25 really happen?

Line 10⟶12 and 19⟶21 are members of mb and thus of strong-fence, while line 12⟶19 and 21⟶10 are members of fre. This means that line 21⟶10⟶12 and line 12⟶19⟶21 are members of strong-prop, and thus of prop and cpord. This pair of cpord relations forms a cycle, and cpord is constrained to be acyclic. Therefore, the exists clause cannot be satisfied.

This is confirmed by the following command:

herd7 -conf strong.cfg C-SB+o-mb-o+o-mb-o.litmus

This command produces the following output:

Outcome for Strong Model Litmus Test #1

 1 Test C-SB+o-mb-o+o-mb-o Allowed
 2 States 3
 3 0:r1=0; 1:r2=1;
 4 0:r1=1; 1:r2=0;
 5 0:r1=1; 1:r2=1;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 3
 9 Condition exists (0:r1=0 /\ 1:r2=0)
10 Observation C-SB+o-mb-o+o-mb-o Never 0 3
11 Hash=a61f698662bb72c2ed1755812580d385

Quick Quiz 24: Is there an easy way to tell which definitions have effect for a given litmus test?
Answer

Quick Quiz 25: Why does “cpord” prohibit a cycle containing two “fre” relationships when “hb” does not? They are both acyclic, after all!
Answer

Cat File: RCU

The previous section showed how smp_mb() can restore sequential consistency. However, as Jade noted, synchronize_rcu() is even stronger still, and therefore requires even more Cat-file code. The final portion of the Cat file therefore covers RCU relationships.

Quick Quiz 26: Say what??? How can anything possibly be stronger than sequential consistency???
Answer

RCU's fragment of the Cat file is as follows:

108 (* Propagation between strong fences *)
109 let rcu-order = hb* ; (rfe ; acq-po)? ; cpord* ; fre? ; propbase* ; rfe?
110
111 (* Chains that can prevent the RCU grace-period guarantee *)
112 let gp-link = sync ; rcu-order
113 let cs-link = po? ; crit^-1 ; po? ; rcu-order
114 let rcu-path0 = gp-link |
115   (gp-link ; cs-link) |
116   (cs-link ; gp-link)
117 let rec rcu-path = rcu-path0 |
118   (rcu-path ; rcu-path) |
119   (gp-link ; rcu-path ; cs-link) |
120   (cs-link ; rcu-path ; gp-link)
121
122 irreflexive rcu-path as rcu

The rcu-order relation interfaces the RCU model to the rest of the memory model, and defines what can order RCU read-side critical sections and RCU grace periods from each other. This relation can be roughly thought of as an arbitrarily long set of sequences of events providing B-cumulativity and then A-cumultivity, with intervening events providing strong ordering.

Quick Quiz 27: But rcu-order could be the empty relationship, so that it would directly connect what preceded it with what followed it. How can that be right?
Answer

The gp-link relation is an RCU grace period followed by some sequence of events that provide sufficient ordering, and the cs-link relation is an RCU read-side critical section followed by some sequence of events that provide sufficient ordering. Note that the cs-link relation allows any access preceding an RCU read-side critical section in that same thread to be used as evidence that an earlier grace period is ordered before the critical section, and vice versa. The importance of this is shown by the following litmus test:

Strong Model Litmus Test #14

  1 C C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   int r1;
  9
 10   r1 = READ_ONCE(*a);
 11   synchronize_rcu();
 12   WRITE_ONCE(*b, 1);
 13 }
 14
 15 P1(int *b, int *c)
 16 {
 17   int r2;
 18
 19   rcu_read_lock();
 20   r2 = READ_ONCE(*b);
 21   WRITE_ONCE(*c, 1);
 22   rcu_read_unlock();
 23 }
 24
 25 P2(int *c, int *d)
 26 {
 27   int r3;
 28
 29   r3 = READ_ONCE(*c);
 30   rcu_read_lock();
 31   // do_something_else();
 32   rcu_read_unlock();
 33   WRITE_ONCE(*d, 1);
 34 }
 35
 36 P3(int *d, int *a)
 37 {
 38   int r4;
 39
 40   r4 = READ_ONCE(*d);
 41   synchronize_rcu();
 42   WRITE_ONCE(*a, 1);
 43 }
 44
 45 exists
 46 (0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 3:r4=1)

The normal usage of cs-link is illustrated by P1(). The cs-link definition could start at line 20, take a po step to the rcu_read_unlock() on line 13, step back to the rcu_read_lock() on line 19, and finally a po step to line 21. This implements the rule: “If any part of an RCU read-side critical section follows anything after a given RCU grace period, then the entirety of that critical section follows anything preceding that grace period”, where the preceding grace period is the one in P0().

The more expansive usage is illustrated by P2(). The cs-link definition could start at line 29, take a po step to the rcu_read_unlock() on line 32, then a crit^-1, step back to the rcu_read_lock() on line 30, and finally a po step to line 33. This allows cs-link (in conjunction with rcu-order) to link the access on line 21 of P1() with the access on line 40 of P3().

Without this more expansive definition of cs-link, the questionable outcome r1&&r2&&r3&&r4 is permitted, which it is not, as can be seen by running:

herd7 -conf strong.cfg C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o.litmus

This gives the reassuring output:

Outcome for Strong Model Litmus Test #14

 1 Test C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o Allowed
 2 States 15
 3 0:r1=0; 1:r2=0; 2:r3=0; 3:r4=0;
 4 0:r1=0; 1:r2=0; 2:r3=0; 3:r4=1;
 5 0:r1=0; 1:r2=0; 2:r3=1; 3:r4=0;
 6 0:r1=0; 1:r2=0; 2:r3=1; 3:r4=1;
 7 0:r1=0; 1:r2=1; 2:r3=0; 3:r4=0;
 8 0:r1=0; 1:r2=1; 2:r3=0; 3:r4=1;
 9 0:r1=0; 1:r2=1; 2:r3=1; 3:r4=0;
10 0:r1=0; 1:r2=1; 2:r3=1; 3:r4=1;
11 0:r1=1; 1:r2=0; 2:r3=0; 3:r4=0;
12 0:r1=1; 1:r2=0; 2:r3=0; 3:r4=1;
13 0:r1=1; 1:r2=0; 2:r3=1; 3:r4=0;
14 0:r1=1; 1:r2=0; 2:r3=1; 3:r4=1;
15 0:r1=1; 1:r2=1; 2:r3=0; 3:r4=0;
16 0:r1=1; 1:r2=1; 2:r3=0; 3:r4=1;
17 0:r1=1; 1:r2=1; 2:r3=1; 3:r4=0;
18 No
19 Witnesses
20 Positive: 0 Negative: 15
21 Condition exists (0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 3:r4=1)
22 Observation C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o Never 0 15
23 Hash=c792a4c620a9d5244c0bee80da2a90fa

In short, if anything within or preceding a given RCU read-side critical section follows anything after a given RCU grace period, then it is probably best if that entire RCU read-side critical section follows anything preceding the grace period, and vice versa.

The rcu-path0 (RCU path base case) relation defines the three basic ways that RCU provides ordering:

A single synchronize_rcu() invocation, which in theory may be substituted for smp_mb(). (In practice, good luck with instances of smp_mb() in preempt-disabled regions of code, to say nothing of the disastrous degradation of performance.)
A synchronize_rcu() that is ordered before an RCU read-side critical section. This commonly used case guarantees that if some RCU read-side critical section extends beyond the end of a grace period, then all of that RCU read-side critical section happens after anything preceding that grace period. In other words, if any part of the critical section might happen after the kfree(), all of that critical section will happen after the corresponding list_del_rcu(). This case groups the RCU grace period in P0() and the RCU read-side critical section in P1() in the example above.
An RCU read-side critical section that is ordered before a synchronize_rcu(). This commonly used case guarantees that if some RCU read-side critical section extends before the beginning of a grace period, then all of that RCU read-side critical section happens before anything following that grace period. In other words, if any part of the critical section might happen before the list_del_rcu(), all of that critical section will happen before the corresponding the kfree(). This case groups the the RCU read-side critical section in P2() and RCU grace period in P3() in the example above.

The recursive definition of rcu-path builds on the rcu-path0 base case. Then (rcu-path;rcu-path) states that if any two sequences of RCU grace periods and read-side critical sections provide ordering, then the concatenation of those two sequences also provides ordering, and applies to the P0()-P1() and P2()-P3() groups in the example above, thus guaranteeing that the questionable outcome r1&&r2&&r3&&r4 is forbidden. On the other hand, the (gp-link;rcu-path;cs-link) states that if some sequence of RCU grace periods and read-side critical sections provides ordering, then ordering is still provided when that sequence is preceded by synchronize_rcu() and followed by an RCU read-side critical section. Finally, line 13's (cs-link;rcu-path;gp-link) states that if some sequence of RCU grace periods and read-side critical sections provides ordering, then ordering is still provided when that sequence is preceded by an RCU read-side critical section and followed by synchronize_rcu().

The irreflexive statement prohibits rcu-path from looping back on itself, in other words, this statement requires rcu-path to provide ordering.

Another way of thinking of rcu-path is of a counter and comparison, implemented recursively. If there are at least as many calls to synchronize_rcu() as there are RCU read-side critical sections in a given rcu-path, ordering is guaranteed, otherwise not.

Let's use this machinery to analyze the prototypical RCU-deferred-free scenario:

Strong Model Litmus Test #15

  1 C C-LB+rl-deref-o-rul+o-sync-o.litmus
  2
  3 {
  4   a=x;
  5 }
  6
  7 P0(int **a)
  8 {
  9   int *r1;
 10   int r2;
 11
 12   rcu_read_lock();
 13   r1 = rcu_dereference(*a);
 14   r2 = READ_ONCE(*r1);
 15   rcu_read_unlock();
 16 }
 17
 18 P1(int **a, int *x, int *y)
 19 {
 20   WRITE_ONCE(*a, y);
 21   synchronize_rcu();
 22   WRITE_ONCE(*x, 1);  /* Emulate kfree(). */
 23 }
 24
 25 exists
 26 (0:r1=x /\ 0:r2=1)

The variable a initially references the variable x, which is initially zero. The P1() function sets variable a to reference the variable y (also initially zero), then sets the value of x to 1 to emulate the effects of kfree(). Any RCU reader accessing and dereferencing a should therefore see the value zero, so that the outcome r2 should be forbidden. In other words, we would expect the cycle 20⟶22⟶14⟶15⟶12⟶13⟶20 to be forbidden. Let's check!

Lines 12⟶15 is a crit relationship, while lines 20⟶22 is a sync relationship. If the cycle is allowed, Lines 13⟶20 form an fre relationship and lines 22⟶14 form an rfe relationship. This means that lines 13⟶20 and lines 22⟶14 are also rcu-order relationships. This means that the series 20⟶22⟶14 is a gp-link relationship.

Given that lines 14⟶15 and 12⟶13 are po relationships, the series 14⟶15⟶12⟶13⟶20 is a cs-link relationship. We therefore have an gp-link relationship followed by a cs-link (or vice versa), so that the series 20⟶22⟶14⟶15⟶12⟶13⟶20 is an rcu-path0 relationship, which means that this same series is also an rcu-path relationship. Because it ends where it starts, on line 20, it is reflexive, and thus forbidden. The following command confirms this:

herd7 -conf strong.cfg C-LB+rl-deref-o-rul+o-sync-o.litmus

This command produces the following output:

Outcome for Strong Model Litmus Test #15

 1 Test C-LB+rl-deref-o-rul+o-sync-o Allowed
 2 States 2
 3 0:r1=x; 0:r2=0;
 4 0:r1=y; 0:r2=0;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (0:r1=x /\ 0:r2=1)
 9 Observation C-LB+rl-deref-o-rul+o-sync-o Never 0 2
10 Hash=4cac9d9e7ffa84096d8869e1ab199f09

Therefore, the RCU read-side critical section in P0() cannot see the emulated kfree() following P1()'s grace period, which should be some comfort to users of RCU.

Quick Quiz 28: Why does Strong Model Litmus Test #15's exists clause specify 0:r1=x? Isn't the second clause (0:r2=1) forbidden in and of itself?
Answer

But suppose we add another RCU read-side critical section to the mix, in the following somewhat inane but hopefully instructive example?

Strong Model Litmus Test #16

  1 C C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu.litmus
  2
  3 {
  4   a=x;
  5 }
  6
  7 P0(int **a)
  8 {
  9   int *r1;
 10   int r2;
 11
 12   rcu_read_lock();
 13   r1 = rcu_dereference(*a);
 14   r2 = READ_ONCE(*r1);
 15   rcu_read_unlock();
 16 }
 17
 18 P1(int **a, int *y, int *z)
 19 {
 20   WRITE_ONCE(*a, y);
 21   synchronize_rcu();
 22   WRITE_ONCE(*z, 1);
 23 }
 24
 25 P2(int *x, int *z)
 26 {
 27   int r3;
 28
 29   rcu_read_lock();
 30   r3 = READ_ONCE(*z);
 31   WRITE_ONCE(*x, 1);  /* Emulate kfree(). */
 32   rcu_read_unlock();
 33 }
 34
 35 exists
 36 (0:r1=x /\ 0:r2=1 /\ 2:r3=1)

Can the outcome r2 happen now?

Lines 12⟶15 and 29⟶32 are crit relationships, while Lines 20⟶22 is a sync relationship. Lines 22⟶30 and 31⟶14 are rfe relationships and lines 13⟶20 are an fre, which means that all are also rcu-order relationships. This means that the series 20⟶22⟶30 is a gp-link relationship.

Given that lines 14⟶15 and 12⟶13 are po relationships, the series 14⟶15⟶12⟶13⟶20 is a cs-link relationship. Similarly, because lines 30⟶32 and 29⟶31 are po relationships, the series 30⟶32⟶29⟶31⟶14 is also a cs-link relationship.

We therefore have one cs-link relationship followed by a gp-link relationship, which in turn is followed by another cs-link relationship. The cs-link relationship 14⟶15⟶12⟶13⟶20 can combine with the gp-link relationship 20⟶22⟶30 to form the rcu-path0 relationship 14⟶15⟶12⟶13⟶20⟶22⟶30. However, there is no way to add the remaining cs-link relationship 30⟶32⟶29⟶31⟶14, so the cycle resulting in r2 can in fact happen. This is confirmed by the command:

herd7 -conf strong.cfg C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu.litmus

Which produces the output:

Outcome for Strong Model Litmus Test #16

 1 Test C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu Allowed
 2 States 6
 3 0:r1=x; 0:r2=0; 2:r3=0;
 4 0:r1=x; 0:r2=0; 2:r3=1;
 5 0:r1=x; 0:r2=1; 2:r3=0;
 6 0:r1=x; 0:r2=1; 2:r3=1;
 7 0:r1=y; 0:r2=0; 2:r3=0;
 8 0:r1=y; 0:r2=0; 2:r3=1;
 9 Ok
10 Witnesses
11 Positive: 1 Negative: 5
12 Condition exists (0:r1=x /\ 0:r2=1 /\ 2:r3=1)
13 Observation C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu Sometimes 1 5
14 Hash=b591d622245952a2fc8eaad233203817

This should be no surprise, given that we have more RCU read-side critical sections than we have grace periods. This situation underscores the need to avoid doing inane things with RCU. However, one nice thing about a memory model incorporating RCU is that such inanity can now be detected, at least when it is confined to relatively small code fragments.

Acknowledgments

We owe thanks to H. Peter Anvin, Will Deacon, Andy Glew, Derek Williams, Leonid Yegoshin, and Peter Zijlstra for their patient explanations of their respective systems' memory models. We are indebted to Peter Sewell, Sumit Sarkar, and their groups for their seminal work formalizing many of these same memory models. We all owe thanks to Dmitry Vyukov, Boqun Feng, and Peter Zijlstra for their help making this human-readable. We are also grateful to Michelle Rankin and Jim Wasko for their support of this effort.

This work represents the views of the authors and does not necessarily represent the views of University College London, INRIA Paris, Scuola Superiore Sant'Anna, Harvard University, or IBM Corporation.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: But couldn't a CPU designer create a memory subsystem that did allow writes to be taken back?

Answer: Maybe someday they will. However, such a CPU would still need to provide ordering, and if it provided ordering similar to current CPU families, it is quite possible that our current models would simply consider the portion of the memory that allows prior writes to be taken back to be part of the CPU rather than part of the memory subsystem.

Back to Quick Quiz 1.

Quick Quiz 2: Why can't CPU designers use speculation to hide the slowness of strong barriers?

Answer: They can and they do. However, they must take care when doing so, because the hardware absolutely must respect the barriers' semantics. This requires complex circuitry to detect cases where speculation must be squashed, and can impose additional delays when rerunning the speculated instructions. So nothing comes for free, but speculation is indeed a powerful tool that is heavily used by high-performance CPUs.

Back to Quick Quiz 2.

Quick Quiz 3: Given that Itanium can reorder reads to the same variable, how can the Linux-kernel memory model reasonably prohibit such reordering?

Answer: The memory model assumes that READ_ONCE() is used for reads to shared variables. (Recall that if the developer chooses to use raw reads, it is the developer's responsibility to verify that neither the compiler nor the hardware can invalidate the developer's algorithm.) READ_ONCE() uses volatile reads, for which Itanium compilers emit ld,acq instructions. Itanium does maintain order for ld,acq instructions, so if you use READ_ONCE(), all will be well.

Back to Quick Quiz 3.

Quick Quiz 4: Isn't this single coherence point a huge bottleneck on large systems?

Answer: Not necessarily. Although the model is presented in terms of a single coherence point, the only real requirement is that there be a single coherence point for a given variable at a given point in time. This requirement is satisified by the cache line holding that variable, which can move around as needed. In addition, this means that the system can enjoy the scalability of a coherence point per cache line, which should suffice even for the very largest of systems.

But what if all the CPUs are writing to a single variable?

This does not sound like particularly enlightened software design, but a scalable hardware implementation simply relies on the store buffers. Given that there is one store buffer per CPU (or at least per core), this should not limit scalability, give or take constraints imposed by memory ordering.

Back to Quick Quiz 4.

Answer: It can't, but there is no need for a Big System Lock. After all, it is not time that matters, but rather coherence order. To see this, consider the following time-ordered sequence of events:

CPU 0 executes the read portion of its RMW operation to variable x.
CPU 1 writes to x.
CPU 0 executes the write portion of its RMW operation.

The system has the option of placing CPU 1's write to x after CPU 0's write to x in coherence order, which avoids the need for tight (and thus expensive) coordination between the CPUs, while still preserving the atomic nature of the RMW operation.

Another option is for the memory subsystem to cause CPU 0's write to fail in this circumstance. On architectures that do this, RMW operations have to be implemented as a loop, where the CPU goes back to the initial read if the write portion is not successful.

Back to Quick Quiz 5.

Quick Quiz 6: The terms “A-cumulativity” and “B-cumulativity” aren't particularly mnemonic, are they?

Answer: They are perfectly mnemonic. The ‘A’ stands for the French word avant, which translates to “before”, and it is the A-cumulative barriers' pre-sets that are modified by A-cumulativity. The ‘B’ stands for the Swahili word baada, which translates to “after”, and it is the B-cumulative barriers' post-sets that are modified by B-cumulativity.

Or you could just remember that ‘A’ precedes ‘B’ in the Latin alphabet.

Back to Quick Quiz 6.

Quick Quiz 7: By symmetry, shouldn't a B-cumulative barrier's post-set include all writes that propagate to the barrier's CPU after the barrier is committed?

Answer: Symmetry is in the eye of the beholder. Beholders who care about hardware performance and scalability (or, for that matter, clean semantics) will prefer the definition in the list over the definition in this Quick Quiz's question. Working out why is left as an exercise for the reader. [Hint: What should happen if a write on one CPU commits before a barrier on another CPU, but the write doesn't propagate to the other CPU until after the barrier has committed?]

Back to Quick Quiz 7.

Quick Quiz 8: Given the steadily increasing number of transistors, why couldn't a CPU analyze code to detect at least some classes of “x == x” comparisons?

Answer: Several existing CPU architectures guarantee that even trivial conditionals will provide ordering, so they have no reason to perform such an analysis. However, a new CPU family might well carry out such analyses. If this happened, this CPU family would probably need to provide an option to suppress trivial-conditional detection in order to correctly execute pre-existing code. But if this hardware optimization came into being and proved sufficiently valuable, it might be necessary to adjust accordingly.

Back to Quick Quiz 8.

Quick Quiz 9: Given all these constraints, how can weak-memory CPUs possibly expect to attain any benefits of any sort compared to strong-memory CPUs?

Answer: This argument between proponents of strong and weak memory ordering has been going on for some decades, so we do not expect to be able to settle it here.

Back to Quick Quiz 9.

Quick Quiz 10: Given how important split caches are for attaining full performance on superscalar CPUs, why don't any non-Alpha architectures have split caches?

Answer: Other architectures do have split caches. However, these other architectures also have additional circuitry that preserves dependencies among accesses to different banks of their split caches.

Back to Quick Quiz 10.

Quick Quiz 11: If a later read instruction can be satisfied before a prior load-acquire instruction, how can the load-acquire possibly order the two instructions?

Answer: Ordering of the two instructions is guaranteed by the fact that the memory subsystem will not allow the read request to pass the load-acquire request.

Back to Quick Quiz 11.

Quick Quiz 12: Why weren't adjustments needed for PowerPC, given that it has a weak memory model?

Answer: The reason that no adjustments were needed for PowerPC was that we started with PPCMEM's Power memory model. Alternatively, one could argue that all necessary adjustments for PowerPC were made at the very beginning of this effort. For those keeping score, the fact that PowerPC's release-acquire chains do not provide full ordering does cause significant heartburn in some circles.

Back to Quick Quiz 12.

Quick Quiz 13: But what if some new CPU had an even weaker memory model than Alpha, ARM, and PowerPC? Mightn't that invalidate a lot of Linux-kernel code?

Answer: In theory, it might. In practice, we expect that the Linux kernel community would be highly motivated to include memory-barrier instructions in the new CPU's arch-specific code so as to minimize (or perhaps even eliminate) core Linux-kernel modifications. We also hope that this prospect will encourage future CPU designers and architects to avoid the need for excessive numbers of memory-barrier instructions.

Back to Quick Quiz 13.

Quick Quiz 14: Given that this is about memory barriers, why “instructions F[Barriers]” instead of perhaps “instructions B[Barriers]”?

Answer: “Memory barriers” are also sometimes called “memory fences”. This can be confusing, but both terms are used so we might as well get used to it. Besides, the “B” instruction class was already reserved for Branches.

Back to Quick Quiz 14.

Quick Quiz 15: Why wouldn't “let sync = fencerel(Sync)” work just as well as the modified definition?

Answer: The modified definition is necessary because the model needs to recognize that code like:

WRITE_ONCE(*x, 1);
synchronize_rcu();
synchronize_rcu();
r2 = READ_ONCE(*y);

will insert two grace periods between the memory accesses, not just one. With the modified definition, there is a “sync” pair linking the WRITE_ONCE() to the first synchronize_rcu() as well as a pair linking that event to the READ_ONCE(), so it is possible to pass from the write to the read via two links. With the “let sync = fencerel(Sync)” definition, there would be no link from the WRITE_ONCE() to the first synchronize_rcu(). Consequently there would be a path from the write to the read involving one link, but no path involving two.

Back to Quick Quiz 15.

Quick Quiz 16: This strong model is insanely complex!!! How can anyone be expected to understand it???

Answer: Given that this model is set up to be as strong as reasonably possible given the rather wide variety of memory models that the Linux kernel runs on, it is actually surprisingly simple. Furthermore, this model has a tool that goes with it, which is more than can be said of memory-barriers.txt.

Nevertheless, it is quite possible that this model should be carefully weakened, if it turns out that doing so simplifies the model without invalidating any use cases. Simpler but weaker models can be found here and here.

Back to Quick Quiz 16.

Quick Quiz 17: Why aren't these additional classes of fences in the Bell file where the other classes live?

Answer: Because exec-order-fence, propagation-fence, and ordering-fence aren't needed by either the weak2 or weak3 model, both of which share the Bell file.

Back to Quick Quiz 17.

Quick Quiz 18: Shouldn't rel-seq-candidate intersect with co in order to avoid spurious release sequences involving multiple variables?

Answer: There is no need; it is implied by the subsequent intersection of rel-seq-candidate with rel-seq ; co-successor in the recursive definition of rel-seq.

Back to Quick Quiz 18.

Answer: @@@ No idea. Alan? @@@

Back to Quick Quiz 19.

Quick Quiz 20: Why would an operation following an address dependency get any special treatment? After all, there does not appear to be any particular ordering relationship in the general case.

Answer: It turns out that Power guarantees that writes following an address-dependency pair are guaranteed not to be reordered before the load heading up the dependency pair, as can be seen from this load-buffering litmus test and its output (note the “Never” on the last lint) and from this message-passing litmus test and its output.

Why would Power provide such ordering to an unrelated store? Because until the load completes, Power has no idea whether or not it is unrelated. If the load returns the same address that is used by the “unrelated” store, then the two stores are no longer unrelated, and the CPU must provide coherence ordering between them. But the CPU can't know what ordering requirements there might be until the load completes, so all later writes must wait until the load completes.

But what about loads? Don't they have the same coherency requirements?

Indeed they do, but the CPU can safely speculate such loads, squashing the speculation if it later learns that there was an unexpected address collision. For more information on this dependency/coherence corner case, please see section 10.5 of A Tutorial Introduction to the ARM and POWER Relaxed Memory Models. Other sections cover many other interesting corner cases.

There is also the possibility that the compiler might know all the values assigned to the variable loaded via rcu_dereference() or lockless_dereference().

Back to Quick Quiz 20.

Quick Quiz 21: You said that these were ppo relations limited by DEC Alpha, but obscurable writes are an ARM limitation. What gives?

Answer: @@@ Beats me. Alan? (My guess is that some aspects are not neatly separable between DEC Alpha and ARM, but I must defer to you.)

Back to Quick Quiz 21.

Quick Quiz 22: But ARM-strong-ppo (and thus ppo) includes addr, which links reads to later dependent reads, which DEC Alpha does not respect. How is this supposed to work?

Answer: @@@ Beats me. Alan?

Back to Quick Quiz 22.

Quick Quiz 23: The hb relation seems to have a lot of moving parts, and the choices seem a bit on the arbitrary side. What gives?

Answer: This is the strong model, which intentionally trades away simplicity to get added strength. The added strength implies added complexity because the strong model is necessarily limited by the various weaknesses of the hardware that the Linux kernel runs on.

Simpler but weaker models can be found here and here.

Back to Quick Quiz 23.

Quick Quiz 24: Is there an easy way to tell which definitions have effect for a given litmus test?

Answer: One very straightforward approach is to edit the .cat and .bell files to remove “acyclic” or “irreflexive” statements. For example, for the above store-buffering litmus test, removing the “acyclic cpord as propagation” allows the cyclic outcome.

Alternatively, you can pass the “-skipcheck propagation” argument-line argument to herd7. However, editing the .bell and .cat files to omit different elements can be an extremely educational activity.

Back to Quick Quiz 24.

Quick Quiz 25: Why does “cpord” prohibit a cycle containing two “fre” relationships when “hb” does not? They are both acyclic, after all!

Answer: The difference is that “hb” requires that any path including an “fre” relationship begin and end at the same thread. Therefore, no matter how you string “hb” relationships together, they cannot prohibit a cycle that goes through two “fre” relationship before returning to the original thread, and thus cannot prohibit the store-buffering litmus test. In contrast, the “strong-prop” relationship that leads up to the “cpord” relationship makes no same-thread restriction, which means that “cpord” can forbid a cycle containing more than one “fre” relationship.

Back to Quick Quiz 25.

Quick Quiz 26: Say what??? How can anything possibly be stronger than sequential consistency???

Answer: Easily.

To see this, recall the store-buffering example from the previous section, in which smp_mb() prevented any executions that were not simple interleavings, in other words, it prohibits the cyclic outcome “!r1&&!r2”. If we replace the first smp_mb() with synchronize_rcu(), replace the second smp_mb() with with an RCU read-side critical section, and reverse P1()'s memory references, we get the following:

Strong Model Litmus Test #17

  1 C C-LB+o-sync-o+rl-o-o-rul.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   int r1;
  9
 10   r1 = READ_ONCE(*a);
 11   synchronize_rcu();
 12   WRITE_ONCE(*b, 1);
 13 }
 14
 15 P1(int *b, int *a)
 16 {
 17   int r2;
 18
 19   rcu_read_lock();
 20   r2 = READ_ONCE(*b);
 21   WRITE_ONCE(*a, 1);
 22   rcu_read_unlock();
 23 }
 24
 25 exists
 26 (0:r1=1 /\ 1:r2=1)

It turns out that synchronize_rcu() is so strong that it is able to forbid the cyclic outcome “r1&&r2” even though P1() places no ordering constraints whatsoever on its two memory references.

Now that is strong ordering!

There is of course no free lunch. On systems having more than one CPU, the overhead of synchronize_rcu() is orders of magnitude greater than that of smp_mb(). You get what you pay for!

Back to Quick Quiz 26.

Quick Quiz 27: But rcu-order could be the empty relationship, so that it would directly connect what preceded it with what followed it. How can that be right?

Answer: It is not just right, but absolutely necessary. This permits a pair of consecutive grace periods to do the right thing. For example, consider the following litmus test, where, as usual, a, b, and c are initially all zero:

Strong Model Litmus Test #18

  1 C C-LB+o-sync-sync-o+rl-o-o-rul+rl-o-o-rul.litmus
  2
  3 {
  4 }
  5
  6 P0(int *a, int *b)
  7 {
  8   int r1;
  9
 10   r1 = READ_ONCE(*a);
 11   synchronize_rcu();
 12   synchronize_rcu();
 13   WRITE_ONCE(*b, 1);
 14 }
 15
 16 P1(int *b, int *c)
 17 {
 18   int r2;
 19
 20   rcu_read_lock();
 21   r2 = READ_ONCE(*b);
 22   WRITE_ONCE(*c, 1);
 23   rcu_read_unlock();
 24 }
 25
 26 P2(int *c, int *a)
 27 {
 28   int r3;
 29
 30   rcu_read_lock();
 31   r3 = READ_ONCE(*c);
 32   WRITE_ONCE(*a, 1);
 33   rcu_read_unlock();
 34 }
 35
 36 exists
 37 (0:r1=1 /\ 1:r2=1 /\ 2:r3=1)

If rcu-order did not permit an empty relationship, the pair of synchronize_rcu() invocations on lines 4 and 5 would not be serialized, but would instead effectively merge into a single synchronize_rcu(). Thus, the possibility of an empty rcu-order is absolutely required to forbid the undesirable outcome r1&&r2&&r3:

Outcome for Strong Model Litmus Test #18

 1 Test C-LB+o-sync-sync-o+rl-o-o-rul+rl-o-o-rul Allowed
 2 States 7
 3 0:r1=0; 1:r2=0; 2:r3=0;
 4 0:r1=0; 1:r2=0; 2:r3=1;
 5 0:r1=0; 1:r2=1; 2:r3=0;
 6 0:r1=0; 1:r2=1; 2:r3=1;
 7 0:r1=1; 1:r2=0; 2:r3=0;
 8 0:r1=1; 1:r2=0; 2:r3=1;
 9 0:r1=1; 1:r2=1; 2:r3=0;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (0:r1=1 /\ 1:r2=1 /\ 2:r3=1)
14 Observation C-LB+o-sync-sync-o+rl-o-o-rul+rl-o-o-rul Never 0 7
15 Hash=44ee0f607659a74ea40149d1ca3d80f5

Back to Quick Quiz 27.

Quick Quiz 28: Why does Strong Model Litmus Test #15's exists clause specify 0:r1=x? Isn't the second clause (0:r2=1) forbidden in and of itself?

Answer: Try it and see what happens!

Back to Quick Quiz 28.