

# A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy

#### portuged state university

#### Andrew L. Baldwin, BS '09, MS '12 W. Robert Daasch, Professor Integrated Circuits Design and Test Laboratory







# **Problem Statement**

- In a fault tolerant system containing three redundant processing mitigate the effects of a single faulty PE.
- When more than one PE is faulty Triple Modular Redundancy can select a faulty output.
- Time distributed voting (TDV) proposes an alternative to TMR to extend fault coverage when multiple PE's are faulty.



# Outline

- Contributions
- Background
- Methods: Time Distributed Voting (TDV)
- Results
- Conclusion and Recommendations



### Contributions

- Time Distributed Voting (TDV) extends coverage in active fault tolerant systems
- CAM based Verilog HDL TDV prototype:
  - Finds voting opportunities by detecting data stream commonalities
  - Aligns PE result for voting execution
- Characterization of aliasing in the ISCAS '85 C6288 benchmark



# **Background - Faults**

- A fault is any upset that modifies a circuit to the point of failure.
- Faults may be caused by:
  - Random Defects Introduced during fabrication.
    May be detectable at test or latent
  - Soft errors occur online, may be recovered or reset
  - Hard errors occur online, may cause permanent damage to the circuit

IC Design & Test Laboratory Random Defects



# Defects failures may be detectable at test



- As minimum feature size gets smaller, the circuit becomes sensitive to smaller defects.
- Smaller defects occur at a greater frequency.
- A .25um defect will occur 8 times more frequently than a .5um defect.  $\frac{1}{2}$
- Relative frequency approximation.<sup>x<sup>3</sup></sup>

# May fail during the product's useful lifetime.



0.4

16 April 2012

Defect size x (µm)

0.5

0.6

Relative frequency

0.0

Xo 0.3

0.7



# **Online Soft and Hard Errors**

- Soft Errors
  - A change to the state of a device or transient
  - Caused by a heavy ionizing particle, cosmic ray, proton, etc.
  - No permanent damage, recovered by reset

- Hard Errors
  - Burnout
  - Latch up
  - Electro Migration
  - Permanently damages the device



## **Fault Mitigation**

- Manufacturers employ techniques to improve yield and reliability in the presence of faults.
- Fault tolerant designs may preserve the functionality of the system when a component fails.
- Passive fault tolerance: masking erroneous results while leaving the faulty circuit in the system
- Active fault tolerance: identifies faulty circuits and removes or replaces them in the system
- Triple modular redundancy (TMR) is a passive fault tolerant technique

# Triple Modular Redundancy (TMR)

- Three redundant processing elements operate same input
- PE results are evaluated in a voting algorithm
  - Majority result is the system output
- Any single erroneous result masked by the majority result





# **Shortcomings of TMR**

- Additional area and power is required for redundant PE's and voting logic
- TMR provides coverage for cases as they arise
- TMR only provides reliable coverage when, at most, only a single PE is faulty
  - What happens when two PE's are faulty?
  - Can we identify the fault free PE using random inputs?



### **Fault Cones and Aliasing**

- A fault's cone is all output bits affected by the fault
- Fault cones f2 and f3 can overlap
- Aliasing possible when input activates both faults
- Aliasing is faulty PEs agreeing to wrong result





# **Majority Voting**

- Majority voting systems with multiple faulty PE's generate:
  - Indeterminate outcomes no PE results match
  - Cases with no majority, or the majority is wrong.
- Faulty PE's correct results vote with healthy PEs
- When PEs contain different faults, majority voting may still favor the healthy (Golden) PEs
- TMR systems assume single fault
  - Eliminates potential for indeterminate (null) voting
  - Eliminates potential aliased voting



#### **Example Fault Cones and Aliasing**



| f1<br>activated | f2<br>activated | f3<br>activated | Bit-level voter         | Word-level Voter       | Comment              |
|-----------------|-----------------|-----------------|-------------------------|------------------------|----------------------|
| 0               | 0               | por             | no fault observed       | no fault observed      | 4                    |
| 0               | 0               | 1               | f3 observed f3 observed |                        |                      |
| 0               | 1               | 0               | f2 observed             | f2 observed            |                      |
| 0               | 1               | 1               | possible aliasing on o5 | possible word aliasing | f2 and f3 overlap    |
| 1               | 0               | 0               | f1 observed f1 observed |                        |                      |
| 1               | 0               | 1               | no bit level aliasing   | indeterminate          | f1 and f3 no overlap |
| 1               | 1               | 0               | no bit level aliasing   | indeterminate          | f1 and f2 no overlap |
| 1               | 1               | 1               | possible aliasing on o5 | indeterminate          |                      |



# **Time Distributed Voting TDV**

- A statistical opportunity exists for faulty PEs to help identify healthy PE's by accumulating voting results over time
- TDV identifies healthy and faulty PE's over time
  - Alternative to TMR masking erroneous results
  - If fault is not activated PE output is correct
  - When fault is activated not all PE output incorrect



# **TDV Prototype**

- Verilog HDL prototype was used to simulate TDV
- Features:
  - Three PE's operating on independent data streams
  - CAM-based FIFO's to detect voting opportunities
  - PE result alignment and vote execution

ססרדבקתל שדקדב טתועברשודע

♣



#### **TDV Block Diagram**







# **TDV Processing Element (C6288)**

- The ISCAS '85 C6288 Benchmark is used as the prototype processing element
- 15 by 16 array structure.
- 16-bit Multiplier (32 input bits, 32 output bits)
- C6288 contains 2448 nodes that may be modeled as SA0 or SA1 faults. Total 4896 fault nodes. (4879 observable)
- Minimum set size of 12 patterns to achieve 100% fault coverage (observable faults)
- 150 Pseudorandom patterns to achieve 100% fault coverage (observable faults)



#### C6288 PE





## C6288 PE (continued)



- C6288 symmetry high rate of aliasing
  - C6288 AND gates compute partial products
- C6288 half and full adders sum partial products



A0 *B*0

A0 P0



#### Adders used in C6288

- Top-row half adders lack the Ci input
- Single half adder in the bottom row lacks the B input







# Results: Coverage of Single Stuck-at Faults



- Simulated 1,200 pseudorandom test patterns for each of the 4,896 fault nodes
- Achieved 100% single stuck-at fault coverage with 150 pseudorandom input patterns.



# **Results: Aliasing Characterization**

| Random   | Activated | Unactivated | Non-Aliasing | Aliasing | Aliasing   |
|----------|-----------|-------------|--------------|----------|------------|
| Patterns | Faults    | Faults      | Faults       | Faults   | Faults (%) |
| 2        | 3,302     | 1,577       | 14           | 3,288    | 99.58%     |
| 3        | 3,974     | 905         | 28           | 3,946    | 99.30%     |
| 6        | 4,481     | 398         | 34           | 4,447    | 99.24%     |
| 12       | 4,768     | 111         | 34           | 4,734    | 99.29%     |
| 25       | 4,821     | 58          | 28           | 4,793    | 99.42%     |
| 50       | 4,874     | 5           | 28           | 4,846    | 99.43%     |
| 100      | 4,877     | 2           | 28           | 4,849    | 99.43%     |
| 200      | 4,879     | 0           | 28           | 4,851    | 99.43%     |
| 400      | 4,879     | 0           | 28           | 4,851    | 99.43%     |
| 800      | 4,879     | 0           | 28           | 4,851    | 99.43%     |
| 1,200    | 4,879     | 0           | 28           | 4,851    | 99.43%     |

- All fault pairs simulated for the N= 4,878 faults
- N(N-1)/2 = 11,982,960 fault pairs
- 99+% of faults in the array have aliasing fault pairs
  - 28 faults displayed no aliasing



# **Results: Aliasing Characterization**

 Aliasing was observed in about 3.2% of all activated fault combinations

| Random<br>Patterns | Activated Fault<br>Combinations | Aliasing Fault<br>Combinations | Aliasing Fault<br>Combinations(%) |
|--------------------|---------------------------------|--------------------------------|-----------------------------------|
| 2                  | 5449951                         | 96361                          | 1.77%                             |
| 3                  | 7894351                         | 145437                         | 1.84%                             |
| 6                  | 10037440                        | 205827                         | 2.05%                             |
| 12                 | 11364528                        | 276495                         | 2.43%                             |
| 25                 | 11618610                        | 317298                         | 2.73%                             |
| 50                 | 11875501                        | 353489                         | 2.98%                             |
| 100                | 11890126                        | 369431                         | 3.11%                             |
| 200                | 11899881                        | 377027                         | 3.17%                             |
| 400                | 11899881                        | 378817                         | 3.18%                             |
| 800                | 11899881                        | 379375                         | 3.19%                             |
| 1200               | 11899881                        | 379437                         | 3.19%                             |

IC Design & Test Laboratory



9.5857

9.4001

5.6071

3.9373

3.1128

2.1439

1.2781

0.5154

0.0466

0.0206

3.224832

1.637596

0.023512

3.270927

3.178738

4851

6.947

maximum

quartile

median

quartile

minimum

#### **Results: Aliasing Characterization**

| ╞ | - | $+\Box$ | 0 | ]— |  | • |  |       |   | <br>$\left\  \right\ $ |   |
|---|---|---------|---|----|--|---|--|-------|---|------------------------|---|
|   |   |         |   |    |  |   |  |       |   |                        | _ |
|   |   |         |   |    |  |   |  |       |   |                        |   |
|   |   |         |   |    |  |   |  |       |   | -                      |   |
|   |   |         |   |    |  |   |  |       |   |                        |   |
|   |   |         |   |    |  |   |  |       |   |                        | - |
|   |   |         |   |    |  |   |  |       |   | -                      | _ |
|   |   |         |   |    |  |   |  | <br>_ | _ |                        | _ |

Percentage of 4,878 fault combinations that alias (4851 singular faults)

- On average a singular fault aliases with ~3.2% of it's possible fault combinations
- At most, a singular fault aliases with ~9.6% of it's possible fault combinations

Ν

Quantiles

100.00%

99.50%

97.50%

90.00%

75.00%

50.00%

25.00%

10.00%

2.50%

0.50%

0.00%

**Moments** 

Std Dev

Std Err Mean

Upper 95% Mean

Lower 95% Mean

Mean



# **Results: Aliasing Characterization**

- Equivalent faults modify the circuit in the same way
  - Display identical symptoms
- Red faults are equivalent
- Equivalent faults ~15,000 of aliasing fault pairs
- Aliasing extends equivalent faults



IC Design & Test Laboratory



PO

# **Results: Aliasing Characterization**

Ρ9

- In the C6288, aliasing frequency is modulated by propagation paths and fault pair proximity.
- Propagation paths Faults that propagate through both adder outputs alias more than faults that only propagate through a single adder output.
- Proximity Aliasing is only observed for fault pairs in the same or adjacent column.
- Equivalence is not required.



- A singular fault (blue) may alias with faults in close columnar proximity (red)
- All aliasing fault pairs are in the same or adjacent column (pawn movement)



- TDV simulation assumes a single healthy (Golden) PE and two faulty (Faulty1 & Faulty2) PE's
- For each pseudorandom input pattern, the PE weights are updated as shown in the table below. (Minority PE weight gets decremented, Majority PE weights get incremented)
- TDV voting is non-biased, the outcomes are not skewed to favor the Golden PE
- The voter does not know what the correct answer is
- The object of is to identify the Golden PE using the accumulated TDV outcomes

| Input Pattern | Result[0] | Result[1] | Result[2] | Weight[0] | Weight[1] | Weight[2] |
|---------------|-----------|-----------|-----------|-----------|-----------|-----------|
| А             | Х         | Х         | Х         | +0        | +0        | +0        |
| А             | Y         | Х         | Х         | -1        | +1        | +1        |
| А             | Х         | Y         | Х         | +1        | -1        | +1        |
| А             | Х         | Х         | Y         | +1        | +1        | -1        |
| А             | Х         | Y         | Z         | +0        | +0        | +0        |

IC Design & Test Laboratory







- Adding more test patterns changes the snapshot of which aliasing faults get coverage.
- The table shows the TDV outcome incrementally as input pattern set size is increased to 1,200 (Green=Correct; Red=Incorrect)

|    | PE1    | PE2   | PE3   | 200<br>Patterns | 400<br>Patterns | 800<br>Patterns | 1200<br>Patterns |
|----|--------|-------|-------|-----------------|-----------------|-----------------|------------------|
|    | Golden | N997  | N2263 | 0               | 0               | 0               | 1                |
|    | Golden | N4297 | N4796 | 0               | 0               | 1               | 0                |
|    | Golden | N752  | N2514 | 0               | 0               | 1               | 1                |
| 1  | Golden | N411  | N3247 | 0               | 1               | 0               | 0                |
| Ľ, | Golden | N1000 | N2759 | 0               | 1               | 0               | 1                |
| 1  | Golden | N997  | N2008 | 0               | 1               | 1               | 0                |
|    | Golden | N872  | N1121 | 0               | 1               | 1               | 1                |
|    | Golden | N868  | N1121 | 1               | 0               | 0               | 0                |
|    | Golden | N3162 | N4676 | 1               | 0               | 0               | 1                |
|    | Golden | N1823 | N2577 | 1               | 0               | 1               | 0                |
|    | Golden | N3508 | N4513 | 1               | 0               | 1               | 1                |
|    | Golden | N843  | N3362 | 1               | 1               | 0               | 0                |
|    | Golden | N3745 | N4760 | 1               | 1               | 0               | 1                |
|    | Golden | N435  | N2694 | 1               | 1               | 1               | 0                |



| Random<br>Patterns | Total Fault<br>Pairs | % Correct<br>No-Aliasing<br>Pairs | % Aliasing<br>Pairs | % Correct<br>Aliasing Pairs | % Total<br>Correct Pairs | % Incorrect<br>Aliasing<br>Pairs |
|--------------------|----------------------|-----------------------------------|---------------------|-----------------------------|--------------------------|----------------------------------|
| 200                | 11,899,881           | 96.83%                            | 3.17%               | 1.79%                       | 98.62%                   | 1.38%                            |
| 400                | 11,899,881           | 96.82%                            | 3.18%               | 1.80%                       | 98.62%                   | 1.38%                            |
| 800                | 11,899,881           | 96.81%                            | 3.19%               | 1.82%                       | 98.63%                   | 1.37%                            |
| 1,200              | 11,899,881           | 96.81%                            | 3.19%               | 1.83%                       | 98.64%                   | 1.36%                            |

#### • For the C6288 array circuit

- TDV covers all observable single faulty PE cases covered by lockstep TMR.
- TDV extends fault coverage to 98.6% of multiple faulty PE's for which TMR provides no coverage.



### **Conclusions and Recommendations**

- Lockstep TMR can fail in the presence of multiple faulty PE's.
- Time distributed voting (TDV) is an alternative to lockstep TMR
- TDV extended coverage to 98.6% of multiple faulty PE's for.
- C6288 benchmark alias simulation TMR provides no coverage for 1.4% fault pairs



#### **Conclusions and Recommendations**

- Aliasing extends beyond equivalent faults
- Conventional fault collapsing does eliminate fault pair aliasing
- TDV does not ensure detection of faulty elements in all cases.
- TDV evicted the healthy PE 3.2% of the fault pairs
- TDV requires analysis of frequency of aliasing fault pairs
- TDV may require engineered test patterns to maximize coverage.



#### **Conclusions and Recommendations**

- TDV provides effective alternative to lockstep TMR
- TDV provides fault tolerant design of systems in the presence of multiple faults.
- Adding PE's reduces aliasing faulty PE's
- Aliasing reduced from PE's with different implementation for same function
- Engineering a minimal test pattern to identify alias fault pairs



#### Acknowledgement

- Andrew Baldwin was a part-time graduate student
- The work described in this talk was his MS thesis

