# **Semiconductor Reliability**

C. Glenn Shirley

Ack: Thanks to Scott C. Johnson for the prettiest slides!

#### **Outline**

- Reliability, Definitions, Bathtub Curve
  - Reliability Measures, Goals
  - Use Conditions
  - Acceleration
  - Mechanisms
    - Constant Failure Rate
    - Infant Mortality
    - Wearout

### What is "Reliability"?

**Definition 1:** (IPC-SM-785, Nov 1992)

The ability of a product to function under given conditions and for a specified period of time without exceeding an acceptable failure level.

**Definition 2:** Most reliability text books.

The probability that an item will perform a required function without failure under stated conditions for a stated period of time.

- Fraction of Population failing in Use, or Failure Probability
- Time in Use until a given fraction has failed
  - "Use": 1) Who is the user, 2) What is the population of systems, and 3) How are they used ("Use Conditions").
- <u>Use Conditions</u>, of system, or components in system.
  - System shipping and storage is part of use.
  - System power-on, power-off. Duty cycle.
  - Conditions while "on". Constant or variable
    - » eg. Human usage patterns, software activity.





- Failure name depends on when it occurs:
  - Yield Loss: Product fails an internal test
  - Quality: Product meets specification at OEM
  - Reliability: Product functions correctly throughout use life

## **Component Failure**

|                | Yield Loss                             | Quality                           | Reliability                              |  |
|----------------|----------------------------------------|-----------------------------------|------------------------------------------|--|
| Affects        | Producer                               | OEM mostly.                       | End User, OEM                            |  |
| Pass Criterion | Functions at test conditions.          | Functions per spec. (Data sheet.) | Functions in end use conditions.         |  |
| Impact         | Higher manufacturing cost at producer. | Higher OEM manufacturing cost.    | OEM warranty cost. Negative brand image. |  |
| Measure        | Fraction (%)                           | Fraction failing (PPM)            | Fraction per unit time %/kh, FITs        |  |

### **The Reliability Problem**



- Quality fails can be handled by thorough testing
  - We test parts for any flaws
  - And we don't sell parts with flaws
- Reliability is harder because the fails come long after we've sold the product
  - How can we tell which parts are going to fail in the future?

### **Component Reliability**







- The stresses and fail mechanisms for semiconductor components are
  - Stresses: voltage, temperature, current, humidity, radiation, temperature cycling, mechanical stress
  - Mechanisms: transistors (degradation, oxide breakdown), interconnects (electromigration, cracking), package (metal migration, corrosion, fatigue)
- Let's explore another example that is more familiar...

### **Human Mortality Example**

- Data from Census bureau.
- For a specific population.
- Y-axis is the proportion of the population at year y-1 dying by year y.
- Contains all data needed to compute:
  - Life expectancy at a given age.
  - Probability of death at a given age.
  - Number of deaths between given ages.
  - Etc.

# **Human Mortality Data**

| Age | Mortality Rate | Age | Mortality Rate | Age | Mortality Rate | Age        | Mortality Rate |  |
|-----|----------------|-----|----------------|-----|----------------|------------|----------------|--|
| 1   | 0.00706        | 26  | 0.00095        | 51  | 0.00439        | 76         | 0.03824        |  |
| 2   | 0.00053        | 27  | 0.00095        | 52  | 0.00473        | 77         | 0.04145        |  |
| 3   | 0.00036        | 28  | 0.00096        | 53  | 0.00512        | 78         | 0.04502        |  |
| 4   | 0.00027        | 29  | 0.00098        | 54  | 0.00557        | 79         | 0.04914        |  |
| 5   | 0.00022        | 30  | 0.00102        | 55  | 0.0061         | 80         | 0.05395        |  |
| 6   | 0.0002         | 31  | 0.00106        | 56  | 0.00673        | 81         | 0.0595         |  |
| 7   | 0.00019        | 32  | 0.00111        | 57  | 0.00742        | 82         | 0.06578        |  |
| 8   | 0.00018        | 33  | 0.00117        | 58  | 0.00816        | 83         | 0.07287        |  |
| 9   | 0.00016        | 34  | 0.00124        | 59  | 0.00892        | 84         | 0.08066        |  |
| 10  | 0.00014        | 35  | 0.00133        | 60  | 0.00971        | 85         | 0.08913        |  |
| 11  | 0.00013        | 36  | 0.00142        | 61  | 0.01058        | 86         | 0.09777        |  |
| 12  | 0.00013        | 37  | 0.00151        | 62  | 0.01157        | 87         | 0.107          |  |
| 13  | 0.00017        | 38  | 0.00161        | 63  | 0.01265        | 88         | 0.11683        |  |
| 14  | 0.00026        | 39  | 0.00173        | 64  | 0.01383        | 89         | 0.12725        |  |
| 15  | 0.00038        | 40  | 0.00187        | 65  | 0.01509        | 0.01509 90 |                |  |
| 16  | 0.00051        | 41  | 0.00201        | 66  | 0.01641        | 91         | 0.14989        |  |
| 17  | 0.00063        | 42  | 0.00217        | 67  | 0.01782        | 92         | 0.1621         |  |
| 18  | 0.00073        | 43  | 0.00234        | 68  | 0.01941        | 93         | 0.17489        |  |
| 19  | 0.00079        | 44  | 0.00253        | 69  | 0.02123        | 94         | 0.18824        |  |
| 20  | 0.00084        | 45  | 0.00274        | 70  | 0.02323        | 95         | 0.20212        |  |
| 21  | 0.00088        | 46  | 0.00299        | 71  | 0.02528        | 96         | 0.21651        |  |
| 22  | 0.00092        | 47  | 0.00325        | 72  | 0.02739        | 97         | 0.23138        |  |
| 23  | 0.00096        | 48  | 0.00353        | 73  | 0.0297         | 98         | 0.24668        |  |
| 24  | 0.00097        | 49  | 0.00381        | 74  | 0.03229        | 99         | 0.26237        |  |
| 25  | 0.00096        | 50  | 0.00409        | 75  | 0.03518        | 100        | 0.27839        |  |

## **Human Mortality Data, ct'd**

Failure Rate

**United States Total Population Mortality Rate 1999** 



### **Human Mortality Data, ct'd**

**United States Total Population Mortality Rate 1999** 



### **Reliability Measure**

- Reliability is measured by a failure rate.
- A failure rate is the fraction of a population failing per unit time in a time interval at a given stress condition.

Failure Rate = 
$$\frac{\text{Number of failures in } \Delta t}{\text{Population size at beginning of } \Delta t} \times \frac{1}{\Delta t}$$
$$= \frac{\text{Number of failures in } \Delta t}{\text{Device hours accumulated in } \Delta t}$$

- This is the average failure rate in the interval t to  $t + \Delta t$ .
- Eg. 100 units are stressed for 1000 hours, failures occur at 100 hours, 400 hours, 700 hours. What is the average failure rate?

$$\lambda_{BE} = \frac{3}{100 + 400 + 700 + 97000} = 0.00003055 = 3.055$$
 %/Kh

#### **Failure Rate Units**

Equivalent Failure Rate Units

| Fail        | % per 1000 hrs | FIT    |  |
|-------------|----------------|--------|--|
| Fraction    |                |        |  |
| per Hour    |                |        |  |
| 0.00001     | 1.0            | 10,000 |  |
| 0.000001    | 0.1            | 1,000  |  |
| 0.0000001   | 0.01           | 100    |  |
| 0.00000001  | 0.001          | 10     |  |
| 0.000000001 | 0.0001         | 1      |  |

- Conversion Factors
  - Fail fraction per hour x  $10^5$  = % per Khr
  - Fail fraction per hour  $x 10^9 = FIT$
  - % per Khr x  $10^4$  = FIT

### **Human Mortality - Examples**

- Of 1000 people alive at 80...
  - How many are dead at 81?
    - > 1000\*0.0595 = 60
  - How many are dead at 82?

$$> 59.5 + (1000 - 59.5)*0.06578 = 121$$

- What is the "failure rate" of 80 year olds?
  - $-10^{5}\times0.054/(24*365) = 0.56$  %/kh = 6159 FITs
- What is the "failure rate" of 20 year olds?
  - $-10^{5}\times0.0008/(24*365) = 91$  FITs
- Typical failure rates of Ics in Use: < 1000 FITs</li>

Failure rate depends on age. It is not a constant, independent of age. This is true of human mortality, and of integrated circuits.

### **Bathtub Curve**



- Defects: Infant Mortality. Declining fail rate, early life.
- Radiation, Software (random): Constant fail rate.
- Materials, Design: Wearout. Increasing fail rate, late life.



#### **Customer-Perceived Bathtub Curve**

 Use Infant Mortality Control (eg. Burn In) to reshape the bathtub fail rate curve as perceived by customers.



Typical Fallout with IMC: 100 -1000 DPM 0-30d; 200 - 400 FITs 0-1y

#### **Outline**

- Reliability, Definitions, Bathtub Curve
- Reliability Measures, Goals
- Use Conditions
- Acceleration
- Mechanisms
  - Constant Failure Rate
  - Infant Mortality
  - Wearout

### **Reliability Goals**

 How are Goals used? Results of experiments or models (Figures of Merit) are compared to Goals to make Pass/Fail decisions.

#### **FOM** ⇔ **Goal** ⇒ **Pass/Fail**

- Reliability goals involve fraction fail and time. (eg. FITs)
- Goals are always stated in relation to some stress condition, or usage model.

#### **Examples:**

- Goal for 85/85 is < 1% failing at 1000 hours.
- Product in use: < 1% fails after 7 years of power on provided the product does not exceed data sheet limits.

### Reliability Goals (ITRS, 2007)

| <i>Table PIDS7a</i> | Reliability Technology | Requirements—Near-term Years |
|---------------------|------------------------|------------------------------|
| 11.0111122          |                        | 210 91111 21112              |

| Year of Production                                                         | 2007          | 2008          | 2009          | 2010          | 2011          | 2012          | 2013        | 2014        | 2015          |
|----------------------------------------------------------------------------|---------------|---------------|---------------|---------------|---------------|---------------|-------------|-------------|---------------|
| DRAM ½ Pitch (nm) (contacted)                                              | 65            | 57            | 50            | 45            | 40            | 36            | 32          | 28          | 25            |
| MPU/ASIC Metal 1 (M1) ½ Pitch (nm)(contacted)                              | 68            | 59            | 52            | 45            | 40            | 36            | 32          | 28          | 25            |
| MPII Physical Gate Length (nm)                                             | 25            | 22            | 20            | 18            | 16            | 14            | 13          | 11          | 10            |
| Early failures (ppm) (First 4000 operating hours) [1]                      | 50-<br>2000   | 50-<br>2000   | 50-<br>2000   | 50-<br>2000   | 50-<br>2000   | 50-<br>2000   | IM          | 50-<br>2000 | 50-<br>2000   |
| Long term reliability (FITS = failures in 1E9 hours) [2]                   | 50–<br>2000   | 50-<br>2000   | 50-<br>2000   | 50–<br>2000   | 50-<br>2000   | 50-<br>2000   | Wearout 0-  |             | 50-<br>2000   |
| SRAM Soft error rate (FIT) MBit)                                           | 1000-<br>2000 | 1000-<br>2000 | 1000-<br>2000 | 1000-<br>2000 | 1000-<br>2000 | 1000-<br>2000 | Constant fr |             | 1000-<br>2000 |
| Kelative janure rate per transistor (normalized to 2007 value) [3]         | 1.00          | 0.83          | 0.71          | 0.66          | 0.57          | 0.51          | 0.46        | 0.40        | 0.37          |
| Relative failure rate per m of interconnect (normalized to 2007 value) [4] | 1.00          | 0.50          | 0.50          | 0.50          | 0.25          | 0.25          | 0.25        | 0.12        | 0.12          |

ERROR!

Manufacturable solutions exist, and are being optimized

Manufacturable solutions are known

Interim solutions are known

Manufacturable solutions are NOT known



- Given a range, the upper limit is the requirement. ⊗
- FITs is NOT "Failures in 1E9 hours". 🕾
- Early fails in "operating hours" but long term could be "calendar hours"! ⊗

#### **Outline**

- Reliability, Definitions, Bathtub Curve
- Reliability Measures, Goals
- Use Conditions
  - Acceleration
  - Mechanisms
    - Constant Failure Rate
    - Infant Mortality
    - Wearout

### **Use Conditions: Life of an IC**















Assembly

Shipping

Storage

Shipping



\_\_

**OEM** assembly

Shipping

End user













### **Use Conditions Depend on Application**









 $\Delta T \sim \sim \sim$ 

Temperature Cycle



Bend Reflow



Handling



Temp, RH





Keypad press

### **End-Use Conditions Vary Widely...**

• CPU usage idle during non-business hours



• CPU usage busy during non-business hours



CPU usage

### **Use Conditions – Temperature**

- Die temperature is determined by
  - The effectiveness of the cooling system (heat sink and fans)
  - The ambient temperature (Tamb)



#### **Use Condition Data and Model Sources**

- Platform (eg PC "box")
  - Lab electrical and thermal measurements on instrumented systems.
- End Use
  - Population and marketing statistics vs location.
  - Ambient vs location (eg. NOAA).
  - Industry standards (ASHRAE)
  - Human activity monitoring.
     Software activity, in-situ data logging.
  - Surveys of end users (poor source)



Ordering Information: RHT10.......Humldby and Temperature USB Datalogge

www.extech.com

#### **Outline**

- Reliability, Definitions, Bathtub Curve
- Reliability Measures, Goals
- Use Conditions
- Acceleration
  - Mechanisms
    - Constant Failure Rate
    - Infant Mortality
    - Wearout

### **Stress and Failure**

- How long is our product going to last?
  - We can't wait until it fails to seethat takes too long!
- We need to identify the stresses that cause it to fail
  - ...and then apply them harder to make our parts fail in a reasonable amount of time
- Our stresses include
  - Voltage
  - Temperature
  - Current
  - Humidity
  - Mechanical stress
  - ...and others





#### **Accelerated Test**



- The most powerful tool (and concept) in the reliability engineer's toolbox.
- Accelerated test increases one or more conditions (e.g., T, V, etc.) to reduce times to failure

Life Test (years) → Accelerated Test (hours)

 Intention is to accelerate a mechanism without inducing new mechanisms

#### **Acceleration Factor**

$$AF = \frac{t_{cold}}{t_{hot}} = \frac{1000 \text{hr}}{100 \text{hr}} = 10$$



- An acceleration factor describes how much a particular stress accelerates degradation or failure.
- An acceleration factor is a <u>ratio of times</u>.
  - NOT fail fractions.
- The "times" are times to have the "same effect".
  - Example of "same effect": The same fraction fails by the same mechanism.

### **Acceleration Concept**



$$AF = \frac{t1}{t2} = \frac{t3}{t4}$$

• Distributions at both conditions must match (same slope) for acceleration concept to make sense

## **Acceleration Example**



A temperature acceleration experiment showing the same distribution shape (slope) at each stress temp

### **Moisture and Temperature Fails**

Result is predicted TTF distribution at use condition

Distributions of Times To Fail for Various Conditions



### **Accelerated Stress Testing**

Special-purpose equipment accelerates various fail mechanisms



An LCBI burn-in system gives V and T stress to accelerate Si fail mechanisms



A HAST system gives pressure and humidity along with V and T to accelerate package fail mechanisms

### **Life Test Accelerates Use**



Lognormal with two-sided 90.0% confidence limits

#### **Outline**

- Reliability, Definitions, Bathtub Curve
- Reliability Measures, Goals
- Use Conditions
- Acceleration
- Mechanisms
  - Constant Failure Rate
  - Infant Mortality Decreasing Failure Rate
  - Wearout Increasing Failure Rate

#### **Mechanisms**

- Constant failure rate.
  - Controlled by fault tolerant design.
  - Eg. Cosmic rays charge upset uncorrelated to age of device.
  - Infant Mortality
    - Controlled by yield improvement and by burn in.
    - Decreasing failure rate makes burn in possible.
    - Caused by defects.
  - Wearout
    - Controlled by design rules.
    - Increasing failure rate limits the life of the IC.
    - Electromigration
    - Oxide Wearout
    - Transistor Degradation

### **Mechanisms**

- Constant failure rate.
  - Controlled by fault tolerant design.
  - Eg. Cosmic rays charge upset uncorrelated to age of device.

### Infant Mortality

- Controlled by yield improvement and by burn in.
- Decreasing failure rate makes burn in possible.
- Caused by defects.
- Wearout
  - Controlled by design rules.
  - Increasing failure rate limits the life of the IC.
  - Electromigration
  - Oxide Wearout
  - Transistor Degradation

# **Defect Yield and Reliability**

- Defects are inescapable.
  - The same kinds of defects that cause yield loss perceived by the manufacturer, cause "infant mortality" perceived by end users.
- Yield is measured at Sort initial wafer-level testing.
- Infant Mortality is measured by life-test, and controlled by burn in.
  - Life test is an extended burn in designed to acquire detailed reliability data.
  - Burn in is a stress preceding final test which activates latent reliability defects (LRDs) so that they may be screened out at final test (Class).
- Defect models of reliability describe only the left part of the bathtub curve; they don't describe wearout.



Portland State

tbi = 0.01 h



tbi = 2 h



tbi = 4 h



tbi = 8 h



tbi = 16 h



# Killer vs Latent Reliability Defects



When defect is within  $\delta$  of line, failure is not immediate but will occur within the specified life of the device.

- Circuit design determines
  - Pattern pitch and space.
  - Different functional blocks have different characteristic pitch/spaces.
- Fab process determines
  - Spatial density of defects, D (defects/cm2)
  - Variation of spatial defect density.
  - Size distribution of defects.
- Ckt design plus size dist'n segregates defects into "killer" and latent reliability defects (LRD).
  - OK, never a yield or reliability defect (1).
  - Sometimes a latent reliability defect (2), sometimes OK (3).
  - Sometimes a killer defect (4), sometimes a latent reliability defect (5), sometimes OK(6).
  - Always a killer defect (7).

# Killer vs Latent Reliability Defects

- Defects much smaller or larger than circuit geometry are not latent reliability defects (LRD).
- Some defects with size commensurate with circuit geometry are latent reliability defects.
- Typically ~ 1% of defects are latent reliability defects.

Stapper's model



Reliability

pp 461-475

# Killer vs Latent Reliability Defects

- Defects may be classified as "killer" defects which affect yield or LRD defects which affect reliability.
- Defects of either kind may be clustered. Described by defect density and defect density variance.
- Killer defects and LRDs are from the same source, so Yield and Reliability defect densities are proportional:
  - Drel/Dyield ≈ constant (typically ~ 1%).
- Dyield is MUCH easier to measure and monitor in manufacturing than Drel.



Spring 2016 Reliability 46

# So much for pretty models... ...now for ugly reality...



From an ITRS report.



# **Activated LRDs, Mainly Shorts**



STI Particle



Salicide Encroachment



Salicide Stringer



Residual Ti



Tungsten Particle



Copper Extrusion

Portland State

Reliability

# **Activated LRDs, Mainly Opens**



### **Mechanisms**

- Constant failure rate.
  - Controlled by fault tolerant design.
  - Eg. Cosmic rays charge upset uncorrelated to age of device.
- Infant Mortality
  - Controlled by yield improvement and by burn in.
  - Decreasing failure rate makes burn in possible.
  - Caused by defects.

#### Wearout

- Controlled by design rules.
- Increasing failure rate limits the life of the IC.
- Electromigration
- Oxide Wearout
- Transistor Degradation

# **Electromigration**



- "Electron wind" from conduction current gradually pushes ions "down wind" into new positions in the lattice
- Good heat sinking of thin film metal permits current densities high enough (~10<sup>6</sup> A/cm<sup>2</sup>) for the phenomenon to occur. Isolated wires would melt at this current density.

#### **Causes Voids and Extrusions**



 Electron wind pushes metal enough to cause voids on one end and extrusions on the other end

#### **Voids and Extrusions**

Example of a void



Example of extrusions



- Voids cause a rise in resistance
  - Dominant fail mode; reliable and easy to characterize
- Extrusions might cause shorts
  - Inconsistent and random

### **Test Structure to Characterize EM**





# **Measuring Voids**





- As voids form, resistance increases
- A threshold is chosen to define a fail

### **EM Model**

Results from many test structures



- Test structure data is used to calculate a max current (Imax) for which <0.1% fail at 7 yrs worst-case use
- This results in a design rule for Imax, which must be followed by all products using this technology

#### **Mechanisms**

- Constant failure rate.
  - Controlled by fault tolerant design.
  - Eg. Cosmic rays charge upset uncorrelated to age of device.
- Infant Mortality
  - Controlled by yield improvement and by burn in.
  - Caused by defects.
- Wearout
  - Controlled by design rules.
  - Electromigration
  - Oxide Wearout
    - Transistor Degradation

#### **Gate Oxide**



- Gate oxide is thin and critical
- Thinner oxide allows less charge on the gate to control the channel, and less charge means faster switching

# **Transistor Scaling Trends**



- Oxide thickness has leveled off around 20Å
  - Due to leakage and reliability
- Channel length continues to shrink

# **Oxide Degradation**



- Oxide degrades with time as
  - Impurities diffuse into it
  - Bonds change

## **Oxide Degradation**



• Electrons can tunnel ("hop") from defect to defect more easily than across the whole gate oxide layer

#### **Oxide Breakdown**

Thick oxide, more difficult to get percolation paths



 Oxide leakage will go up dramatically when a fully connected percolation path forms

#### **Soft Breakdown**





- The full percolation path makes a "soft" breakdown
  - Soft breakdown is considered a fail
- High current in the percolation path can change it to a "hard" breakdown

#### **Mechanisms**

- Constant failure rate.
  - Controlled by fault tolerant design.
  - Eg. Cosmic rays charge upset uncorrelated to age of device.
- Infant Mortality
  - Controlled by yield improvement and by burn in.
  - Caused by defects.
- Wearout
  - Controlled by design rules.
  - Electromigration
  - Oxide Wearout
  - Transistor Degradation

# **PMOS Bias Temperature Instability**



- PMOS negative bias / temperature instability
  - PBT or NBTI
  - Primarily affects PMOS transistors
  - Degrades device performance
- Primarily manifests in slower switching leading to Fmax degradation

### **Fmax**





- Fmax caused by a speed path delay fault
  - Chip fails when some calculation is not ready in time
- Delays caused by
  - Transistor switching (higher V speeds them up)
  - Signal propagation



# **Fmax V and T Sensitivity**



- Low temperature, high voltage maximizes Fmax.
- But high voltage degrades reliability.

## **Fmax Degrades Over Time**





- Fmax decreases by 0 to ~5% over the life of a part
- Roughly follows a power law

# **Instructor Biography**

- C. Glenn Shirley
  - MSc in Physics (University of Melbourne, Australia)
  - PhD in Physics (Arizona State University)
  - 3 years post-doc at Carnegie-Mellon (Pittsburgh, PA)
  - 1 year at US Steel
  - 7 years at Motorola
  - 23 at Intel mostly in TD Q&R, retired in 2007.
    - » Package reliability, silicon reliability, Test Q&R
  - Joined PSU ECE in the IC Design and Test Lab in 2008 as a Research Prof.
- Contact information
  - cshirley@pdx.edu