# Modern FPGA Basics

# The Xilinx XC6200 chip, the software tools and the board development tools

## What is an FPGA?

- Field Programmable Gate Array
- Fully programmable alternative to a customized chip
- Used to implement functions in hardware
- Also called a <u>Reconfigurable</u> <u>Processing Unit</u> (**RPU**)

# **Reasons to use an FPGA**

- Hardwired logic is very fast
- Can interface to outside world
  - Custom hardware/peripherals
  - "Glue logic" to custom co/processors
- Can perform bit-level and systolic operations not suited for traditional CPU/MPU

# The Xilinx XC6200 RPU

- SRAM-based FPGA
  - Fast, unlimited **reconfiguration**
  - Dynamic and partially reconfigurable logic
- Microprocessor interface
- Symmetrical, hierarchical and regular structure

# XC6200Family FPGAs

# Agenda

- XC6200 Architecture
- Design Flows
- Library Support
- Applications
- Reconfigurable Processing

## **Typical** Embedded Control Design



## 6 Problems Confronting Embedded Control Designers Today



#### XC6200 System Features Meet these requirements of Embedded Coprocessing



## **XC6200 Architectural Overview**

- Array of fine grain function cells, each with a register
  - high gate count for structured logic or regular arrays
- Abundant, hierarchical routing resources
  - global/local
  - use cell logic/use switches
- Flexible pin configuration
  - programmable as *in, out, bidirectional, tristate*
  - CMOS or TTL logic levels

# **XC6200 Architecture (cont)**

- High speed CPU interface for <u>configuration</u> and register I/O
  - Programmable bus width (8..32-bits)
  - Direct processor read/write access to all user registers
  - All user registers and configuration SRAM mapped into processor address space

# **XC6200 Architecture**



# **XC6200 Functional Unit**

 Design based on the fact that any function of two Boolean variables can be computed by a 2:1 MUX.



#### XC6200 Unit Cell

- Each unit cell contains a computation unit:
  - D-type register
  - 2-input logic function
  - Nearest neighbor interconnection
  - Individually programmable from host interface (uP)





Figure 5. XC6200 Basic Cell

## Logical Organization: XC6200 Function Unit

- Function unit allows :
  - any function of 2 variables
  - any flavour of 2:1 mux
  - buffers, inverters, or constant 0s and 1s
  - any of the above in addition to a D-type register
- 3 I/Ps, each from any of 8 directions; O/P to up/to 4 directions

Programmable input

Redirect output

#### Logical Organization: Function Unit.



#### Logical Organization: Function Unit. (cont)



Figure 7. Cell Logic Functions

#### Three basic modes

#### On any 2 of these arguments

| Function   | X1  | X2 | XЗ | Y2        | Y3            | RP | CS | Q  |               |
|------------|-----|----|----|-----------|---------------|----|----|----|---------------|
| 0          | А   | А  | А  | X2        | <del>X3</del> | х  | С  | XX |               |
| 1          | А   | А  | А  | <u>X2</u> | X3            | х  | С  | х  |               |
| BUF (Fast) | А   | х  | х  | Q         | Q             | Q  | с  | 0  |               |
| BUF        | х   | А  | А  | <u>X2</u> | X3            | х  | с  | х  | Shows what    |
| INV (Fast) | А   | х  | х  | Q         | Q             | Q  | С  | 0  | selected by   |
| INV        | х   | А  | А  | X2        | X3            | х  | С  | х  | corresponding |
| A.B (Fast) | А   | в  | х  | <u>X2</u> | Q             | Q  | С  | 0  | mux           |
| A.B        | А   | в  | А  | <u>X2</u> | X3            | х  | с  | х  |               |
| A.B (Fast) | А   | х  | В  | Q         | <del>X3</del> | Q  | с  | 0  |               |
| Ā.B        | А   | А  | в  | X2        | <del>X3</del> | х  | С  | х  |               |
| A.B (Fast) | А   | В  | х  | X2        | Q             | Q  | С  | 0  |               |
| A.B        | Α   | В  | Α  | X2        | X3            | х  | С  | х  | < 0 for       |
| A+B (Fast) | А   | Х  | в  | Q         | <del>X3</del> | Q  | С  | 0  | 0 for<br>fast |
| A+B        | А   | А  | в  | <u>X2</u> | X3            | х  | С  | х  | fact          |
| A+B (Fast) | А   | В  | х  | <u>X2</u> | Q             | Q  | С  | 0  | last          |
| Ā+B        | А   | в  | А  | <u>X2</u> | Х3            | х  | С  | х  |               |
| A+B (Fast) | А   | х  | В  | Q         | X3            | Q  | С  | 0  |               |
| A+B        | А   | А  | в  | X2        | Х3            | х  | С  | х  |               |
| A⊕B        | А   | в  | в  | X2        | <u>X3</u>     | х  | С  | х  |               |
| A⊕B        | А   | В  | в  | <u>X2</u> | Х3            | х  | С  | х  |               |
| M2_1       | SEL | А  | в  | <u>X2</u> | X3            | х  | С  | х  |               |
| M2_1B1A    | SEL | А  | в  | X2        | <del>X3</del> | х  | С  | х  |               |
| M2_1B1B    | SEL | А  | в  | <u>X2</u> | Х3            | х  | С  | х  |               |
| M2_1B2     | SEL | А  | в  | X2        | ХЗ            | х  | С  | х  |               |

### Cell logic function table

multiplexers

## Logical Organization: Function Unit. (cont)

| Function   | X1 | X2 | Х3 | Y2        | Y3            | RP | CS | Q |
|------------|----|----|----|-----------|---------------|----|----|---|
| 0          | А  | А  | А  | X2        | <del>X3</del> | Х  | С  | х |
| 1          | А  | А  | А  | <u>X2</u> | ХЗ            | х  | С  | х |
| BUF (Fast) | А  | х  | х  | Q         | Q             | Q  | С  | 0 |
| BUF        | х  | А  | А  | <u>X2</u> | X3            | х  | С  | х |
| INV (Fast) | А  | х  | х  | Q         | Q             | Q  | С  | 0 |
| INV        | х  | А  | А  | X2        | ХЗ            | х  | С  | х |
| A.B (Fast) | А  | В  | х  | <u>X2</u> | Q             | Q  | С  | 0 |
| A.B        | А  | В  | А  | <u>X2</u> | <del>X3</del> | х  | С  | х |



#### Physical Organization: Cells, Blocks and Tiles



## **XC6200 Architecture**

Regular connections to nearest neighbors

- Large array of simple, configurable cells (sea of gates)
- Each cell:
  - D-Type register
  - Logic function
  - Nearest-neighbor interconnection
  - Grouped in 4x4 block



## **XC6200** Architecture

- 16 (4x4) neighborconnected cells are grouped together to form a larger cellular array
- Communication "lanes" available between neighboring 4x4 cell blocks



### **Routing Resources Example**



#### Physical Organization: Cells, Blocks and Tiles (cont)





#### XC6200 Architecture

- A 4x4 array of the previously shown 4x4 blocks forms a 16x16 block
- Length 16 FastLANEs
   connect these larger
   arrays



#### XC6200 Architecture

- A 4x4 array of the 16x16 blocks forms the central 64x64 cell array
- Chip-Length
   FastLANEs connect
- Central block surrounded by I/O pads



# **XC6200 Routing**

- Each level of hierarchy has its own associated routing resources

   Unit cells, 4x4, 16x16, 64x64 cell blocks
- Routing does not use a unit cell's resources
- Switches at the <u>edge of the blocks</u> provide for connections between the levels of interconnect

### Routing Switches:



Figure 8. Routing Switches at 4x4 Block Boundary

#### North and South Switches:



This slide shows what is connected to routing switches



### East and West Switches:





## **Clock Distribution:**

16x16 **Global Input** Low Skew 'H' Distribution Of Global Signals (XC6216)

Clock connections are fixed, for speed

## **Clear Distribution:**



"CL" = Chip-Length

Figure 14. Additional Switches at 16x16 Boundaries

#### **Input/Output Architecture:**



Figure 15. Input/Output Architecture

Flexible pin configuration => programmable as <u>in, out, bidirectional, tristate</u> =>CMOS or TTL logic levels

#### **Connections Between IOB's and Built-In XC6200 Control Logic:**

This is programmed for input only

Table 3 Connections Between IOB's And Built-In XC6200 Control Logic

| B Signal Type    | Example       | EnToPadB                               | DToPadB                                      | DForPadB                 | DFromPadB                                                            |
|------------------|---------------|----------------------------------------|----------------------------------------------|--------------------------|----------------------------------------------------------------------|
| Input Only       | <del>CS</del> | 1                                      | 0                                            | L16 Output<br>From Array | Drives XC6200 Con-<br>trol Logic CS Input                            |
| Output Only      | SECE          | Driven By<br>XC6200 Con-<br>trol Logic | SECE Out From<br>XC6200 Control<br>Logic     | L16 Output<br>From Array | Not Connected                                                        |
| Bidirectional    | Data Bus      | Driven By<br>XC6200 Con-<br>trol Logic | Data-Bus Out<br>From XC6200<br>Control Logic | L16 Output<br>From Array | Drives XC6200 Inter-<br>nal FASTmap <sup>TM</sup> Data<br>Bus Inputs |
| From Padless IOB | South IOB12   | Enable Output<br>From W0 IOB           | DToPad Output<br>From W0 IOB                 | L16 Output<br>From Array | Drives W0 IOB<br>DFromPad Input                                      |
| None             | North IOB30   | 1                                      | 0                                            | L16 Output<br>From Array | Not Connected                                                        |

#### **Array Data Sources In West IOB's:**



Figure 17. Array Data Sources In West IOB's

### **XC6200 Device Organization**

Conceptual view



Logic symbol



RAM on board

Easy interface to uP

### FastMAP CPU Interface

- The industry's only *random access* configuration interface
  - allows for extremely fast full or partial device configuration - you only program the bits you need
- Allows direct CPU (random) access to user registers
  - supports *"coprocessing"* applications.

## FastMAP CPU Interface (cont)

- Easily interfaced to most microprocessors and microcontrollers
  - "memory mapped" architecture makes it just like designing with SRAM

## FastMAP (cont)

- Map Register allows mapping of <u>user</u> <u>registers</u> on to 8, 16, or 32 bit data bus
- Allows unconstrained register placement
- Obviates need for complex shift and mask operations



### FastMAP (cont)

- Wildcard Registers allow "don't cares" on address bits
  - same data can be written to <u>several locations</u> (SRAM and user registers) in one cycle
  - fast configuration of bit-slice type designs
  - broadcast of data to registers without tying up valuable routing resources.

## Partial <u>Run-time</u> Reconfiguration

- Extend hardware to a *larger* (virtual) capacity through rapid reconfiguration
- Derive time-varying structures that are smaller and faster than the ASIC counterpart
- Make more transistors participate in a given computation

#### Partial Run-time Reconfiguration



*Time* = 0

#### Partial Run-time Reconfiguration



*Time = <a short time later>* 

# Reconfiguration Speed vs Traditional Technologies



#### XC6200 Family Members

| Device               | XC6209 | XC6216 | XC6236 | XC6264  |
|----------------------|--------|--------|--------|---------|
| Appr Gate Count      | 9k     | 16k    | 36k    | 64k     |
| Number of Cells      | 2304   | 4096   | 9216   | 16384   |
| Max No. of Registers | 2304   | 4096   | 9216   | 16384   |
| Number of IOBs       | 192    | 256    | 384    | 512     |
| Cell Rows x Columns  | 48x48  | 64x64  | 96x96  | 128x128 |

Notes :

- 1. Gate counts are estimated average cases, based on LSI Logic figures register rich designs can have a much higher equivalent gate count than stated above.
- 2. Not all IOBs are connected directly to pads some pads are shared between IOBs.

#### **Design Flows**



#### Library Support

- Primitive gates and functions (compatible with other Xilinx parts)
  - AND, OR, ADD, MULT, etc
- More complex macros also to be available
  - memory access
  - DSP functions (FIR, FFT, DCT)
  - JTAG, decoders, etc.

# Applications

- Can be used as "regular" FPGA
  - serial interface allows for booting from PROM
- Intended to act as hardware accelerator for microprocessors
  - FastMAP allows for
    - direct microprocessor access to "internal" logic
    - fast reconfiguration of all or part of device

Applications (cont)

- "Context switching" and "virtual hardware" are realistic propositions
- Typical uses might include:
  - DSP,
  - image processing,
  - data paths,
  - etc.

# **Reconfigurable Processing**

- "Custom computing" concept, building on
  - fast configuration
  - virtual hardware
- PCI based development system to be made available
  - can be used as a custom computer in its own right, or
  - as an aid to system development for customers' designs

## **XC6000 Software:**

- XACT6000 Software From Xilinx.
- Trianus/Hades Design Entry Software for the XC6200
- Velab: Free VHDL Elaborator for the XC6200.
- XC6200 Inspector.



# A Multiplier for the XC6200

#### A Multiplier for the XC6200

- Structure
- Math
- Building Lookup Tables
- Area Optimization
- Mapping into an XC6200
- Changing Coefficients
- Performance
- Summary



#### Math Class



Colors related to previous slide

#### Architecture of the Multiplier



# LUTs by Muxing

#### All logic realized as trees

- Lookup Table contains all pre-calculated partial products.
- Use a Truth Table to determine Mux inputs. All possible products for multiplying by 0011 (3)

| A3 | A2 | A1 | AO | <b>P7</b> | <b>P6</b> | <b>P</b> 5 | P4 | <b>P3</b> | <b>P2</b> | <b>P1</b> | <b>P</b> 0 |
|----|----|----|----|-----------|-----------|------------|----|-----------|-----------|-----------|------------|
| 0  | 0  | 0  | 0  | 0         | 0         | 0          | 0  | 0         | 0         | 0         | 0          |
| 0  | 0  | 0  | 1  | 0         | 0         | 0          | 0  | 0         | 0         | 1         | 1          |
| 0  | 0  | 1  | 0  | 0         | 0         | 0          | 0  | 0         | 1         | 1         | 0          |
| 0  | 0  | 1  | 1  | 0         | 0         | 0          | 0  | 1         | 0         | 0         | 1          |
| 0  | 1  | 0  | 0  | 0         | 0         | 0          | 0  | 1         | 1         | 1         | 0          |
| 0  | 1  | 0  | 1  | 0         | 0         | 0          | 0  | 1         | 1         | 1         | 1          |
| 0  | 1  | 1  | 0  | 0         | 0         | 0          | 1  | 0         | 0         | 1         | 0          |
| 0  | 1  | 1  | 1  | 0         | 0         | 0          | 1  | 0         | 1         | 0         | 1          |
| 1  | 0  | 0  | 0  | 0         | 0         | 0          | 1  | 1         | 0         | 0         | 0          |
| 1  | 0  | 0  | 1  | 0         | 0         | 0          | 1  | 1         | 0         | 1         | 1          |
| 1  | 0  | 1  | 0  | 0         | 0         | 0          | 1  | 1         | 1         | 1         | 0          |
| 1  | 0  | 1  | 1  | 0         | 0         | 1          | 0  | 0         | 0         | 0         | 1          |
| 1  | 1  | 0  | 0  | 0         | 0         | 1          | 0  | 0         | 1         | 0         | 0          |
| 1  | 1  | 0  | 1  | 0         | 0         | 1          | 0  | 0         | 1         | 1         | 1          |
| 1  | 1  | 1  | 0  | 0         | 0         | 1          | 0  | 1         | 0         | 0         | 0          |
| 1  | 1  | 1  | 1  | 0         | 0         | 1          | 0  | 1         | 1         | 1         | 1          |



#### Each Pi realized as this

#### Optimizing the Lookup

- Two mux levels can be collapsed into a single gate.
- The function can be determined with a truth table.



Logic synthesis done by the CAD tool

No optimization

#### **Multiplier Schematic**



- The corresponding view in the layout editor.
- The LUTs are offset to line up bits for adder.
- Pipeline registers are cheap.
  - XC6216 has 4096 Flip Flops



#### A Closer Look at a Lookup



- Each 12-bit LUT is built from 12 one bit LUTs.
- LUTs get stacked vertically.



#### **Determining Coefficients**

- Schematic for a single 4-input LUT.
- Functions can be determined from the Truth Table.





#### **Changing Coefficients**

- Functionality of a cell is contained in one byte.
  - <u>32-bit access</u> can change <u>the function of 4</u> <u>cells</u> per write cycle.
- 96 cells need writing, or 24 write cycles. (worst case)
  - 1.45µs assuming
     33MHz



4\*24=96

# Summary on Multiplier Design

- 8x8 constant coefficient multiplier
- Pipelined 75+ MHz performance
- Small grain architecture High degree of LUT optimization
- Coefficients easily changed Fast reconfiguration times.
- High Performance/Dollar



# Development Board

### EXAMPLE : H.O.T. Works

- Development system based on the Xilinx XC6200-series RPU
- Includes:
  - H.O.T. Works Configurable Computer Board
  - H.O.T. Works **Development System Software**



### H.O.T. Works Board











#### H.O.T. Works Software

- Xilinx XACTStep
  - Map, Place and Router for XC6200
- Velab
  - Structural VHDL elaborator
- WebScope
  - Java-based debug tool
- H.O.T. Works Development System
  - C++-based API for board interfacing

#### **Design Flow**

#### **Design Flow for VHDL Entry Method**



#### **Run-Time Programming**

- C++ support software is provided for low-level board interface and device configuration
- Digital design is downloaded to the board at execution time
- User-level routines must be written to conduct data input/output and control

#### Conclusions

- Xilinx XC6200 provides a fast and inexpensive method to obtain great speedups in certain classes of algorithms
- There exist tools that provide a useable development platform to go from structural VHDL to digital design, and a programmable run-time interface in C++.

# Sources

# Mark L. Chang <a href="mailto:key">Mark L. Chang</a> <a href="mailto:selic:complete;">Mark L. Chang</a> <a href="mailto:selic:complete;">selic:complete;</a> <a href="mailto:selic:complete;">s

#### **Stephen Churcher**

### **Ahmad Alsolaim**