文件名称:
Packing Techniques for Virtex-5 FPGAs 2009.pdf
开发工具:
文件大小: 426kb
下载次数: 0
上传时间: 2019-08-17
详细说明:Packing is a key step in the FPGA tool flow that straddles the boundaries between synthesis, technology
mapping and placement. Packing strongly influences circuit speed, density, and power, and
in this article, we consider packing in the commercial FPGA context and examine the area and performance
tPacking Techniques for Virtex-5 FPGAS 18: 3
A preliminary version of this work appeared in Ahmed et al. [2008]. In this
extended journal version, we give a more thorough treatment of the proposed
techniques. More significantly, we have retuned our algorithms to reflect an
area and performance trade-off that we believe will be more desirable to cus-
tomers. As such, for this article, we have regenerated all of our experimental
results, and we also present new experimental work on the area/performance
trade-off space of the proposed packing techniques
The balance of the article is organized as follows: Section 2 reviews related
work on packing and introduces the Virtex-5 FPGA architecture. Packing tech-
niques that target logic blocks are offered in Section 3. Packing techniques for
Large IP blocks appear in Section 4. Conclusions and suggestions for future
work are given in Section 5
2. BACKGROUND
2.1 Packing
Figure l depicts the classic FPGA logic block model used in the vast major
ty of academic research. It consists of a cluster of LUTs and fip-Alops, where
each fip-flop can be optionally bypassed for implementing combinational logic
Static RAM cells, not shown in the figure, control the logic function of each
LUT, and also control the multiplexer and interconnect connectivity. Inputs
to the logic block come from the FPGAs general interconnect: horizontal and
vertical channels of FPGA routing. Local interconnect inside the logic block
is available for realizing fast paths within the logic block. Observe that each
LUT/FF pair drives both local interconnect, as well as general interconnect.
Most prior packing work assumes the local interconnect to be a full crossbar
switch matrix-every input can be programmably connected to any output. In
this logic block model, connections within the logic block are fast, and connec
tions between logic blocks, routed through the general interconnect, are slow
in comparison. The packing step decides which LUTs to put together into a
single logic block, and therefore, packing has a significant impact on circuit
speed. The model of Figure 1 is representative of early Altera FLEX FPGAs
[Altera 2003], however, it has become out-of-step with the logic blocks present
in modern Xilinx or Altera FPGAs. Modern logic blocks present a more com-
plex packing problem and new optimization opportunities and impose different
constraints, as we will illustrate in the sections below
Much research has been published on packing for the logic block in Figure 1
Perhaps the most cited work is that of Betz and Rose, who proposed an area
driven packing algorithm and showed that, due inherent locality in circuits
the number ofinputs to a logic block can be much smaller than the total num
ber of lUT inputs within a cluster [Betz and Rose 1997. In particular, for a
logic block with N 4-input LUTs, Betz and Rose [1997] showed that only 2N +2
inputs to the cluster are needed-a number much smaller than 4N. Marquardt
extended the work to perform timing-driven packing and demonstrated the im
pact of packing on critical path delay Marquardt et al. 1999
ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009
18:4
T Ahmed et al
LUT
oEcoOos
而00日
LUT
oc09
LUT
local interconnect
Fig. 1. Classic FPGA logic block targeted by most academic research
Packing affects power consumption as intralogic block connections will have
lower capacitance than interlogic block connections. A natural approach there
fore is to attempt to keep nets with high switching activity contained within
logic blocks, as was proposed in Lamoureux and Wilton [2003]. An entirely
different approach for power-driven packing was shown in Singh and Marek-
Sadowska [20021, where Rent's rule was used to establish a preference for how
many logic block inputs should be used during packing, leading to lower over
ll interconnect usage, capacitance and power. Although not yet available com-
mercially, dual-VDD FPGAs have been proposed by academia, where the idea
is to programmably allow logic blocks to operate at reduced supply voltage
(slower but lower power). Researchers at UClA developed a complete CAD
How for a proposed dual-VDD FPGA, including new packing techniques [Chen
and Cong 2004]. The aim of packing in this context is to pack LUTs based
on their timing-criticality, placing noncritical LUTs together into logic blocks
that will be operated at low-Vnn. Hassan et al. [2005] dealt with packing for
a low-power fpGa having logic blocks that when idle, can be placed into a low
eakage sleep state.
On the speed axis, more recent work includes Dehkordi and Brown [2002
which also uses a rents rule-based algorithm, and prevents loosely connected
LUTs from being packed together; that is, it prevents the packing of unrelated
LUTs. Other papers tie together packing with other phases of the FPGA CAD
fow. For example, Schabas and Brown [2003 looked at packing in the con
text of logic replication for performance; a subset of lUTs are deliberately left
empty by the packer to accomodate later LUT replications during placement
An interesting recent work by Lin et al. 2006] brought together packing and
technology mapping and showed that higher speed can be attained using a
unified algorithm for concurrent packing and technology mapping
2.2 Virtex-5 FPGA Architecture
We now describe the virtex-5 architecture, focusing on the blocks that are the
target of our packing techniques
ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009
Packing Techniques for Virtex-5 FPGAS 18: 5
COUT
COUT
SLICE
6-LUT FF
Slice(1)
6-LUTI FF
Switch
atrix
6-LUT
FF
Slice(0)
16-LUT
FF
CIN
CIN
Fig. 2. Virtex-5 CLB and slice
2.2.1 Logic Blocks. Logic blocks in Virtex-5 are called configurable logic
blocks(CLBs)and comprise two SLICEs and a switch matrix, as shown in
Figure 2. The switch matrix serves the purpose of providing intra and inter-
CLB connectivity. As shown in the figure, each SLICE contains four 6-input
LUTs and 4 fip-Hops. SlICEs receive and produce carry signals from neigh-
boring SlICEs for implementing fast ripple-carry addition
Figure 3 zooms in on the details of portion of a SlICE-the LuTyfip-Alop
pair. Dashed lines in the figure represent connections whose details are omit
ted for clarity, but will be described below. Traditional LUTs in FPGAs have
a single output and can implement a single logic function. However, LUTs in
Virtex-5 FPGAs have two outputs, 05 and 06, and can implement two logic
functions. Specifically, the LUT can implement a single logic function using up
to 6 inputs, with the output signal appearing on 06. The LUT can also impl
ment two functions that together use up to 5 inputs, in which case, both the
05 and o6 outputs are used, and the lut input a6 must be tied to logic-1
The architecture necessitates that when two LUt outputs are used, only one
of them can be registered
An important property of the lut/fF structure in Figure 3 is that o6 di-
rectly drives a Slice output, A, whereas O5 must pass through an additional
multiplexer before reaching SLICE output, AMUX. This makes output 05
slower than output 06. Consequently, for high performance, 05 should not
be used in timing-critical combinational paths. The relative slowness of o5 is
something that must be considered when packing LUTs into Virtex-5 SLICEs,
if high speed performance is to be maintained
Figure 4 shows the dual-output LUT hardware implementation. LUTs in
FPGAs are implemented with trees of multiplexers, and the figure shows a
portion of a 6-input luT. Values in SRAMs cells(on the left of the figure)are
set to hold the truth table for the logic function(s) implemented in the lut
The fastest LUT input, A6, is held constant at logic-l when two functions are
mapped to the lut, and therefore the fast input is unavailable for either of the
ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009
18:6
T Ahmed et al
AMUX
A3 6-LUT o5
Fig 3. Dual-output 6-input LuT.
A5
A1 A2 A3 A4
A6(tied to logic-1)
multiplexer tree
O6
Fig 4. Hardware implementation of dual-output Lu
two functions. Thus, both logic functions suffer from increased delay in dual
output mode, relative to the case of implementing a single function in a LUt,
where the a6 can be used. Observe that the o5 output is fed by an internal
node of the multiplexer tree
Circuits implemented in an FPGA need not use all 6 Lut inputs, and in
fact, circuits frequently contain many small logic functions that need only 2
or 3 LUT inputs. Figure 5 shows a histogram(similar to that in Hutton et al
2004])oflut sizes from a circuit collected from a Xilinx customer. In this
case, 37%0 of LUTs use 2 or 3 inputs, and such small luts could be paired
together and packed into one dual-output 6-LUT. The frequency of small luts
in circuits bodes well for the usability of the second lUT output
Section 3 describes a packing approach for using both lut outputs to pro
duce higher logic density, while at the same time minimizing the performance
hit of using of the slower O5 output. The higher density is achieved as a re
sult of the pervasiveness of small logic functions in circuits that can be packed
together in a single dual-output 6-LUT
2.2.2 Block RAMs and DSPs. In early FPGAs, any data used by the on-
chip application had to be stored in off-chip RAMs. Accessing memory was
slow and power hungry. Modern FPGAs such as Virtex-5 include blocks of
ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009
Packing Techniques for Virtex-5 FPGAs
5的En-
5
Number of used LUT inputs
Fig. 5. LUT input usage for sample design
36Kb ram
18 Kb RAM
18 Kb RAM
Fig. 6. Virtex-5 block RAM tile
static RAM on-chip. Figure 6 shows a Virtex-5 block RAM, which can store
up to 36 Kbits of data. A block RAM can operate as a single 36 Kb RAM, or
as two independent 18 Kb RAMs. The RAMs rcad and write operations are
synchronous Multiple block RAMs are organized into columns on the Virtex-5
device and two adjacent block RaMs in a column can be combined to implement
eper memories
When an 18 Kb RAM is used in dual-port mode there are two possibilities
(1) True dual-port mode: the 18 Kb block RAM has two independent access
ports, A and B. Data can be written to either or both ports, and can be
read from either or both ports. Figure 7(a)illustrates this case. In the
figure. the two 18 Kb block raMs in the block ram tile are referred to
as the upper and lower block RAMs, the two access ports on each 18 Kb
block RAM are called the A and B access ports. DI, DO and ADDR refer to
data-in, data-out, and address ports, respectively
ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009
18:8
T Ahmed et al
36Kb RAM
36Kb Ram
DIA, upper
DOA, upper
DI upper
DO upper
DIB
upper
18 Kb RAM
WRADDR, i 18 Kb RAM
ADDRA up
(upper)
DOB,upp
RDADDR up
(upper)
ADDRB up
DIA lower
DI lower
DOA lower
DIB lower
18 Kb RAM
WRADDR low 18 Kb RAM DO Lower
ADDRA loy
DOB lower
ADDRB low
RDADDR low
a) Two 18Kb true dual-port memories b) Two 18Kb simple dual-port memories
Fig. 7. Dual-port block RAMs: true and simple
(2)Simple dual-port rAM mode: In this case, each port cannot be used for both
reads and writes; rather, the 18 Kb block RAM has one port for read opera
tions and an independent port for write operations. Figure 7(b)illustrates
this case. In the figure, WRADDR and rdadDR refer to the write-address
and rcad-address ports, respectivcly
The maximum allowable word width of a RAM in simple dual-port mode is 36
bits, which is double that allowed for true dual-port mode. The RaMs have
configurable "aspect ratio, meaning that an 18 Kb RAM can be configured as
16K×1,8R×2,2K×9,or1R×18 or for simple dual- port rams,512×36
DSP circuits frequently contain multiply-accumulate operations, and in
early FPGAs, such functionality was implemented using the generic LUT
fabric. In modern FPGAs, however, such functionality can be realized with
improved speed and power using hard ip blocks. Figure 8 depicts a virtex-5
DSP block, called the DSP48E. The DSP48E comprises a 25 x 18 multiplier
which receives inputs A and b as the operands to the multiply block. The
product is fed into an alu that also may receive input c. The alu can im
plement addition, subtraction, and varied logic functions. Multiple DSP48Es
can chain together to form morc complex functions through the PCIN and
PCOUT ports shown in the figure. The structure in Figure 8 is not entirely
combinational and actually has programmable pipeline depth, via configurable
registers present at several places in the block. A comprehensive description
of the dsP48E can be found in Xilinx [2007]
The virtex-5 layout scheme has columns of block RAMs and columns
of DSP48Es, integrated into the regular fabric of logic blocks and routing
Section 4 describes packing techniques for block RAMs and DSPs that yield
improved performance for circuits containing commonly-occurring patterns of
block ramdsp instances
3. PACKING FOR LOGIC BLOCKS
In the traditional FPGA CAD flow, packing is done at the preplacement stage
However, for the case of packing into the dual-output 6-LUTs in virtex-5, one of
the two LUT outputs has slower speed Packing into 6-LUTs must be done with
consideration of the circuit's critical path, if high speed is to be maintained
ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009
Packing Techniques for Virtex-5 FPGAs
8:9
ACOUT BCOUT
PCOUT
B(18-bit)
A(25-bit
M X
C(48-bit)
ACIN BCIN
Fig8. Virtex-5 dsp48E block
Because roughly 50% of the path delay in FPGAs is interconnect delay, little
known regarding path criticality at the preplacement stage. Therefore, we
choose to do dual-output Lut packing during placement, as we have access to
better estimates of connection delay
3. 1 Placer floy
Figure 9 shows the flow of the Xilinx placer. The input to the fow is a netlist
composed of LUTs, flip-Hlops, large IP blocks, and some prepacked SlICEs,
which may or may not be able to accomodate additional luts and/or regis
ters. Core logic blocks are placed using analytical techniques, similar to those
published in Eisenmann and Johannes [1998] and Viswanathan et al. [2007]
After initial IO placement, the placement problem is represented mathemat
ically, as a system of equations to be solved. An initial overlapped placement
of the design is computed by solving the equations. Based on the intermediate
placement, the mathematical formulation is modified to move logic blocks away
from highly congested regions. The solving and reformulation of the mathe
matical system continues iteratively until ultimately, a feasible overlap-free
placement is produced. Once analytical placement is finished, swap-based lo
cal optimization is run on the placed design to further refine the placement.
The cost function used in the placer considers wirelength and timing
Cost
W+b·T
where w and T arc the wirelength and timing(speed performance)costs,
respectively, and a and b are scalar weights. The values of a and b can be
set according to the relative priority of speed performance versus wirelength
Since actual routes are not available during placement, the wirelength cost
depends on wirelength estimation. Likewise, the timing cost depends on esti-
mates of connection delays and user-supplied constraints
The Xilinx tools permit a user to manually pack the SlICEs in their design, thereby limiting the
potential for additional automatic packing into dual-output LUTs
ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009
18:10
T Ahmed et al
Analytical Placement
Overlap Free
Yes
Swap Based
Local Opt
Fig 9. Placer fle
3.2 Packing during placement
3.2.1 Leveraging Two LUT Outputs. The first task in creating a dual
output Lut packing algorithm was to"open up"the usage of the second luT
output, 05, and permit two LUTs to be packed together. This involved im
plementing legality checking on whether two LUTs can feasibly pack together.
Two principal legality checks are made: First we check that, combined, the
two LUTs use no more than 5 distinct variables, and if so, they are disqualified
from packing together. Second, we ensure that at most one of the two luts
drives a register. The reason for this second restriction is that there exists only
one register per dual-output 6-lUT. If two LUTs feeding registers were packed
into the 6-luT. one of the luts would not be able to attain a fast connect to
its fanout register, damaging performance considerably. In fact, if tWO LUTs
needing registers are packed together into a single 6-LUT, one of the two reg-
isters must be placed in a nonadjacent register slot, perhaps even in another
CLB. Other detailed packing restrictions that we found useful are described in
the following paragraphs
While simulated annealing-based placers keep design objects snapped onto
the FPga grid at all times, analytical placement algorithms operate in the
floating point domain. At the end of spreading, when the placement is close to
being feasible (overlap-free), the design objects are"fit"or"snapped onto the
FPGA grid. Actually, opening up the second lUT slot in each 6-LUT permits a
ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.