Packing Techniques for Virtex-5 FPGAs 2009.pdfPack

文件名称: Packing Techniques for Virtex-5 FPGAs 2009.pdf

所属分类: 硬件开发

开发工具:

文件大小: 426kb

下载次数: 0

上传时间: 2019-08-17

提供者: fengji******

下载 (426kb)

不能下载？报告错误

详细说明：Packing is a key step in the FPGA tool flow that straddles the boundaries between synthesis, technology mapping and placement. Packing strongly influences circuit speed, density, and power, and in this article, we consider packing in the commercial FPGA context and examine the area and performance tPacking Techniques for Virtex-5 FPGAS 18: 3 A preliminary version of this work appeared in Ahmed et al. [2008]. In this extended journal version, we give a more thorough treatment of the proposed techniques. More significantly, we have retuned our algorithms to reflect an area and performance trade-off that we believe will be more desirable to cus- tomers. As such, for this article, we have regenerated all of our experimental results, and we also present new experimental work on the area/performance trade-off space of the proposed packing techniques The balance of the article is organized as follows: Section 2 reviews related work on packing and introduces the Virtex-5 FPGA architecture. Packing tech- niques that target logic blocks are offered in Section 3. Packing techniques for Large IP blocks appear in Section 4. Conclusions and suggestions for future work are given in Section 5 2. BACKGROUND 2.1 Packing Figure l depicts the classic FPGA logic block model used in the vast major ty of academic research. It consists of a cluster of LUTs and fip-Alops, where each fip-flop can be optionally bypassed for implementing combinational logic Static RAM cells, not shown in the figure, control the logic function of each LUT, and also control the multiplexer and interconnect connectivity. Inputs to the logic block come from the FPGAs general interconnect: horizontal and vertical channels of FPGA routing. Local interconnect inside the logic block is available for realizing fast paths within the logic block. Observe that each LUT/FF pair drives both local interconnect, as well as general interconnect. Most prior packing work assumes the local interconnect to be a full crossbar switch matrix-every input can be programmably connected to any output. In this logic block model, connections within the logic block are fast, and connec tions between logic blocks, routed through the general interconnect, are slow in comparison. The packing step decides which LUTs to put together into a single logic block, and therefore, packing has a significant impact on circuit speed. The model of Figure 1 is representative of early Altera FLEX FPGAs [Altera 2003], however, it has become out-of-step with the logic blocks present in modern Xilinx or Altera FPGAs. Modern logic blocks present a more com- plex packing problem and new optimization opportunities and impose different constraints, as we will illustrate in the sections below Much research has been published on packing for the logic block in Figure 1 Perhaps the most cited work is that of Betz and Rose, who proposed an area driven packing algorithm and showed that, due inherent locality in circuits the number ofinputs to a logic block can be much smaller than the total num ber of lUT inputs within a cluster [Betz and Rose 1997. In particular, for a logic block with N 4-input LUTs, Betz and Rose [1997] showed that only 2N +2 inputs to the cluster are needed-a number much smaller than 4N. Marquardt extended the work to perform timing-driven packing and demonstrated the im pact of packing on critical path delay Marquardt et al. 1999 ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009 18:4 T Ahmed et al LUT oEcoOos 而00日 LUT oc09 LUT local interconnect Fig. 1. Classic FPGA logic block targeted by most academic research Packing affects power consumption as intralogic block connections will have lower capacitance than interlogic block connections. A natural approach there fore is to attempt to keep nets with high switching activity contained within logic blocks, as was proposed in Lamoureux and Wilton [2003]. An entirely different approach for power-driven packing was shown in Singh and Marek- Sadowska [20021, where Rent's rule was used to establish a preference for how many logic block inputs should be used during packing, leading to lower over ll interconnect usage, capacitance and power. Although not yet available com- mercially, dual-VDD FPGAs have been proposed by academia, where the idea is to programmably allow logic blocks to operate at reduced supply voltage (slower but lower power). Researchers at UClA developed a complete CAD How for a proposed dual-VDD FPGA, including new packing techniques [Chen and Cong 2004]. The aim of packing in this context is to pack LUTs based on their timing-criticality, placing noncritical LUTs together into logic blocks that will be operated at low-Vnn. Hassan et al. [2005] dealt with packing for a low-power fpGa having logic blocks that when idle, can be placed into a low eakage sleep state. On the speed axis, more recent work includes Dehkordi and Brown [2002 which also uses a rents rule-based algorithm, and prevents loosely connected LUTs from being packed together; that is, it prevents the packing of unrelated LUTs. Other papers tie together packing with other phases of the FPGA CAD fow. For example, Schabas and Brown [2003 looked at packing in the con text of logic replication for performance; a subset of lUTs are deliberately left empty by the packer to accomodate later LUT replications during placement An interesting recent work by Lin et al. 2006] brought together packing and technology mapping and showed that higher speed can be attained using a unified algorithm for concurrent packing and technology mapping 2.2 Virtex-5 FPGA Architecture We now describe the virtex-5 architecture, focusing on the blocks that are the target of our packing techniques ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009 Packing Techniques for Virtex-5 FPGAS 18: 5 COUT COUT SLICE 6-LUT FF Slice(1) 6-LUTI FF Switch atrix 6-LUT FF Slice(0) 16-LUT FF CIN CIN Fig. 2. Virtex-5 CLB and slice 2.2.1 Logic Blocks. Logic blocks in Virtex-5 are called configurable logic blocks(CLBs)and comprise two SLICEs and a switch matrix, as shown in Figure 2. The switch matrix serves the purpose of providing intra and inter- CLB connectivity. As shown in the figure, each SLICE contains four 6-input LUTs and 4 fip-Hops. SlICEs receive and produce carry signals from neigh- boring SlICEs for implementing fast ripple-carry addition Figure 3 zooms in on the details of portion of a SlICE-the LuTyfip-Alop pair. Dashed lines in the figure represent connections whose details are omit ted for clarity, but will be described below. Traditional LUTs in FPGAs have a single output and can implement a single logic function. However, LUTs in Virtex-5 FPGAs have two outputs, 05 and 06, and can implement two logic functions. Specifically, the LUT can implement a single logic function using up to 6 inputs, with the output signal appearing on 06. The LUT can also impl ment two functions that together use up to 5 inputs, in which case, both the 05 and o6 outputs are used, and the lut input a6 must be tied to logic-1 The architecture necessitates that when two LUt outputs are used, only one of them can be registered An important property of the lut/fF structure in Figure 3 is that o6 di- rectly drives a Slice output, A, whereas O5 must pass through an additional multiplexer before reaching SLICE output, AMUX. This makes output 05 slower than output 06. Consequently, for high performance, 05 should not be used in timing-critical combinational paths. The relative slowness of o5 is something that must be considered when packing LUTs into Virtex-5 SLICEs, if high speed performance is to be maintained Figure 4 shows the dual-output LUT hardware implementation. LUTs in FPGAs are implemented with trees of multiplexers, and the figure shows a portion of a 6-input luT. Values in SRAMs cells(on the left of the figure)are set to hold the truth table for the logic function(s) implemented in the lut The fastest LUT input, A6, is held constant at logic-l when two functions are mapped to the lut, and therefore the fast input is unavailable for either of the ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009 18:6 T Ahmed et al AMUX A3 6-LUT o5 Fig 3. Dual-output 6-input LuT. A5 A1 A2 A3 A4 A6(tied to logic-1) multiplexer tree O6 Fig 4. Hardware implementation of dual-output Lu two functions. Thus, both logic functions suffer from increased delay in dual output mode, relative to the case of implementing a single function in a LUt, where the a6 can be used. Observe that the o5 output is fed by an internal node of the multiplexer tree Circuits implemented in an FPGA need not use all 6 Lut inputs, and in fact, circuits frequently contain many small logic functions that need only 2 or 3 LUT inputs. Figure 5 shows a histogram(similar to that in Hutton et al 2004])oflut sizes from a circuit collected from a Xilinx customer. In this case, 37%0 of LUTs use 2 or 3 inputs, and such small luts could be paired together and packed into one dual-output 6-LUT. The frequency of small luts in circuits bodes well for the usability of the second lUT output Section 3 describes a packing approach for using both lut outputs to pro duce higher logic density, while at the same time minimizing the performance hit of using of the slower O5 output. The higher density is achieved as a re sult of the pervasiveness of small logic functions in circuits that can be packed together in a single dual-output 6-LUT 2.2.2 Block RAMs and DSPs. In early FPGAs, any data used by the on- chip application had to be stored in off-chip RAMs. Accessing memory was slow and power hungry. Modern FPGAs such as Virtex-5 include blocks of ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009 Packing Techniques for Virtex-5 FPGAs 5的En- 5 Number of used LUT inputs Fig. 5. LUT input usage for sample design 36Kb ram 18 Kb RAM 18 Kb RAM Fig. 6. Virtex-5 block RAM tile static RAM on-chip. Figure 6 shows a Virtex-5 block RAM, which can store up to 36 Kbits of data. A block RAM can operate as a single 36 Kb RAM, or as two independent 18 Kb RAMs. The RAMs rcad and write operations are synchronous Multiple block RAMs are organized into columns on the Virtex-5 device and two adjacent block RaMs in a column can be combined to implement eper memories When an 18 Kb RAM is used in dual-port mode there are two possibilities (1) True dual-port mode: the 18 Kb block RAM has two independent access ports, A and B. Data can be written to either or both ports, and can be read from either or both ports. Figure 7(a)illustrates this case. In the figure. the two 18 Kb block raMs in the block ram tile are referred to as the upper and lower block RAMs, the two access ports on each 18 Kb block RAM are called the A and B access ports. DI, DO and ADDR refer to data-in, data-out, and address ports, respectively ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009 18:8 T Ahmed et al 36Kb RAM 36Kb Ram DIA, upper DOA, upper DI upper DO upper DIB upper 18 Kb RAM WRADDR, i 18 Kb RAM ADDRA up (upper) DOB,upp RDADDR up (upper) ADDRB up DIA lower DI lower DOA lower DIB lower 18 Kb RAM WRADDR low 18 Kb RAM DO Lower ADDRA loy DOB lower ADDRB low RDADDR low a) Two 18Kb true dual-port memories b) Two 18Kb simple dual-port memories Fig. 7. Dual-port block RAMs: true and simple (2)Simple dual-port rAM mode: In this case, each port cannot be used for both reads and writes; rather, the 18 Kb block RAM has one port for read opera tions and an independent port for write operations. Figure 7(b)illustrates this case. In the figure, WRADDR and rdadDR refer to the write-address and rcad-address ports, respectivcly The maximum allowable word width of a RAM in simple dual-port mode is 36 bits, which is double that allowed for true dual-port mode. The RaMs have configurable "aspect ratio, meaning that an 18 Kb RAM can be configured as 16K×1,8R×2,2K×9,or1R×18 or for simple dual- port rams,512×36 DSP circuits frequently contain multiply-accumulate operations, and in early FPGAs, such functionality was implemented using the generic LUT fabric. In modern FPGAs, however, such functionality can be realized with improved speed and power using hard ip blocks. Figure 8 depicts a virtex-5 DSP block, called the DSP48E. The DSP48E comprises a 25 x 18 multiplier which receives inputs A and b as the operands to the multiply block. The product is fed into an alu that also may receive input c. The alu can im plement addition, subtraction, and varied logic functions. Multiple DSP48Es can chain together to form morc complex functions through the PCIN and PCOUT ports shown in the figure. The structure in Figure 8 is not entirely combinational and actually has programmable pipeline depth, via configurable registers present at several places in the block. A comprehensive description of the dsP48E can be found in Xilinx [2007] The virtex-5 layout scheme has columns of block RAMs and columns of DSP48Es, integrated into the regular fabric of logic blocks and routing Section 4 describes packing techniques for block RAMs and DSPs that yield improved performance for circuits containing commonly-occurring patterns of block ramdsp instances 3. PACKING FOR LOGIC BLOCKS In the traditional FPGA CAD flow, packing is done at the preplacement stage However, for the case of packing into the dual-output 6-LUTs in virtex-5, one of the two LUT outputs has slower speed Packing into 6-LUTs must be done with consideration of the circuit's critical path, if high speed is to be maintained ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009 Packing Techniques for Virtex-5 FPGAs 8:9 ACOUT BCOUT PCOUT B(18-bit) A(25-bit M X C(48-bit) ACIN BCIN Fig8. Virtex-5 dsp48E block Because roughly 50% of the path delay in FPGAs is interconnect delay, little known regarding path criticality at the preplacement stage. Therefore, we choose to do dual-output Lut packing during placement, as we have access to better estimates of connection delay 3. 1 Placer floy Figure 9 shows the flow of the Xilinx placer. The input to the fow is a netlist composed of LUTs, flip-Hlops, large IP blocks, and some prepacked SlICEs, which may or may not be able to accomodate additional luts and/or regis ters. Core logic blocks are placed using analytical techniques, similar to those published in Eisenmann and Johannes [1998] and Viswanathan et al. [2007] After initial IO placement, the placement problem is represented mathemat ically, as a system of equations to be solved. An initial overlapped placement of the design is computed by solving the equations. Based on the intermediate placement, the mathematical formulation is modified to move logic blocks away from highly congested regions. The solving and reformulation of the mathe matical system continues iteratively until ultimately, a feasible overlap-free placement is produced. Once analytical placement is finished, swap-based lo cal optimization is run on the placed design to further refine the placement. The cost function used in the placer considers wirelength and timing Cost W+b·T where w and T arc the wirelength and timing(speed performance)costs, respectively, and a and b are scalar weights. The values of a and b can be set according to the relative priority of speed performance versus wirelength Since actual routes are not available during placement, the wirelength cost depends on wirelength estimation. Likewise, the timing cost depends on esti- mates of connection delays and user-supplied constraints The Xilinx tools permit a user to manually pack the SlICEs in their design, thereby limiting the potential for additional automatic packing into dual-output LUTs ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009 18:10 T Ahmed et al Analytical Placement Overlap Free Yes Swap Based Local Opt Fig 9. Placer fle 3.2 Packing during placement 3.2.1 Leveraging Two LUT Outputs. The first task in creating a dual output Lut packing algorithm was to"open up"the usage of the second luT output, 05, and permit two LUTs to be packed together. This involved im plementing legality checking on whether two LUTs can feasibly pack together. Two principal legality checks are made: First we check that, combined, the two LUTs use no more than 5 distinct variables, and if so, they are disqualified from packing together. Second, we ensure that at most one of the two luts drives a register. The reason for this second restriction is that there exists only one register per dual-output 6-lUT. If two LUTs feeding registers were packed into the 6-luT. one of the luts would not be able to attain a fast connect to its fanout register, damaging performance considerably. In fact, if tWO LUTs needing registers are packed together into a single 6-LUT, one of the two reg- isters must be placed in a nonadjacent register slot, perhaps even in another CLB. Other detailed packing restrictions that we found useful are described in the following paragraphs While simulated annealing-based placers keep design objects snapped onto the FPga grid at all times, analytical placement algorithms operate in the floating point domain. At the end of spreading, when the placement is close to being feasible (overlap-free), the design objects are"fit"or"snapped onto the FPGA grid. Actually, opening up the second lUT slot in each 6-LUT permits a ACM Transactions on Reconfigurable Technology and Systems, Vol 2, No 3, Article 18, Pub date: September 2009

(系统自动生成,下载前可以参看下载内容)