FPGA-Based Secret Key Distillation Engine Quantum

文件名称: FPGA-Based Secret Key Distillation Engine Quantum Key Distribution Systems

所属分类: 网络安全

开发工具:

文件大小: 2mb

下载次数: 0

上传时间: 2019-07-16

提供者: ahn****

下载 (2mb)

不能下载？报告错误

详细说明：An FPGA-Based 4 Mbps Secret Key Distillation Engine for Quantum Key Distribution Systems.pdf IDQ QKD 论文，可以参考。J Sign Process Syst(2017)86: 1-15 they contain on average less than one photon. The receiver of leaked information and the information leaked to eve (Bob) measures the arrival time of photons with a single through the quantum channel (in case of an eavesdropping photon detector Dbit in order to retrieve the information. attack) needs to be removed from the key bits to produce This encoding scheme based on time bins hence requires a the final secret key, in a step called privacy amplification very accurate clock synchronization between both parties. We employ Toeplitz hashing with a randomized matrix to In order to guarantee the secrecy of the transmission, an amplify the key bits privacy. The secret key bits are then for interferometer is added at Bob. Any attempt of an eaves- warded to a key manager, which uses some of the bits for dropper(Eve)to intercept the photons, will cause a decrease authentication of the classical channel and emits the rest as of the interference visibility, which can be detected with the secret key which is shared between Alice and bob the monitoring detector Dmon. For security reasons, decoy sequences(optical pulses in both time bins which do not encode a bit value) have to be introduced randomly, too. For 4 Secret Key Distillation System Architecture a detailed description of the QKd protocol see [2 The hardware architecture of the key distillation system is presented in Fig 3. The diagram show all blocks necessary 3 Secret Key Distillation Process to operate either as Alice or as bob. the front -end that is fed by the quantum channel consists of the sifting modules, The secret key distillation process enables reliable key which rely on the synchronization and alignment blocks agreement between Alice and Bob on top of the optical sig- that handle the clock recovery and exact synchronization nalling scheme(e. g, the introduced Cow) that is employed of the transmitter and receiver systems(cf. Section 4.1) by the Qkd system. The distillation process is detailed in On Alice's side the front end also includes a quantum state Fig.2 in the form of a sequence diagram. While the indi- selection module, which consumes random bits from an vidual processing units involved in the distillation and their external quantum random number generator, and generates interactions including the corresponding data flows are pre- the random qubits that are sent over the optical quantum sented in greater detail in Section 4, we summarize the channel to Bob. In fact, the required random numbers are process as follows provided on both Alice and Bob by an intermediate AES Alice sends random qubits over the quantum channel to randomness expansion block, which expands the available Bob, and the secret key distillation then begins after sin- throughput of random bits to the range of hundreds of mbps gle photon detection on Bob's side. For this detection to work, both alice and bob need to be synchronized which is achieved by a high precision clock recovery scheme Alice Quantum Bob attached to the classical communication channel which is a point to point connection between both sides in addition to I Channel Random bits -Qubits Single Photo the quantum channel. During the sifting step Bob sends its Detection successful detection timings (the bit index, not the detected Synchronization kl cLock/ Frames IN Synchronization bit values) back to Alice, who in turn reveals the decoys that Alignment Alignment were introduced by her, such that both sides agree after sift Detection ing on the same raw bits that will form part of the secret Decoys LDPC Syndrome Following the sifting phase, the core signal processing on Correction T-Syndromest LDPC Ero Calculation the raw key bits is performed In a first step the errors in the detected bits are corrected with a forward error correction Universal Hash UHF Rad Universal hash Function code. The resulting key bits are then verified and filtered Function accordingly using a randomized universal hash function Frame filter process. Note that most key distillation steps so far are leak- tT Filter Result-+ Frame Filter ing some amount of information to eve over the classical Pr Random Privac PA Matriⅸ communication channel, e. g, in form of redundancy added Amplification Amplification by the syndromes used for error correction. This quantity Key Manager Key Manager Authenticated Note that only a small fraction of the pulses can be detected, strongly Shared Shared Channel I depending on the optical fiber length between Alice and Bob. How Secret Key Secret Key ever. these missed detections do not result in bit errors. since these bits are omitted by Alice during the sifting process Figure 2 Scqucncc diagram of secret key distillation process 4 J Sign Process Syst(2017)86: 1-15 Figure 3 Systcm architccture f the secret key distillation Clock DDR2- engine. ron- Alice: LDPC DDR2-RAM Jos zation Manager LDPCOk Decoder Controlller 250MHz:125 RAM Bob: Syndrome syndrom Codi DIMM. RAM Port write Arbiter Classica Service Channel Channels rval& tags Hash ead Front- Authentication L Frame Filter End timin gs decoys matix Random Privacy Matrix and amplification Channel: det. Alignment Sifting (op tical) only on Bob) Back End K FPGA Manager Quantum State Random bitstream Config. Reg. Selection Expansion(AES) Monitor secret Alice Quantum Random Number Generator PCHExpress up from the limited output rate of the truly random quantum a memory access schedule with high memory interface source [8. utilization The back-end or key post-processing chain employed The key distillation system furthermore employs a pci to distil the secret key consists of two core modules, the Express link to a PC, used for configuration, monitoring forward error correction(FEC) and privacy amplification and internal debugging purposes. The link also provides a (PA) means to extract the final shared secret key and redistribute The FEC. module is based on a quasi-cyclic low den- it to appropriate consumers, e.g., OTP encryption systems sity parity check (LDPC)code, with syndrome encoding or high speed AES encryptors [15] performed at the quantum bit receiver side (Bob). The The full system architecture is targeted to be imple- transmitter side(alice) performs decoding of its own trans- mented on a single- FPGa (Xilinx Virtex-6 LX240T) mitted codeword bits to adapt(correct) them according to board that supports Small Form-factor Pluggable(SFP the syndrome information received from Bob. Please refer transceivers, a PCI-Express interface, and external DDR to Section 4. 4 for an in-depth description of the FEC and RAM its embedded frame filter mechanism based on randomized In the following we elaborate on the specific implemen universal hashin tation details of signal processing modules that are unique Aftererror correction, the key bits are hashed in a privacy to the context of quantum key distillation amplification module, which effectively compresses the raw key by an adjustable compression ratio(secret fraction) to 4.1 Synchronization and Clocking obtain the final secret key (cf. Section 4.5). The random bits required for the pa step are generated on the receiver The QKd system requires a synchronized and latency com- side(bob ) and transmitted over the service channel to Alice. pensated clocking scheme between Alice and Bob, to cor Since the Pa involves large block sizes of the order of mBit rectly detect the qubits (i.e, to identify the correct time bins to reduce the impact of finite size effects, the key and ran- at Bob. Figure 4 shows this clocking scheme. The clocks of dom bits have to be stored in off-chip memory. During Bob and alice are synchronized through the classical chan- processing, the raw key bits are then repeatedly read by the nel, which is using a SFP link at a frequency of 2.5 Ghz PA module from external memory, while new raw key bits Bob recovers the data and also the clock that is sent over for the next block are simultaneously produced by the fec this serial link from Alice with the 8B/10B protocol using and stored in a back buffer, also located in external memory. the clock-data recovery (CDR)function provided by the Consequently, the core architecture also comprises a module FPGa built-in gTX transceivers. However, the recovered dedicated to the arbitration of the single external DDR mem- clock on Bob's side is not sufficiently clean in terms of ji ory port, which is detailed in Section 4.6. This module ter, for the requirements of the photon detection circuitry allows to keep FPGA-internal buffer sizes low by enforcing on the quantum channel receiver side. Hence, the recovered ringer J Sign Process Syst(2017)86: 1-15 and outputs an authentication tag of n 127 bits PLL ref clock To decrease the key bits consumption from 383 bits per 125MH cleaner block down to almost only 128 bits per block, we exploit OSc the fact that the system has to authenticate several messages recovered 25 MHZ by using the variant proposed in [26]: instead of generat SFP ing a new strongly-universal hash function for each message Classical channel and hence consume 3n - I bits, one can generate a single strongly-universal hash function and keep it secret. Then NH Emitter → Detector Quantum every authentication tag is encrypted with a one-time-pad 125250 channel 125250 With this re-use scheme, authenticating t messages requires MHz Alice Bob MHZ 3n-1+t(n-1) bits instead of t(3n-1) bits when no key is reused. It was recently shown in [17] that, in Canetti's Figure 4 System clocking scheme with clock recovery universal composability (UC) framework [4], this authen tication protocol with key re-use is E-UC-secure, which clock is fed outside the FPGa to an external PLl with high means that the overall authentication error is upper-bounded quality jitter cleaner. The cleaned clock is then used inside by t.E the fpga to drive the alignment block for quantum chan- To simplify the service channel implementation at high nel detections, which correctly phase-aligns the triggering dala rates, we avoid an acknowledge- or retransmission of the photon detector with the arrival of the incoming pho- based communication protocol. However, since the trans tons on Bob. Furthermore, from this cleaned clock all other mission bit error rate of 10-12 for the serial link at 2.5 Gbps system clocks are derived. The clock manager outputs a is too high for our purposes(one bit error approximately nemory clock (250 MHz) for the DDr2 controller interface every 7 min), we perform error correction on the classi and a general system clock(125 MHz) for the signal pro- cal channel data with a Hamming( 15, 11)code to bring the cessing modules. Additionally, on Alice a separate LDPC error rate down to 10-18. The encoding and decoding of the clock(62. 5 MHz)is generated, since the LDPC decoder can Hamming code is implemented using lookup tables be implemented more efficiently at a reduced clock speed 4.3 Sifting 4.2 Service Channel The sifting module on Bobs side discloses the timings and The service channel provides a FIFO-based multi-channel types(base) of the photon detections to Alice over the ser- abstraction interface for the classical channel serial link vice channel. The encodings for this sifting information can hich connects Alice and Bob. All signal processing mod- be chosen with a precision of 6 or 14 bit for the differen- ules in the key distillation engine use their own private tial timestamp data which is always relative to the previous channel(accessible through a FIFO interface)to exchange detection. This flexibility in the number of bits allows the data between the two sides. The multiple data streams system to be better tuned to the detection rate which drops are interleaved by the service channel module, which also with increasing fiber-length. With the long encoding up to authenticates the communication between both parties as 250 km are feasible, while the service channel bandwidth required by the QKd protocol to exclude any tampering usage is better optimized for short distances (up to 20 km) by a third party with the key distillation process data. with shorter time stamps. Alice prunes its bit stream accord Information-theoretically secure authentication is provided ing to the timing information from Bob, and additionally by families of strongly-universal hash functions constructed discloses to Bob the decoy positions and monitor detection based on polynomial hashing as proposed in [5, 26]. There, outcomes, which Bob uses to also sift its own stream of raw a m/2 -almost universal family of hash functions is com- key bits and to estimate the vis posed with a 1/ 2/--almost strongly universal family of Note that the internal buffer of the sifting module on hash functions to give a(m +2)/2-almost strongly uni- Alices side needs to be of considerable size. The longer the versal family of hash functions, where n.(m+ 1)is the distance between Alice and bob, the larger is the required ength in bits of the input message. The advantages of this sifting buffer, since Alice needs to hold all transmitted bits method are that only a total of 3n- l secret key bits are con- in this buffer until the sifting information from bob arrives sumed per authentication to select the hash functions, which For a fiber length of 50 kin the round-trip time is about is constant in terms of the initial size of the message. A 500 us, which results in a typical buffer size of 625 kbit total of m t 2 multiplications over G F(2n) are performed, to store both the bit value and the base at a link rate of as well as the same number of additions (XOrs) over n-bit 625 Mbps. At 250 km of fiber length the required buffer values. The authenticator processes blocks of 995328 bits increases to almost 3 mbit to accommodate for the round pinger J Sign Process Syst(2017)86: 1-15 trip time increase to 2380 us. Hence, at larger link dis- being filtered out by the following check mechanism of the tances an external ram buffer is potentially needed to store FEC, due to an unrecoverable error the transmitted bits of alice efficiently before they can be The efficieny of our system is defined as the achieved sifted percentage of the ideal Shannon capacity of the channel based on the Qber. the capacity of the binary symmetric 4.4 Forward Error correction channel is defined as 4.4.Ⅰ LDPC Codes C(∈)=1-h() with h(e) being the entropy of a bit with error probability e LDPC codes have been chosen for the error correction which corresponds to our QBER. The achievable error free because of their excellent Shannon-capacity approaching throughput of information is calculated as error correction capabilities and the existence of low G=(1-FER)·R complexity decoding solutions [21]. a valid code word x of a classical LDPC code is characterized by satisfying the FER is the frame error rate after decoding and also a func- condition that tion of QBer. We can see from Fig. 6 that by switchin between rates we can achieve a high efficiency for differ- H. x ent QbeR regimes. especially at high rate codes there is a where H is the sparsely populated parity check matrix, significant gain compared to the results published in[27] which defines the specific instance of the code, and s The LDPC decoder operates at a clock frequency of is a known vector called syndrome. The aspect ratio of 62.5 MHz(halved system clock), with 10 iterations per h defines the coding rate(R). LDPC codewords can be decoded code word, providing a maximum throughput of decoded by believe propagation(BP)algorithms 235 Mbps. Hence, the processing speed of the lDPC For our implementation we have chosen a quasi-cyclic decoder is not the limiting factor for the overall maximum prototype matrix H as it used in various modern com- secret key rate of the key distillation engine munication standards. The set of parameters and the h are derived from the LDPC codes specified for the IEEe 4.4.2 Frame Filtering 802.1In standard ll The quasi-cyclic nature of H allows the use of a partial- After error correction of the sifted key bits, it is possible parallel decoder architecture based on a layered decoding that a codeword on Alice side still has a mismatch compared schedule [19]. Such an architecture offers a good trade-off to the bits on Bob(feC failure). Hence, the QKd system between efficient implementation and high reconfigurabil- requires an integrity check on the decoded key bits such that ity. Hence, we can support a wide set of different code rates no mismatch in the final agreed secret key stream is possible without additional complexity [21]. Adapting the coding (with an upper-bounded probability f<232). Verification rate to the channel conditions allows to utilize the quantum of the parity check equations in the ldpc decoder pro channel with high efficiency over many different scenarios. vides a first error detection mechanism. Unfortunately, this Results are provided for 4 different parity check matrices mechanism leaves many block errors undetected, since the corresponding to r=) a fixed codeword size of 1944 is sufficient as our application assumes a constant data tre. In conventional wireless applications the syndrome of th LDPC code is usually fixed to s =0. The parity informa- tion is encoded in the codeword x. For the quantum channel pplication we adopt the concept of syndrome coding as described in [13]. In our system the syndrome s is not a fixed 010" value but dynamically calculated by bob for each codeword based on his sifted data The forward error correction provided by the LDPC 10 This Work-R=1/2 decoder in the QKd system achieves a residual frame error This Work-R=2/3 rate(FER) of less then 0.5 at a quantum bit error rate This Work-R-3/4 This work -R=5/6 (QBER)of 6 on the quantum channel, for a code rate of 0.02 C.04 0.08 0.1 R= 1/2, as can be observed in Fig. 5. One frame corre QBER sponds to a codeword of size 1944 bits,i. e, for a QBER of figure 5 error correction performance of forward error correction 6% each frame has an independent probability of 0.5 of (LDPC decoder) ringer J Sign Process Syst(2017)86: 1-15 48-bit uniformly distributed random value to avoid any tar This Work -R=1/2 This Work -R=2/3 geted collision attacks, and produces a 48-bit hash tag as This Work -R=3/4 This Work -R=5/6 output. a new random value is used for each invocation of Cui et a.间6 the uhF, i.e., for each calculated hash tag. The maximum collision probability for a single tag is 242. 6. while for a sequence of 512 tags(one PA block) it increases to 2-33.5 The block diagram of the Uhf unit is shown in Fig. 7 旦40 The accumulator of the uhf module is loaded during ini- tilization with the first 48 bit of the key bits(code word to be hashed In the following 40 rounds the hash tag is sequen tially computed by performing a GF( 1)Plication of the accumulator with the defined random value for the cur rent tag. After the multiplication the result is XOr'ed with the next 48 bit of key bits, before being stored back in the QBER accumulator. Note that the last part of the code word(k40) is only 24 bit long and needs to be padded with zeros to fit Figure 6 Channel usage efficiency for different code rates the 48-bit format The complexity of the UHF Imodule lies in the required G F(2o)finite-field multiplier. Our design space explo decoder tries to produce a valid codeword and converges, ration showed satisfactory results for a fully combinatorial despite being in error. Hence, we impleinent an additional approach [12]. The multiplier generates all possible mul- ntegrity check, i.e., error detection after correction, by per- tiples (GF powers) of the first operand, using chained forming a hash tag calculation and comparison for each exponentiation blocks. Note that these blocks only con- LDPC code word of size 1944 bits after zero-padding it to tain a fixed barrel shifter (rotate by one bit), and three 1968 bits. The padding is necessary, since the hashed word parallel Xor gates at fixed bit positions(defined by the needs to have a length which is a multiple of 48 bit G F(2+0)polynomial), which reduces the average logic After calculation of the syndrome for the sifted key bits depth through the exponentiation chain down to only on Bob's side, a 48-bit hash tag is calculated for each code three XOR gates. All the multiples are then each com word and sent to Alice, together with a random value of the bined with the second input operand and the result vec same size, which is needed to randomize the hashing pro- tors are finally individually summed to produce the 48-bit cess(required for security reasons). Alice then calculates its result own hash tag for its decoded key bits, using the same ran- dom value, and then performs a comparison of the tags. If a tag mismatch occurs, the corresponding frame(codeword) ke data 48 is discarded by Alice, before the key bits are written to the external memory for further processing. All hash tag com random value(〔 parison results are furthermore forwarded to Bob, which also performs the corresponding filtering of frames on his finite-field key bits stream. Note that the internal Fifo buffer required GF(2 at bob to perform the frame filtering can grow to consider (fully combinatorial) able size, if the round-trip delay time on the service channel an1 2 48 for the hash check is large As mentioned above, the process of hash tag calcula- x← rand G(24) tion requires randomization, since it is necessary to pre for i= 1 …,40 vent an attacker from forging an attack vector on the ←a8x+k service-or quantum-channel that could force a collision of the employed hash function, resulting in a key agreement between Alice and Bob on differing key bits accumulator(a To prevent this, we employ an error detection method 48 named polynomial hashing, which is built on top of a uni hash alid after 41 rounds: 1944 bit codeword versal hash function (UHF)[5, 26]. The UhF is based on tag +24 bit 0-padding finite-field arithmetic over the galois Field G F(2+o)with Figure 7 Block diagram of the universal hash function module using the polynomial x8+x4+x2+x20+1. The method uses a randomized polynomial hashing over G F(248 pinger J Sign Process Syst(2017)86: 1-15 4.5 Privacy Amplification data is kept in off-chip memory(e.g, DDR2-RAM), and is processed in slices. The complete matrix multiplication he transmission of key bits from Alice to Bob is accom- is performed in multiple steps, by slicing the result vec- panied by leakage of information due to the necessary tor into m/ k slices of length k bits each. Every result slice communication on the service channel for sifting, error (y(s-1)k+1, y(s-1)k+2,.,ysk) for s m/k is calcu correction and detection, or due to eavesdropping attacks. lated completely, before continuing with the next part of the Hence, to guarantee a certain level of security, it is neces- result vector. sary to remove this quantity of leaked information through The architecture of the Pa performing the matrix mul- a privacy amplification step During privacy amplification tiplication is shown in Fig 8. The calculation of one slice a partially secure string is to be transformed into a highly is performed in multiple cycles, processing w raw key bits secret key by public discussion. It has been shown that this per cycle. The hashing architecture consists of k parallel can be realized by computing the output of a randomly but multiply-accumulator (MAC) units, which each perform per publicly chosen two-universal hash function applied to the cycle in parallel w single-bit multiplications over GF(2), input string, which results in a secret key that is secure combining random matrix bits with raw key bits. The according to a universally composable security definition results are then summed (Xor parity bit) together with the [18]. To this end, we employ Toeplitz hashing [1 1] to reduce current value of the accumulator register to produce the new the error corrected bit stream by an adjustable compression sum bit. The input raw key bits for the MAC units, i.e., the ratio, ensuring the security of the remaining secret key bits. currently needed part of x of size w, are the same for each Toeplitz hashing is performed by a single matrix-vector MAC unit (in a cycle), while the corresponding row of ran- multiplication over GF(2), according to dom bits is taken from a large shift register of size k t The first MAC unit takes the w random bits starting from the second flip-flop, while the second mac unit takes the 7n+171 w random bits starting from the third flip-flop, and so on, as illustrated in Fig. 8. Note that the first flip-flop is only used n+m-1n+m-2···Fn-m-1 for loading purposes (of full random words of size w),but not directly for calculation where the m x n matrix consists of random bits ri with As the example in the bottom right corner of Fig. 8 i=I.n+ m-l, and the corresponding vector is con- shows, every cycle the shift register is advanced by a step structed from the n raw key bits xj withj=1.n that size of w, shifting in a new word of random bits at the top need to be privacy amplified, i.e., hashed, resulting in m (e.g, random bits r5, r6, r? in cycle 2). By only storing each secret key bits y withl= 1..m. The employed matrix is a cycle the boundary of the currently required matrix area of diagonal-constant (Toeplitz) matrix, which means that it has dimension k x w, we achieve a very area-efficient imple a regular diagonal structure. This specific structure reduces mentation of the Toeplitz hashing algorithm. We exploit the the amount of random bits required for the compression of regular diagonal structure of the matrix, by a shifting mech a vector of length n with a secret fraction of m/n(using an anism that allows the boundary required for cycle i to be m x n matrix) from mn to m+ n-l, compared to a fully easily transformed into the boundary that is required for random matrix. This allows to reduce the number of random cycle i+ 1 bits that have to be generated, processed and sent over the Before the calculation of a slice can begin. initial random service channel to a reasonable amount matrix bits(the k bits that correspond to the slice of bits of Note that the random matrix is not kept secret, however it the first column of the full random matrix )are preloaded is not to be revealed over the public channel before all qubits into the lower part of the shift register, loading w bits per associated with the hashing matrix have been transmitted cycle. At the same time the result bits y of the previous and the raw key bits have been error corrected and filtered slice calculation are shifted out of the Mac units, and are on both sides forwarded to the key manager as secret key bits For correct handling of unwanted finite size effects, the During the calculation of one slice, the architecture con- block size n for the PA vector should be as large as pos- sumes w random bits as well as w key bits per cycle. the sible. We choose a block size of nN 10(specifically calculation for one slice is processed in n/w cycles, while 512x 1944 for our reference implementation), which the loading phase takes k/w cycles. Since one full PA block for a typical secret fraction of 10 leads to a multiplication consists of m/k slices, generation of m secret key bits uses with a Toeplitz matrix of dimensions 105 x 10 ((k/w)+(n/w))(m/k) cycles(including all load overhead Since the total amount of bits required to construct the between slices) random matrix and the raw key bit vector itself are ty picall The parameter w is chosen based on the width and too large to fit into FPGA on-chip memory(BRAMs, the throughput of the external memory interface, i. e how much ringer J Sign Process Syst(2017)86: 1-15 Figure8 Toeplitz hashing random bits key bits architecture employed by key bits privacy amplification w,KMAC sum I MAC bit I random bits L,K MAC hashed key bits MAC le 1 cycle 2 cycle 3 Z+MAC T1o r1 r2r3 ra rs rer,rs ri1 r10 r1 r3 rars r6 r, MAC r12 [11 r10 r1 r2 r3 r4 rs r6 rio r1 r2rar4 Is (k+w)-bit r14r13「1 r11 rio r1rar3 shift register k multiply (w steps/cycle) accumulate units example: k=6 W=3 n=9 data it can provide per clock cycle. The main parameter needed. Hence to optimize the resource utilization it is for adjusting the circuit area(FPGA resource)utilization as beneficial to choose the largest u such that the maximum well as the degree of parallelism, is the number of parallel key rate(bounded by the memory throughput limit) is not MAC units. k exceeded In our specific system a word width w of 128 The PA architecture supports flexible compression ratios, would not be beneficial(for any k), since the supported key providing support for the optimal secret fraction corre- rate would al ways exceed the maximum that the external sponding to the physical link properties of the underlying RAM allows QKD system installation and the desired finite block length Regarding the available resources in the V6LX240T security parameter. The secret fraction can be set in the full FPGA (150720 LUTs), a configuration of k= 512 and range of0 to 100 in increments of k/n. w=64 provides a good compromise between sup The design space exploration of the privacy amplifica- ported maximum key rate (4.1 Mbps) and the required tion unit is presented in Table l. The values relate to a device utilization (11 %). With a pa block size of QKD system using a Virtex-6 LX240T FPGA, where the n 10 this configuration enables a very fine step PA is clocked at 125 MHZ. The external memory used by size of 0.05 %o for the secret fraction PA compression the Pa is a DDR2 SDRAM providing a peak throughput parameter. of 25.6 Gbps. Processing of one full Pa block is associ- ated with a certain number of bits that need to be transferred 4.6 External Memory Arbitration (written and read) from the external memory, dependin on the number of available maC units(and the com- The QKD system architecture as shown in Fig 3 comprises pression ratio). The number of read bits mainly depends an external memory for storage of data blocks that are too on the number of slices m/k that need to be processed large for on-chip storage in FPGA BRAMS. Since the raw for one pa block. Since the secret key output size(per key block size (n)used as input for the privacy amplifica Pa block /matrix multiplication) scales directly with the tion step is in the order of MBit, and the pa algorithm needs compression ratio, the maximum key rate is practically access to the full block multiple times for a single com independent of the compression ratio and only depends on pression (i.e, simple streaming is not possible), the external the number of parallel Mac units k. This maximum key RAM has to be used for storage of these input blocks and the rate is always bounded by the available external memory random matrix bits that are additionally consumed by the throughput PA. It is hence necessary to arbitrate four different produc The actual achievable key rate for a given number of ers and consumers of data that need access to the external MAC units k then depends on how much hardware in form RAM of LUTs is expended by choosing an appropriate word width w for the mac units. In general, doubling of that word writing of (corrected and filtered)raw key bits by the width provides a two-fold increase in output key rate, while LDCP decoder(Alice)/the syndrome encoder(Bob) also requiring twice the amount of FPGA LUts. The num writing of random bits by the Toeplitz matrix generator ber of required flip-flops is about 2k. since each mac unit reading of raw key bits by the privacy amplification contains one flip-flop, and a shift-register of size k is also reading of random bits by the privacy amplification pinger 10 J Sign Process Syst(2017)86: 1-15 Table 1 Design space exploration of resource utilization for the privacy amplification unit on a vir parallel MAC units(k)and word widths(w) MACS Transt Max Kr =16 32 u=64 Mbit ja Mbps中 flops KR℃ LUTS KR℃ LUTS LUTS 128 1562 1.64 7 0.26 975 0.51 1950 1.03 3901 256 783 3.28 528 0.51 1950 1.03 3901 2.06 7802 6.55 1040 1.03 3901 2.06 15604 1024 197 13.02 2064 2.06 7802 4.11 15604 31208 2048 100 5.76 4112 4.11 15604 8.21 31208 1643 62415 a per block, for a secret fraction of 196/1944 N 10 % i.e. a compression from x 106 to 10 b Theoretical maximum keyrate, when limited by external memory throughput(25.6 Gbps), i. e external memory is exclusively used for PA Achievable keyrate in Mbps, when Pa is running at 125 MHz and input data buffers are never starved The external memory arbitration unit provides the nec- an alternating fashion, always keeping the two correspond- essary memory access coordination such that all four pro- ing PA input buffers /FIFOs filled. After one slice has ducers and consumers can operate at the same time, without been processed the arbiter writes new key bits followed by any of them suffering internal buffer overflow or starvation, new random bits into the corresponding back buffers, before respectively. The goal is furthermore to maximize utiliza- starting to process the next PA slice tion of the single external memory DDR2 SDRAM port, The required total amount of bits that need to be trans- which provides 400 MT/s(mega transfers per second). With ferred(written to and read from the external raM) per PA 64-hit transfers(read or write accesses) this results in a max- block of size n is 2n(k+ 1)+ 2m- 1 bits(where m/n is imum theoretical external memory throughput of 25.6 Gbps. the secret fraction, and k is the slice size, i.e., the number of This limited throughput has to be carefully partitioned in parallel MAC units) such a way that all modules operate at their highest pos- The round-robin schedule employs varying access win sible throughput, without being throttled due to inefficient dow sizes, which are freely configurable for different com memory access patterns, to allow for an overall high Pa pression ratios. The sizes are defined by the required access output rate, i.e., high final secret key rate bandwidth ratios, i.e, the amount of bits each port needs to The arbitration operates on a round-robin schedule based on the state diagram depicted in Fig 9. In general, the sys Write Five configuration registers tem operates on three buffers, which are used in a ring Read com mands K configuration, each buffer holding a full set of raw key and Buffer o Read chunks random bits for one pa block. During normal operation in Write Key commands the pa block loop, the oldest buffer is being processed by one Write Rand Commands the PA, while the two younger buffers are being written with Wr Kcy u Slice Count Write Cmd4↓ Write new key and random bits K Rand specifies PA At start-up the system first stores one full block of raw Buffer 1 MWr Rand Buffer O cmas compression rato key bits into external buffer 0. and then proceeds by fill ing a second back-buffer I with raw key bits, while at the Full block (secret fraction) i←0Done same time storing the random bits associated with the first block of key bits in its corresponding buffer 0. This delay Read PA slice loop Read of the random bits by one full back-buffer is necessary for Preload Read if (Slice Count) Ke cmds security reasons, since the random bits generated on Bob's Buffer slice side must not be revealed over the service channel, before Buffer Chunks Wr Rand the entire raw key block has already been processed by the cmds Read I only with key bits), the RAM arbiter enters its main PA t Rand key Write/: Ca ol system. After this initial phase of filling two buffers(buffer Write Rand Key Buffer block processing loop, which consists of five states. In the Buf. i+1 Buf. i+2 first state the Pa internal shift register of size k is preloaded Pa block loopy with random bits from memory. This is followed by Pa pro Figure 9 State diagram of configurable external memory arbitration cessing of one slice, where key and random bits are read in unit for privacy amplification using back-buffcrs ringer

(系统自动生成,下载前可以参看下载内容)