文件名称:
FPGA-Based Secret Key Distillation Engine Quantum Key Distribution Systems
开发工具:
文件大小: 2mb
下载次数: 0
上传时间: 2019-07-16
详细说明:An FPGA-Based 4 Mbps Secret Key Distillation Engine for Quantum Key Distribution Systems.pdf
IDQ QKD 论文,可以参考。J Sign Process Syst(2017)86: 1-15
they contain on average less than one photon. The receiver of leaked information and the information leaked to eve
(Bob) measures the arrival time of photons with a single through the quantum channel (in case of an eavesdropping
photon detector Dbit in order to retrieve the information. attack) needs to be removed from the key bits to produce
This encoding scheme based on time bins hence requires a the final secret key, in a step called privacy amplification
very accurate clock synchronization between both parties. We employ Toeplitz hashing with a randomized matrix to
In order to guarantee the secrecy of the transmission, an amplify the key bits privacy. The secret key bits are then for
interferometer is added at Bob. Any attempt of an eaves- warded to a key manager, which uses some of the bits for
dropper(Eve)to intercept the photons, will cause a decrease authentication of the classical channel and emits the rest as
of the interference visibility, which can be detected with the secret key which is shared between Alice and bob
the monitoring detector Dmon. For security reasons, decoy
sequences(optical pulses in both time bins which do not
encode a bit value) have to be introduced randomly, too. For 4 Secret Key Distillation System Architecture
a detailed description of the QKd protocol see [2
The hardware architecture of the key distillation system is
presented in Fig 3. The diagram show all blocks necessary
3 Secret Key Distillation Process
to operate either as Alice or as bob. the front -end that is
fed by the quantum channel consists of the sifting modules,
The secret key distillation process enables reliable key which rely on the synchronization and alignment blocks
agreement between Alice and Bob on top of the optical sig- that handle the clock recovery and exact synchronization
nalling scheme(e. g, the introduced Cow) that is employed of the transmitter and receiver systems(cf. Section 4.1)
by the Qkd system. The distillation process is detailed in On Alice's side the front end also includes a quantum state
Fig.2 in the form of a sequence diagram. While the indi- selection module, which consumes random bits from an
vidual processing units involved in the distillation and their external quantum random number generator, and generates
interactions including the corresponding data flows are pre- the random qubits that are sent over the optical quantum
sented in greater detail in Section 4, we summarize the channel to Bob. In fact, the required random numbers are
process as follows
provided on both Alice and Bob by an intermediate AES
Alice sends random qubits over the quantum channel to randomness expansion block, which expands the available
Bob, and the secret key distillation then begins after sin- throughput of random bits to the range of hundreds of mbps
gle photon detection on Bob's side. For this detection to
work, both alice and bob need to be synchronized which
is achieved by a high precision clock recovery scheme
Alice
Quantum
Bob
attached to the classical communication channel which is a
point to point connection between both sides in addition to
I Channel
Random bits
-Qubits
Single Photo
the quantum channel. During the sifting step Bob sends its
Detection
successful detection timings (the bit index, not the detected
Synchronization kl cLock/ Frames IN Synchronization
bit values) back to Alice, who in turn reveals the decoys that
Alignment
Alignment
were introduced by her, such that both sides agree after sift
Detection
ing on the same raw bits that will form part of the secret
Decoys
LDPC Syndrome
Following the sifting phase, the core signal processing on
Correction T-Syndromest
LDPC Ero
Calculation
the raw key bits is performed In a first step the errors in the
detected bits are corrected with a forward error correction
Universal Hash
UHF Rad
Universal hash
Function
code. The resulting key bits are then verified and filtered
Function
accordingly using a randomized universal hash function
Frame filter
process. Note that most key distillation steps so far are leak-
tT Filter Result-+
Frame Filter
ing some amount of information to eve over the classical
Pr
Random
Privac
PA Matriⅸ
communication channel, e. g, in form of redundancy added
Amplification
Amplification
by the syndromes used for error correction. This quantity
Key Manager
Key Manager
Authenticated
Note that only a small fraction of the pulses can be detected, strongly
Shared
Shared
Channel I
depending on the optical fiber length between Alice and Bob. How
Secret Key
Secret Key
ever. these missed detections do not result in bit errors. since these bits
are omitted by Alice during the sifting process
Figure 2 Scqucncc diagram of secret key distillation process
4
J Sign Process Syst(2017)86: 1-15
Figure 3 Systcm architccture
f the secret key distillation
Clock
DDR2-
engine.
ron-
Alice: LDPC
DDR2-RAM
Jos
zation
Manager
LDPCOk Decoder
Controlller
250MHz:125
RAM
Bob: Syndrome
syndrom
Codi
DIMM.
RAM Port
write Arbiter
Classica
Service
Channel
Channels
rval& tags Hash
ead
Front-
Authentication
L Frame Filter
End
timin
gs
decoys
matix Random
Privacy
Matrix
and amplification
Channel: det. Alignment
Sifting
(op tical)
only on Bob)
Back
End
K
FPGA
Manager
Quantum State
Random bitstream
Config. Reg.
Selection
Expansion(AES)
Monitor
secret
Alice
Quantum Random Number Generator
PCHExpress
up from the limited output rate of the truly random quantum a memory access schedule with high memory interface
source [8.
utilization
The back-end or key post-processing chain employed
The key distillation system furthermore employs a pci
to distil the secret key consists of two core modules, the Express link to a PC, used for configuration, monitoring
forward error correction(FEC) and privacy amplification and internal debugging purposes. The link also provides a
(PA)
means to extract the final shared secret key and redistribute
The FEC. module is based on a quasi-cyclic low den- it to appropriate consumers, e.g., OTP encryption systems
sity parity check (LDPC)code, with syndrome encoding or high speed AES encryptors [15]
performed at the quantum bit receiver side (Bob). The
The full system architecture is targeted to be imple-
transmitter side(alice) performs decoding of its own trans- mented on a single- FPGa (Xilinx Virtex-6 LX240T)
mitted codeword bits to adapt(correct) them according to board that supports Small Form-factor Pluggable(SFP
the syndrome information received from Bob. Please refer transceivers, a PCI-Express interface, and external DDR
to Section 4. 4 for an in-depth description of the FEC and RAM
its embedded frame filter mechanism based on randomized
In the following we elaborate on the specific implemen
universal hashin
tation details of signal processing modules that are unique
Aftererror correction, the key bits are hashed in a privacy to the context of quantum key distillation
amplification module, which effectively compresses the raw
key by an adjustable compression ratio(secret fraction) to 4.1 Synchronization and Clocking
obtain the final secret key (cf. Section 4.5). The random
bits required for the pa step are generated on the receiver The QKd system requires a synchronized and latency com-
side(bob ) and transmitted over the service channel to Alice. pensated clocking scheme between Alice and Bob, to cor
Since the Pa involves large block sizes of the order of mBit rectly detect the qubits (i.e, to identify the correct time bins
to reduce the impact of finite size effects, the key and ran- at Bob. Figure 4 shows this clocking scheme. The clocks of
dom bits have to be stored in off-chip memory. During Bob and alice are synchronized through the classical chan-
processing, the raw key bits are then repeatedly read by the nel, which is using a SFP link at a frequency of 2.5 Ghz
PA module from external memory, while new raw key bits Bob recovers the data and also the clock that is sent over
for the next block are simultaneously produced by the fec this serial link from Alice with the 8B/10B protocol using
and stored in a back buffer, also located in external memory. the clock-data recovery (CDR)function provided by the
Consequently, the core architecture also comprises a module FPGa built-in gTX transceivers. However, the recovered
dedicated to the arbitration of the single external DDR mem- clock on Bob's side is not sufficiently clean in terms of ji
ory port, which is detailed in Section 4.6. This module ter, for the requirements of the photon detection circuitry
allows to keep FPGA-internal buffer sizes low by enforcing on the quantum channel receiver side. Hence, the recovered
ringer
J Sign Process Syst(2017)86: 1-15
and outputs an authentication tag of n
127 bits
PLL
ref clock
To decrease the key bits consumption from 383 bits per
125MH
cleaner
block down to almost only 128 bits per block, we exploit
OSc
the fact that the system has to authenticate several messages
recovered
25 MHZ
by using the variant proposed in [26]: instead of generat
SFP
ing a new strongly-universal hash function for each message
Classical
channel
and hence consume 3n - I bits, one can generate a single
strongly-universal hash function and keep it secret. Then
NH Emitter
→ Detector
Quantum
every authentication tag is encrypted with a one-time-pad
125250
channel
125250
With this re-use scheme, authenticating t messages requires
MHz Alice
Bob
MHZ
3n-1+t(n-1) bits instead of t(3n-1) bits when no
key is reused. It was recently shown in [17] that, in Canetti's
Figure 4 System clocking scheme with clock recovery
universal composability (UC) framework [4], this authen
tication protocol with key re-use is E-UC-secure, which
clock is fed outside the FPGa to an external PLl with high means that the overall authentication error is upper-bounded
quality jitter cleaner. The cleaned clock is then used inside by t.E
the fpga to drive the alignment block for quantum chan-
To simplify the service channel implementation at high
nel detections, which correctly phase-aligns the triggering dala rates, we avoid an acknowledge- or retransmission
of the photon detector with the arrival of the incoming pho- based communication protocol. However, since the trans
tons on Bob. Furthermore, from this cleaned clock all other mission bit error rate of 10-12 for the serial link at 2.5 Gbps
system clocks are derived. The clock manager outputs a is too high for our purposes(one bit error approximately
nemory clock (250 MHz) for the DDr2 controller interface every 7 min), we perform error correction on the classi
and a general system clock(125 MHz) for the signal pro- cal channel data with a Hamming( 15, 11)code to bring the
cessing modules. Additionally, on Alice a separate LDPC error rate down to 10-18. The encoding and decoding of the
clock(62. 5 MHz)is generated, since the LDPC decoder can Hamming code is implemented using lookup tables
be implemented more efficiently at a reduced clock speed
4.3 Sifting
4.2 Service Channel
The sifting module on Bobs side discloses the timings and
The service channel provides a FIFO-based multi-channel types(base) of the photon detections to Alice over the ser-
abstraction interface for the classical channel serial link vice channel. The encodings for this sifting information can
hich connects Alice and Bob. All signal processing mod- be chosen with a precision of 6 or 14 bit for the differen-
ules in the key distillation engine use their own private tial timestamp data which is always relative to the previous
channel(accessible through a FIFO interface)to exchange detection. This flexibility in the number of bits allows the
data between the two sides. The multiple data streams system to be better tuned to the detection rate which drops
are interleaved by the service channel module, which also with increasing fiber-length. With the long encoding up to
authenticates the communication between both parties as 250 km are feasible, while the service channel bandwidth
required by the QKd protocol to exclude any tampering usage is better optimized for short distances (up to 20 km)
by a third party with the key distillation process data. with shorter time stamps. Alice prunes its bit stream accord
Information-theoretically secure authentication is provided ing to the timing information from Bob, and additionally
by families of strongly-universal hash functions constructed discloses to Bob the decoy positions and monitor detection
based on polynomial hashing as proposed in [5, 26]. There, outcomes, which Bob uses to also sift its own stream of raw
a m/2 -almost universal family of hash functions is com- key bits and to estimate the vis
posed with a 1/ 2/--almost strongly universal family of
Note that the internal buffer of the sifting module on
hash functions to give a(m +2)/2-almost strongly uni- Alices side needs to be of considerable size. The longer the
versal family of hash functions, where n.(m+ 1)is the distance between Alice and bob, the larger is the required
ength in bits of the input message. The advantages of this sifting buffer, since Alice needs to hold all transmitted bits
method are that only a total of 3n- l secret key bits are con- in this buffer until the sifting information from bob arrives
sumed per authentication to select the hash functions, which For a fiber length of 50 kin the round-trip time is about
is constant in terms of the initial size of the message. A 500 us, which results in a typical buffer size of 625 kbit
total of m t 2 multiplications over G F(2n) are performed, to store both the bit value and the base at a link rate of
as well as the same number of additions (XOrs) over n-bit 625 Mbps. At 250 km of fiber length the required buffer
values. The authenticator processes blocks of 995328 bits increases to almost 3 mbit to accommodate for the round
pinger
J Sign Process Syst(2017)86: 1-15
trip time increase to 2380 us. Hence, at larger link dis- being filtered out by the following check mechanism of the
tances an external ram buffer is potentially needed to store FEC, due to an unrecoverable error
the transmitted bits of alice efficiently before they can be
The efficieny of our system is defined as the achieved
sifted
percentage of the ideal Shannon capacity of the channel
based on the Qber. the capacity of the binary symmetric
4.4 Forward Error correction
channel is defined as
4.4.Ⅰ LDPC Codes
C(∈)=1-h()
with h(e) being the entropy of a bit with error probability e
LDPC codes have been chosen for the error correction which corresponds to our QBER. The achievable error free
because of their excellent Shannon-capacity approaching throughput of information is calculated as
error correction capabilities and the existence of low
G=(1-FER)·R
complexity decoding solutions [21]. a valid code word x
of a classical LDPC code is characterized by satisfying the FER is the frame error rate after decoding and also a func-
condition that
tion of QBer. We can see from Fig. 6 that by switchin
between rates we can achieve a high efficiency for differ-
H. x
ent QbeR regimes. especially at high rate codes there is a
where H is the sparsely populated parity check matrix, significant gain compared to the results published in[27]
which defines the specific instance of the code, and s
The LDPC decoder operates at a clock frequency of
is a known vector called syndrome. The aspect ratio of 62.5 MHz(halved system clock), with 10 iterations per
h defines the coding rate(R). LDPC codewords can be decoded code word, providing a maximum throughput of
decoded by believe propagation(BP)algorithms
235 Mbps. Hence, the processing speed of the lDPC
For our implementation we have chosen a quasi-cyclic decoder is not the limiting factor for the overall maximum
prototype matrix H
as it used in various modern com- secret key rate of the key distillation engine
munication standards. The set of parameters and the h
are derived from the LDPC codes specified for the IEEe 4.4.2 Frame Filtering
802.1In standard ll
The quasi-cyclic nature of H allows the use of a partial- After error correction of the sifted key bits, it is possible
parallel decoder architecture based on a layered decoding that a codeword on Alice side still has a mismatch compared
schedule [19]. Such an architecture offers a good trade-off to the bits on Bob(feC failure). Hence, the QKd system
between efficient implementation and high reconfigurabil- requires an integrity check on the decoded key bits such that
ity. Hence, we can support a wide set of different code rates no mismatch in the final agreed secret key stream is possible
without additional complexity [21]. Adapting the coding (with an upper-bounded probability f<232). Verification
rate to the channel conditions allows to utilize the quantum of the parity check equations in the ldpc decoder pro
channel with high efficiency over many different scenarios. vides a first error detection mechanism. Unfortunately, this
Results are provided for 4 different parity check matrices mechanism leaves many block errors undetected, since the
corresponding to r=)
a fixed codeword size of
1944 is sufficient as our application assumes a constant data
tre.
In conventional wireless applications the syndrome of th
LDPC code is usually fixed to s =0. The parity informa-
tion is encoded in the codeword x. For the quantum channel
pplication we adopt the concept of syndrome coding as
described in [13]. In our system the syndrome s is not a fixed 010"
value but dynamically calculated by bob for each codeword
based on his sifted data
The forward error correction provided by the LDPC
10
This Work-R=1/2
decoder in the QKd system achieves a residual frame error
This Work-R=2/3
rate(FER) of less then 0.5 at a quantum bit error rate
This Work-R-3/4
This work -R=5/6
(QBER)of 6 on the quantum channel, for a code rate of
0.02
C.04
0.08
0.1
R= 1/2, as can be observed in Fig. 5. One frame corre
QBER
sponds to a codeword of size 1944 bits,i. e, for a QBER of figure 5 error correction performance of forward error correction
6% each frame has an independent probability of
0.5 of (LDPC decoder)
ringer
J Sign Process Syst(2017)86: 1-15
48-bit uniformly distributed random value to avoid any tar
This Work -R=1/2
This Work -R=2/3
geted collision attacks, and produces a 48-bit hash tag as
This Work -R=3/4
This Work -R=5/6
output. a new random value is used for each invocation of
Cui et a.间6
the uhF, i.e., for each calculated hash tag. The maximum
collision probability for a single tag is 242. 6. while for a
sequence of 512 tags(one PA block) it increases to 2-33.5
The block diagram of the Uhf unit is shown in Fig. 7
旦40
The accumulator of the uhf module is loaded during ini-
tilization with the first 48 bit of the key bits(code word to
be hashed In the following 40 rounds the hash tag is sequen
tially computed by performing a GF( 1)Plication of
the accumulator with the defined random value for the cur
rent tag. After the multiplication the result is XOr'ed with
the next 48 bit of key bits, before being stored back in the
QBER
accumulator. Note that the last part of the code word(k40)
is only 24 bit long and needs to be padded with zeros to fit
Figure 6 Channel usage efficiency for different code rates
the 48-bit format
The complexity of the UHF Imodule lies in the required
G F(2o)finite-field multiplier. Our design space explo
decoder tries to produce a valid codeword and converges, ration showed satisfactory results for a fully combinatorial
despite being in error. Hence, we impleinent an additional approach [12]. The multiplier generates all possible mul-
ntegrity check, i.e., error detection after correction, by per- tiples (GF powers) of the first operand, using chained
forming a hash tag calculation and comparison for each exponentiation blocks. Note that these blocks only con-
LDPC code word of size 1944 bits after zero-padding it to tain a fixed barrel shifter (rotate by one bit), and three
1968 bits. The padding is necessary, since the hashed word parallel Xor gates at fixed bit positions(defined by the
needs to have a length which is a multiple of 48 bit
G F(2+0)polynomial), which reduces the average logic
After calculation of the syndrome for the sifted key bits depth through the exponentiation chain down to only
on Bob's side, a 48-bit hash tag is calculated for each code three XOR gates. All the multiples are then each com
word and sent to Alice, together with a random value of the bined with the second input operand and the result vec
same size, which is needed to randomize the hashing pro- tors are finally individually summed to produce the 48-bit
cess(required for security reasons). Alice then calculates its result
own hash tag for its decoded key bits, using the same ran-
dom value, and then performs a comparison of the tags. If a
tag mismatch occurs, the corresponding frame(codeword)
ke
data
48
is discarded by Alice, before the key bits are written to the
external memory for further processing. All hash tag com
random value(〔
parison results are furthermore forwarded to Bob, which
also performs the corresponding filtering of frames on his
finite-field
key bits stream. Note that the internal Fifo buffer required
GF(2
at bob to perform the frame filtering can grow to consider
(fully combinatorial)
able size, if the round-trip delay time on the service channel
an1 2
48
for the hash check is large
As mentioned above, the process of hash tag calcula-
x← rand G(24)
tion requires randomization, since it is necessary to pre
for i= 1
…,40
vent an attacker from forging an attack vector on the
←a8x+k
service-or quantum-channel that could force a collision of
the employed hash function, resulting in a key agreement
between Alice and Bob on differing key bits
accumulator(a
To prevent this, we employ an error detection method
48
named polynomial hashing, which is built on top of a uni
hash
alid after 41 rounds:
1944 bit codeword
versal hash function (UHF)[5, 26]. The UhF is based on
tag +24 bit 0-padding
finite-field arithmetic over the galois Field G F(2+o)with
Figure 7 Block diagram of the universal hash function module using
the polynomial x8+x4+x2+x20+1. The method uses a randomized polynomial hashing over G F(248
pinger
J Sign Process Syst(2017)86: 1-15
4.5 Privacy Amplification
data is kept in off-chip memory(e.g, DDR2-RAM), and
is processed in slices. The complete matrix multiplication
he transmission of key bits from Alice to Bob is accom- is performed in multiple steps, by slicing the result vec-
panied by leakage of information due to the necessary tor into m/ k slices of length k bits each. Every result slice
communication on the service channel for sifting, error (y(s-1)k+1, y(s-1)k+2,.,ysk) for s
m/k is calcu
correction and detection, or due to eavesdropping attacks. lated completely, before continuing with the next part of the
Hence, to guarantee a certain level of security, it is neces- result vector.
sary to remove this quantity of leaked information through
The architecture of the Pa performing the matrix mul-
a privacy amplification step During privacy amplification tiplication is shown in Fig 8. The calculation of one slice
a partially secure string is to be transformed into a highly is performed in multiple cycles, processing w raw key bits
secret key by public discussion. It has been shown that this per cycle. The hashing architecture consists of k parallel
can be realized by computing the output of a randomly but multiply-accumulator (MAC) units, which each perform per
publicly chosen two-universal hash function applied to the cycle in parallel w single-bit multiplications over GF(2),
input string, which results in a secret key that is secure combining random matrix bits with raw key bits. The
according to a universally composable security definition results are then summed (Xor parity bit) together with the
[18]. To this end, we employ Toeplitz hashing [1 1] to reduce current value of the accumulator register to produce the new
the error corrected bit stream by an adjustable compression sum bit. The input raw key bits for the MAC units, i.e., the
ratio, ensuring the security of the remaining secret key bits. currently needed part of x of size w, are the same for each
Toeplitz hashing is performed by a single matrix-vector MAC unit (in a cycle), while the corresponding row of ran-
multiplication over GF(2), according to
dom bits is taken from a large shift register of size k t
The first MAC unit takes the w random bits starting from
the second flip-flop, while the second mac unit takes the
7n+171
w random bits starting from the third flip-flop, and so on, as
illustrated in Fig. 8. Note that the first flip-flop is only used
n+m-1n+m-2···Fn-m-1
for loading purposes (of full random words of size w),but
not directly for calculation
where the m x n matrix consists of random bits ri with
As the example in the bottom right corner of Fig. 8
i=I.n+ m-l, and the corresponding vector is con- shows, every cycle the shift register is advanced by a step
structed from the n raw key bits xj withj=1.n that size of w, shifting in a new word of random bits at the top
need to be privacy amplified, i.e., hashed, resulting in m (e.g, random bits r5, r6, r? in cycle 2). By only storing each
secret key bits y withl= 1..m. The employed matrix is a cycle the boundary of the currently required matrix area of
diagonal-constant (Toeplitz) matrix, which means that it has dimension k x w, we achieve a very area-efficient imple
a regular diagonal structure. This specific structure reduces mentation of the Toeplitz hashing algorithm. We exploit the
the amount of random bits required for the compression of regular diagonal structure of the matrix, by a shifting mech
a vector of length n with a secret fraction of m/n(using an anism that allows the boundary required for cycle i to be
m x n matrix) from mn to m+ n-l, compared to a fully easily transformed into the boundary that is required for
random matrix. This allows to reduce the number of random cycle i+ 1
bits that have to be generated, processed and sent over the
Before the calculation of a slice can begin. initial random
service channel to a reasonable amount
matrix bits(the k bits that correspond to the slice of bits of
Note that the random matrix is not kept secret, however it the first column of the full random matrix )are preloaded
is not to be revealed over the public channel before all qubits into the lower part of the shift register, loading w bits per
associated with the hashing matrix have been transmitted cycle. At the same time the result bits y of the previous
and the raw key bits have been error corrected and filtered slice calculation are shifted out of the Mac units, and are
on both sides
forwarded to the key manager as secret key bits
For correct handling of unwanted finite size effects, the
During the calculation of one slice, the architecture con-
block size n for the PA vector should be as large as pos- sumes w random bits as well as w key bits per cycle. the
sible. We choose a block size of nN 10(specifically calculation for one slice is processed in n/w cycles, while
512x 1944 for our reference implementation), which the loading phase takes k/w cycles. Since one full PA block
for a typical secret fraction of 10 leads to a multiplication consists of m/k slices, generation of m secret key bits uses
with a Toeplitz matrix of dimensions 105 x 10
((k/w)+(n/w))(m/k) cycles(including all load overhead
Since the total amount of bits required to construct the between slices)
random matrix and the raw key bit vector itself are ty picall
The parameter w is chosen based on the width and
too large to fit into FPGA on-chip memory(BRAMs, the throughput of the external memory interface, i. e how much
ringer
J Sign Process Syst(2017)86: 1-15
Figure8 Toeplitz hashing
random bits key bits
architecture employed by
key bits
privacy amplification
w,KMAC
sum I
MAC
bit
I random bits
L,K MAC
hashed key bits
MAC
le 1 cycle 2 cycle 3
Z+MAC
T1o r1 r2r3 ra rs rer,rs
ri1 r10 r1 r3 rars r6 r,
MAC
r12 [11 r10 r1 r2 r3 r4 rs r6
rio r1 r2rar4 Is
(k+w)-bit
r14r13「1 r11 rio r1rar3
shift register
k multiply
(w steps/cycle)
accumulate units
example: k=6 W=3 n=9
data it can provide per clock cycle. The main parameter needed. Hence to optimize the resource utilization it is
for adjusting the circuit area(FPGA resource)utilization as beneficial to choose the largest u such that the maximum
well as the degree of parallelism, is the number of parallel key rate(bounded by the memory throughput limit) is not
MAC units. k
exceeded In our specific system a word width w of 128
The PA architecture supports flexible compression ratios, would not be beneficial(for any k), since the supported key
providing support for the optimal secret fraction corre- rate would al ways exceed the maximum that the external
sponding to the physical link properties of the underlying RAM allows
QKD system installation and the desired finite block length Regarding the available resources in the V6LX240T
security parameter. The secret fraction can be set in the full FPGA (150720 LUTs), a configuration of k= 512 and
range of0 to 100 in increments of k/n.
w=64 provides a good compromise between sup
The design space exploration of the privacy amplifica- ported maximum key rate (4.1 Mbps) and the required
tion unit is presented in Table l. The values relate to a device utilization (11 %). With a pa block size of
QKD system using a Virtex-6 LX240T FPGA, where the n 10 this configuration enables a very fine step
PA is clocked at 125 MHZ. The external memory used by size of 0.05 %o for the secret fraction PA compression
the Pa is a DDR2 SDRAM providing a peak throughput parameter.
of 25.6 Gbps. Processing of one full Pa block is associ-
ated with a certain number of bits that need to be transferred 4.6 External Memory Arbitration
(written and read) from the external memory, dependin
on the number of available maC units(and the com- The QKD system architecture as shown in Fig 3 comprises
pression ratio). The number of read bits mainly depends an external memory for storage of data blocks that are too
on the number of slices m/k that need to be processed large for on-chip storage in FPGA BRAMS. Since the raw
for one pa block. Since the secret key output size(per key block size (n)used as input for the privacy amplifica
Pa block /matrix multiplication) scales directly with the tion step is in the order of MBit, and the pa algorithm needs
compression ratio, the maximum key rate is practically access to the full block multiple times for a single com
independent of the compression ratio and only depends on pression (i.e, simple streaming is not possible), the external
the number of parallel Mac units k. This maximum key RAM has to be used for storage of these input blocks and the
rate is always bounded by the available external memory random matrix bits that are additionally consumed by the
throughput
PA. It is hence necessary to arbitrate four different produc
The actual achievable key rate for a given number of ers and consumers of data that need access to the external
MAC units k then depends on how much hardware in form RAM
of LUTs is expended by choosing an appropriate word width
w for the mac units. In general, doubling of that word
writing of (corrected and filtered)raw key bits by the
width provides a two-fold increase in output key rate, while
LDCP decoder(Alice)/the syndrome encoder(Bob)
also requiring twice the amount of FPGA LUts. The num
writing of random bits by the Toeplitz matrix generator
ber of required flip-flops is about 2k. since each mac unit
reading of raw key bits by the privacy amplification
contains one flip-flop, and a shift-register of size k is also
reading of random bits by the privacy amplification
pinger
10
J Sign Process Syst(2017)86: 1-15
Table 1 Design space exploration of resource utilization for the privacy amplification unit on a vir
parallel MAC units(k)and word widths(w)
MACS
Transt
Max Kr
=16
32
u=64
Mbit ja
Mbps中
flops
KR℃
LUTS
KR℃
LUTS
LUTS
128
1562
1.64
7
0.26
975
0.51
1950
1.03
3901
256
783
3.28
528
0.51
1950
1.03
3901
2.06
7802
6.55
1040
1.03
3901
2.06
15604
1024
197
13.02
2064
2.06
7802
4.11
15604
31208
2048
100
5.76
4112
4.11
15604
8.21
31208
1643
62415
a per block, for a secret fraction of 196/1944 N 10 % i.e. a compression from x 106 to 10
b Theoretical maximum keyrate, when limited by external memory throughput(25.6 Gbps), i. e external memory is exclusively used for PA
Achievable keyrate in Mbps, when Pa is running at 125 MHz and input data buffers are never starved
The external memory arbitration unit provides the nec- an alternating fashion, always keeping the two correspond-
essary memory access coordination such that all four pro- ing PA input buffers /FIFOs filled. After one slice has
ducers and consumers can operate at the same time, without been processed the arbiter writes new key bits followed by
any of them suffering internal buffer overflow or starvation, new random bits into the corresponding back buffers, before
respectively. The goal is furthermore to maximize utiliza- starting to process the next PA slice
tion of the single external memory DDR2 SDRAM port,
The required total amount of bits that need to be trans-
which provides 400 MT/s(mega transfers per second). With ferred(written to and read from the external raM) per PA
64-hit transfers(read or write accesses) this results in a max- block of size n is 2n(k+ 1)+ 2m- 1 bits(where m/n is
imum theoretical external memory throughput of 25.6 Gbps. the secret fraction, and k is the slice size, i.e., the number of
This limited throughput has to be carefully partitioned in parallel MAC units)
such a way that all modules operate at their highest pos-
The round-robin schedule employs varying access win
sible throughput, without being throttled due to inefficient dow sizes, which are freely configurable for different com
memory access patterns, to allow for an overall high Pa pression ratios. The sizes are defined by the required access
output rate, i.e., high final secret key rate
bandwidth ratios, i.e, the amount of bits each port needs to
The arbitration operates on a round-robin schedule based
on the state diagram depicted in Fig 9. In general, the sys
Write
Five configuration registers
tem operates on three buffers, which are used in a ring
Read com mands
K
configuration, each buffer holding a full set of raw key and
Buffer o
Read chunks
random bits for one pa block. During normal operation in
Write Key commands
the pa block loop, the oldest buffer is being processed by
one
Write Rand Commands
the PA, while the two younger buffers are being written with
Wr Kcy
u Slice Count
Write
Cmd4↓ Write
new key and random bits
K
Rand
specifies PA
At start-up the system first stores one full block of raw
Buffer 1 MWr Rand Buffer O
cmas
compression rato
key bits into external buffer 0. and then proceeds by fill
ing a second back-buffer I with raw key bits, while at the
Full block
(secret fraction)
i←0Done
same time storing the random bits associated with the first
block of key bits in its corresponding buffer 0. This delay
Read
PA slice loop
Read
of the random bits by one full back-buffer is necessary for
Preload
Read
if (Slice Count)
Ke
cmds
security reasons, since the random bits generated on Bob's
Buffer slice
side must not be revealed over the service channel, before
Buffer
Chunks
Wr Rand
the entire raw key block has already been processed by the
cmds
Read
I only with key bits), the RAM arbiter enters its main PA t Rand key Write/: Ca ol
system. After this initial phase of filling two buffers(buffer
Write
Rand
Key
Buffer
block processing loop, which consists of five states. In the
Buf. i+1
Buf. i+2
first state the Pa internal shift register of size k is preloaded
Pa block loopy
with random bits from memory. This is followed by Pa pro
Figure 9 State diagram of configurable external memory arbitration
cessing of one slice, where key and random bits are read in unit for privacy amplification using back-buffcrs
ringer
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.