文件名称:
ZooKeeper’s atomic broadcast protocol: Theory and practice
开发工具:
文件大小: 382kb
下载次数: 0
上传时间: 2019-10-14
详细说明:zookeeper 系统中的主要分布式协调算法,有助于深入理解zkThe original Paxos protocol does not enable multiple outstanding transactions
Paxos does not require FIFO channels for communication, so it tolerates message loss
and reordering. If two outstanding transactions have an order dependency, then Pa
cannot have multiple outstanding transactions because FIFO order is not guaranteed
This problem could be solved by batching multiple transactions into a single proposal
and allowing at most one proposal at a time, but this has performance draw backs
The manipulation of the sequence of transactions to use during recovery from pri
mary crashes is claimed to not be efficient enough in Paxos 12. Zab improves this
aspect by employing a transaction identification scheme to totally order the transac
tions. Under the scheme, in order to update the application state of a new primary
process, it is sufficient to inspect the highest transaction identifier from each process
and to copy transactions only from the process that accepted the transaction with the
highest identifier. In Paxos, the same idea cannot be applied on sequence numbers
so a new primary has to execute Phase l of Paxos for all previous sequence numbers
for which the primary has not "learned a value(in Zab terminology, " committed a
transaction”)
Additional performance requirements [19 for ZooKeeper are: (i) low latency. (ii)
good throughput under bursty conditions, handling situations when write workloads
Increase ra
pidly during e g, massive system reconfiguration, and (iii)smooth failure
handlin, so that the service can stay up when some non-leader server crashes
2.2 Crash-recovery system model
ZooKeeper assumes the crash-recovery model as system model 12. The system
a set, of processes ii=p1, p2,...Px, also referred to as peers in this report, that
communicate by message passing, are each equipped with a stable storage device, and
may crash and recover indefinitely many times. a quorum of II is a subset Q c I such
that Q>N 2. Any two quorums have a non-empty intersection. Processes have two
states: up and down. a process is down from a crash time point to the beginning of
its recovery, and up from the beginning of a recovery until the next crash happens
There is a. bidirectional channel for every pair of processes in IL, which is expected
to satisfy the following properties: (i) integrity, asserting that process p, receives a
message m, from pi only if pi has sent m; and (ii) prefix, stating that if process p
receives a message m and there is a message m' that precedes m in the sequence of
messages pi sent to pi, then pi receives m before m. To achieve these properties
ZooKeeper uses TCP- therefore FIFO- channels for communication
2.3 Expected properties
To guarantee that processes are consistent, there are a couple of safety properties to be
satisfied by Zab. Additionally, for allowing multiple outstanding operations. we require
primary order properties. To state these properties we first need some definitions
In ZooKeepers crash-recovery model, if the primary process crashes, a new primary
process needs to be elected. Since broadcast messages are totally ordered, we require
at most one primary active at a time. So over time we get an unbounded sequence
of primary processes P1p2... Pe.. where pe E Il. and e is an integer called epoch
representing a period of time when Pe was the single primary in the ensemble. Process
Pe precedes pe/, denoted pe pp/, if e
then either pi delivers(v, 2l>or p, delivers (v, 2)
Prinary order properties 12 are given below
Local primary order: If a primary broadcasts(e, z) before it broadcasts(, 2>
then a process that delivers(v, 2) must have delivered (v, 2) before(v, 2>
Global primary order: Suppose a primary Pi broadcasts (v, 2), and a primary
Pi>pa broadcasts(,, a). If a, process delivers both(u, 2) and,, 2), then
must deliver (v, 2) before(, 2>
Primary integrity: If a primary Pe broadcasts (v, 2) and some process delivers
fU, a'> which was broadcast by Pel pe, then Pe must have delivered(v, a', before
broadcasting〈U,).
Local primary order corresponds to FIFO order. Primary integrity guarantees that
a primary has delivered transactions from previous epochs
3 Atomic broadcast protocol
In Zab, there are three possible (non-persistent )states a peer can assume: following,
leading or election. Whether a peer is a, follower or a leader, it executes three zab
hases:(1)discovery,(2) synchronization, and(3) broadcast, in this order. Previous
to Phase 1, a peer is in state election, when it executes a leader election algorithm to
look for a peer to vote for becoming the leader. At the beginning of Phase l, the peer
inspects its vote and decides whether it should become a follower or a leader. For this
reason. leader election is sometimes called phase 0
The leader peer coordinates the phases together with the followers, and there should
be at most one leader peer in Phase 3 at a time, which is also the primary process to
broadcast messages. In other words. the primary is always the leader. Phases l and 2
are important for bringing the ensemble to a mutually consistent state, specially when
recovering from crashes. They constitute the recovery part of the protocol and are criti
cal to guarantee order of transactions while allowing multiple outstanding transactions
If crashes do not occur, peers stay indefinitely in Phase 3 participating in broadcasts
similar to the two phase commit protocol 9. During Phases 1, 2, and 3, peers can
decide to go back to leader election if any failure or timeout occurs
ZooKeeper clients are applications that use ZooKeeper services by connecting to
at least one server in the ensemble. The client submits operations to the connected
server, and if this operation implies some state change, then the Zab layer will perform
broadcast. If the operation was submitted to a follower, it is forwarded to the leader
peer. If a leader receives the operation request, then it executes and propagates the
state change to its followers. Read requests from the client are directly served by any
ZooKeeper server. The client can choose to guarantee that the replica is up-to-date by
issuing a sync request to the connected ZooKeeper server
In Zab, transaction identifiers(zxid are crucial for implementing total order prop
erties. The zxid x of a transaction u, a) is a pair (e. c), where e is the epoch number
of the primary pe that generated the transaction( v, i), and c is an integer acting as
counter. The notation 2.epoch means e, and 2. counter =c. The counter c is incre
mented every time a new transaction is introduced by the primary When a new epoch
starts-a new leader becomes active- c is set to zero and e is incremented from what
was known to be the previous epoch. Since both e and c are increasing, transactions
can be ordered by their zxid. For two zxids(e, c) and(/, c'>, we write(e, c)2el, d)
if e< e or ife= el and c< c
There are four variables that constitute th
used during the recovery part of the protoco e persistent state of a peer, which are
history: a log of transaction proposals accepted
acceptedEpoch: the epoch number of the last NEWEPOCH message accepted
currentEpoch: the epoch number of the last NEWLEADER message accepted
lastZxid: zxid of the last proposal in the history
We assume some mechanism to determine whether a transaction proposal in the
history has been committed in the peer's ZooKeeper database. The
e variable names
above follow the terminology of 18, while in 12 they are different: history of a peer
f is hf, acceptedEpoch is f-p, currentEpoch is fa, and lastZxid is f2rid
3.1 Phases of the protocol
The four phases of the Zab protocol are described next
Phase 0: Leader election Peers are initialized in this phase, having state election
No specific leader election protocol needs to be employed, as long as the protocol
terminates, with high probability, choosing a peer that is up and that a quorum of
peers voted for. After termination of the leader election algorithm, a peer stores its
vote to local volatile memory. If peer p voted for peer p, then p' is called the prospective
leader for p. Only at the beginning of phase 3 does a prospective leader become an
established leader, when it will also be the primary process. If the peer has voted for
itself, it shifts to state leading, otherwise it changes to state following
Phase 1: Discovery In this phase, followers communicate with their prospective
leader, so that the leader gathers information about the most recent transactions that
its followers accepted. The purpose of this phase is to discover the most updated
sequence of accepted transactions among a quorum, and to establish a new epoch so
that previous leaders cannot commit new proposals. The complete description of this
phase is described in Algorithm1
1 Follower f.
2 Send the message FOLLOWERINFO(F. acceptedEpoch)to L
3 upon receiving NEWEPOCH(e from L do
if e>> F.acceptedEpoch then
F. acceptedEpoch e for all e received through FOLLOWERINFO(e)
Propose NEWEPOCH(c to all followers in Q
17 upon receiving ACKEPOCH from all followers in do
Find the follower f in Q such that for all f′∈Q\{八}:
either f. currentEpoch Fhistory such that 2<2 z do
hing(wait)
end
Commit(deliver)transaction e, 2)
28 end
Algorithm 3: Zab phase 3 Broadcast
algorithms L 2 and B] are apparently asynchronous and do not take into account
possible peer crashes. To detect failures, Zab employs periodic heartbeat messages
between followers and their leaders. If a leader does not receive heartbeats from a
quorun of followers within a given timeout, it abandons its leadership and shifts to
state election and Phase 0. a follower also goes to Leader Election Phase if it does not
receive heart beats from its leader within a timeout
3.2 Analytical results
We briefly mention some formal properties that Zab satisfies, and their correspond-
ing proofs were given in Junqueira et al. 1112. The invariants are simple to show
oy inspecting the three algorithms, while claims are carefully demonstrated using the
invariants
Invariant 1 22 In Broadcast Phase, a follower F accepts a proposal(e, ( u, z))only
if F. currentEpoch =e
Invariant 212 During the Broadcast Phase of epoch e, if a follower F has
F. currentEpoch =e, then F accepts proposals and delivers transactions according to
arid order
Invariant 3 12 During Phase 1, a follower F will not accept proposals from the leader
of any epoch e< F. acceptedEpoch
Invariant 4[12 In Phase 1, an ACKEPOCH(F. currentEpoch, Fhistory, F.lastzxid)
message does not alter, reorder, or lose transactions in Fhistory. In Phase 2, a
NEWLEADER(e!, Lhistory )message does not alter, reorder, or lose transactions in
Lhistory
Invariant 5 12 The sequence of transactions a follower f delivered while in phase
3 of epoch F.currentEpoch is contained in the sequence of transactions broadcast by
prinary PEe, where F e denotes the last epoch e such that f learned that e has been
committed
Claim 1 11y For every epoch number e, there is at most one process that calls ready(e
in broadcast phase
Claim 2 12 Zab satisfies the properties from Section 2. 3: broadcast integrity, agree
ment, total order, local primary order, global primary order, and primary integrity
Claim 3 12 Liveness property: Suppose thal a quorun Q of followers is up, the
Followers in q have l as their prospective leader, L is up, and messages between a
Follower in Q and are received in a linely fashion. IfL proposes a transaction
(e, U, a)), then(e, U, 2 is eventually committed
4 Implementation
Apache ZooKeeper is written in Java, and the version we have used for studying the
implementation was 3.3.3 8. Version 3. 3.4 is the latest stable version(to this date)
but this has very little differences in the Zab layer. Recent unstable versions have
significant changes, though
Most of the source code is dedicated to ZooKeepers storage functions and client
communication. Classes responsible for Zab are deep inside the implementation. As
mentioned in Section 2. 2 TCP connections are used to implement the bidirectional
channels between peers in the ensemble. The FIfO order that TCP communication
satisfies is crucial for the correctness of the broadcast protocol
The Java implementation of Zab roughly follows algorithms四囡and图 Several
optimizations were added to the source code, which make the actual implementation
look significantly different from what we have seen in the previous section. In par
ticular, the default leader election algorithm for Phase 0 is tightly coupled with the
implementation of Phase 1
Fast Leader Election(FLE) is the name of the default leader election algorithm in
the implementation. This algorithm employs an optimization: It attempts to elect as
leader the peer that has the most up-to-date history from a quorum of processes. When
such a leader is elected. in Phase 1 it will not need to communicate with followers to
discover the latest history. Even though other leader election algorithms are supported
by the implementation. in reality Phase 1 was modified to require that Phase 0 elects
a leader with the most up-to-date history
In practice, since FLE covers the discovery responsibility of Phase l, this phase has
been neglected in version 3.3.3(and also 3.3.4) of ZooKeeper. There is no clear distinc
tion between Phases I and 2 in the implementation, so we refer to the combination of
both as recovery phase. This phase comes after Phase 0, and assumes that the leader
has the latest history in a quorum. Algorithm 4 is an approximate pseudocode of the
Recovery Phase, and Figure I] compares the implemented phases to Zab's phases
Lab protocol
Phasc 0
Phasc 2
(Leader Flection)
(Discovery
(Synchronization)
(乃 broadcast)
Implemented protocol
Fast Leader election
Recovery phase
Broadcast phase
Figure 1: Comparison betweens the phases of Zab protocol and the implemented pro
tocol
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.