文件名称:
Dynamic Density Based Clustering.pdf
开发工具:
文件大小: 480kb
下载次数: 0
上传时间: 2019-08-18
详细说明:Dynamic Density Based Clustering.pdf Dynamic Density Based Clustering.pdf Dynamic Density Based Clustering.pdf016Q17
Oo
O
O1
1p)∈
(a)dataset
(b)Core graph
(c)One possible p-approximate core graph
Figure 2: Illustration of dbSCan and p-approximate dBSCaN(p=0.5, MinPts =3)
Min pts, which can be regarded as a constant next we review how
haps more intuitive, and allows a simple extension to
e clusters are formed using graph terminology.
P-approximate DBSCan, as we will see later
Given a point p E P, we use B(p, r) to represent the ball that is
centered at p, and has radius r. The point is said to be a core point
Hardness of dbscan and USEC. It is easy to see that the db
if B(p, e) covers at least Min Pts points of P(including p itself
ScAn clusters on a set P of n points can be computed in O(n")
otherwise, it is a non-core point. To illustrate, consider the dataset
time, noticing that the core graph g has O(n)edges. However,
f 18 points in Figure 2a, where e is the radius of the inner solid
lever algorithms should produce the clusters without generating all
circle, and Min Pts=3. The core points have been colored black
the edges, and thus, avoid the quadratic trap Indeed, when d-2,
while the non-core points colored white. The dashed circle can be
the clusters can be computed in only O(n log n)time [11]
ignored for the time being
It would be highly desired to find an algorithm of o(m )time for
dBSCAN clusters are defined in Lwo sleps. The first one focuses
d 3, but recently Gian and Tao [10] have essentially dispelled
exclusively on the core points, and groups them into preliminary
the possibility. They proved an O(n)-time reduction from the unit-
clusters. The second step determines how the non-core points should
spherical emptines s checking (USEC)problem to dbsCan. in
be assigned to the clusters. Next, we explain the two steps in detail
other words, any T(n)-time DBSCAN algorithm implies that USEC
problem can be solved with O(T(m))time
Step 1 Clustering Core Points. It will be convenient to imagine an
In USEC, we are given a set Sred of red points and a set Blue
undirected core graph g on Pthis graph is conceptual and need
of blue points in I. All the points have distinct coordinates on
not be materialized. Specifically, each vertex of G corresponds
every dimension. The objective is to determine whether there exist
to a distinct core point in P. There is an edge between two core red point pred and a blue point pblue such that dist(pred, pblue)<1
points(a k.a. vertices)p1, p2 if and only if dist(p1, p2)5 in a
ure 2b shows the core graph for the dataset of Figure 2a
broad class of algorithms [6, 7]. For d=3 and 4, beating the
Each connected component(CC)of G constitutes a preliminary
has been a grand open problem in theoretical computer science, and
cluster In Figure 2b, there are 3 CCs(a k.a. preliminary clusters
is widely believed [o] to be impossible. By the reduction of [10],
Note that every core point belongs to exactly one preliminary cluster
no dbsCan algorithm can have running time o(n/o)in d25
in d=3 and 4, this is also true unless unlikely ground-breaking
Step 2 Non-Core Assignment. This step augments the preliminary
improvements could be mlade on 3D USEC
clusters with non-core points. For each non-core point p, DBSCAN
looks at every core point pcore E B(p, e), and assigns p to the(only)
Approximation and the Sandwich Guarantee. Gan and Tao [10]
preliminary cluster containing Pcore. Note that, in this manner, p
developed p-app roximate DBSCAN, which returns almost the sa
may be assigned to zero, one, or more than one preliminary cluster
clusters as exact dbscan by offering a strong sandwich guarantee
After all the non-core points have been assigned, the preliminary
that will be introduced shortly. In contrast to the high time complex
clusters become final clusters
ity of the latter, the approximate version takes only O(n)expected
time to compute for any constant p>0
It should be clear from the above that the dbscan clusters are
uniquely defined by the parameters e and MinPts. but they are
Besides the parameters E and Min Pts inherited from DBSCAN
not necessarily disjoint. A non-core point may belong to multiple
the approximate version accepts a third parameter p, which is a small
clusters, while a core point must exist only in a single cluster. It is
positive constant less than 1, and controls the clustering precision
possible that a non-core point is not in any cluster; such a point is
Its clusters can also be defined in the same two steps as in exact
called noise
DBSCAN, as explained below
In Figure 2a, there are two non-core points 013 and 018. Since Step 1: Clustering Core Points. It will also be convenient to follow
B(o13, c) covers 014, 013 is assigned to the preliminary cluster of a graph-based approach. Let us define an undirected e-approximate
014. B(018, c), however, covers no core points, indicating that 018 is
core graph Gp on the dataset P--again, this graph is conceptual
noise. The final dbSCan clusters are (01, 02, ,05,[
and need not be materialized. Each vertex of Gp corresponds to a
012},{o13,O14,…,O17}
distinct core point in P. Given two core points P1, P2, whether or
not go has an edge between their vertices is determined as
Remark. DBSCAN can also be defined under the notion of"density-
reachable"; see [9]. The above graph-based definition is equiv
The edge definitely exists if dist(p1, P2)(1+p)e
Let p be a set of points in I that is subject to updates, each of
e Dont care. otherwise
which inserts a new point to P, or deletes an existing point from P.
We are given a clustering description which specifies correct ways
Each preliminary cluster is still a CC, but of Gp. Unlike the core to cluster P. The description is what distinguishes DBSCAN from
graph G, Gp may not be unique. This flexibility is the key to the
e.g., p-approximate DBSCAN
vast improvement in time complexity [10]
Suppose that, by the clustering description, C(P)is a legal
To illustrate, consider the dataset of Figure 2a again with the e
of clusters on the current contents of P. Without loss of generality
shown and MinPts=3, but also with p=0.5(the radius of the
assume that C(P)=C1, C2,..., Cr]. where is the number of
dashed circle indicates the length of (1 +p)). Figure 2c illustrates
clusters,andC;(1≤讠≤x) is a subset of f. Note that the clusters
a possible p-approximate core graph. Attention should be paid do not need to be disjoint
to the edge (o4, 010. Note(from the circles in Figure 2c)that
e< dist(o4, 010)<(1+ p)ethis belongs to the"dont-care
Given an arbitrary subset Q2 of P, a cluster group-by (c-group-by
case meaning that there may or may not be an edge(04, 010). If
query must return for every C:∈C(P)(∈[1,x]):
he edge exists(as in Figure 2c), there are 2 CCs (
· Nothing at all,ifCz∩Q=0
clusters); otherwise. the p-approximate core graph is the same as in
Figure 2b, giving 3 preliminary clusters
C;∩Q, otherwise
This definition has several useful properties
Step 2: Non-Core Assignment. Each non-core point p may be as-
signed to zero, one, or multiple preliminary clusters. Specifically,
It breaks only the points of Q by how they should appear
let s be a CC of Ge. Whether p should be added to the preliminary
together in the clusters of P. Points in P\Q are not reported
cluster of s is determined as
at all, thus avoiding cheating algorithms " that use expensive
report time as an excuse for high processing cost
Yes, if s has a core point in B(p, e)
No, if s has no core point in B(p, (1+p)e)
When Q=P, the query result Q(P) is simply C(P).
Dont care. otherwise.
All the query results must be based on the same C(P). This
prevents another form of"cheating"when the clustering de
The preliminary clusters after all the assignment constitute the final
scription permits multiple legal possibilities ofC(P). Specifi
clusters
cally, the algorithm can no longer argue that the results Q1(P)
As mentioned, 013 and o18 are the only two non-core points in
and Q2(P)of two queries Q1 and Q2 should both becor-
Figure 2a. While o18 is still a noise point, the case of 013 is more
rect"because Q1 (P)is defined on one possible C(P), while
interesting. First, it must be assigned to the preliminary cluster of
Q2(P) is defined on another. Instead, they must be consis-
014, just like exact DBSCAN. Second, it may or may not be assigned
tently defined on the same C(P)-the one output by the query
ith
o the preliminary cluster of o12(also the cluster of o14). Either case
is regarded as a correct result
Our objective is to design an algorithm that is fast in processing
Sandwich guarantee. Recall that the clusters of exact DBSCAN
both updates and queries. We distinguish two scenarios: (i)semi
are uniquely delermined by the parameters E ind Miri Pts. Now
dynamic: where all the updates are insertions, and (ii) fully-dynamic,
where the updates can be arbitrary insertions and deletions
imagine we slightly increase e by an anount no Inore than pE Have
the clusters of dbsCan changed? If yes, it neans chat the origi
We consider that the dimensionality d is small such that(v d)is
nal choice of E is unstable--clusters are susceptible even lU a Liny
an acceptable hidden constant. All our theoretical results will carry
perturbation LO E. If no, then the sandwich guaranlee asserts that
this constant, and hence, are suitable only for low dimensionality
p-approximate dbscan returns precisely the saIne clusters as DB
Our experiments run up to d=7.
SCAN In Figure 2, the clusters changed because E was deliberately
set to be a large value of 0.5. In [10, the recommended value for
Dynamic Exact DBSCAN [8]. Dynamic maintenance of density-
practical data is actually 0.001
based clusters has been studied by Ester et al. [8] for exact DB
SCAN. Next. we review their method--named incremental dB
We refrain from elaborating on the formal description of the
SCAN (IncDBSCAN)assuming MinPts= 1 so that all the points
sandwich guarantee at this moment We will come back to this
of P are core points. This allows us to concentrate on the main ideas
in Section 6 where we prove that our new"double-approximate
without the relatively minor details of handling non-core points
DBSCAn offers just the same guarantee
Insertion. Recall that, for exact DBSCAN, the clusters C(P)are
Remark. Note, interestingly, that when p =0, there are no"dont-
uniquely determined by
y the input parameters e and Min Pts. Given
care" scenarios such that p-approximate dbscan degenerates into
a new point pnew, the insertion algorithm retrieves all the points in
exact DBSCAN. Hence, the former actually subsumes the latter as a B(pneu, e), and then merges the clusters of those points into one
special case
The correctness can be seen from the core graph G, where et-
3. PROBLEM DEFINITION AND STATE OF
fectively an edge is added between pmew and every other point in
emember
this view is conceptual, and G does not
THE ART
need to be materialized Figure 3a shows the G before the insertion,
The Problem of Dynamic Clustering. We now provide a formal
which has two CCs. To insert point o as in Figure 3b, the algorithm
ormulation of dynamic clustering, using the C-group-by query as
finds the points 011, 012, and 013 in B(pmew, E).The two clusters of
the key stepping stone. Our approach is to define the problem il
those points are merged-the newly added edges(0, 011),(0, 012),
a way that is orthogonal to the semantics of clusters, so that the
(o, 013) in Figure 3b connect the two CCs into one
problem remains valid regardless of whether we have dbscan or
In merging the clusters, Incdbscan does not modify the cluster
any of its approximate versions in mind
ids of the points in the affected clusters, which can be prohibitively
15
(1+p)
13
915017:6CCid真
CC id 2
O
Q14N016
O11
(a) core
nout o
(b)Core graph with o
O●s
Figure 3: Illustration of the IncdbsCan method
non-core cells
expensive because the number of such points can be exceedingly
(a Grid and cells
(b)a grid graph
high. Instead, it remembers the"merging history"of the cluster ids
Figure 4: Our grid-graph framework(MinPts =3)
Deletion. The deletion algorithm, in general, reverses the steps of
insertion, except for a breadth first search(BFS)that is needed to
The components of the framework will be instantiated differently in
judge whether (and how )a cluster has split into several
later sections for individual variants
Let pold be the point being deleted. IncDBSCAN retrieves all the
4.1 A Grid graph approach
puints in B(pold, 6). In(the conceptual)G, all the edges between
such points and old are removed. after which the cc of pold may
The key idea behind our framework is to turn dynamic clustering
or may not be broken into separate CCs(e. g, in Figure 3b, the
nto the problem of maintaining CCs(connected components )of a
CC is torn into two only if 013. 014, or o is deleted). To find out,
special graph
the deletion algorithm performs as many threads of bfs on g
Grid and Cclls. We impose an arbitrary grid D in the data space Rd,
as the number of points in B(pold, e). If two threads"meet up
where each cell is a d-dimensional square with side length e/vd on
they are combined into one because they must still be in the same
every dimension. This ensures that any two points in the same cell
CC. As soon as only one thread is left, the algorithm terminates,
are within distance at most e from each other
being sure that no cluster split has taken place. Otherwise, the
remaining threads continue until the end. in which case each thread
Given a cell c of D, we denote by P(c) the set of points in P that
has enumerated all the points in a new cluster that is spawned by the
are covered by c. We call c
deletion all those points can now be labeled with the same cluster
A non-empty cell if P(c) contains at least one point
id
For example, suppose that we delete o in Figure 3b, which starts
A core cell if P(c)contains at least one core point
three threads of BFS from ol1, 012 and 013, respectively. The re
· a dense cell if P(c)|≥ Min pts, and a sparse cell if 1≤
sulting core graph reverts back to Figure 3a. The threads of Oll
P(c)< MinPts
and o12 meet up into one, which eventually traverses the entire CC
containing o11 and o12. Similarly, the other thread traverses the
Given two cells c1, c2, we say that they are E-close if the smallest
distance between the boundary of ci and that of c is at most e
entire CC containing 013
When a thread of b fs needs the adjacent neighbors of a point p
Consider for instance the grid in Figure 4a, imposed on a set P of
in G, the algorithm finds all the other points p2 E B(pi, e) through
18 points. Again, the radii of the solid and dashed circles indicate
a range query [3, 12]. Every such p2 is an adjacent neighbor of pi
e and(1+ p)e, respectively; and MinPts=3. The only non-core
This essentially" the edge between p1 and p2
points are a13 and o18. The core cells are shaded in Figure 4b. The
non-core cells are(5, 4 )and(7, 2); note that the minimum distance
Query. The algorithm of [8] can easily answer a C-group-by query
between the two cells is e-hence they are E-close
Q by grouping the points of Q by their cluster ids(some ids need to
be obtained from the merging history)
Grid Graph. In a grid graphG=(V,E), V is the set of core
cells of D, while e is a set of edges satisfying the following CC
Drawbacks of incDBsCAN. Both insertion and deletion start with
reguirement
a range query to extract the points in b(p, e), which are called the
seed points [8]. The query is expensive when p falls in a dense
Let p1, p2 be two core points of P, and c1(or c2, resp. )be
region of P where there are many seed points. The issue is more
the core cell that contains pi (or p2, resp. ) Then, pi and p2
are in the same cluster if and only if C1 and c2 are in the same
he number of range queries is simply huge Queries are needed to
serious in a deletion, because multiple range
form bfs. the worst situation ha
CC of G
The above requirement is fulfilled by using the following rules to
decide if E should have an edge between two core cells c1, C2 E V
4. THE OVERALL FRAMEWORK
Yes, if there is a pair of core points(p1, p2)E P(cixP(c2)
satisfying dist(p1, p2)0, there is a semi-dynamic p-approximate DBSCAN algorithm p-approximate DBSCAN algorithm A as follows
that processes each insertion in o(1)amortized time, and answers
a C-group-by query Q in o(QD time
1. Initialize a p-approximate dbsCan instance with e= 1,
Min Pts= 3, and an arbitrary p >0. Let P be the input set,
The same insertion and query efficiency can also be achieved in
which is empty at this moment
2D space for exact DBSCAN
2. Use A to insert all the red points into P.
6. DYNAMIC HARDNESS AND DOUBLE
3. For every blue point p=(a1, 32,.,d)(hence, 1>a),
APPROXIMATION
carry out the
following steps
We now come to perhaps the most surprising section of the pa-
3. 1 Use A to insert p into P.
per. Recall that p-approximate dbsCan was proposed to address
the computational hardness of dbsCan on static datasets. In
3. 2 Use A to insert a dummy point p'=(1+1, 2,.,td
Section 6.1, we will show that the p-approximate version suffers
into P. That is, p has the same coordinates as p on all
from the same hardness on fully dynamic datasets. Interestingly,
dimensions i E [2, d], except for the first dimension
le culprit this time is the definition of core point. This motivates
where p has coordinate 1 l. See Figure 6b for an
the proposition of p-double-approximate dbsCan in Section 6.2
illustration(where p= pilae)
where we also prove that the new proposition has a sandwich guar-
3.3 Use A to answer a C-group-by query with Q-p,p]
antee as strong as the p-approximate version
If the query returns the same cluster id for p and p
terminate the algorithm, and return"yes" to the USEC
6.1 Hardness of Dynamic p-Approximation
USEC with Line Separation. Next, we introduce the UsEC with
3.4 Use A to delete p and p from P
line separation(UsEcC-LS) Problem, which has a subtle connection
with p-approximate DBSCaN, as shown later
4. Return“no” to the usec- LS proble
……
The running time of the algorithm is O(n.Tup(n)+ Tgry(n)) 01.03
CC ii 2
because we issue at most 2n insertions and 2n deletions as well as
16
n queries, in total. Next, we prove that the algorithm is correct
13
Consider Lines 3. 1-3.4. A crucial observation is that the dummy
point p must be a non-core point, because b(p, e) contains only
two points p, p. Therefore, p and p are placed into the same cluster
by p-approximate dbscan if and only if p is a core point. However,
p is a core point if and only if B (p, e) covers at least 3 points, which
must include p, p, and at least one point p" on the other side of
ered point p and blue point p are therefore within distance 1
(a) One possible P-double-approx
(b)Grid graph
It is now straightforward to verify that our algorithm always
core grap
returns the correct answer for USEC-LS.
Figure 7: Illustration of p-double approximation
THEOREM 2. For any p>0 and any dimensionality d> 3, any
p-approximate dbsCAN algorithm must incur Q(n/)amortized
Sandwich Guarantee. Recall that this is an attractive feature of p-
time either to process an update, or to answer a C-group-by query approximate dbscan. Next, we prove that p-double-approximate
(even if Q= 2), unless the USEC problem in R could be solved DbsCan provides just the same guarantee. Following the style
of [ 1o, we define
PROOF. Suppose that the algorithm were able to process an up
61 as the set of clusters ofexacl DBSCAN with parameters
date and a query both in o(n,)amortized time. By Lemma 2, we
(E, MiTPts)
would solve USEC-LS in o(n/)time which, by Lemma l,means
62 as the set of clusters of exact dbsCan with parameters
that we would solve USEC in o(n*/)time.
(∈(1+p), Minpts
As explained in Section 2, for USEC, a lower bound ofQ(n)
6 as a set of clusters that is a legal result of p-double-approximate
known[7] in d> 5, whereas beating the O(n,)bound in d=3, 4
DBSCAN with parameters 6, Min Pts, and p
is a major open problem in theoretical computational geometry, and
Then, the sandwich guarantee is
believed to be impossible [6]
THEOREM 3. The following statements are true:(i) For any
This is disappointing because dbscan succumbing to the hard
cluster c1∈61, there is a cluster c∈6 such that O1≤C,and
ness of USEC was what motivated p-approximate DBSCaN. Theo-(ii) for any cluster C E 6, there is a cluster C2 E 62 such that
rem 2 shows that the latter suffers from the same hardness when both
C C2
insertions and deletions are allowed! note that the theorem does not
apply to the seimi-dy namic update scheine because the deletions at
The proof can be found in the appendix. Note that the theorem is
Line 3.4 are essential. In facl, Theorem 1 already proved that effi-
purposely worded exactly the same as Theorem 3 of [10
cient semi-dynamic algorithms exist for p-approximate dbsCan
7. FULLY DYNAMIC ALGORITHMS
Finally, it is worth pointing out that Theorem 2 holds even for
P-0, i.e., it is applicable to exact dbSCAN as well
This section presents our algorithms for maintaining p-double
approximate dbscan clusters under both insertions and deletions
6.2 p-Double-Approximate dbSCan and
(again, exact DBSCAN is captured with p=0). We will achieve the
Sandwich guarantee
purpose by instantiating the general framework in Section 4. The
The New Proposition. To enable both(fully-dynamic) update
reader is reminded that the core-point definition has changed to the
nd query efficiency, we propose p-double-approximate DBSCAN
one in Section 6.2
which takes the same parameters e, MinPts, and p as p-approximate
7.1 Approximate Bichromatic Close Pair
DBSCAN. Whether a point p e P is a core point is now decided in
a relaxed manner
We now take a short break from clustering to discuss a computa
tional geometry problem which we name the approximate bichro
Definitely a core point if B(p, e)covers at least MinPts matic close pair(abCp)problem. In this problem, we have two
points of P.
disjoint axis- parallel squares C1, c2 in R. There is a set S(c1)of
Definitely not a core point if b(p, (1 +p)e) covers less than
points in C1, and a set S(c2) of points in C2. The two sets are subject
MinPts points of F
to insertions and deletions. Let c and p be positive real values. We
are asked to maintain a witness pair of (pi, p2)such that
e Dont care. otherwise
It may be an empty pair(i.e, pi and p2 are null)
A good example to illustrate this is point 013 in Figure 4a. Since
B(013, 6) covers 2< MinPts=3 points, 013 is not a core point.
If it is not empty, we must have dist(pi, p2)s(1+p)e
under exact or p-approximate DBSCAN. Under double approxima-
The pair must not be empty, if there exist a point PI E S(c1)
tion, however, it falls into the dont-care case for p= 0.5, because
and a point P2 E S(c2)such that dist (p1: P2)0, there is a fully-dynamic p-double-approximate DBSCAN
algorithm that processes each insertion and deletion in o(1)amor-
7.2 Edges in the grid graph and abcP
tized time, and answers a C-g roup-by query q in o(QD time
Returning to p-double-approximate dbscan clustering, let us
The same update and query efficiency can also be achieved in 2D
recall that in the grid graph G, if there is an edge between core space for exact dbSCAN.
cells ci and c2, then the two cells must be E-close. Such an edge
may disappear/re-appear as the core points of P(c1)and P(c
8. EXPERIMENTS
are deleted/inserted. maintaining this edge can be regarded as an
instance of the aBCP problem, where S(c1) is the set of core points
Section 8. 1 describes the setup of our empirical evaluation. Then,
in P(c1), and S(c2)is the set of core points in P(C2)-the edge
Sections 8.2 and 8.3 report the results on semi-dynamic and fully
exists if and only if the witness pair is not emply!
dynamic algorithms, respectivel
We run a thread of the aBCP algorithm of Lemma 3 on every pair
8.1S
Setup
of e-close core cells ci and c2. Those threads will be referred to as Workload. We evaluate a clustering algorithm by its efficiency
the abCP instances of c1 (or c2). Whenever the edge between Cr
in processing a workload, which is a mixed sequence of update
and c2(re-)appears, we call Edgelnsert(c1, c2)of the CC structure; and queries. Each update or query is collectively referred to as an
whenever it disappears, we call EdgeRemove(c1, C2)
operation. A workload is characterized by several parameters
7. 3 The Core-Status Structure
. N: the total number of updates
Given a point g, an approximate runge count query [1o]returns
ins: the percentage of insertions. In other words, the work
an integer k that falls between B(4, e)l and B(q, (1+pe)l.The
load has N. %ins insertions, and N(l-%ins )deletions. Thi
query can be answered in O(1) time by a structure that can be
parameter is fixed to l in semi-dynamic scenarios
updated in O(1) time per insertion and deletion [16]. Under the
fary: the query frequency, which is an integer controlling how
elaxed core-point definition of p-double-approximation, whether a
often a C-group-by query is issued
point p e P is a core point can be decided directly by issuing such
The production of a workload involves 3 steps, as explained below
a query with p. If the query returns k, we declare p a core point if
and only if k≥ Minpts
Step 1: Insertions. The sequence of insertions is obtained by first
Leveraging this lacl, next we describe how Lo explicitly updale
generating a static dataset "of I
N. ins points, and then,
the core-slalus of all points in P along with insertions and deletions
randomly permuting these points (i.e., if a point stands at the i-th
position, it is the i-th inserted). We generate static datasets whose
To insert a point prew in cell Cneu, we first check whether clusters are the outcome of a"random walk with restart", as was
Pnew itself is a core point. Remember that the insertion may the approach suggested in [10], and will be reviewed shortly. Note
urn some existing non-core points into core points
that the random permutation mentioned earlier allows the clusters
tify such points, we look at each of the O(1) E-close sparse
to form up even at an early stage of the workload
cells c of cnew. Simply check all the points p E P(c)to see
The data space is a d-dimensional square that has range 0, 10
if p is currently a core point
on each dimension, a static dataset is created using the seed
The deletion of a point pold from cell cold may turn some spreader technique in [10, which generates around 10 clusters
existing core points into non-core points. Following the same and 0.0001 I noise points as follows. First, place a spreader at a
dea in insertion, we look at every E-close sparse cell c of cold
random location p of the data space. At each time tick, the spreader
and check all the points p E P(c) for their current core statu
adds to the dataset a point that is uniformly distributed in b(p, 25)
Whenever the spreader has generated 100 points while stationed at
7.4 GUM
the same p. it is forced to move towards a random direction by a
When a point Pcore(say, in cell ccore) has turned into a core point,
distance of 50. Finally, with probability 10/(0.99991), the spreader
we check whether ccore is already in V
restarts"by jumping to another random location of the data space
Regardless of whether a restart happens, the current time tick fin-
If so, simply insert pcore into every abCP instance(Lemma 3) ishes, and the next one starts. The spreader works for 0. 99991 time
of ccore -as explained in Section 7. 2, this properly maintains ticks(thus producing 0.9999I points). After that, 0.0001. I random
the edges of c
points are added to the dataset as noise
Otherwise, it must hold that P(ccore)< Min Pts=O(1)
Step 2: Deletions. First, append to the insertion sequence D
We add ccore to V. Then, for every e-close core cell c of
N-I deletion tokens, where each token is simply a"place-holder
Ccore, decide w hether to create an edge between c and ccore
by using the algorithm of L emma 3 to find an initial witness
pair( thereby starting the aBCP instance on ccore and c)
parameter
value
2,3.5.7
Consider now the scenario where a core point p in cell ccore has
E
50d,100d,200d,400d,800d
turned into a non-core point. If ccore is still a core cell, we remove p
from all the ab CP instances of ccore. Otherwise, we simply remove
0.01N,0.02N,0.03N,D.1N
Ccore from V, and destroy all its aBCP instances
Table 2: Variable parameter values (defaults in bolds)
PeroZ
IncDBSCAN
average cost per operation(microsec)
●●●●●●●●●●●。。●0●●●●●●●●●●●命受●●●会合会自自●●●●●自●●命●●●●●●●●●●●●●●●●●●●●●●●●●●●●。●●●●●●●●会。合●。●●●4
10这个的的单的个的全的的的全全论的全全的的全全的的的全个个
0
(a) Average operation cost vs time
10ec口
numher of operations(million)
(b) Max update cost vs time
Figure 8: Performance of semi-dynamic algorithms in 2D
A Semi Approx angco st.- Semi-Appror maupdcost-C-IncDBSCAN augCo
IncDBSCAN mazupdcost
l0.COst (microsec
命命合命吉·空会会盒空空空女空空女空命命盒鱼堂金盒盒金金盒章鱼金金食金金鱼金金全金金食金金金金命命
02命鱼复空大A点会
△△△△△△△△△△△△△△△AAA人AA人AAA△∧人△人△人AA△△入△△△△△△人△A△△△△△△△△△△△△△△△△△△△△△△△△△△AA△△△
number of operations(million
(a)3D
cost(microsec)
10
10△△AA△△A△AA△△△A△△△△△△△△△△△△△△△A△A△△△ MAAAAAAAZ△A△A△AA△A△A△△A△A△△△△△
number of operations(million)
(b)5D
cost(Inicrosec
10
|态金的显京食盒鱼鱼合合自合d血卓食食鱼鱼食鱼金皇皇变文攻皇皇皇立皇鱼鱼鱼立堂立堂堂宜堂鱼度堂
10入C
crations(million)
(c)7D
Figure 9: Performance of semi-dynamic algorithms in d>3 dimensions
into which we will later fill a concrete point to delete. Then, ran- uniformly at random from 2, 100 Then, Q is populated by random
domly permute the resulting sequence(which has length N). Check sampling Q points from S without replacement
whether the permutation is bad, namely, if any of its prefixes has
more tokens than insertions. If so, we attempt another rando
DBSCAN Algorithms. Our experimentation examined
permutation until a good one is obtaine
IncdbsCaN [8 the state-of-the-art dynamic algorithm for
Now we have a good sequence of insertions and deletion tokens
exact dbscan. as reviewed in section 3
To fill in the tokens, scan down the sequence, and add each inserted
2d- Semi-Exact: our semi-dynamic algorithm in Theorem 1
point into S, until coming across the first token. Select a random
for exact DBSCAN in 2D space
point in S as the one deleted by the token, and then remove the poi
Semi-Appror: our semi-dynamic algorithm in Theorem 1 for
from s. The scan continues in this fashion until all the tokens have
been filled
P-approximate dbsCan in d-dimensional space with d 2 2
2d-Full-Exact: our fully-dynamic algorithm in Theorem 4 for
Step 3: Queries. We simply insert a C-group-by query after every
exact dbsCaN in 2D space
fary updates in the sequence. Recall that the query specifies a
Double-Approx: our fully-dynamic algorithm in Theorem 4
parameter Q, which is generated as follows. Let s be the set of
for p-double-approximate DbsCan in d-dimensional space
alive" points that have been inserted, but not yet deleted before
with d> 2
the query. We first decide the value of Q by choosing an integer All the algorithms were implemented in C++, and compiled with
c version 4.8. 4
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.