Dynamic Density Based Clustering.pdf 016Q17 Oo O O

文件名称: Dynamic Density Based Clustering.pdf

所属分类: 其它

开发工具:

文件大小: 480kb

下载次数: 0

上传时间: 2019-08-18

提供者: anhuiboz********

下载 (480kb)

不能下载？报告错误

详细说明：Dynamic Density Based Clustering.pdf Dynamic Density Based Clustering.pdf Dynamic Density Based Clustering.pdf016Q17 Oo O O1 1p)∈ (a)dataset (b)Core graph (c)One possible p-approximate core graph Figure 2: Illustration of dbSCan and p-approximate dBSCaN(p=0.5, MinPts =3) Min pts, which can be regarded as a constant next we review how haps more intuitive, and allows a simple extension to e clusters are formed using graph terminology. P-approximate DBSCan, as we will see later Given a point p E P, we use B(p, r) to represent the ball that is centered at p, and has radius r. The point is said to be a core point Hardness of dbscan and USEC. It is easy to see that the db if B(p, e) covers at least Min Pts points of P(including p itself ScAn clusters on a set P of n points can be computed in O(n") otherwise, it is a non-core point. To illustrate, consider the dataset time, noticing that the core graph g has O(n)edges. However, f 18 points in Figure 2a, where e is the radius of the inner solid lever algorithms should produce the clusters without generating all circle, and Min Pts=3. The core points have been colored black the edges, and thus, avoid the quadratic trap Indeed, when d-2, while the non-core points colored white. The dashed circle can be the clusters can be computed in only O(n log n)time [11] ignored for the time being It would be highly desired to find an algorithm of o(m )time for dBSCAN clusters are defined in Lwo sleps. The first one focuses d 3, but recently Gian and Tao [10] have essentially dispelled exclusively on the core points, and groups them into preliminary the possibility. They proved an O(n)-time reduction from the unit- clusters. The second step determines how the non-core points should spherical emptines s checking (USEC)problem to dbsCan. in be assigned to the clusters. Next, we explain the two steps in detail other words, any T(n)-time DBSCAN algorithm implies that USEC problem can be solved with O(T(m))time Step 1 Clustering Core Points. It will be convenient to imagine an In USEC, we are given a set Sred of red points and a set Blue undirected core graph g on Pthis graph is conceptual and need of blue points in I. All the points have distinct coordinates on not be materialized. Specifically, each vertex of G corresponds every dimension. The objective is to determine whether there exist to a distinct core point in P. There is an edge between two core red point pred and a blue point pblue such that dist(pred, pblue)<1 points(a k.a. vertices)p1, p2 if and only if dist(p1, p2)5 in a ure 2b shows the core graph for the dataset of Figure 2a broad class of algorithms [6, 7]. For d=3 and 4, beating the Each connected component(CC)of G constitutes a preliminary has been a grand open problem in theoretical computer science, and cluster In Figure 2b, there are 3 CCs(a k.a. preliminary clusters is widely believed [o] to be impossible. By the reduction of [10], Note that every core point belongs to exactly one preliminary cluster no dbsCan algorithm can have running time o(n/o)in d25 in d=3 and 4, this is also true unless unlikely ground-breaking Step 2 Non-Core Assignment. This step augments the preliminary improvements could be mlade on 3D USEC clusters with non-core points. For each non-core point p, DBSCAN looks at every core point pcore E B(p, e), and assigns p to the(only) Approximation and the Sandwich Guarantee. Gan and Tao [10] preliminary cluster containing Pcore. Note that, in this manner, p developed p-app roximate DBSCAN, which returns almost the sa may be assigned to zero, one, or more than one preliminary cluster clusters as exact dbscan by offering a strong sandwich guarantee After all the non-core points have been assigned, the preliminary that will be introduced shortly. In contrast to the high time complex clusters become final clusters ity of the latter, the approximate version takes only O(n)expected time to compute for any constant p>0 It should be clear from the above that the dbscan clusters are uniquely defined by the parameters e and MinPts. but they are Besides the parameters E and Min Pts inherited from DBSCAN not necessarily disjoint. A non-core point may belong to multiple the approximate version accepts a third parameter p, which is a small clusters, while a core point must exist only in a single cluster. It is positive constant less than 1, and controls the clustering precision possible that a non-core point is not in any cluster; such a point is Its clusters can also be defined in the same two steps as in exact called noise DBSCAN, as explained below In Figure 2a, there are two non-core points 013 and 018. Since Step 1: Clustering Core Points. It will also be convenient to follow B(o13, c) covers 014, 013 is assigned to the preliminary cluster of a graph-based approach. Let us define an undirected e-approximate 014. B(018, c), however, covers no core points, indicating that 018 is core graph Gp on the dataset P--again, this graph is conceptual noise. The final dbSCan clusters are (01, 02, ,05,[ and need not be materialized. Each vertex of Gp corresponds to a 012},{o13,O14,…,O17} distinct core point in P. Given two core points P1, P2, whether or not go has an edge between their vertices is determined as Remark. DBSCAN can also be defined under the notion of"density- reachable"; see [9]. The above graph-based definition is equiv The edge definitely exists if dist(p1, P2)(1+p)e Let p be a set of points in I that is subject to updates, each of e Dont care. otherwise which inserts a new point to P, or deletes an existing point from P. We are given a clustering description which specifies correct ways Each preliminary cluster is still a CC, but of Gp. Unlike the core to cluster P. The description is what distinguishes DBSCAN from graph G, Gp may not be unique. This flexibility is the key to the e.g., p-approximate DBSCAN vast improvement in time complexity [10] Suppose that, by the clustering description, C(P)is a legal To illustrate, consider the dataset of Figure 2a again with the e of clusters on the current contents of P. Without loss of generality shown and MinPts=3, but also with p=0.5(the radius of the assume that C(P)=C1, C2,..., Cr]. where is the number of dashed circle indicates the length of (1 +p)). Figure 2c illustrates clusters,andC;(1≤讠≤x) is a subset of f. Note that the clusters a possible p-approximate core graph. Attention should be paid do not need to be disjoint to the edge (o4, 010. Note(from the circles in Figure 2c)that e< dist(o4, 010)<(1+ p)ethis belongs to the"dont-care Given an arbitrary subset Q2 of P, a cluster group-by (c-group-by case meaning that there may or may not be an edge(04, 010). If query must return for every C:∈C(P)(∈[1,x]): he edge exists(as in Figure 2c), there are 2 CCs ( · Nothing at all,ifCz∩Q=0 clusters); otherwise. the p-approximate core graph is the same as in Figure 2b, giving 3 preliminary clusters C;∩Q, otherwise This definition has several useful properties Step 2: Non-Core Assignment. Each non-core point p may be as- signed to zero, one, or multiple preliminary clusters. Specifically, It breaks only the points of Q by how they should appear let s be a CC of Ge. Whether p should be added to the preliminary together in the clusters of P. Points in P\Q are not reported cluster of s is determined as at all, thus avoiding cheating algorithms " that use expensive report time as an excuse for high processing cost Yes, if s has a core point in B(p, e) No, if s has no core point in B(p, (1+p)e) When Q=P, the query result Q(P) is simply C(P). Dont care. otherwise. All the query results must be based on the same C(P). This prevents another form of"cheating"when the clustering de The preliminary clusters after all the assignment constitute the final scription permits multiple legal possibilities ofC(P). Specifi clusters cally, the algorithm can no longer argue that the results Q1(P) As mentioned, 013 and o18 are the only two non-core points in and Q2(P)of two queries Q1 and Q2 should both becor- Figure 2a. While o18 is still a noise point, the case of 013 is more rect"because Q1 (P)is defined on one possible C(P), while interesting. First, it must be assigned to the preliminary cluster of Q2(P) is defined on another. Instead, they must be consis- 014, just like exact DBSCAN. Second, it may or may not be assigned tently defined on the same C(P)-the one output by the query ith o the preliminary cluster of o12(also the cluster of o14). Either case is regarded as a correct result Our objective is to design an algorithm that is fast in processing Sandwich guarantee. Recall that the clusters of exact DBSCAN both updates and queries. We distinguish two scenarios: (i)semi are uniquely delermined by the parameters E ind Miri Pts. Now dynamic: where all the updates are insertions, and (ii) fully-dynamic, where the updates can be arbitrary insertions and deletions imagine we slightly increase e by an anount no Inore than pE Have the clusters of dbsCan changed? If yes, it neans chat the origi We consider that the dimensionality d is small such that(v d)is nal choice of E is unstable--clusters are susceptible even lU a Liny an acceptable hidden constant. All our theoretical results will carry perturbation LO E. If no, then the sandwich guaranlee asserts that this constant, and hence, are suitable only for low dimensionality p-approximate dbscan returns precisely the saIne clusters as DB Our experiments run up to d=7. SCAN In Figure 2, the clusters changed because E was deliberately set to be a large value of 0.5. In [10, the recommended value for Dynamic Exact DBSCAN [8]. Dynamic maintenance of density- practical data is actually 0.001 based clusters has been studied by Ester et al. [8] for exact DB SCAN. Next. we review their method--named incremental dB We refrain from elaborating on the formal description of the SCAN (IncDBSCAN)assuming MinPts= 1 so that all the points sandwich guarantee at this moment We will come back to this of P are core points. This allows us to concentrate on the main ideas in Section 6 where we prove that our new"double-approximate without the relatively minor details of handling non-core points DBSCAn offers just the same guarantee Insertion. Recall that, for exact DBSCAN, the clusters C(P)are Remark. Note, interestingly, that when p =0, there are no"dont- uniquely determined by y the input parameters e and Min Pts. Given care" scenarios such that p-approximate dbscan degenerates into a new point pnew, the insertion algorithm retrieves all the points in exact DBSCAN. Hence, the former actually subsumes the latter as a B(pneu, e), and then merges the clusters of those points into one special case The correctness can be seen from the core graph G, where et- 3. PROBLEM DEFINITION AND STATE OF fectively an edge is added between pmew and every other point in emember this view is conceptual, and G does not THE ART need to be materialized Figure 3a shows the G before the insertion, The Problem of Dynamic Clustering. We now provide a formal which has two CCs. To insert point o as in Figure 3b, the algorithm ormulation of dynamic clustering, using the C-group-by query as finds the points 011, 012, and 013 in B(pmew, E).The two clusters of the key stepping stone. Our approach is to define the problem il those points are merged-the newly added edges(0, 011),(0, 012), a way that is orthogonal to the semantics of clusters, so that the (o, 013) in Figure 3b connect the two CCs into one problem remains valid regardless of whether we have dbscan or In merging the clusters, Incdbscan does not modify the cluster any of its approximate versions in mind ids of the points in the affected clusters, which can be prohibitively 15 (1+p) 13 915017:6CCid真 CC id 2 O Q14N016 O11 (a) core nout o (b)Core graph with o O●s Figure 3: Illustration of the IncdbsCan method non-core cells expensive because the number of such points can be exceedingly (a Grid and cells (b)a grid graph high. Instead, it remembers the"merging history"of the cluster ids Figure 4: Our grid-graph framework(MinPts =3) Deletion. The deletion algorithm, in general, reverses the steps of insertion, except for a breadth first search(BFS)that is needed to The components of the framework will be instantiated differently in judge whether (and how )a cluster has split into several later sections for individual variants Let pold be the point being deleted. IncDBSCAN retrieves all the 4.1 A Grid graph approach puints in B(pold, 6). In(the conceptual)G, all the edges between such points and old are removed. after which the cc of pold may The key idea behind our framework is to turn dynamic clustering or may not be broken into separate CCs(e. g, in Figure 3b, the nto the problem of maintaining CCs(connected components )of a CC is torn into two only if 013. 014, or o is deleted). To find out, special graph the deletion algorithm performs as many threads of bfs on g Grid and Cclls. We impose an arbitrary grid D in the data space Rd, as the number of points in B(pold, e). If two threads"meet up where each cell is a d-dimensional square with side length e/vd on they are combined into one because they must still be in the same every dimension. This ensures that any two points in the same cell CC. As soon as only one thread is left, the algorithm terminates, are within distance at most e from each other being sure that no cluster split has taken place. Otherwise, the remaining threads continue until the end. in which case each thread Given a cell c of D, we denote by P(c) the set of points in P that has enumerated all the points in a new cluster that is spawned by the are covered by c. We call c deletion all those points can now be labeled with the same cluster A non-empty cell if P(c) contains at least one point id For example, suppose that we delete o in Figure 3b, which starts A core cell if P(c)contains at least one core point three threads of BFS from ol1, 012 and 013, respectively. The re · a dense cell if P(c)|≥ Min pts, and a sparse cell if 1≤ sulting core graph reverts back to Figure 3a. The threads of Oll P(c)< MinPts and o12 meet up into one, which eventually traverses the entire CC containing o11 and o12. Similarly, the other thread traverses the Given two cells c1, c2, we say that they are E-close if the smallest distance between the boundary of ci and that of c is at most e entire CC containing 013 When a thread of b fs needs the adjacent neighbors of a point p Consider for instance the grid in Figure 4a, imposed on a set P of in G, the algorithm finds all the other points p2 E B(pi, e) through 18 points. Again, the radii of the solid and dashed circles indicate a range query [3, 12]. Every such p2 is an adjacent neighbor of pi e and(1+ p)e, respectively; and MinPts=3. The only non-core This essentially" the edge between p1 and p2 points are a13 and o18. The core cells are shaded in Figure 4b. The non-core cells are(5, 4 )and(7, 2); note that the minimum distance Query. The algorithm of [8] can easily answer a C-group-by query between the two cells is e-hence they are E-close Q by grouping the points of Q by their cluster ids(some ids need to be obtained from the merging history) Grid Graph. In a grid graphG=(V,E), V is the set of core cells of D, while e is a set of edges satisfying the following CC Drawbacks of incDBsCAN. Both insertion and deletion start with reguirement a range query to extract the points in b(p, e), which are called the seed points [8]. The query is expensive when p falls in a dense Let p1, p2 be two core points of P, and c1(or c2, resp. )be region of P where there are many seed points. The issue is more the core cell that contains pi (or p2, resp. ) Then, pi and p2 are in the same cluster if and only if C1 and c2 are in the same he number of range queries is simply huge Queries are needed to serious in a deletion, because multiple range form bfs. the worst situation ha CC of G The above requirement is fulfilled by using the following rules to decide if E should have an edge between two core cells c1, C2 E V 4. THE OVERALL FRAMEWORK Yes, if there is a pair of core points(p1, p2)E P(cixP(c2) satisfying dist(p1, p2)0, there is a semi-dynamic p-approximate DBSCAN algorithm p-approximate DBSCAN algorithm A as follows that processes each insertion in o(1)amortized time, and answers a C-group-by query Q in o(QD time 1. Initialize a p-approximate dbsCan instance with e= 1, Min Pts= 3, and an arbitrary p >0. Let P be the input set, The same insertion and query efficiency can also be achieved in which is empty at this moment 2D space for exact DBSCAN 2. Use A to insert all the red points into P. 6. DYNAMIC HARDNESS AND DOUBLE 3. For every blue point p=(a1, 32,.,d)(hence, 1>a), APPROXIMATION carry out the following steps We now come to perhaps the most surprising section of the pa- 3. 1 Use A to insert p into P. per. Recall that p-approximate dbsCan was proposed to address the computational hardness of dbsCan on static datasets. In 3. 2 Use A to insert a dummy point p'=(1+1, 2,.,td Section 6.1, we will show that the p-approximate version suffers into P. That is, p has the same coordinates as p on all from the same hardness on fully dynamic datasets. Interestingly, dimensions i E [2, d], except for the first dimension le culprit this time is the definition of core point. This motivates where p has coordinate 1 l. See Figure 6b for an the proposition of p-double-approximate dbsCan in Section 6.2 illustration(where p= pilae) where we also prove that the new proposition has a sandwich guar- 3.3 Use A to answer a C-group-by query with Q-p,p] antee as strong as the p-approximate version If the query returns the same cluster id for p and p terminate the algorithm, and return"yes" to the USEC 6.1 Hardness of Dynamic p-Approximation USEC with Line Separation. Next, we introduce the UsEC with 3.4 Use A to delete p and p from P line separation(UsEcC-LS) Problem, which has a subtle connection with p-approximate DBSCaN, as shown later 4. Return“no” to the usec- LS proble …… The running time of the algorithm is O(n.Tup(n)+ Tgry(n)) 01.03 CC ii 2 because we issue at most 2n insertions and 2n deletions as well as 16 n queries, in total. Next, we prove that the algorithm is correct 13 Consider Lines 3. 1-3.4. A crucial observation is that the dummy point p must be a non-core point, because b(p, e) contains only two points p, p. Therefore, p and p are placed into the same cluster by p-approximate dbscan if and only if p is a core point. However, p is a core point if and only if B (p, e) covers at least 3 points, which must include p, p, and at least one point p" on the other side of ered point p and blue point p are therefore within distance 1 (a) One possible P-double-approx (b)Grid graph It is now straightforward to verify that our algorithm always core grap returns the correct answer for USEC-LS. Figure 7: Illustration of p-double approximation THEOREM 2. For any p>0 and any dimensionality d> 3, any p-approximate dbsCAN algorithm must incur Q(n/)amortized Sandwich Guarantee. Recall that this is an attractive feature of p- time either to process an update, or to answer a C-group-by query approximate dbscan. Next, we prove that p-double-approximate (even if Q= 2), unless the USEC problem in R could be solved DbsCan provides just the same guarantee. Following the style of [ 1o, we define PROOF. Suppose that the algorithm were able to process an up 61 as the set of clusters ofexacl DBSCAN with parameters date and a query both in o(n,)amortized time. By Lemma 2, we (E, MiTPts) would solve USEC-LS in o(n/)time which, by Lemma l,means 62 as the set of clusters of exact dbsCan with parameters that we would solve USEC in o(n*/)time. (∈(1+p), Minpts As explained in Section 2, for USEC, a lower bound ofQ(n) 6 as a set of clusters that is a legal result of p-double-approximate known[7] in d> 5, whereas beating the O(n,)bound in d=3, 4 DBSCAN with parameters 6, Min Pts, and p is a major open problem in theoretical computational geometry, and Then, the sandwich guarantee is believed to be impossible [6] THEOREM 3. The following statements are true:(i) For any This is disappointing because dbscan succumbing to the hard cluster c1∈61, there is a cluster c∈6 such that O1≤C,and ness of USEC was what motivated p-approximate DBSCaN. Theo-(ii) for any cluster C E 6, there is a cluster C2 E 62 such that rem 2 shows that the latter suffers from the same hardness when both C C2 insertions and deletions are allowed! note that the theorem does not apply to the seimi-dy namic update scheine because the deletions at The proof can be found in the appendix. Note that the theorem is Line 3.4 are essential. In facl, Theorem 1 already proved that effi- purposely worded exactly the same as Theorem 3 of [10 cient semi-dynamic algorithms exist for p-approximate dbsCan 7. FULLY DYNAMIC ALGORITHMS Finally, it is worth pointing out that Theorem 2 holds even for P-0, i.e., it is applicable to exact dbSCAN as well This section presents our algorithms for maintaining p-double approximate dbscan clusters under both insertions and deletions 6.2 p-Double-Approximate dbSCan and (again, exact DBSCAN is captured with p=0). We will achieve the Sandwich guarantee purpose by instantiating the general framework in Section 4. The The New Proposition. To enable both(fully-dynamic) update reader is reminded that the core-point definition has changed to the nd query efficiency, we propose p-double-approximate DBSCAN one in Section 6.2 which takes the same parameters e, MinPts, and p as p-approximate 7.1 Approximate Bichromatic Close Pair DBSCAN. Whether a point p e P is a core point is now decided in a relaxed manner We now take a short break from clustering to discuss a computa tional geometry problem which we name the approximate bichro Definitely a core point if B(p, e)covers at least MinPts matic close pair(abCp)problem. In this problem, we have two points of P. disjoint axis- parallel squares C1, c2 in R. There is a set S(c1)of Definitely not a core point if b(p, (1 +p)e) covers less than points in C1, and a set S(c2) of points in C2. The two sets are subject MinPts points of F to insertions and deletions. Let c and p be positive real values. We are asked to maintain a witness pair of (pi, p2)such that e Dont care. otherwise It may be an empty pair(i.e, pi and p2 are null) A good example to illustrate this is point 013 in Figure 4a. Since B(013, 6) covers 2< MinPts=3 points, 013 is not a core point. If it is not empty, we must have dist(pi, p2)s(1+p)e under exact or p-approximate DBSCAN. Under double approxima- The pair must not be empty, if there exist a point PI E S(c1) tion, however, it falls into the dont-care case for p= 0.5, because and a point P2 E S(c2)such that dist (p1: P2)0, there is a fully-dynamic p-double-approximate DBSCAN algorithm that processes each insertion and deletion in o(1)amor- 7.2 Edges in the grid graph and abcP tized time, and answers a C-g roup-by query q in o(QD time Returning to p-double-approximate dbscan clustering, let us The same update and query efficiency can also be achieved in 2D recall that in the grid graph G, if there is an edge between core space for exact dbSCAN. cells ci and c2, then the two cells must be E-close. Such an edge may disappear/re-appear as the core points of P(c1)and P(c 8. EXPERIMENTS are deleted/inserted. maintaining this edge can be regarded as an instance of the aBCP problem, where S(c1) is the set of core points Section 8. 1 describes the setup of our empirical evaluation. Then, in P(c1), and S(c2)is the set of core points in P(C2)-the edge Sections 8.2 and 8.3 report the results on semi-dynamic and fully exists if and only if the witness pair is not emply! dynamic algorithms, respectivel We run a thread of the aBCP algorithm of Lemma 3 on every pair 8.1S Setup of e-close core cells ci and c2. Those threads will be referred to as Workload. We evaluate a clustering algorithm by its efficiency the abCP instances of c1 (or c2). Whenever the edge between Cr in processing a workload, which is a mixed sequence of update and c2(re-)appears, we call Edgelnsert(c1, c2)of the CC structure; and queries. Each update or query is collectively referred to as an whenever it disappears, we call EdgeRemove(c1, C2) operation. A workload is characterized by several parameters 7. 3 The Core-Status Structure . N: the total number of updates Given a point g, an approximate runge count query [1o]returns ins: the percentage of insertions. In other words, the work an integer k that falls between B(4, e)l and B(q, (1+pe)l.The load has N. %ins insertions, and N(l-%ins )deletions. Thi query can be answered in O(1) time by a structure that can be parameter is fixed to l in semi-dynamic scenarios updated in O(1) time per insertion and deletion [16]. Under the fary: the query frequency, which is an integer controlling how elaxed core-point definition of p-double-approximation, whether a often a C-group-by query is issued point p e P is a core point can be decided directly by issuing such The production of a workload involves 3 steps, as explained below a query with p. If the query returns k, we declare p a core point if and only if k≥ Minpts Step 1: Insertions. The sequence of insertions is obtained by first Leveraging this lacl, next we describe how Lo explicitly updale generating a static dataset "of I N. ins points, and then, the core-slalus of all points in P along with insertions and deletions randomly permuting these points (i.e., if a point stands at the i-th position, it is the i-th inserted). We generate static datasets whose To insert a point prew in cell Cneu, we first check whether clusters are the outcome of a"random walk with restart", as was Pnew itself is a core point. Remember that the insertion may the approach suggested in [10], and will be reviewed shortly. Note urn some existing non-core points into core points that the random permutation mentioned earlier allows the clusters tify such points, we look at each of the O(1) E-close sparse to form up even at an early stage of the workload cells c of cnew. Simply check all the points p E P(c)to see The data space is a d-dimensional square that has range 0, 10 if p is currently a core point on each dimension, a static dataset is created using the seed The deletion of a point pold from cell cold may turn some spreader technique in [10, which generates around 10 clusters existing core points into non-core points. Following the same and 0.0001 I noise points as follows. First, place a spreader at a dea in insertion, we look at every E-close sparse cell c of cold random location p of the data space. At each time tick, the spreader and check all the points p E P(c) for their current core statu adds to the dataset a point that is uniformly distributed in b(p, 25) Whenever the spreader has generated 100 points while stationed at 7.4 GUM the same p. it is forced to move towards a random direction by a When a point Pcore(say, in cell ccore) has turned into a core point, distance of 50. Finally, with probability 10/(0.99991), the spreader we check whether ccore is already in V restarts"by jumping to another random location of the data space Regardless of whether a restart happens, the current time tick fin- If so, simply insert pcore into every abCP instance(Lemma 3) ishes, and the next one starts. The spreader works for 0. 99991 time of ccore -as explained in Section 7. 2, this properly maintains ticks(thus producing 0.9999I points). After that, 0.0001. I random the edges of c points are added to the dataset as noise Otherwise, it must hold that P(ccore)< Min Pts=O(1) Step 2: Deletions. First, append to the insertion sequence D We add ccore to V. Then, for every e-close core cell c of N-I deletion tokens, where each token is simply a"place-holder Ccore, decide w hether to create an edge between c and ccore by using the algorithm of L emma 3 to find an initial witness pair( thereby starting the aBCP instance on ccore and c) parameter value 2,3.5.7 Consider now the scenario where a core point p in cell ccore has E 50d,100d,200d,400d,800d turned into a non-core point. If ccore is still a core cell, we remove p from all the ab CP instances of ccore. Otherwise, we simply remove 0.01N,0.02N,0.03N,D.1N Ccore from V, and destroy all its aBCP instances Table 2: Variable parameter values (defaults in bolds) PeroZ IncDBSCAN average cost per operation(microsec) ●●●●●●●●●●●。。●0●●●●●●●●●●●命受●●●会合会自自●●●●●自●●命●●●●●●●●●●●●●●●●●●●●●●●●●●●●。●●●●●●●●会。合●。●●●4 10这个的的单的个的全的的的全全论的全全的的全全的的的全个个 0 (a) Average operation cost vs time 10ec口 numher of operations(million) (b) Max update cost vs time Figure 8: Performance of semi-dynamic algorithms in 2D A Semi Approx angco st.- Semi-Appror maupdcost-C-IncDBSCAN augCo IncDBSCAN mazupdcost l0.COst (microsec 命命合命吉·空会会盒空空空女空空女空命命盒鱼堂金盒盒金金盒章鱼金金食金金鱼金金全金金食金金金金命命 02命鱼复空大A点会 △△△△△△△△△△△△△△△AAA人AA人AAA△∧人△人△人AA△△入△△△△△△人△A△△△△△△△△△△△△△△△△△△△△△△△△△△AA△△△ number of operations(million (a)3D cost(microsec) 10 10△△AA△△A△AA△△△A△△△△△△△△△△△△△△△A△A△△△ MAAAAAAAZ△A△A△AA△A△A△△A△A△△△△△ number of operations(million) (b)5D cost(Inicrosec 10 |态金的显京食盒鱼鱼合合自合d血卓食食鱼鱼食鱼金皇皇变文攻皇皇皇立皇鱼鱼鱼立堂立堂堂宜堂鱼度堂 10入C crations(million) (c)7D Figure 9: Performance of semi-dynamic algorithms in d>3 dimensions into which we will later fill a concrete point to delete. Then, ran- uniformly at random from 2, 100 Then, Q is populated by random domly permute the resulting sequence(which has length N). Check sampling Q points from S without replacement whether the permutation is bad, namely, if any of its prefixes has more tokens than insertions. If so, we attempt another rando DBSCAN Algorithms. Our experimentation examined permutation until a good one is obtaine IncdbsCaN [8 the state-of-the-art dynamic algorithm for Now we have a good sequence of insertions and deletion tokens exact dbscan. as reviewed in section 3 To fill in the tokens, scan down the sequence, and add each inserted 2d- Semi-Exact: our semi-dynamic algorithm in Theorem 1 point into S, until coming across the first token. Select a random for exact DBSCAN in 2D space point in S as the one deleted by the token, and then remove the poi Semi-Appror: our semi-dynamic algorithm in Theorem 1 for from s. The scan continues in this fashion until all the tokens have been filled P-approximate dbsCan in d-dimensional space with d 2 2 2d-Full-Exact: our fully-dynamic algorithm in Theorem 4 for Step 3: Queries. We simply insert a C-group-by query after every exact dbsCaN in 2D space fary updates in the sequence. Recall that the query specifies a Double-Approx: our fully-dynamic algorithm in Theorem 4 parameter Q, which is generated as follows. Let s be the set of for p-double-approximate DbsCan in d-dimensional space alive" points that have been inserted, but not yet deleted before with d> 2 the query. We first decide the value of Q by choosing an integer All the algorithms were implemented in C++, and compiled with c version 4.8. 4

(系统自动生成,下载前可以参看下载内容)