1905.06081.pdf数学书籍 ai智能数学书籍User profiles matching

文件名称: 1905.06081.pdf

所属分类: 机器学习

开发工具:

文件大小: 290kb

下载次数: 0

上传时间: 2019-10-12

提供者: leok*****

下载 (290kb)

不能下载？报告错误

详细说明：数学书籍 ai智能数学书籍User profiles matching in social networks Data Collecting. Our approach consists of several stages. At first, we must data from two social media using a crawling framework (profiles, photos from albums and posts)I. For the purposes of validation of our results, we collect a set of profiles from VKontakte, which have an explicit link to their secondary profile in Instagram- the only possible way to build the labelled dataset Face Detection and Embedding. We process photos using two algorithms 1. face detection -we apply mtonn- Multi-task Cascaded Convolutional Networks 11, which achieved efficiency superior to the closest competitors and is not affected by scaling of the faces 2. face embedding- to construct embeddings of extracted faces FaceNet neural network is applied阿 We apply mtcnn prc-traincd on the Wider face datasct and Facc ct prc Lr ained oll the VGGFaced?. Then this data is filtered Filtering. The extracted face embeddings are further filtered by their parameters according to several heuristics filtering by number of pixels(hereinafter, we will use the term quality of the image 2. filtering by anchors(child faces removing Face Net has limitations on the minimum required quality of images and we filter images of faces by the number of pixels of these faces. The accurate control of the above parameters allows to achieve an improved precision and recall of matching this is partly due to the behaviour of the selected method for embedding construction. In the experimental study in Sect. 4] we found an effect of the quality of facial images on the final matching efficiency- it improves the F1-scorc by 4% The other heuristics probably can be related to the dataset limitation of VGGFace2 with which FaceNet was trained. VGGFace2 contains young and mature faces of people but does not contain the faces of babies and small children This leads to a problem that embeddings of childs faces have a very small margin Detween each other. That is why we should remove their faces from the user's collection of photos to avoid mismatching of profiles. Figure l reveals that the distribution of distances between embeddings of children's faces has a bias from the distribution of distances between embeddings of random peoples faces Additional filtering of data is accomplished using so-called anchors. An an chor is a vector that represents some space of embedded faces. In our study, we use the anchor to represent the faces of children. We create it by following way. A set, of children faces was collected semi-aut omatic lv: we find kindergarten and photographers accounts using tags and specific usernames. For instance, tags under the photos with words children", n kindergarte en, etc. Then we build an anchor-element-wise mean of all vectors of childrens faces. All face embeddings which are close to this anchor are removed from the dataset Coderepositoryused-https://github.com/davidsandberg/facenet Sokhin et al Distribution of distances 2000 Distance between non-children faces Distance between children faces 91500 a 1000 500 0 0 0.5 Distance Fig. 1. Distribution of distances between random people faces and between children aces Owner identification. This is the main part of our approach that is performed separately for each profile in each social network. Embeddings of faces are formed in Euclidean space. We apply hierarchical clustering for each profile separately with the single linkage algorithm and distance threshold 0.8. This algorithm allows us to generate a non-fixed number of clusters based on the euclidean dis tance between face embeddings Each cluster of the profile should belong either lo a single person in Che real world. whose faces have slightly different but close embeddings or to persons who look very similar due to distortions introduced by hairstyle, put on glasses beards and other things which make them look similar. This is the main part of our approach that is performed separately for each profile in each social net work. Embeddings of faces are formed in Euclidean space. We apply hierarchical clustering for each profile separately with the single linkage algorithm and dis tance threshold 0. 8. This algorithm allows us to generate a non-fixed number of clusters based on the Euclidean distance between facial embeddings Each cluster of the profile should belong either to a single person in the real world. whose faces have slightly different but close embeddings or to persons who look very similar due to distortions introduced by hairstyle, put on glasses beards and other things which ma ke them look similar We assume that most, users publish photos with different people, but the number of their face occurrences is greater than others. Following this hypothesis in order to find the owners'faces, we must choose the largest cluster and combine them into one vector- the defining vector (Dv) of profile using faces from a hosen cluster. The dV is an element-wise mean of all generated embeddings with User profiles matching in social networks the same dimension (n where V-face embedding n-number of embeddings of the user) Dp-1、2 ∑ However, due to possible sharpness of the dv, it is worth to take into account the other largest clusters. Sometimes people publish many similar photos,even the same photos. In case that is shown in Fig. 2(a) the first cluster only consists of two unique images. We are not able to match this profile using this cluster But we can add the others(for instance, the second largest, that is shown in Fig 2(b)) and form a new DV using more than two unique face embeddings Our experiments in Sect. 4]show that this assumption and the proposed solution llow us lo achieve resulis Chal exceed the use of one cluster. Experimental results give us the optimal value -2 clusters. If after clustering there is only one cluster, we use all photos of the user, if there are all clusters with the same size (e.g. 1 element), we set this profile as "unable to set the owner"and mark as Fig. 2. Examples of cluster:(a)the first largest;(b) the second largest After that, the dv of cach profile in both social media rcprcscnts the uscr and will be used to matching If the size of the largest cluster is less than a given threshold, this user marked as profiles without pair, beca use it is not possible to detect the owner's face correctly. The tuning of the threshold is also provided in the experimental stud Profiles matching. The process of profiles matching is simple: defining vectors of users from two social media are compared with each other. We calculate the L2 norm between profiles in two social media, for each profile in one social media we find the profile from the other with the smallest distance and mark as a candidate for matching(2) argmin2(DV,DV1n)={ DIns VETS∈Dvm“: L2(DVi, DVkmst)>L2(DVA, DVnst )J I. Sokhin et al If the smallest distance is higher than the given threshold (threshold distance hereinafter), this means there is no pair in the other social media or we could not find it 4 Experinental study 4.1 Details of the experimental part Our experimental plan consists of three Imlainl steps: baseline evalualion using real names-based matching; evaluation for full profiles without any limitations evaluation with alignment rate reduction and photos number reduction Dataset description. We use our own dataset- Dataset4675, which consists of 4675 profiles from VKontakte and 3100 profiles from Instagram, which simulates working with partially aligned networks-only 3100 VKontakte users have a pair in other social media. Dataset4675 users have from 50 to 500 publicly available photos Metrics. We clarify dcfinitions of precision, rccall and Fl-scorc, that wC usc for this classification problem, which is not fully classical. Since we are working with VKontakte as our main social media and want to saturate its profiles with additional information, all metrics are calculated with respect to the number of VKontakte users With V as a number of all real pairs in our dataset (3193),K as a number of the correct predictions of the algorithm(correctly matched pairs of VKontakte and Instagram profiles) and K as a number of all predictions of the algorithm the precision is delined as follows(3) K And the recall is defined as follows(4) We need both the reca. lI and precision in order to evaluate our approach, FI-score shows the balance between them and is used to choose the best parameters 4.2 Baseline evaluation. Real names matching The real names of users from Dataset4675 are compared with Levenshtein dis tance metric and sensitivity is analyzed according to its threshold distance. For each user we are looking for the closest user from other social networks. if closest distance exceeds the threshold value, we remain this user without a pair User profiles matching in social networks The real names are processed in the following sequence: lower case translation non-alphabetic characters removing; transliteration The precision and recall are shown in Table l The highest Fl of 0.295 is achieved with P=0.765 and R=0.83 and the distance threshold of 4 permuta tions. With a small dataset in relation to the real number of users, this approach achieves a good precision, but il shiould be noticed that the precision decreases with the increasing number of users. This can be explained from the fact of a large number of homonyms in the real world. Also, we have a very low recall Table 1. real name based matching results Threshold precision recall fl-score 0.9760.1060.191 2 0.9720.1480.257 0.1690.286 4 0.7650.1830.295 5 0.5110.1920.279 6 0.352 0.1980.253 0.2690.2030.231 0.2350.2050.219 4.3 Evaluation for full profiles Cluster analysis. At first, we analyze the dependency on the clusters number in Table 2] with fixed parameter of threshold distance-065 and image quality 6400 Table 2. Cluster dependence anal ysis Number of largest clusters used Precision Recall F1-score 0.96170.78850.8665 09782078750.8726 0.97970.78390.8709 4 0.97930.78450.8712 0.98010.78420.8713 It can be seen as proof of the requirements of more than 1 clusters mentioned Sect. 3-the F1-score in this case is 0.855. The optimal value of the number of the cluster is 2 T. Sokhin et al Face-based matching. We also provide a sensitivity analysis of our approach in Fig 3 We use Dataset4675 for this part of the experimental study. One can a strong dependence between the threshold distance and efficiency. While high precision is achieved with a smallest threshold distance value, the reca.Il emains lower than 0.7, that can be seen in Table B The higher Fl is 0.0868 with image quality 80 and threshold distance 0.65 Face-based matching F1-score Threshold distance:0.35 0.8 hreshold distance: 0.45 L0.7 Threshold distance:0.55 Threshold distance: 0.65 0.6 Threshold distance: 0.75 Fig 3. Fl-score of face-based matching depending on the image quality and the thresh old distance Table 3. Face-based matching results Threshold distance Image quality 0.35 0.45 0.55 0.65 0.75 Precision 0.997 0.9890.976 0.951 0.999 0.997 0.984 0.933 1.0 0.995 0.947 1.0 1.0 1.0 0.994 0.946 100 1.0 10 1.0 0.992 0.948 150 1.0 10 1.0 0.992 0.948 Recall 0.4780.6060.6870.739 30 0.513 0.637 0.709 0.793 0.519 0.645 0.7210.77 0.8 80 0.515 0.6380.715 0.77 0.798 100 0.507 0.634 0.71 0.761 0.797 150 0.461 0.588 0.671 0.734 0.772 User profiles matching in social networks 4.4 Evaluation with the reduced alignment rate and the reduced number of photos Here we experiment with limited data and rate of alignment of users. If our approach requires as much data as possiblc, it is only applicable for govcrnmcnt and law enforcement with social media cooperation Avatars only matching. When working with facial images, using avatars can be the easiest way. This removes the need for the owner detection stage because the idea of an avatar is to present the owner. Here we use only users'avatars from Dataset4675 to evaluate this assumption in Fig. 4 Face-based matching F1-score Avatars only -+ Threshold distance: 0.35 0.4 k threshold distance:0. 15 Threshold distance.55 Threshold distance:0.65 Threshold distance:0.75 0 150 mage qualit Fig 4. Fl-score of face-based matching depending on the image quality and the thresh old distance. Avatars onl We faced the recall decrease in general and almost zero Fl-score with a gh value of the quality filter. We achieve 0.539 F1-score with the following parameters: threshold distance -0.75, quality-30 Reducing the number of images for each user. We reduce the number of available photos of each user from Dataset4675 in order to estimate our approach in the condition of greater uncertainty in Fig. 5 Thc proccdurc of sampling is as follows: for cach uscr, we solcct X%c of his/hor photos for 10 times. It is interesting that the precision rate remains almost the same even with 10% of da ta from each user profile of both socia. media. The reason for the low recall rate is the owner detection part: a small amount of randomly sampled data does not allow to find the owner's face and to form a good defining vector Reducing the rate of intersections. Partial alignment. In the final part of the experiments, we examine the partial alignment of social networks. As noted by 10 authors real social media are partially alignment- not all users from one I. Sokhin et al Face-based matching Photos sampling per user Precision 0.8 0.6 0.2 04 0.6 0.8 Fraction of photos Fig. 5. The dependence of the efficiency of the algorit hm on the proportion of user social media have accounts in anothcr onc. It is impossible to investigate the rcal rale of this intersection, but we call consider a nuimber of rale values and create a synt. hetically reduced intersection. The high variance of precision and reca.lI depicted in Fig. 6is explained by user properties: we match different users, due to random sampling. Some of these users could have more or fewer photos, good or bad(such as biased vector) defining vectors. The stability of recall shows Face- based matching. Intersection reducing 0.8 c0.6 G0.4 0.5 0.5 Intersection rate Fig. 6. The dependence of the efficiency of the algorithm on the proportion of use that our approach can be applied on low-alignment networks. The precision decreased on low-rate alignment because of many false-positive samples, this can potentially be improved by additional filtering

(系统自动生成,下载前可以参看下载内容)