开发工具:
文件大小: 290kb
下载次数: 0
上传时间: 2019-10-12
详细说明:数学书籍 ai智能数学书籍User profiles matching in social networks
Data Collecting. Our approach consists of several stages. At first, we must data
from two social media using a crawling framework (profiles, photos from albums
and posts)I. For the purposes of validation of our results, we collect a set of
profiles from VKontakte, which have an explicit link to their secondary profile
in Instagram- the only possible way to build the labelled dataset
Face Detection and Embedding. We process photos using two algorithms
1. face detection -we apply mtonn- Multi-task Cascaded Convolutional
Networks 11, which achieved efficiency superior to the closest competitors
and is not affected by scaling of the faces
2. face embedding- to construct embeddings of extracted faces FaceNet neural
network is applied阿
We apply mtcnn prc-traincd on the Wider face datasct and Facc ct prc
Lr ained oll the VGGFaced?. Then this data is filtered
Filtering. The extracted face embeddings are further filtered by their parameters
according to several heuristics
filtering by number of pixels(hereinafter, we will use the term quality of the
image
2. filtering by anchors(child faces removing
Face Net has limitations on the minimum required quality of images and
we filter images of faces by the number of pixels of these faces. The accurate
control of the above parameters allows to achieve an improved precision and
recall of matching this is partly due to the behaviour of the selected method for
embedding construction. In the experimental study in Sect. 4] we found an effect
of the quality of facial images on the final matching efficiency- it improves the
F1-scorc by 4%
The other heuristics probably can be related to the dataset limitation of
VGGFace2 with which FaceNet was trained. VGGFace2 contains young and
mature faces of people but does not contain the faces of babies and small children
This leads to a problem that embeddings of childs faces have a very small margin
Detween each other. That is why we should remove their faces from the user's
collection of photos to avoid mismatching of profiles. Figure l reveals that the
distribution of distances between embeddings of children's faces has a bias from
the distribution of distances between embeddings of random peoples faces
Additional filtering of data is accomplished using so-called anchors. An an
chor is a vector that represents some space of embedded faces. In our study, we
use the anchor to represent the faces of children. We create it by following way. A
set, of children faces was collected semi-aut omatic lv: we find kindergarten and
photographers accounts using tags and specific usernames. For instance, tags
under the photos with words children", n kindergarte
en, etc. Then we build an
anchor-element-wise mean of all vectors of childrens faces. All face embeddings
which are close to this anchor are removed from the dataset
Coderepositoryused-https://github.com/davidsandberg/facenet
Sokhin et al
Distribution of distances
2000
Distance between non-children faces
Distance between children faces
91500
a
1000
500
0
0
0.5
Distance
Fig. 1. Distribution of distances between random people faces and between children
aces
Owner identification. This is the main part of our approach that is performed
separately for each profile in each social network. Embeddings of faces are formed
in Euclidean space. We apply hierarchical clustering for each profile separately
with the single linkage algorithm and distance threshold 0.8. This algorithm
allows us to generate a non-fixed number of clusters based on the euclidean
dis tance between face embeddings
Each cluster of the profile should belong either lo a single person in Che real
world. whose faces have slightly different but close embeddings or to persons
who look very similar due to distortions introduced by hairstyle, put on glasses
beards and other things which make them look similar. This is the main part
of our approach that is performed separately for each profile in each social net
work. Embeddings of faces are formed in Euclidean space. We apply hierarchical
clustering for each profile separately with the single linkage algorithm and dis
tance threshold 0. 8. This algorithm allows us to generate a non-fixed number of
clusters based on the Euclidean distance between facial embeddings
Each cluster of the profile should belong either to a single person in the real
world. whose faces have slightly different but close embeddings or to persons
who look very similar due to distortions introduced by hairstyle, put on glasses
beards and other things which ma ke them look similar
We assume that most, users publish photos with different people, but the
number of their face occurrences is greater than others. Following this hypothesis
in order to find the owners'faces, we must choose the largest cluster and combine
them into one vector- the defining vector (Dv) of profile using faces from a
hosen cluster. The dV is an element-wise mean of all generated embeddings with
User profiles matching in social networks
the same dimension (n where V-face embedding n-number of embeddings of
the user)
Dp-1、2
∑
However, due to possible sharpness of the dv, it is worth to take into account
the other largest clusters. Sometimes people publish many similar photos,even
the same photos. In case that is shown in Fig. 2(a) the first cluster only consists
of two unique images. We are not able to match this profile using this cluster
But we can add the others(for instance, the second largest, that is shown in
Fig 2(b)) and form a new DV using more than two unique face embeddings
Our experiments in Sect. 4]show that this assumption and the proposed solution
llow us lo achieve resulis Chal exceed the use of one cluster. Experimental
results give us the optimal value -2 clusters. If after clustering there is only one
cluster, we use all photos of the user, if there are all clusters with the same size
(e.g. 1 element), we set this profile as "unable to set the owner"and mark as
Fig. 2. Examples of cluster:(a)the first largest;(b) the second largest
After that, the dv of cach profile in both social media rcprcscnts the uscr
and will be used to matching If the size of the largest cluster is less than a given
threshold, this user marked as profiles without pair, beca use it is not possible to
detect the owner's face correctly. The tuning of the threshold is also provided in
the experimental stud
Profiles matching. The process of profiles matching is simple: defining vectors of
users from two social media are compared with each other. We calculate the L2
norm between profiles in two social media, for each profile in one social media
we find the profile from the other with the smallest distance and mark as a
candidate for matching(2)
argmin2(DV,DV1n)={ DIns VETS∈Dvm“:
L2(DVi, DVkmst)>L2(DVA, DVnst )J
I. Sokhin et al
If the smallest distance is higher than the given threshold (threshold distance
hereinafter), this means there is no pair in the other social media or we could
not find it
4 Experinental study
4.1 Details of the experimental part
Our experimental plan consists of three Imlainl steps: baseline evalualion using
real names-based matching; evaluation for full profiles without any limitations
evaluation with alignment rate reduction and photos number reduction
Dataset description. We use our own dataset- Dataset4675, which consists of
4675 profiles from VKontakte and 3100 profiles from Instagram, which simulates
working with partially aligned networks-only 3100 VKontakte users have a pair
in other social media. Dataset4675 users have from 50 to 500 publicly available
photos
Metrics. We clarify dcfinitions of precision, rccall and Fl-scorc, that wC usc
for this classification problem, which is not fully classical. Since we are working
with VKontakte as our main social media and want to saturate its profiles with
additional information, all metrics are calculated with respect to the number of
VKontakte users
With V as a number of all real pairs in our dataset (3193),K as a number of
the correct predictions of the algorithm(correctly matched pairs of VKontakte
and Instagram profiles) and K as a number of all predictions of the algorithm
the precision is delined as follows(3)
K
And the recall is defined as follows(4)
We need both the reca. lI and precision in order to evaluate our approach, FI-score
shows the balance between them and is used to choose the best parameters
4.2 Baseline evaluation. Real names matching
The real names of users from Dataset4675 are compared with Levenshtein dis
tance metric and sensitivity is analyzed according to its threshold distance. For
each user we are looking for the closest user from other social networks. if
closest distance exceeds the threshold value, we remain this user without a pair
User profiles matching in social networks
The real names are processed in the following sequence: lower case translation
non-alphabetic characters removing; transliteration
The precision and recall are shown in Table l The highest Fl of 0.295 is
achieved with P=0.765 and R=0.83 and the distance threshold of 4 permuta
tions. With a small dataset in relation to the real number of users, this approach
achieves a good precision, but il shiould be noticed that the precision decreases
with the increasing number of users. This can be explained from the fact of a
large number of homonyms in the real world. Also, we have a very low recall
Table 1. real name based matching results
Threshold precision recall fl-score
0.9760.1060.191
2
0.9720.1480.257
0.1690.286
4
0.7650.1830.295
5
0.5110.1920.279
6
0.352
0.1980.253
0.2690.2030.231
0.2350.2050.219
4.3 Evaluation for full profiles
Cluster analysis. At first, we analyze the dependency on the clusters number
in Table 2] with fixed parameter of threshold distance-065 and image quality
6400
Table 2. Cluster dependence anal ysis
Number of largest clusters used Precision Recall F1-score
0.96170.78850.8665
09782078750.8726
0.97970.78390.8709
4
0.97930.78450.8712
0.98010.78420.8713
It can be seen as proof of the requirements of more than 1 clusters mentioned
Sect. 3-the F1-score in this case is 0.855. The optimal value of the number
of the cluster is 2
T. Sokhin et al
Face-based matching. We also provide a sensitivity analysis of our approach
in Fig 3 We use Dataset4675 for this part of the experimental study. One can
a strong dependence between the threshold distance and efficiency. While
high precision is achieved with a smallest threshold distance value, the reca.Il
emains lower than 0.7, that can be seen in Table B The higher Fl is 0.0868
with image quality 80 and threshold distance 0.65
Face-based matching F1-score
Threshold distance:0.35
0.8
hreshold distance: 0.45
L0.7
Threshold distance:0.55
Threshold distance: 0.65
0.6
Threshold distance: 0.75
Fig 3. Fl-score of face-based matching depending on the image quality and the thresh
old distance
Table 3. Face-based matching results
Threshold distance
Image quality
0.35
0.45
0.55
0.65
0.75
Precision
0.997
0.9890.976
0.951
0.999
0.997
0.984
0.933
1.0
0.995
0.947
1.0
1.0
1.0
0.994
0.946
100
1.0
10
1.0
0.992
0.948
150
1.0
10
1.0
0.992
0.948
Recall
0.4780.6060.6870.739
30
0.513
0.637
0.709
0.793
0.519
0.645
0.7210.77
0.8
80
0.515
0.6380.715
0.77
0.798
100
0.507
0.634
0.71
0.761
0.797
150
0.461
0.588
0.671
0.734
0.772
User profiles matching in social networks
4.4 Evaluation with the reduced alignment rate and the reduced
number of photos
Here we experiment with limited data and rate of alignment of users. If our
approach requires as much data as possiblc, it is only applicable for govcrnmcnt
and law enforcement with social media cooperation
Avatars only matching. When working with facial images, using avatars can
be the easiest way. This removes the need for the owner detection stage because
the idea of an avatar is to present the owner. Here we use only users'avatars
from Dataset4675 to evaluate this assumption in Fig. 4
Face-based matching F1-score Avatars only
-+ Threshold distance: 0.35
0.4
k threshold distance:0. 15
Threshold distance.55
Threshold distance:0.65
Threshold distance:0.75
0
150
mage qualit
Fig 4. Fl-score of face-based matching depending on the image quality and the thresh
old distance. Avatars onl
We faced the recall decrease in general and almost zero Fl-score with a
gh value of the quality filter. We achieve 0.539 F1-score with the following
parameters: threshold distance -0.75, quality-30
Reducing the number of images for each user. We reduce the number of
available photos of each user from Dataset4675 in order to estimate our approach
in the condition of greater uncertainty in Fig. 5
Thc proccdurc of sampling is as follows: for cach uscr, we solcct X%c of his/hor
photos for 10 times. It is interesting that the precision rate remains almost the
same even with 10% of da ta from each user profile of both socia. media. The
reason for the low recall rate is the owner detection part: a small amount of
randomly sampled data does not allow to find the owner's face and to form a
good defining vector
Reducing the rate of intersections. Partial alignment. In the final part of
the experiments, we examine the partial alignment of social networks. As noted
by 10 authors real social media are partially alignment- not all users from one
I. Sokhin et al
Face-based matching Photos sampling per user
Precision
0.8
0.6
0.2
04
0.6
0.8
Fraction of photos
Fig. 5. The dependence of the efficiency of the algorit hm on the proportion of user
social media have accounts in anothcr onc. It is impossible to investigate the rcal
rale of this intersection, but we call consider a nuimber of rale values and create
a synt. hetically reduced intersection. The high variance of precision and reca.lI
depicted in Fig. 6is explained by user properties: we match different users, due
to random sampling. Some of these users could have more or fewer photos, good
or bad(such as biased vector) defining vectors. The stability of recall shows
Face- based matching. Intersection reducing
0.8
c0.6
G0.4
0.5
0.5
Intersection rate
Fig. 6. The dependence of the efficiency of the algorithm on the proportion of use
that our approach can be applied on low-alignment networks. The precision
decreased on low-rate alignment because of many false-positive samples, this
can potentially be improved by additional filtering
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.