文件名称:
Joint 3D Face Reconstruction and Dense.pdf
开发工具:
文件大小: 1mb
下载次数: 0
上传时间: 2019-09-02
详细说明:这是关于Joint 3D Face Reconstruction and Dense的相关论文Joint 3D Face Reconstruction and Dense Alignment
9 &ms. All of these are achieved by the elaborate design of the 2D representation
of 3d facial structure and the corresponding loss function Specifically. we design
a uv position map, which is a 2D image recording the 3d coordinates of a com
plete facia l point cloud, and at the same time keeping the semantic meaning at
each UV place. We then train a simple encoder-decoder network with a weighted
loss that focuses more on discriminative region to regress the uV position map
from a single 2D facial image. Figure 1 shows our method is robust to poses
illuminations and occlusions
In summary, our main contributions are:
For the first time, we solve the problems of face alignment and 3D face
reconstruction together in an end-to-end fashion without the restriction of
low-dimensional solution space
To directly regress the 3D facial structure and dense alignment, we develop
a novel representation called uV position map, which records the position
nformation of 3d face and provides dense correspondence to the semant
meaning of each point on UV space
For training, we proposed a weight mask which assigns different weight to
each point on position map and compute a weighted loss. We show that this
design helps improving the performance of our network
We finally provide a light-weighted framework that runs at over 100FPs to
directly obtain 3d facc reconstruction and alignment result from a single 2D
facial image.
Comparison on the aFLW2000-3D and Florence datasets shows that our
method achieves more than 25% relative improvements over other state-
of-the-art methods on both tasks of 3d face reconstruction and dense face
alignment
2 Related works
2. 1 3D Face Reconstruction
Since Blanz and Vetter proposed 3D Morphable Model(3DMM) in 1999 6, meth
ods based on 3DMM are popular in completing the task of monocular 3D face
reconstruction. Most of earlier methods are to establish the correspondences of
the special points between input images and the 3d template including land
marks 37, 68, 56, 27, 10, 29, 19 and local features 26, 49. 19 then solve the non-
linear optimization function to regress thle 3DMM coefficients. However, these
methods heavily rely on the accuracy of landmarks or other feature points de-
tector. Thus, somc mcthods 22, 63 firstly uSc CNNs to learn the dense cor-
respondence bet ween input image and 3D template, then calculate the 3DMM
parameters with predicted dense const.rains. Recent works a Iso explore the usage
of cnn to predict 3DMM parameters directly. 32, 67, 17, 39, 48 use cascaded
cnn Structure to regress the accurate 3dMM coefficients. which take a lot of
time due to iterations. [15, 57, 31, 36 propose end-to-end CNN architectures to
directly estimate the 3DMM Shape parameters. Unsupervised methods have been
eng et a
also researched recently, [55, 3] can regress the 3DMM coefficients without the
help of training data, which performs badly in faces with large poses and strong
occlusions llowever, the main defect of those methods is model-based, resulting
in a limited geometry which is constrained in model space. Some other met hods
can reconstruct 3D faces without 3D shape basis. 24, 33, 20, 53, 51] can produce
3D structure by warping the shape of a reference 3D model. [4 also reconstruct
the 3D shape of faces by learning a 3D Thin Plate Spline(TPS) warping func
tion via a deep network which warps a generic 3D model to a subject specific 3D
shape. Obviously, the reconstructed face geometry from these methods are also
restricted by the reference Illodel, which Inealls the structure differs when the
template changes. Recently, [28 propose to st raight forwardly map the image pix-
els to full 3D facial structure via volumetric CNN regression. This method is not
restricted in the model space any more, while needs a complex network structure
and a lot of time to predict the voxel data. Different from above methods, Our
framework is model-free and light-weighted, can run at real time and directly
obtain the full 3D facial geometry along with its correspondence information
2.2 Face Alignment
In the field of computer vision, face alignment is a long-standing problem which
attracts lots of attention. In the beginning, there are a number of 2d facial align
ment approaches which aim at locating a set of fiducial 2D facial landmarks
such as classic Active Appearance Model(AMM)43, 52, 58 and Constrained Lo-
cal Models( CLM)34, 1. Then cascaded regression 14, 60 and CNN-based meth
ods[38, 46, 9 are largely used to achieve st. ate-of-the-art performance in 2D land-
marks location. However, 2D landmarks location only regresses visible points on
faces, which is limited to describe face shape when the pose is large. Recent works
then research the 3D facial alignment, which begins with fitting a 3DMM44, 67
18 or registering a 3D facial template[51, 5] with a 2D facial image. Obviously
3D reconstruction methods based on model can easily complete the task of 3D
face alignment. Actually, [ 67, 63, 31 are specially designated Inethods to achieve
3D face alignment by means of 3DMM fitting. Recently [8, 9 use a deep net
work to directly predict the heat map to obtain the 3D facial landmarks and
achieves state-of-the-art performance. Thus, as sparse face alignment tasks are
highly completed by aforementioned methods, the task of dense face alignment
begins to develop. Notice that, the dense face alignment means the methods
should offer the correspondence between two face images as well as between a
2D facial image and a 3D facial reference geometry. [40 use Inlulti-constraint
train a Cnn which estimates the 3DMM parameters and then provides a very
dcns 3D alignment. 22, 63 directly learn thc correspondence bctwccn 2D input
d 3d template via a deep network, while tho
dence is not
complete,only visible face region is considered. Compared to prior works, our
method can directly establish the dense correspondence of all regions once the
position map is regressed. No intermediate parameters such as 3DMM coeffi
cients and TPs warping parameters are needed in our method, which means our
twork can run very fast
Joint 3D Face Reconstruction and Dense Alignment
3 Proposed method
This section describes the framework and the details of our proposed method
Firstly, we introduce thle characteristics of the positiOn Inap for our representa-
tion. Then we elaborate the CNn architecture and the loss function designed
specially for learning the mapping from unconstrained RGB image to its 3D
structure. The implementation details of our method are shown in the last sub-
section
3.1 3D Face Representation
Our goal is to regress the 3d facial geometry and its dense correspondence infor-
mation from a singlc 2D imagc. Thus we nccd a proper reprcscntation which can
be directly predicted via a deep network. One simple and commonly used idea is
to concatenate the coordinates of all points in 3n face as a vector and use a. net-
work to predict it. However, this projection from 3D space into lD vector which
discards the spatial adjacency information among points increases the difficulties
n training deep neural networks. Spatially adjacent points could share weights
in predicting their positions, which can be easily achieved by using convolutional
layers, while the coordinates as a 1D vector needs a fully connected layer to pre-
dict each point with much more parameters that increases the network size and
is hard to train. 16 proposed a point set generation network to directly predict
the point cloud of 3d object as a vector from a single image. However, the max
number of points is only 1024, far from enough to represent an accurate 3D face
So model-based methods[67, 15, 40 regress a few model parameters rat her than
the coordinates of points, which usually needs special care in training such as
using Mahalanobis distance and inevitably limits the estimated face geometr
to the their model space. 28 proposed 3D binary volume as the representa.
tion of 3D structure and uses Volumetric Regression Network(VrN) to output
a 192 x 192 x 200 volume as the discretized version of point cloud. By using
this representation, VRN call be built with full convolutiOnal layers. However
discretization limits the resolution of point cloud, and most part of the network
output correspond to non-surface points which are of less usage
To address the problems in previous works, we propose UV position map
as the presentation of full 3D facial structure with alignment information. UV
position map or position map for short, is a 2D image recording 3d positions of
all points in UV space. In the past years, UV space or UV coordinates, whicl
is a 2D iillage plane parameterized fronn the 3D surface, has beell utilized as a
way to express information including the texture of faces(texture map)3, 13
45,61, 2.5D gcomctry(height map)41, 42, 3D gcomctry(gcomctry imago)21, 54
and the correspondences between 3D facial meshes. Different from previous
works, we use UV space to store the 3D position of points from 3D face model
aligned with corresponding 2D facial image. As shown in Figure 2, we assume
the projection from 3D model to 2D image is weak perspective projection and
define the 3D facial position in Left-handed Cartesian Coordinate system. The
origin of the 3D space overlaps with the upper-left of the input image, with
eng et a
the positive x-axis pointing to the right of the image and minimum z at origin
The ground truth 3D facial shape exactly matches the face in the 2D image
when projected to the x-y plane. Thus the position map can be expressed as
Pos(ui, 1i)=(:i, Ji, zi), where(ui, vi) represents the UV coordinate of it. h point.
in face surface and(i, Vi, ti) represents the corresponding 3D position of facial
structure with(Ti, yi) representing corresponding 2D position of face in the input
RGB images and z: representing the depth of this point. Note that, (ui, vi) and
(i, yi) represent the same position of face so alignment information can be
reserved. Our position map can be easily comprehended as replacing the r, 9, 6
value in texture Inap by y, 2 coordinates
Fig 2: The illustration of UV position map. Left: 3D plot of input image and its
corresponding aligned 3D point cloud(as ground truth). Right: The first row is
the input 2D image, extracted UV texture map and corresponding UV position
map. The second row is the x, y, z channel of the uV position map
Thus our position map records a dense set of points from 3D face with its
scmantic mcaning wc arc able to simultancously obtain the 3D facial structurc
and dense aligllInent result by using a cnn to regress the position Illap directly
from unconstrained 2D images. The network architecture in our method could
be greatly simplified due to this convenience. Notice that the position map con
tains the information of the whole face, which makes it different from other 2D
representations such as Projected Normalized Coordinate Code(PNCC)67, 48
an ordinary depth image [53 or quantized UV coordinates(22, which only re-
serve the information of visible face region in the input image. Our proposed
positiOn Ilap also infers the invisible parts of face, thus our Inethod call predict
a complete 3D face
Since we want to regress the 3d full structure from 2D image directly, the
unconstrained 2D facial images and their corresponding 3D shapes are needed
for end-to-end training. 300W-LP167 is a large dataset that contains more than
60K unconstrained images with fitted 3DMM parameters, which is suitable to
form our training pairs. Besides, the 3DMM parameters of this dataset are based
on the Basel Face Model(BFM)[6. Thus, in order to make full use of this dataset,
Joint 3D Face Reconstruction and Dense Alignment
we conduct the UV coordinates corresponding to BFM. To be specific, we use the
parameterized UV coordinates from 3 which computes a Tutte embedding[17
with conformal laplacian weight and then maps the mesh boundary to a square
ince the number of vertices in bfm is more than 50K. we choose 256 as the
position map size, which get a high precision point cloud with negligible re-
ample error
3.2 Network architecture and Loss fumction
题
Fig 3: The architecture of PRN. The Green rectangles represent the residual
blocks, and the bluc oncs rcprcscnt the transposed convolutional layers
Since our network transfers the input rgB image into position map image, we
employ an encoder-decoder structure to learn the transfer function. The encoder
part of our network begins with one convolution layer followed by 10 residual
blocks2] which reduce the256×256×3 input image into8×8×512 feature
maps, the decoder part contains 17 transposed convolution layers to generate the
predicted 256 x 256 x 3 position map. We use kernel size of 4 for all convolution
or transposed convolutiOn layers, and use RelU layer for activatiOn. Given that
the position map contains both the full 3D information and dense alignment
result, we dont need extra network module for multi-task during training or
inferring. The architecture of our network is shown in Figure 3
In order to learn the parameters of the network, we build a loss function
to measure the difference between ground truth position map and the network
output Mean square error(Mse) is a commonly used loss for such learning task
such as in [63, 12. However, MSE treats all points equally, so it is not entirely
appropriate for learning the position map. Since central region of face has more
discriminative features than other regions, we employ a weight mask to form our
loss function. As shown in Figure 4, the weight Imlask is a gray image recording
the weight of each point on position map. It has the same size and pixel-to-pixel
correspondence to position map. According to our ob jcctivc, wc scparatc points
into four categories, each has its own weights in the loss function. The position
of 68 facia l keypoints has the highest weight, so that to ensure the network to
learn accurate locations of these points. The neck region usually attracts less
attention, and is often occluded by hairs or clothes in unconstrained images
Since learning the 3d shape of neck or clothes is beyond our interests, we assign
O weight to points in neck region to reduce disturbance in the training process
Y. Feng et al
Fig 4: The illustration of weight mask. From left to right: UV texture map, UV
position map, colored texture map with segmentation information(blue for eye
region, red for nosc rcgion, grccn for mouth rcgion and purple for ncck rcgion)
the final weight Illask
Thus, we denote the predicted position map as Pos u, v)for u, v representing
each pixel coordinate. Given the ground truth position map Pos(u, v)and weight
mask W(u, u), our loss function is dcfincd as
Loss=∑Pos(x,n)-Pos(,)·W(u,)
Specifically, We use following weight ratio in our experiments, subregion(68
facial landmarks): subregion2(eye, nose, mouth): subregion(other face area)
subregion(ncck)= 16: 4: 3: 0. Thc final wcight mask is shown in Figurc 4
3. 3 Training Details
As described above, we choose 300W-LP 67 to form our training sets, since
it contains face images across different angles with the allllotation of estimated
3DMM coefficients, from which the 3D point cloud could be easily ge
Specifically, we crop the imagos according the ground truth bounding box and
rescale them to size 256 x 256. Then utilize their annotated 3DMM parameters to
generate the corresponding 3D position, and render them into UV space to obtain
the ground truth position map, the map size in our training is also 256 X 256,
which means a precision of more than 45K point cloud to regress. Notice that
although our training data is generated from 3DMM, our network's output, the
position map is not restricted to any face template or linear space of 3DMM
We perturb the training set by randomly rotating and translating the target
face in 2D illlage plane. To be specific, the rotation is froIll-45 to 45 degree anigles
translation changes is random from 10 percent of input size, and scale is from
0. 9 to 1. 2. Likc28, wc also augment our training data by scaling color channcls
with scale range from 0.6 to 1. 4. In order to handle images with occlusions, we
synt hesize occlusions by adding noise texture into raw images, which is similar
to the work of [50, 63. With all above augmentation operations, our training
data covers all the difficult cases. We use the network described in section 3 to
train our model. For optimization, we use Adam optimizer with a learning rate
begins at 0.0001 and decays half after each 5 epochs. The batch size is set as 16
Joint 3D Face Reconstruction and Dense Alignment
4 Experimental results
In this part, we evaluate the performance of our proposed method on the tasks
of 3d face alignent and 3D face reconstruction. We first introduce the test
datasets used in our experiments in section 4.1. Then in section 4.2 and 4.3 we
compare our results with other methods in both quantitative and qualitative
way. We then compare our methods runtime with other methods in section 4.4
In the end, the ablation study is conducted in section 4.5 to evaluate the effect
of weight mask in our method
4.1 Test Dataset
To evaluate our performance on the task of dense alignment and 3D facc recon
struction, multiple test datasets listed below are used in our experiments
AFLW2000-3D is const ructed by [67 to evaluate 3D face a lignment on
challenging unconstrained images. This database contains the first 2000 images
from AFLW35 and expands its annotations with fitted 3DMM parameters and
68 3D landmarks. We use this database to evaluate the performance of our
method on both face reconstruction and face alignment tasks
AFLW-LFPA is another extension of AFLW dataset constructed by 32
By picking images from aFlw according to the poses, the authors construct
this dataset which contains 1299 test images with a balanced distribution of
yaw angle. Besides, each image is annotated with 13 additional landmarks as a
expansion to only 21 visible landmarks in AFLW. This database is evaluated on
the task of 3D face alignment. We use 34 visible landmarks as the ground truth
to measure the accuracy of our result
Florence is a 3d face dataset that contains 53 subjects with its ground truth
3D mesh acquired from a structured-light scanning system 2. On experiments
each subject generates renderings with different poses as the same with 28:a
pitch of-15, 20 and 25 degrees and spaced rotations bctwccn -80 and 80. Wc
Compare the performance of our Inethod On face reconstruction against other
very recent state-of-the-art methods VRN-Guided 28] and 3DDFA[67 on this
4.2 3d Face Alignment
o evaluate the face alignment performance. We employ the Normalized mean
Error (NME) to be the evaluation Metric, bounding box size is used as the non
malization factor. Firstly, we evaluate our method on a sparse set of 68 facial
landmarks, and comparc our result with 3DDFA67, DcFA40 and 3D-FAN9
on dataset AFLW2000-3D. As shown in figure 5, our result slightly outper
forms the state-of-the-art met hod 3D-FAN when calculating per dist. ance with
2D coordinates. When considering the depth value, the performance discrepancy
between our method and 3D-FAN increases Notice that the 3D-FAN needs an-
other network to predict the z coordinate of landmarks, while the depth value
can be obtained directly in our method
eng et a
68 points with 2D coordinate
bE points with 3D coordinates
3665z
3DD+A- 6034
3UDHA:75]7
PAN (cuns): 3.2699
N(oLrs}:4.006
Fig 5: Cumulative Errors Distribution(CED) curves on AFLW2000-3D. Eva1-
uation is performed on 68 landmarks with both the 2D(left)and 3D(right)Co-
ordinates. Overall 2000 images from AFLW2000-3D dataset are used here. The
mean NMe%c of each method is also showed in the legend
To further investigate the performance of our method across poses and dataset. s
we also report the Nme with small, medium and large yaw angles on AFlW2000
3d dataset and the mean nme on both AFLw2000-3D and AFLW-LPFa datasets
Table 1 shows the results note that the numerical values are recorded from their
published papers. Follow the work 67, we also randomly select 696 faces from
aFL 2000 to balance the distribution. The result shows that our method is ro-
bust to changes of pose and datasets. Although all the state-of-the-art methods
of 3D face alignment conduct evaluation on AFLW2000-3d dataset, the ground
truth is still controversial 63, 9 duc to its annotation pipclinc which is bascd
on Landmarks Marching method (68. Thus, we visualize some results in Figure
6 that have nme larger than 6.5% and we find our results are more accurate
than the ground truth in some cases. We also compare our dense alignment re-
Table 1: PerforInance cOInlparison on AFLW2000-3D(68 landMarks)alld AFLW
LFPA(34 visible la.ndmarks. The NME(%) for faces with different, yaw angles
are reported. The first best result in each category is highlighted in bold. the
lower is the better
AFLW2000-3D
AFLW-LFPA
Method
0 to 3030 to 60 60 to 90 Mean
Mean
SDM60
3.674949.676.12
3DDFA[673.784.547.935.42
3DDFA+SDM67|3434247.174.94
PAWF3
4.72
Yu et al. 63
3.626.06
9.56
3DSTN「4
3154.33598449
DeFA 0
1.50
3.86
PRN (ours
2753.51461362
2.93
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.