文件名称:
ALBERT_presentation.pdf
开发工具:
文件大小: 2mb
下载次数: 0
上传时间: 2019-10-20
详细说明:ALBERT: A Lite BERT
for Language Understanding作者演讲PPtWe are witnessing a big shift in the
approach in natural language
understanding in the last two years
Full-network pre-training share most of the parameters
between pre-training and fine-tuning
NSP
Mask LM
Mask LM
MNLI/NER/SQuAD
Start/End span
BERT
∴
BERT
ICLS] 1
[SEPITok t
ICLS
Masked sentence A
Masked sentence B
Question
Paragraph
Unlabeled Sentence a and B Pair
Question Answer Pair
Pre-training
Fine-Tuning
Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification. 2018
Radford, Alec, et al. "Improving language understanding by generative pre-training. 2018
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding 2018
Full-network pre-training is the alexNet in NLU
IMAGENET
900
30
83.2
80.0
72.0
70.0
590
600
d
10%
d
50.0
09
40.0
2010
2011
2012
2013
20142015
Gated AR(Oct. 2017)GPT (Jun 2018) BERT(Nov 2018) XLNET(Jun 2019) Roberta(JuL 2019)
Human performance
Turker performance
Improvements on ImageNet over the years
mprovements on RaCe in the last two years
RACE: English Reading Comprehension EXam for Chinese Middle/High Schoo
Can we improve Full-network pre-training models
similar to what computer vision community did for
AlexNet?
BERT uses self-supervised loss
NSP: Predict whether the
Mask Lm: recover the
two input sentences are
randomly masked token units
next to each other or not
Add Norm
NSP
Mask lm
Mask lM
Feed
Forward
C‖T1
T1
LX
Add norm
BERT
Multi-Head
Attention
E
ICLS
Tok 1
Tok N
Tok 1
TokM
Positional
Encoding
①
Masked Sentence a
Input
Masked Sentence B
Embedding
Unlabeled sentence a and b pair
Inputs
What happens after AlexNet?
Revolution of depth
28.2
25.8
152 layers
16.4
11.7
22 layers 19 layers
6.7
73
3.57
8 layers
8 layers
shallow
ILSVRC15 LSVRC14 LSVRC14 LSVRC13 ILSVRC12 LSVRC11 LSVRC10
ResNet GoogleNet
VGG
AlexNet
Image Net Classification top-5 error(%)
Slide credit: kaiming he
For BERT, increasing the depth (l and width(h) of
the network leads to better performance
Hyperparams
Dev Set accuracy
#L h #a LM (ppl) MNLi-m MRPC SST-2
3768125.84
77.9
79.888.4
676835.24
80.6
82.2907
QQ00
6768124.68
81984.8913
12768123.99
84.4
86.792.9
121024163.54
85.786993.3
241024163.23
86687893.7
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language
understanding 2018
s having better NLU model as easy
as increasing the model size
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.