Siamese Recurrent Architectures for Learning Sente

文件名称: Siamese Recurrent Architectures for Learning Sentence Similarity.pdf

所属分类: 深度学习

开发工具:

文件大小: 1mb

下载次数: 0

上传时间: 2019-10-14

提供者: woleg*****

下载 (1mb)

不能下载？报告错误

详细说明：用最简单的模型、最简单的特征工程做出好效果，追求的就是极致性价比。如果有需要，可以在此基础上做一些模型更改和特征工程，提高表现效果。ture for face verification developed by Chopra, Hadsell, and This forces the LSTm to entirely capture the semantic dif- LeCun(2005), which utilizes symmetric Conv Nets where ferences during training, rather than supplementing the rnN we use LSTMs Siamese neural networks been proposed with a more complex learner that can help resolve shortcom- for a number of metric learning tasks(Yih et al. 2011 ings in the learned representations as done by Kiros et al Chen and Salman 2011), but to our knowledge, recurrent (2015)and Tai, Socher, and Manning(2015) connections remain largely unexplored in this context As Chopra, Hadsell, and LeCun(2005) point out, using a l2 rather than l1 norm in the similarity function can lead to Manhattan lstm model undesirable plateaus in the overall objective function. This The proposed Manhattan LSTM (MaLSTM) model is out is because during early stages of training, a l2-based model lined in Figure 1. There are two networks lsTMa and is unable to correct errors where it erroneously believes se LSTMb which each process one of the sentences in a given matically different sentences to be nearly identical due to pair, but we solely focus on siamese architectures with tied vanishing gradients of the Euclidean distance. Empirically weights such that LSTMa=LS, in this work. Neverthe our results are fairly stable across various types of simple less, the general untied version of this model may be more similarity function, but we find that g utilizing the manhat useful for applications with asymmetric domains such as tan distance slightly outperforms other reasonable alterna information retrieval( where search queries are stylistically tives such as cosine similarity (used in Yih et al. 2011) distinct from stored documents) Semantic relatedness scoring The SiCK data contains 9927 sentence pairs with a 5,000/4, 927 training/test split(Marelliet al. 2014). Each pair exp(-1/ is annotated with a relatedness label E [1, 5 corresponding the average relatedness judged by 10 different individu- als. Although their skip-thoughts rnn is trained on a vast LSTM LSTMb corpus for two weeks, Kiros et al. (2015) point out that it is unable to distinguish between many of the test-set sentences shown in Table 1, highlighting the difficulty of this task Sentence pair G S M a little girl is looking at a woman in costume He Is smart. A truly 硎 Ise man. a young girl is looking at a woman in costume 4.74.54 Figure 1: Our model uses an LSTM to read in word-vectors a person is performing tricks on a motorcycle representing each input sentence and employs its final hid- The performer is tricking a person on a motorcycle 264.42.9 den state as a vector representation for each sentence. Subse Somconc is pouring ingredients into a pot. quently, the similarity between these representations is used A man is removing vegetables from a pot. 2. 4 3.6 2.5 as a predictor of semantic similarity Nobody is pouring ingredients into a pot Someone is pouring ingredients into a pot. 3. 5 4.2 3.7 The LSTM learns a mapping from the space of vari- able length sequences of dim-dimensional vectors into Rdrep Table 1: Example sentence pairs from the SICK test data. G (din=300, drep=50 in this work). More concretely, denotes ground truth relatedness E [1, 5),S= skip-thought ch sentence (represented as a sequence of word vectors predictions, and M= MaLSTM predictions T1,...,mT, is passed to the LSTM, which updates its hidden state at each sequence-index via equations(2)-(7). The final To enable our model to generalize beyond the limited vO- representation of the sentence is encoded by hT E Rdrep abulary present in the SiCk training set, we provide the the last hidden state of the model. For a given pair of sen LSTM With inputs that reflect relationships between words tences, our approach applies a pre-defined similarity func beyond what can be inferred from the small number of train tion g: Rdrep x rdren ,r to their LSTM-representations. ing sentences. LSTMS typically require large datasets to Similarities in the representation space are subsequently achieve good generalization due to their vast numbers of pa- used to infer the sentences underlying semantic similarity rameters, and we thus augment our dataset with numerous Note that unlike typical language modeling RNNs, which additional training examples, a common practice in SemEval are used to predict the next word given the previous text, systems(Marelli et al. 2014)as well as high-performing our LSTMS simply function like the encoder of sutskever, neural networks. Like many top performing semantic simi Vinyals, and Le (2014). Thus, the sole error signal backprop- larity systems, our LSTM takes as input word-vectors which agated during training stems from the similarity between have been pre-trained on an external corpus. We use the sentence representations hr, hTt, and how this predicted 300-dimensional word2vec embeddings which Mikolov et similarity deviates from the human annotated ground truth al (2013)demonstrate can capture intricate inter-word rela relatedness. We restrict ourselves to the simple similarity tionships such as vec(king)-vec(man)+ vec( woman)A function g(h(Ta), hT)=exp(-l71a)-hrl1)E[O, 1 Publiclyavailableat:code.google.com/p/word2vec 2788 vec(queen). We encourage invariance to precise wording Results and expand our dataset by employing thesaurus-based au mentation in which 10,022 additional training examples are The Mals tM is able to accurately score the table 1 exam- generated by replacing random words with one of their syn- ples which Kiros et al. highlight as difficult for their skip onyms found in Wordnet (Miller 1995). A similar strategy thoughts model. Despite being calibrated for MsE, our ap is also successfully adopted by Zhang, Zhao, and Le cun proach performs better than existing systems for the seman (2015). Unlike the SemEval 2014 submissions, our methods tic relatedness task across all three evaluation metrics(see do not require extensive manual feature generation beyond Table 2). Note that because all results shown in Table 2 rely the separately trained word2vec vectors on additional feature generation(e.g. dependency parses The MalstM predicts relatedness for a given pair o or data augmentation schemes, this is only an evaluation of complete relatedness-scoring systems rather than a fair sentences via g(hr,hT), and we train the siamese net- comparison of the different learning algorithms employed work using backpropagation-through-time under the mean- Nonetheless, we perform ablation experiments to better un- squared-error (MSE) loss function (after rescaling the derstand our methods finding that the Pearson-correlation training-set relatedness labels to lie E 0, 1). SemEval the primary SemEval performance metric) of our approach evaluates predicted similarities against the given human worsens by: 0.01 without regression calibration, 0.02 with annotated similarities on three metrics pearson correlation out pre-training, and 0.04 without synonym augmentation Spearman correlation, and MsE. Due to the simple con- Due to the limited available training data, we do not realize struction of our similarity function, the predictions of our performance gains by switching to multi-layer or bidirec- model are constrained to follow the exp(-ac)curve and are tional stms thus not suited for these evaluation metrics. After training our model, we apply an additional nonparametric regression step to obtain better-calibrated predictions(with respect to Method MSE MSE). Over the training set, the given labels(under original Illinois-LH 0.79930.75380.3692 1,5 scale)are regressed against the univariate MaLSTM ai and Hockenmaier 2014) g-predicted relatedness as the sole covariate, and the fitted UNAL-NLP 0.80700.74890.3550 regression function is evaluated on the mal sTM-predicted Gimenez et al. 2014) relatedness of the test pairs to produce adjusted final predic Meaning Factory 0.82680.77210.3224 tions. We use the classical local-linear estimator discussed (Bjerva et al. 2014) in Fan and Gijbels(1992) with bandwidth selected using ECNU 0.8414 leave-one-out cross-validation. This calibration step serves (Zhao, Zhu, and Lan 2014) as a minor correction for our restrictively simple similarity function(which is necessary to retain interpretability of the Skip-thought+coco 0.86550.79950.2561 sentence representations) (Kiros et al. 2015) Dependency Tree-LSTM 0.8676 0.8083 0.2532 (Tai, Socher, and Manning 2015 Training details Cony Net 0.86860.80470.2606 (He, Gimpel, and Lin 2015) Our LStM uses 50-dimensional hidden representations ht MalSTM 088220834502286 and memory cells Ct. Optimization of the parameters is done using the Adadelta method of Zeiler (2012)along Table 2: Test set Pearson correlation(r), Spearmans p, and with gradient clipping(rescaling gradients whose norm ex- mean squared error for the sick semantic textual similarity ceeds a threshold) to avoid the exploding gradients problem task. The first group of results are top SemEval 2014 sub (Pascanu, Mikolov, and Bengio 2013). We employ earl missions and the second group are recent neural network stopping based on a validation set containing 30%o of the methods(best result from each paper shown) training examples It is well-known that the success of LSTMs depends cru In Table 3. Tai Socher and Manning show the most sim- cially on their initialization, and often parameters transferred llar test-set examples found by their Tree-LSTM for three from neural networks trained for a different task can serve given sentences as well as its inferred similarity scores. We as a strong starting point for the optimization(c f. Ben- apply our model to these same examples, determining that gio 2012). We first initialize our Lstm weights with small while the sequential MaLSTM is slightly worse at identify random Gaussian entries(and a separate large value of 2.5 ing active-passive equivalence, our approach is better at dis- for the forget gate bias to facilitate modeling of long range tinguishing verbs and objects than the compositional Tree- dependence). Then, our MalsTM is (pre)trained as pre- LSTM which often infers seemingly over-estimated relat viously described on separate sentence-pair data provided edness scores in Table 3. For example, the ground truth la- for the earlier Sem Eval 2013 Semantic Textual Similarity beling between"Tofu is being sliced by a woman"and"A task(Agirre and Cer 2013). The weights resulting from this woman is slicing butter"is only 2.7 in the sick test set pre-training thus form our starting point for the sick data, (and substituting"potatoes"for "butter"should not greatl which is markedly superior to a random initialization increase relatedness between the two statements 2789 Ranking by dependency Tree-LSTM Model Tree M 23 56 a woman is slicing potatoes -0.335 a woman Is cutting potatoes 4.824.87 re is no man pointing at a car potatoes arc bcing sliced by a woman 4.704.38 2 The woman is not playing the fluto tofu is bcing sliced by a woman 4.393.51 3 The man is not riding a horse 4 A man is pointing at a silver sedan a boy is waving at some young runners from 5 The woman is playing the flute the ocean 6 A man is riding a horse a group of men is playing with a ball on the 3.79 3.13 910 a young boy wearing a red swimsuit is jumping 3.37 3.48 1 Two kids are bouncing on colorful balls out of a hlue kiddies pool 2 Two children are bouncing on colorful balls the man is tossing a kid into the swimming pool 3. 19 2.26 3 The golden dog is running through a field of tall grass that is near the ocean 4 A brown dog is running through tall green grass two men are playing guitar 5 A woman is putting on makeup carefully 6 A woman is carefully removing her makeup the man is singing and playing the guitar 4.0 he man is opening the guitar for donations and 4.01 230 7 A woman is applying cosmetics to her eyelid 8 A woman is carefully applying cosmetics to her eyelid lays with the case 9 There is no woman cutting potatoes two men arc dancing and singing in front of a 4.00 2.33 10 A woman is slicing carrots crow 2 3A7,89191112 Table 3: Most similar sentences(from 1000-sentence sub 0.14 sample)in the SiCK test data according to the Tree-LSTM 1 The cat is running across the grave 2 A cat is playing a keyboard Tree /M denote relatedness (with the sentence preceding 3 The brown animal is jumping in the air each group) predicted by the Tree-LSTM/ MaLSTM 4 The animal with big eyes is eating 5 A dog is bouncing on a trampoline 6 A dog is running on the ground 7 A dog is running on the road 8 Several boys are jumping on a trampoline Sentence representations 9 A little boy is running on the ground and playing with a little girl 10 Someone is playing a piano 11 A man is running on the road We now investigate the geometry of the sentence 12 A man is playing an electronic keyboard representation-space learned by the malstM network. As the l1 metric is the sum of element-wise differences, we hy- Figure 2: MaLSTM representations of test set sentences de- pothesize that by using specific hidden units (i.e. dimensions picted along three different dimensions of hr(indices 1, 2 of the sentence representation) to encode particular charac Each number above the axis corresponds to a sen- teristics of a sentence. the trained malstM infers seman tence representation and its location represents the value this tic similarity between sentences by simply aggregating their particular hidden unit assigns to the sentence(shown below) differences in various characteristics Some examples supporting this idea are shown in Fig ure 2, which depicts the values that particular sentences take along specific dimensions of hr. It is evident that the hidden duction like t-SNE (van der Maaten and Hinton 2008),we unit shown at the top has learned to detect negation, separat- can simply use principal components analysis(PCa) for in ing sentences containing words like"no"or"not"from the formative visualization of the MaLSTM representations due rest, regardless of the other content in the text The hidden to their simple structure unit in the middle plot is particularly sensitive to catego musIc rization of the direct objects, separating sentences describ- culinary themes ing actions on balls, grass, cosmetics, and vegetables. The animals hidden unit depicted at the bottom of Figure 2 clearly sep Waler environments arates sentences based on their subject, imposing an inter ●vio|ence esting ordering that reflects broader similarity between the subject categories: cats, general animals, dogs, boys, gen eral people(someone), and men. Unlike the ConvNet of He Gimpel, and lin which measures similarity explicitly across multiple scales and locations in the sentence, the delineation 9° of these underlying characteristics emerges naturally in the MaLSTM representations, which are guided solely by the l1 metric and overall semantic similarity labels Next. we shift our attention from local characteristics of different hidden units toward the global geometry of the sen ence representation space. Due to our training criterion, this space is naturally endowed with the e metric and avoids being highly warped. While analysis of neural network rep- Figure 3: MaLSTM representations for all sentences fro resentations typically requires nonlinear dimensionality re- the siCK test set, projected onto two principal components 2790 Figure 3 depicts an overview of the SiCk dataset from the semantic relatedness scoring(with no supervised informa perspective of the Malstm model(after PCa dimension tion regarding contradictions or the neutral threshold), they reduction For interpretability, we color many of the sen capture enough relevant characteristics of the sentences to tences based on distinct concepts/themes under which they be highly useful for entailment-classification. In contrast to fall. The geometric coherence of the sentences in the repre he malsTM representations, the Illinois- LH syStem em sentation space exists across numerous categories: from sen- ploys many features specially constructed for this task such tences about animals (ranging from cats to lemurs), culinary as hypernym counts and occurrences of no" and"not". In- themes (like slicing vegetables), music (like guitar playing), terestingly, a useful feature like no"-occurrence, which Lai water environments(e.g. the ocean or swimming pools), and Hockenmaier manually selected, has been automaticall etc. In fact, the sentence representations cluster along nearly learned by our model and is encoded by the first hidden unit all additional meaningful semantic categorizations we could shown in Figure 2 come up with (not depicted due to coloring constraints) One peculiar aspect of this representations space is the Method Accuracy low-density region that separates the culinary themed exam ples from the other sentences. Around this area, there are nu Illinois -LH 84.6 merous violence and gun-related sentences in the representa (Lai and Hockenmaier 2014) tions, for example: A man is fixing a silencer to a gun".We ECNU 83.6 find that these violent texts are likely to receive much lower (zhao, Zhu, and Lan 2014) similarity scores when paired with more mundane sentences UNAL-NLP 83.1 typically found in SiCK(the average violent-nonviolent pair Gimenez et al. 2014) only has similarity 1.88 compared with an average of 3.41 Meaning Factory 81.6 for all test-set pairs). This explains why the MaLSTM rep (Bjerva et al. 2014) resentations have learned to become sparse in the vicinity of Reasoning-based n-best 80.4 these violent examples(depicted in red in Figure 3) Thus, Figure 3 shows that human-determined semantic re- (Lien and Kouylekov 2015) Lang Pro hybrid-800 81.4 latedness heavily depends on the occurrence of such themes (Abzianidze 2015) These discoveries about the sick dataset are enabled by the SNLI-transfer 3-claSS lstm 80.8 interpretability of the MalSTM representations, unlike the (Bowman et al. 2015) other proposed neural networks which rely on complex oper- ations over their learned representations. In addition to pro MalstM features SVM 84.2 viding model insight, informative representations can pro vide a useful tool for exploratory data analysis Table 4: Test set accuracy for the sick semantic entail- ment classification. The first gre Entailment classification mEval 2014 submissions and the second are more recently proposed methods To evaluate the broader utility of our sentence representa tions, we leverage them for a different application: the se mEval 201 4 textual entailment task (Marelli et al. 2014).In D iScussion addition to the relatedness scores each of the sick sen- tence pairs has also been labeled as one of three classes This work demonstrates that a simple lstm is capa entailment, contradiction, or neutral, which are to be pre ble of modeling complex semantics if the representations dicted for the test examples. For this task, we solely rely on are explicitly guided. Leveraging synonym augmentation the same representations learned for predicting semantic re and pretrained word-embeddings, we circumvent the size latedness(fixed without additional fine-tuning), and simply limitations of existing labeled datasets. Analvsis of the apply standard learning methods to do the entailment classi learned model reveals that it utilizes diverse hidden units to fication encode different characteristics of each sentence. Admitting Specifically, from the MaLSTM representations hTa,, B(b) efficient test-time inference our model can be deployed in of each pair of sentences, we compute the following simple real-time applications. Not only useful for scoring semantic features(also successfully used by Tai, Socher, and man relatedness/entailment, trained malSTM sentence represen ning 2015): element-wise(absolute) differences InTa hle tations can produce interesting insights in exploratory data analysis thanks to their interpretable structure and element-wise products //. USing only these fea- Since our approach relies on pre-trained word-vectors as tures. we train a radial-basis-kernel svm to classify the en- the lstm inputs, it will benefit from improvements in word- tailment labels. The one-versus-all approach to multi-class embedding methods such as those of li et al. (2015), espe problems is employed with hyperparameters optimized in 5 cially as these word-vectors more comprehensively capture fold cross-validation synonymity and entity-relationships. We also foresee signif- Table 4 shows that such an approach outperforms all other icant gains as the amount of labeled semantic similarity data textual-entailment systems except for the llinois-LH system grows, both for statistical reasons and because sufficiently of lai and Hockenmaier(2014). Thus even though the fea- large sample sizes enable learning of de novo word-vectors tures provided to the svm are learned for the distinct goal of tailored to this model 2791 References Learning and Explicit Matrix Factorization Perspective. 1J Abzianidze, L. 2015. A Tableau Prover for Natural Logic CAI and Language. EMNLP 2492-2502 Lien, E, and Kouylekov, M. 2015. Semantic Parsing for Agirre,E. and Cer. D. 2013. SEM 2013 shared task: Se- Textual Entailment. International Conference on Parsing mantic Textual Similarity. SemEval 2013 Technologies 40-49 Bengio. Y. 2012. Deep learning of representations for Un Marelli. M: Bentivogli. L: Baroni.M: Bernardi. R supervised and Transfer Learning. JMLR W&CP: Proc. Un Menini, S. and Zamparelli, R. 2014. SemEval-2014 Task 1 supervised and Transfer Learning challenge and workshop Evaluation of compositional distributional semantic models 17-36 on full sentences through semantic relatedness and textual Bjerva, J. Bos, J. van der Goot, R; and Nissim, M. 2014 entailment. Semeval 20/4 The meaning factory: Formal semantics for recognizing tex Mihalcea R ; Corley, C, and Strapparava, C. 2006. Corpus tual entailment and determining semantic similarity. Se- based and Knowledge-based measures of Text Semantic mEval 2014 Similarity. AAAI Conference on Artificial intelligence Bowman, S.R.; Angeli, G; Potts, C; and Manning, C. D Mikolov. t: Sutskever. I Chen K. Corrado. G. and dean 2015. A large annotated corpus for learning natural language J. 2013. Distributed Representations of Words and Phrases nference EMNLP 632-642 and their compositionality. NIPS 3111-3119 Chen, K, and Salman, A 2011. Extracting Speaker-Specific Miller. G.A. 1995. WordNet: A Lexical Database for En Information with a Regularized Siamese Deep Network glish. Communications of the ACM 38(11): 39-41 NPS298-306 Pascanu, R, Mikolov, T, and Bengio, Y. 2013. On the Cho. K: Gulcehre B. v M.C. Bahdanau D: Schwenk, F. difficulty of training recurrent neural networks. ICML 1310 B H, and Bengio. Y. 2014. Learning Phrase Representa 1318. tions using rnn encoder-Decoder for statistical machine Siegelmann, H. T, and Sontag, E D. 1995. On the Com Translation EMNLP 1724-1734 putational Power of Neural Nets. Journal of computer and Chopra. S: Hadsell. R: and Le cun, Y. 2005. Learn Svstem sciences 50: 132-150 ing a similarity metric discriminatively, with application to Socher, R 2014. Recursive Deep learning for Natural lan face verification. Computer Vision and Pattern Recognition guage Processing and Computer Vision. Phd thesis, Stan 1:539-546 ford University Fan, J, and Gijbels, I. 1992. Variable bandwidth and lo Sutskever, I Vinyals, O. and Le, Q. 2014. Sequence to cal linear regression smoothers. The annals of statistics sequence learning with neural networks. NIPS3104-3112 20:2008-2036 Tai, K. S, Socher, R. and Manning, C. D. 2015 Graves, A. 2012. Supervised sequence labelling with Re- Improved semantic Representations From Tree-Structured current Neural Networks. Studies in Computational Intelli Long short-Term Memory Networks. ACL 1556-1566 gence, Springer van der maaten. L. and hinton G. 2008. Visualizing high- Greff. K. Srivastava.R.K. Koutnik.L. Steunebrink.B. R Dimensional data Using t-SNE. Journal of machine learn and Schmidhuber, J 2015. LSTM: A Search Space Odyssey. ing Research 9: 2579-2605 arXiv:1503.04069 Yih. W: Toutanova K. Platt. J. and meek. C. 2011. Learn- He, H. Gimpel, K, and Lin, J. 2015. Multi-Perspective Sen ing Discriminative Projections for Text Similarity Measures tence Similarity Modeling with Convolutional Neural Net- Proceedings of the Fifteenth Conference on Computational works. EMNLP 1576-1586 Natural Language Learning 247-256 hochreiter, S, and Schmidhuber, J. 1997. Long Short-Term Zeiler, M. D. 2012. ADADELTA: An Adaptive Learning Memory. Neural Compulation 9(8): 1735-1780 Rate method arXiv 1212.5707 Jimenez, S; Duenas, G; Baquero, J. Gelbukh, A Batiz, A Zhang, X, Zhao, J, and LeCun, Y. 2015. Character J D. and Mendizabal, A. 2014. Unal-nlp: Combining soft level Convolutional Networks for Text Classification cardinality features for semantic textual similarity, related arXiv:1509.01626. ness and entailment SemEval 2014 Zhao J . Zhu.T. T. and Lan. M. 2014. Ecnu: One stone Kiros, R; Zhu, Y, Salakhutdinov, R; Zemel, R S, Torralba, two birds: Ensemble of heterogenous measures for semantic A Urtasun, R; and Fidler, S. 2015. Skip-Thought Vectors relatedness and textual entailment. SemEval 2014 NIPS to appear Lai. a. and hockenmaier. j. 2014. lllinois-h a deno- tational and distributional approach to semantics. SemEval 2014. Le, Q, and Mikolov, T. 2014. Distributed representations of sentences and documents CML 1188-1196 Li,Y∴;ⅹu,L.;Tian,F.; Jiang,L.; Zhong,ⅹ.; and chen.E. 2015. Word Embedding revisited: A New Representation 2792

(系统自动生成,下载前可以参看下载内容)