文件名称:
Building_Spanish_Part-of-Speech_Tagger_Using_Python_Pattern.pdf.pdf
开发工具:
文件大小: 379kb
下载次数: 0
上传时间: 2019-09-14
详细说明:Building_Spanish_Part-of-Speech_Tagger_Using_Python_Pattern.pdf7232016
Using Wikicorpus nltk to build a Spanish part-of-speech tag ger CLiPS
want to learn general rules in the form of, for example: any proper noun followed by a verb instead
of Puerto Rico "followed by a verb
entences wikicorpus(words=1000000)
ANONYMOUS =anonymous
for s in sentences
for i, (w, tag) in enumerate(s):
if tag =="NP":+ NP= proper nain in Parole tagset
SL1=(ANONYMOUS,NP")
We can then train NLTK's Fas-BrillTaggerTrainer. It is based on a unigram tagger, which is
simply a lexicon of known words and their part-of-speech tag. It will then boost the accuracy with a
set of contextual rules that change a word's part-of-speech tag depend ing on the surround ing words
from nltk tag import Unigramfagger
from nltk tag import FastBrillTaggerTrainer
from nltk tag. brill import Symme-ricProxinmateTokensTemplate
from nltk tag. brill import FroximateTokensTerrplate
from nltk tag. bill import ProximateTags 1
from r⊥tk,tag.b=11⊥ import proximatewordsRu⊥e
ctx-[t Context surrounding words and tags
SymmetricProximateTokensTemplate(ProximateTagsRule,(1, 1))
SymmetricProximateTokensTemplate(pr
SymmetricProxiateTokensTemolate(ProximmateTaysRuler(1, 3)),
SymmetricProximateTokensTemplate(ProximateTagsRuler(2, 2)),
SymmetricProximateTokensTemplate(ProximateWordsRule,(1, 1)
ymmetricProximaterokensTemplate(pro
ewordsRlle,(1, 2))
ProximateTokensTemplate(ProxinateTagsRule, (l,-1),(1, 1))
tagger UnigramTacger(sentences)
tagger =-FastBrillTaggerTrainer (-agger, ctx, trace=0)
tagger tagger train(sentences, max rules=100)
#print tacger evaluate(wikicorpus(10000, start=l))
Brill's algorithm uses an iterative approach to learn contextual rules. In short, this means that it tries
different combinations of interesting rules to find a subset that produces the best tagging accuracy
This process is time-consuming(minutes or hours), so we want to store the final subset for reuse
Brills algorithm in nltk dcfincs contcxt using indices. For cxamplc,(1, 2)in thc previous script
ords(or tags)after the current word. Brill's original impler
commands to describe context, e.g., NEXT1OR2WORD or NEXT1OR2TAG Pattem also uses these
commands, so we need to map nltk's indices to the command set
£⊙1u1e主nt
a= rule. original t
b= rule. replacemen- tag
c= rule. conditions
x=c[0][2]
[0][:2
E 1
this script
continue
if isinstance(rule, ProximateTagsRule)
if r -(-1,-1): cmd -PREVTAC
if r ==(+1 +1): cmd =NEXTTAG
i王r==1-2,-1):cmd="PREV⊥OR2AG"
if
if
) cmd =NEXTIOR2OR3TA
f
if isinstance(rule, ProximateWordsRule):
if
f
if r ==(-2r-1): cmd =PREVIOR2WD
tx append(ss ss"(a, b,
nttp: //wwclips. ua ac be/pag es/using-wikicorpus-nltk-to-build-a-spanish-part-of-speech-tagger
36
7232016
Using Wikicorpus nLtK to build a Spanish part-of-speech tag ger CLiPS
open("es-ccntext txt,w").write(BOM UtF8 + n"join(ctx. encode( utf-8)
We end up with a file es-context txt with a 1 00 contextual rules in a format usable with Pattern
4. Rules for unknown words based on word suffixes
By default, unknown words(=not in lexicon) will be tagged as nouns. we can improve this with
morphological rules, in other words, rules based on word pre fixes and suffixes. For example, English
words ending in-ly are usually adverbs really, extremely, and so on. Similarily, Spanish words that
end n-menteare adverbs. Spanish words ending in -ando or -iendo are verbs in the present
participle: hablando, escribiendo, and so on
860,"Sp":8,"VMs":7}}
suffix defaultdict (lambda: defaultdict (int))
for senterce in wikicorpus(1000300)
f。xw, tag in contone:
X =w[-5:]# Tast 5 characters.
⊥en(x)<⊥en(W) and tag
suffix[x][tag]+= 1
for x, tacs in suffix i=ems():
tag Iax(tags, key=tags get) t RO
sum(tags. valucs())
t4860+8+7
f2= tags[tag] float (f1)+ 4860/4875
top append((fl, f2, tag)
tcp- sorted(top, reverse=True)
tcp filter(lambda (fl, f2, x, ag): fl >=10 and f
top)
tcp filter(lambda (fl, f2, x, -ag): tag !="NC
top)
p[:100
)王orf1,f
open("es-mcrphology txt","w"). write(BOM UTF8 \n"join(top). encode(utf
We end up with a file es-morpholcgy txt with a 100 sufix rules in a format usable with Pattern.
To clarify this, examine the table below. We read 1 million words from Wikicorpus, of which 4, 875
bords cnd in -mente. 98%of thosc arc tagged as RG(Parolc tag for adverb, RB in Pcnn tagsct) It
was also taggedsp(preposition)8 times and vMs(verb)7 times
The above script has two constraint for rule selection: =1 >= 10 will discard rules that match less than
10 words, and f2>0. 8 will discard rules for which the most frequent tag falls below 80%. This
means that unknown words that end in-mente will be tagged as adverbs, since we consider the other
cases negligible. We can experiment with different settings to see if the accuracy of the tagger
Improves.
FREQUENCY
SUFFIX
PARTS-OF-SPEECH
EXAMPLE
5986
-acion
998 Ncs+1%sp
derivation
4875
went
988RG+1号sP+18WMs
correctamente
3276
ones
998 NCP +1 VMS
dimensions
1824
bien
100%RG
tambien
1247
-en
998W+18Ncs
septiembre
1134
-dades
998 NCP+18 sP
posibilidades
5. Subclassing the pattern. text Parser class
In summary, we constructed an es-lexicon txt with the part-of-speech tags of known words
(steps 1-2)together with an es-contex= txt(step 3 )and an es-morphology txt(step 4). We
can use these to create a parser for Spanish by subclassing the base Parser in the pattern. text
module. The pattern. text module has base classes for Parser, Lexicon, Morphology, etc. Take a
moment to review the source code, and the source code of other parsers in Pattern. Youl ll notice that
all parsers follow the same simple steps. a template for new parsers is included
nttp: //wwclips. ua ac be/pag es/using-wikicorpus-nltk-to-build-a-spanish-part-of-speech-tagger
46
7232016
Using Wiki corpus &nltK to build a Spanish part-of-speech tagger CLiPs
n pattern, text.xx
The parser basc class has the following mcthods with default behavior
Parser fird tokens() finds sentence markers(? )and splits punctuation marks from
word
Parser fird tags() finds word part-of-speech tags,
Parser fird chunks() finds words that belong together (e.g, the black cats ),
Parser fir:d labels() finds word roles in the sentence(e.g, subject and object),
Parser fird lcmmata() finds word base forms(cats -cat
Parser parse()
executes the above steps on a given string
We can create an instance of the Spanishparser and feed it our data. We will need to
rodcfinc -ind tags ( )to map Parole tags to Penn Trccbank tags(which all othcr parsers in Pattcrn
as well
PAROLE =I
nCC":CC
NCS:LN
HVAN: D
HPT: DTI
VMN:VB
RG
def parole2penntreebank(token, tag):
return token, PAROLE, get(tag, tag)
lass SparishParser(parser
def find tags(self, tokens, *xkwargs):
t Farse:, find tags() can take an optional map(token, tag) function
t which returns an updated (tok
tag)tuple
kwargs. setdefault ("map", parole2penntreeoank
return Parser. find tags(self, tokens, x*kwarcs)
Load the lexicon and the rules in an instance of spanishParser
from pattern. text import Lexicon
morpho⊥agy
es- morpho⊥oay,tx
parser= SranishParser(
lexicon =lexicon
aras
return parserparse(s, * args, i*kwargs)
nttp: //wwclips. ua ac be/pag es/using-wikicorpus-nltk-to-build-a-spanish-part-of-speech-tagger
56
7232016
Using Wiki corpus &nltK to build a Spanish part-of-speech tagger CLiPs
It is still missing features(notably lemmatization)but our Spanish parser is essentially ready for use
print parse (uEl gato se sento en la alfombra")
gato
sento
alfom bra
DT
PRP
VB
DT
6. Testing the accuracy of the parser
The following script can be used to test the accuracy of the parser against Wikicorpus. We used 1.5
million words with 300 contextual and 100 morphological rules for an accuracy of about 91%. So we
lost g% but the parser is also fast and compact- the data files are about IMB in size. Note how we
pass map=None to theparse()command. This parameter is in turn passed
to SpanishParser find tags()so that the original Parole tags are returned, which we can
compare to the tags in Wikicorpus
for sl in wikicorpus (100000, start=1):
join (w for w, tag in s1)
2- parsc(s2, tags-Truc, caunks-Faloc, rap-Nonc) split([o]
f⊙x(w1,tag1)
2, tag2) in zip(sl, s2)
f tagl
print float(i)/n
nttp: //wwclips. ua ac be/pag es/using-wikicorpus-nltk-to-build-a-spanish-part-of-speech-tagger
66
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.