stanford chinese segmentor Tokenization of raw tex

文件名称: stanford chinese segmentor

所属分类: 企业管理

开发工具:

文件大小: 1mb

下载次数: 0

上传时间: 2013-04-10

提供者: bbkin*****

下载 (1mb)

不能下载？报告错误

详细说明： Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation. The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications. The system requires Java 1.6+ to be installed. We recommend at least 1G of memo ry for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts. Arabic Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis. The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in: Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL. Chinese Chinese is standardly written without spaces between words (as are some other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in: Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. Two models with two different segmentation standards are included: Chinese Penn Treebank standard and Peking University standard. On May 21, 2008, we released a version that makes use of lexicon features. With external lexicon features, the segmenter segments more consistently and also achieves higher F measure when we train and test on the bakeoff data. This version is close to the CRF-Lex segmenter described in: Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. In WMT. The older version (2006-05-11) without using external lexicon features will still be available for download, but we do recommend using the latest version. Another new feature of the latest release is that the segmenter can now output k-best segmentations. An example of how to train the segmenter is now also available. ...展开收缩

(系统自动生成,下载前可以参看下载内容)