进行分词
- 网络word segmentation
-
本文不进行分词,直接按照字符的匹配进行抽取。
Without word segmentation , features are extracted in accordance with the matching rules directly .
-
首先建立了一个约五十万字的封闭语料库,然后对语料进行分词和词频统计。
After setting up a closed corpus of five hundred thousand characters , we carry out the word segmentation and word frequency statistic .
-
语言处理会根据附著词进行分词。
The linguistic processing splits the term according to the clitic .
-
本文提出了一种隐式分词连写输入法,该方法在中文文本输入的同时进行分词,并将文本以词串的形式在机内保存。
A Chinese input approach implicating word segmentation is presented in this paper .
-
分析中文的语义,首先要对句子进行分词。
To analyse the Chinese semantic phrases , one must divide the sentences into words .
-
汉语自动分词在面向大规模真实文本进行分词时仍然存在很多困难。
The automatic word segmentation of Chinese sentences is difficult when the processing mechanism faces large scale real texts .
-
使用分词系统对用户输入的查询语句进行分词和词性标注;②根据检索特征提取关键词集合。
Use the word system to query user input word segmentation and POS tagging ;② Based retrieval feature extraction keyword collections .
-
经过对分词技术的研究,本文采用字符串匹配算法中的最大匹配法进行分词处理。
After the research on word segmentation , in this paper , the string matching algorithm for maximum matching in the sub-word processing .
-
它是自然语言处理过程中首要的技术环节,其重要性不言而喻。目前的中文分词技术主要针对中文文本进行分词。
It is the primary key technical link in natural language processing . The current technology of Chinese word segmentation is working for Chinese text .
-
自然语言处理的主要工作是对文本进行分词标注、句法分析、三元组提取等工作。文本结构本体的建立是选取人物描述类作为本体建模的样本,能起到代表性意义。
NLP includes text segmentation , tagging , parsing , three-tuple extraction , etc. The ontology of text structure takes character described as ontology modeling sample .
-
由于不需要进行分词和特征提取,该表示方法与具体语种无关。
The avoidance of word segmentation and feature extraction shows that the categorizing process is irrelevant to do with the concrete language and is a language independent method .
-
本文将基于统计的二元分词方法应用于中文网页分类,实现了在事先没有词表的情况下通过统计构造二字词词表,从而根据网页中的文本进行分词,进而进行网页的分类。
Word segmentation is an important step in Chinese natural language processing . This paper explores the problem of classifying Chinese web pages based on statistical word segmentation .
-
分词是中文信息处理的基础,在汉语文本分类、文献标引、智能检索、自然语言理解与处理等应用中,首先都要对中文文本进行分词处理。
Word segmentation is the basis of Chinese language processing , which is widely applied in text classification , literature index , intelligent retrieval and natural language processing .
-
系统首先提出基于字信息的汉语词法分析方法,对汉语网页中文本进行分词处理,然后利用基于组成字结构信息的方法发现新词。
The system first employs character-based Chinese morphological analysis for segmenting Chinese texts into words , and then presents a method based on structure information of constituent character .
-
然后使用一种适合目录信息的结构和存储格式的分词方法,对目录文本进行分词处理并对目录信息中的特征项进行标注。
According to the structure and storage of the directory , we design and apply a word segmentation algorithm on the text to segment the words and label the features .
-
系统在索引时对原始文本进行分词处理后以词为单元生成倒排索引,检索部分则采用了经典的向量空间模型。
The index function component segments the original documents into words and then generates an inverted index with word units . The retrieval component applies the classical vector space model .
-
词法分析是汉语处理的必经步骤,本文采用了双向最大匹配法进行分词处理,并在一定程度上消除了歧义。
Chinese word segmentation is a necessary step in processing , we adopt a two-way maximum matching method to process the chinese word , and eliminate ambiguity to some extent .
-
对用户提交的以自然语言表述的问题进行分词处理,去除相关辅助词,最后提取出核心词进行查询。
When user submits the query sentence , system will process word segmentation , remove the relevant auxiliary word , finally extracted the core words of query and then search the words .
-
自动分词和词性标注直接影响命名实体的识别,本文采用了海量智能分词系统对文本进行分词和标注。
Automatic Segmentation and POS ( Part-of-speech ) tagging directly impact on the named entity recognition . This paper used the Massive Intelligent Segmentation system to the text word segmentation and Tagging .
-
而对明代汉语语料进行分词及词频统计的研究,可以更全面地了解这个时期的词汇使用概貌。
So the research of word segmentation and frequency statistic on the corpus of Ming Dynasty will help us to know clearly the general picture of the word usage in the period .
-
在这部分实验中因为所用语料短小、领域性非常强、口语化比较严重,通用的分词软件不能很好的进行分词。
In this part of the experiment because of the Corpus is short , the field background is very strong , colloquial more serious , common segmentation software can not do this job well .
-
在这种模型之上,可以利用文本当中任意匹配的短语来定义文本之间的近似程度,避免了对中文文本进行分词以及处理高维向量等问题。
Based on this model , the similarity between different documents can be calculated by finding marched phrases in documents , which avoided the process of Chinese word segment and handling high dimension vectors .
-
价格比对的部分,是通过对商品描述文本进行分词,然后使用文档相似度对比的方法做到的。
Regarding the method of price comparison , the first step is to do the text segmentation of product description . Then use the algorithm of document similarity to calculate the similarity of product .
-
首先我们需要得到评测用的平衡语料,随后进行分词,然后把语料标注拼音从而得到平台的输入。
Firstly , we need to obtain the balanced corpus and then segment it by word for an easy work when it is labeled with pinyin which is the input file of the system .
-
本文以向量空间模型的方式描述单句,对单句进行分词并去除停用词之后就可以得到该单句的文本特征向量。
This paper uses vector space model to describe input text . In order to get the text feature vector , we segment a simple sentence into keywords or phrases and remove the stop words .
-
其中汉语的处理任务大都需要先进行分词,经过多年的深入研究,汉语自动分词技术在应对传统文本上已经取得了不错的成绩。
The majority of Chinese processing tasks are performed on the basis of word segmentation . After years of in-depth study , Chinese automatic word segmentation technology has achieved desirable results in terms of the traditional text .
-
本文分析已有分词方法的优劣,并采用基于统计与基于规则相结合的分词方法进行分词,取各方法之精髓,弥补各分词方法力所不及之处。
This paper analyzes the pros and cons of the existing segmentation techniques , and then the author used a combination of word segmentation method based on statistics and rule-based segmentation . Taking the essence of other method to make up for their own disadvantages .
-
该方法首先在基于长度优先的基础上同时结合词频优先进行分词,对未匹配字串再应用改进的正向最大匹配法和逆向最大匹配法结合熵率进行分词。
The method firstly based on priority of length combining with word frequency to segment short sentence . If any non-matching word strings of the short sentence exist , we apply the improved maximum matching method and reverse maximum matching method combined with entropy rate to segment .
-
设计并实现一个网页分类系统,采用相同的特征权值计算方法,特征选择算法以及分类算法,进行基于分词的网页分类系统和基于N-Gram的网页分类系统的对比实验,分析两者的分类效果。
This page designs a Chinese web categorization system , with the same feature weight , feature selection and categorizing algorithm , based on Word-Segment categorization system and N-Gram categorization system .
-
本文使用了基于简单贝叶斯模型的过滤算法,同时使用N-gram对中文文本进行自动分词,并且组合多个N-gram来加快分类的收敛速度,这样分类是一种切实可行的垃圾邮件过滤方法。
Many researches indicate that text classification is a feasible way . A Naive Bayesian Algorithm is proposed in this paper to model the filtering and a N-gram method is also introduced to segment the Chinese text into word .