NLTK | Mr kuai

1
2

语料库认知:
语料库中存放的是在语言的实际使用中真实出现过的语言材料；语料库是以电子计算机为载体承载语言知识的基础资源；真实语料需要经过加工（分析和处理），才能成为有用的资源。

from nltk.book import *
text1.concordance('monstrous')   #concordance 一致
text1.similar('monstrous')  #与monstrous相似上下文的词
text1.common_contexts(['monstrous','very'])  #研究多个词共同的上下文
text1.count('for')  #统计某个词出现的次数
bigrams(['more', 'is', 'said', 'than', 'done']) #双连词搭配
nltk.corpus.gutenberg.fileids()   返回古腾堡项目项目的所有文本信息
nltk.corpus.gutenberg.words('austen-emma.txt')  #获取古腾堡项目艾玛文本的所有单词
from nltk.corpus import brown #布朗语料库:包含500个不同来源的文本
brown.categories()  #查看布朗语料库所有的类别
#查看布朗语料库中类别为新闻的文本中的单词
brown.words(categories='news')  
brown.words(fileids = ['news'])
brows.sents(categories=['news','editorial','fiction'])  #多个文本
#载入自己的语料库
from nltk.corpus import PlaintextCorpusReader
#处理html  === > 对文本进行分词
raw = nltk.clean_html(html)
tokes = nltk.word_tokensize(raw)

1.nltk.corpus  语料库和词典的标准化接口(获取和处理语料库)
2.nltk.tokenize,nltk.stem  分词,句子分解提取主干(字符串处理)
3.nltk.collocations  t-检验,卡方
4.nltk.tag  词性标识符
5.nltk.classify,nltk.cluster  决策树,最大熵,贝叶斯,EM等(分类)
6.nltk.chunk   正则表达式(分块)
7.nltk.parse   图表,基于特征,一致性,概率(解析)
8.nltk.metrics 精度,召回率等(指标评测)
9.nltk.probability  概率分布,平滑概率分布
10.nltk.app  nltk.chat   图形化的关键词排序,聊天机器人(应用)

nltk自带的语料库(corpus)

gutenberg 古典小说语料库
webtext 网络广告
nps_chat  聊天消息语料库
browm  一个百万词级的英语语料库
reuters  路透社语料库,新闻文档
inaugural  演讲语料库

NLTK词频统计

freq = nltk.FreqDist(数据)
B()  返回词典的长度
plot(title,cumulative=False)  #绘制词频分布图,若cumu为True,则是累计频率分布图  
tabulate()  生成频率分布的表格形式
most_common() 返回出现次数最频繁的词和频度
hapaxes() 返回只出现过一次的词

freq.most_common(5)  返回出现频率最高的五个词

NLTK去除停用词(stopwords)

在自然语言处理中,无用词(数据)称为停用词
from nltk.corpus import stopwords
stopwords.words('english')

#例题:从文本中删除停用词
from nltk.corpus import stopwords
from nltk.tokensize  import word_tokenize
example_sent = 'this is a sample sentence,showing off the stop  words filtration'
stop_words =  set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stops_words]

nltk的分词(tokensize)

from  nltk.tokenize import  sent_tokenize
from nltk.tokenize  import word_tokenize
#分句
sent_tokenize(mytext)
sent_tokenize(mytext,'french')  #标记非英语语言文本
#分词
word_tokenize(mytext)

nltk词干提取==可能创造不存在的词汇

***************************1********************************
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem('working')---------->work
****************************2*******************************
fromnltk.stem import SnowballStemmer
lancaster_stemmer = LancasterStemmet()
lancaster_stemmer.stem('working') --------->work

nltk词形还原 == 解决词干提取会出现不存在词汇的问题

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('increases') ---->increase
lemmatizer.lemmatize('palying',post='v')----->默认还原结果为名词,post还原为动词

nltk词性标注(pos tag)

import nltk
text = nltk.word_tokenize('what does the fox say')
nltk.post_tag(text)
#返回
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]

NLTK中的wordnet

#获取给定词的定义和例句
from nltk.corpus import wordnet
syn = wordnet.synsets('pain') #获取'pain'同义词集
syn[0].definition()  #pain的解释
syn[0].examples()  #pain的例句
#获取同义词
for syn in wordnet.synsets('pain'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
#获取反义词
antonyms = []
for syn in wordnet.synsets("small"):
    for l in syn.lemmas:
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())