Для компактности (не рекомендуется):
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> sent = "John write His name as Ishmael"
>>> [pos_tag(word_tokenize(i)) for i in sent_tokenize(sent)]
[[('John', 'NNP'), ('write', 'VBD'), ('His', 'NNP'), ('name', 'NN'), ('as', 'IN'), ('Ishmael', 'NNP')]]
>>> tagged_sent = [pos_tag(word_tokenize(i)) for i in sent_tokenize(sent)]
>>> [[(word,"CAPITALIZED" if word[0].isupper() else None, "noun" if word[1][0] == "N" else "non-noun") for word,pos in sentence] for sentence in tagged_sent]
[[('John', 'CAPITALIZED', 'non-noun'), ('write', None, 'non-noun'), ('His', 'CAPITALIZED', 'non-noun'), ('name', None, 'non-noun'), ('as', None, 'non-noun'), ('Ishmael', 'CAPITALIZED', 'non-noun')]]
Более читаемый код:
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> sent = "John write His name as Ishmael"
>>> tagged_sents = [pos_tag(word_tokenize(i)) for i in sent_tokenize(sent)]
>>> added_annotation_sents = []
>>> for sentence in tagged_sents:
... each_sent = []
... for word, pos in sentence:
... caps = "CAPITALIZED" if word[0].isupper() else None
... isnoun = "noun" if word[1][0] == "N" else "non-noun"
... each_sent.append((word,caps,isnoun))
... added_annotation_sents.append(each_sent)
...
>>> added_annotation_sents
[[('John', 'CAPITALIZED', 'non-noun'), ('write', None, 'non-noun'), ('His', 'CAPITALIZED', 'non-noun'), ('name', None, 'non-noun'), ('as', None, 'non-noun'), ('Ishmael', 'CAPITALIZED', 'non-noun')]]
Если вы настаиваете на удалении элемента None
, если это слово не написано с заглавной буквы:
>>> [[tuple([ann for ann in word if ann is not None]) for word in sent] for sent in added_annotation_sents]
[[('John', 'CAPITALIZED', 'non-noun'), ('write', 'non-noun'), ('His', 'CAPITALIZED', 'non-noun'), ('name', 'non-noun'), ('as', 'non-noun'), ('Ishmael', 'CAPITALIZED', 'non-noun')]]
person
alvas
schedule
04.03.2014