Коротко:
Сначала давайте добавим несколько заголовков в ваш csv, чтобы он был более удобочитаемым при доступе к столбцам:
>>> import pandas as pd
>>> df = pd.read_csv('myfile.csv', delimiter=';')
>>> df.columns = ['sent', 'tag']
>>> df['sent']
0 Preliminary Discourse to the Encyclopedia of D...
1 d'Alembert claims that it would be ignorant to...
2 However, as the overemphasis on parental influ...
3 this can also be summarized as a distinguish b...
Name: sent, dtype: object
>>> df['tag']
0 certain
1 certain
2 uncertain
3 uncertain
Теперь давайте создадим функцию tok_and_tag
, которая последовательно выполняет word_tokenize
и pos_tag
:
>>> from nltk import word_tokenize, pos_tag
>>> from functools import partial
>>> tok_and_tag = lambda x: pos_tag(word_tokenize(x))
>>> df['sent'][0]
'Preliminary Discourse to the Encyclopedia of Diderot'
>>> tok_and_tag(df['sent'][0])
[('Preliminary', 'JJ'), ('Discourse', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('Encyclopedia', 'NNP'), ('of', 'IN'), ('Diderot', 'NNP')]
Затем мы можем использовать df.apply
для токенизации и пометьте столбец предложения фрейма данных:
>>> df['sent'].apply(tok_and_tag)
0 [(Preliminary, JJ), (Discourse, NNP), (to, TO)...
1 [(d'Alembert, NN), (claims, NNS), (that, IN), ...
2 [(However, RB), (,, ,), (as, IN), (the, DT), (...
3 [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: sent, dtype: object
Если вы хотите, чтобы предложения были строчными:
>>> df['sent'].apply(str.lower)
0 preliminary discourse to the encyclopedia of d...
1 d'alembert claims that it would be ignorant to...
2 however, as the overemphasis on parental influ...
3 this can also be summarized as a distinguish b...
Name: sent, dtype: object
>>> df['lower_sent'] = df['sent'].apply(str.lower)
>>> df['lower_sent'].apply(tok_and_tag)
0 [(preliminary, JJ), (discourse, NN), (to, TO),...
1 [(d'alembert, NN), (claims, NNS), (that, IN), ...
2 [(however, RB), (,, ,), (as, IN), (the, DT), (...
3 [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: lower_sent, dtype: object
Кроме того, нам нужен какой-то способ получить словарь POS, мы можем использовать collections.Counter
и itertools.chain
, чтобы сгладить список списка:
>>> df['lower_sent']
0 preliminary discourse to the encyclopedia of d...
1 d'alembert claims that it would be ignorant to...
2 however, as the overemphasis on parental influ...
3 this can also be summarized as a distinguish b...
Name: lower_sent, dtype: object
>>> df['lower_sent'].apply(tok_and_tag)
0 [(preliminary, JJ), (discourse, NN), (to, TO),...
1 [(d'alembert, NN), (claims, NNS), (that, IN), ...
2 [(however, RB), (,, ,), (as, IN), (the, DT), (...
3 [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: lower_sent, dtype: object
>>> df['tagged_sent'] = df['lower_sent'].apply(tok_and_tag)
>>> tokens, tags = zip(*chain(*df['tagged_sent'].tolist()))
>>> tags
('JJ', 'NN', 'TO', 'DT', 'NN', 'IN', 'NN', 'NN', 'NNS', 'IN', 'PRP', 'MD', 'VB', 'JJ', 'TO', 'VB', 'IN', 'NN', 'MD', 'VB', 'VBN', 'IN', 'DT', 'JJ', 'NN', '.', 'RB', ',', 'IN', 'DT', 'NN', 'IN', 'JJ', 'NN', 'IN', 'NNS', 'NN', 'VBZ', 'VBN', 'RB', 'VBN', 'IN', 'DT', 'JJ', 'NN', ',', 'JJ', 'NNS', 'VBD', 'JJ', 'NN', 'IN', 'DT', 'RBR', 'JJ', 'NN', 'IN', 'NN', 'NN', 'IN', 'VBG', 'JJ', 'NN', 'NNS', '(', 'NN', 'CC', 'NN', ',', 'CD', ')', '.', 'DT', 'MD', 'RB', 'VB', 'VBN', 'IN', 'DT', 'JJ', 'NN', 'IN', 'DT', 'NNS', 'NN')
>>> set(tags)
{'CC', 'VB', ')', 'NNS', ',', 'JJ', 'VBZ', 'DT', 'NN', 'PRP', 'RBR', 'TO', 'VBD', '(', 'VBN', '.', 'MD', 'IN', 'RB', 'VBG', 'CD'}
>>> possible_tags = sorted(set(tags))
>>> possible_tags
['(', ')', ',', '.', 'CC', 'CD', 'DT', 'IN', 'JJ', 'MD', 'NN', 'NNS', 'PRP', 'RB', 'RBR', 'TO', 'VB', 'VBD', 'VBG', 'VBN', 'VBZ']
>>> possible_tags_counter = Counter({p:0 for p in possible_tags})
>>> possible_tags_counter
Counter({'NNS': 0, 'VBZ': 0, 'DT': 0, '(': 0, 'JJ': 0, 'VBD': 0, ')': 0, 'RB': 0, 'VBG': 0, 'RBR': 0, 'VB': 0, 'IN': 0, 'CC': 0, ',': 0, 'PRP': 0, 'CD': 0, 'VBN': 0, '.': 0, 'MD': 0, 'NN': 0, 'TO': 0})
Чтобы перебрать каждое помеченное предложение и получить количество POS:
>>> df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))
0 {'NN': 3, 'IN': 1, 'TO': 1, 'DT': 1, 'JJ': 1}
1 {'NN': 3, 'VB': 3, 'PRP': 1, 'TO': 1, 'DT': 1,...
2 {')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, 'NNS': 3,...
3 {'DT': 3, 'VB': 1, 'NN': 2, 'VBN': 1, 'NNS': 1...
Name: tagged_sent, dtype: object
>>> df['pos_counts'] = df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))
>>> df['pos_counts']
0 {'NN': 3, 'IN': 1, 'TO': 1, 'DT': 1, 'JJ': 1}
1 {'NN': 3, 'VB': 3, 'PRP': 1, 'TO': 1, 'DT': 1,...
2 {')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, 'NNS': 3,...
3 {'DT': 3, 'VB': 1, 'NN': 2, 'VBN': 1, 'NNS': 1...
Name: pos_counts, dtype: object
# Now we can add in the POS that don't appears in the sentence with 0 counts:
>>> def add_pos_with_zero_counts(counter, keys_to_add):
... for k in keys_to_add:
... counter[k] = counter.get(k, 0)
... return counter
...
>>> df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))
0 {'VB': 0, 'IN': 1, 'PRP': 0, 'DT': 1, 'CC': 0,...
1 {'VB': 3, ')': 0, 'DT': 1, 'CC': 0, 'RB': 0, '...
2 {'VB': 0, ')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, ...
3 {'VB': 1, 'IN': 2, 'PRP': 0, 'NN': 2, 'CC': 0,...
Name: pos_counts, dtype: object
>>> df['pos_counts_with_zero'] = df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))
Теперь объедините значения в список:
>>> df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])
0 [0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 3, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 0, 0, 1, 3, 2, 2, 3, 1, 1, 0, 0, ...
2 [1, 1, 3, 1, 1, 1, 3, 7, 6, 0, 11, 3, 0, 2, 1,...
3 [0, 0, 0, 0, 0, 0, 3, 2, 1, 1, 2, 1, 0, 1, 0, ...
Name: pos_counts_with_zero, dtype: object
>>> df['sent_vector'] = df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])
Теперь вам нужно создать новую матрицу для хранения BoW:
>>> df2 = pd.DataFrame(df['sent_vector'].tolist)
>>> df2.columns = sorted(possible_tags)
И вуаля:
>>> df2
( ) , . CC CD DT IN JJ MD ... NNS PRP RB RBR TO VB VBD \
0 0 0 0 0 0 0 1 1 1 0 ... 0 0 0 0 1 0 0
1 0 0 0 1 0 0 1 3 2 2 ... 1 1 0 0 1 3 0
2 1 1 3 1 1 1 3 7 6 0 ... 3 0 2 1 0 0 1
3 0 0 0 0 0 0 3 2 1 1 ... 1 0 1 0 0 1 0
VBG VBN VBZ
0 0 0 0
1 0 1 0
2 1 2 1
3 0 1 0
[4 rows x 21 columns]
Вкратце:
from collections import Counter
from itertools import chain
import pandas as pd
from nltk import word_tokenize, pos_tag
df = pd.read_csv('myfile.csv', delimiter=';')
df.columns = ['sent', 'tag']
tok_and_tag = lambda x: pos_tag(word_tokenize(x))
df['lower_sent'] = df['sent'].apply(str.lower)
df['tagged_sent'] = df['lower_sent'].apply(tok_and_tag)
possible_tags = sorted(set(list(zip(*chain(*df['tagged_sent'])))[1]))
def add_pos_with_zero_counts(counter, keys_to_add):
for k in keys_to_add:
counter[k] = counter.get(k, 0)
return counter
# Detailed steps.
df['pos_counts'] = df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))
df['pos_counts_with_zero'] = df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))
df['sent_vector'] = df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])
# All in one.
df['sent_vector'] = df['tagged_sent'].apply(lambda x:
[count for tag, count in sorted(
add_pos_with_zero_counts(
Counter(list(zip(*x))[1]),
possible_tags).most_common()
)
]
)
df2 = pd.DataFrame(df['sent_vector'].tolist())
df2.columns = possible_tags
person
alvas
schedule
20.05.2017
'myfile.csv'
? Не могли бы вы напечататьdata.head()
? - person alvas   schedule 20.05.2017data.head()
, я увижу 2 столбца: предложение и метку. В предложении хранятся предложения, а в метке одно из двух значений — да или нет. - person ZverArt   schedule 20.05.2017pos_tag
, верно? Поскольку для желаемого результата вам просто нужен вектор мешка слов. - person alvas   schedule 20.05.2017