Создайте предложение (строку) в матрице счетчиков (столбцов) POS-тегов из фрейма данных

Я пытаюсь построить матрицу, в которой первая строка будет частью речи, а первый столбец - предложением. значения в матрице должны показывать количество таких POS в предложении.

Итак, я создаю POS-теги следующим образом:

data = pd.read_csv(open('myfile.csv'),sep=';') 

target = data["label"]
del data["label"]

data.sentence = data.sentence.str.lower() # All strings in data frame to lowercase

for line in data.sentence:
    Line_new= nltk.pos_tag(nltk.word_tokenize(line))
    print(Line_new)

Результат:

[('together', 'RB'), ('with', 'IN'), ('the', 'DT'), ('6th', 'CD'), ('battalion', 'NN'), ('of', 'IN'), ('the', 'DT')]

Как я могу создать матрицу, которую я описал выше, из такого вывода?

ОБНОВЛЕНИЕ: желаемый результат

                   NN  VB    IN    VBZ    DT
 I was there       1   1     1      0     0
 He came there     0   0     1      1     1

мой файл.csv:

"A child who is exclusively or predominantly oral (using speech for communication) can experience social isolation from his or her hearing peers, particularly if no one takes the time to explicitly teach them social skills that other children acquire independently by virtue of having normal hearing.";"certain"
"Preliminary Discourse to the Encyclopedia of Diderot";"certain"
"d'Alembert claims that it would be ignorant to perceive that everything could be known about a particular subject.";"certain"
"However, as the overemphasis on parental influence of psychodynamics theory has been strongly criticized in the previous century, modern psychologists adopted interracial contact as a more important determinant than childhood experience on shaping people’s prejudice traits (Stephan & Rosenfield, 1978).";"uncertain"
"this can also be summarized as a distinguish behaviour on the peronnel level";"uncertain"

ZverArt 20.05.2017 источник

comment

Что вы имеете в виду под матрицей? Не могли бы вы указать и привести пример желаемого результата? - alvas 20.05.2017

comment

@alvas Спасибо за ваше замечание! Добавил нужный вывод в конце - ZverArt 20.05.2017

comment

Как выглядит ваш 'myfile.csv'? Не могли бы вы напечатать data.head()? - alvas 20.05.2017

comment

@alvas Если я напечатаю data.head(), я увижу 2 столбца: предложение и метку. В предложении хранятся предложения, а в метке одно из двух значений — да или нет. - ZverArt 20.05.2017

comment

Я имею в виду, не могли бы вы показать, как выглядит оригинальный `myfile.csv'? Достаточно показать 5 верхних строк. - alvas 20.05.2017

comment

@alvas Да, конечно. добавлено в основное сообщение поста. - ZverArt 20.05.2017

comment

Просто чтобы еще раз проверить, вам не нужна информация о POS от pos_tag, верно? Поскольку для желаемого результата вам просто нужен вектор мешка слов. - alvas 20.05.2017

comment

@alvas хорошо, что вы дважды проверили это. Благодарю вас! Я сделал ошибку. Я отредактировал его. Конечно, вместо слов есть часть речи. - ZverArt 20.05.2017

Ответы (1)

arrow_upward
6
arrow_downward

Коротко:

Сначала давайте добавим несколько заголовков в ваш csv, чтобы он был более удобочитаемым при доступе к столбцам:

>>> import pandas as pd
>>> df = pd.read_csv('myfile.csv', delimiter=';')
>>> df.columns = ['sent', 'tag']
>>> df['sent']
0    Preliminary Discourse to the Encyclopedia of D...
1    d'Alembert claims that it would be ignorant to...
2    However, as the overemphasis on parental influ...
3    this can also be summarized as a distinguish b...
Name: sent, dtype: object
>>> df['tag']
0      certain
1      certain
2    uncertain
3    uncertain

Теперь давайте создадим функцию tok_and_tag, которая последовательно выполняет word_tokenize и pos_tag:

>>> from nltk import word_tokenize, pos_tag
>>> from functools import partial
>>> tok_and_tag = lambda x: pos_tag(word_tokenize(x))
>>> df['sent'][0]
'Preliminary Discourse to the Encyclopedia of Diderot'
>>> tok_and_tag(df['sent'][0])
[('Preliminary', 'JJ'), ('Discourse', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('Encyclopedia', 'NNP'), ('of', 'IN'), ('Diderot', 'NNP')]

Затем мы можем использовать df.apply для токенизации и пометьте столбец предложения фрейма данных:

>>> df['sent'].apply(tok_and_tag)
0    [(Preliminary, JJ), (Discourse, NNP), (to, TO)...
1    [(d'Alembert, NN), (claims, NNS), (that, IN), ...
2    [(However, RB), (,, ,), (as, IN), (the, DT), (...
3    [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: sent, dtype: object

Если вы хотите, чтобы предложения были строчными:

>>> df['sent'].apply(str.lower)
0    preliminary discourse to the encyclopedia of d...
1    d'alembert claims that it would be ignorant to...
2    however, as the overemphasis on parental influ...
3    this can also be summarized as a distinguish b...
Name: sent, dtype: object

>>> df['lower_sent'] = df['sent'].apply(str.lower)

>>> df['lower_sent'].apply(tok_and_tag)
0    [(preliminary, JJ), (discourse, NN), (to, TO),...
1    [(d'alembert, NN), (claims, NNS), (that, IN), ...
2    [(however, RB), (,, ,), (as, IN), (the, DT), (...
3    [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: lower_sent, dtype: object

Кроме того, нам нужен какой-то способ получить словарь POS, мы можем использовать collections.Counter и itertools.chain, чтобы сгладить список списка:

>>> df['lower_sent']
0    preliminary discourse to the encyclopedia of d...
1    d'alembert claims that it would be ignorant to...
2    however, as the overemphasis on parental influ...
3    this can also be summarized as a distinguish b...
Name: lower_sent, dtype: object

>>> df['lower_sent'].apply(tok_and_tag)
0    [(preliminary, JJ), (discourse, NN), (to, TO),...
1    [(d'alembert, NN), (claims, NNS), (that, IN), ...
2    [(however, RB), (,, ,), (as, IN), (the, DT), (...
3    [(this, DT), (can, MD), (also, RB), (be, VB), ...
Name: lower_sent, dtype: object

>>> df['tagged_sent'] = df['lower_sent'].apply(tok_and_tag)

>>> tokens, tags = zip(*chain(*df['tagged_sent'].tolist()))

>>> tags
('JJ', 'NN', 'TO', 'DT', 'NN', 'IN', 'NN', 'NN', 'NNS', 'IN', 'PRP', 'MD', 'VB', 'JJ', 'TO', 'VB', 'IN', 'NN', 'MD', 'VB', 'VBN', 'IN', 'DT', 'JJ', 'NN', '.', 'RB', ',', 'IN', 'DT', 'NN', 'IN', 'JJ', 'NN', 'IN', 'NNS', 'NN', 'VBZ', 'VBN', 'RB', 'VBN', 'IN', 'DT', 'JJ', 'NN', ',', 'JJ', 'NNS', 'VBD', 'JJ', 'NN', 'IN', 'DT', 'RBR', 'JJ', 'NN', 'IN', 'NN', 'NN', 'IN', 'VBG', 'JJ', 'NN', 'NNS', '(', 'NN', 'CC', 'NN', ',', 'CD', ')', '.', 'DT', 'MD', 'RB', 'VB', 'VBN', 'IN', 'DT', 'JJ', 'NN', 'IN', 'DT', 'NNS', 'NN')

>>> set(tags)
{'CC', 'VB', ')', 'NNS', ',', 'JJ', 'VBZ', 'DT', 'NN', 'PRP', 'RBR', 'TO', 'VBD', '(', 'VBN', '.', 'MD', 'IN', 'RB', 'VBG', 'CD'}
>>> possible_tags = sorted(set(tags))
>>> possible_tags
['(', ')', ',', '.', 'CC', 'CD', 'DT', 'IN', 'JJ', 'MD', 'NN', 'NNS', 'PRP', 'RB', 'RBR', 'TO', 'VB', 'VBD', 'VBG', 'VBN', 'VBZ']

>>> possible_tags_counter = Counter({p:0 for p in possible_tags})
>>> possible_tags_counter
Counter({'NNS': 0, 'VBZ': 0, 'DT': 0, '(': 0, 'JJ': 0, 'VBD': 0, ')': 0, 'RB': 0, 'VBG': 0, 'RBR': 0, 'VB': 0, 'IN': 0, 'CC': 0, ',': 0, 'PRP': 0, 'CD': 0, 'VBN': 0, '.': 0, 'MD': 0, 'NN': 0, 'TO': 0})

Чтобы перебрать каждое помеченное предложение и получить количество POS:

>>> df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))
0    {'NN': 3, 'IN': 1, 'TO': 1, 'DT': 1, 'JJ': 1}
1    {'NN': 3, 'VB': 3, 'PRP': 1, 'TO': 1, 'DT': 1,...
2    {')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, 'NNS': 3,...
3    {'DT': 3, 'VB': 1, 'NN': 2, 'VBN': 1, 'NNS': 1...
Name: tagged_sent, dtype: object

>>> df['pos_counts'] = df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))

>>> df['pos_counts']
0    {'NN': 3, 'IN': 1, 'TO': 1, 'DT': 1, 'JJ': 1}
1    {'NN': 3, 'VB': 3, 'PRP': 1, 'TO': 1, 'DT': 1,...
2    {')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, 'NNS': 3,...
3    {'DT': 3, 'VB': 1, 'NN': 2, 'VBN': 1, 'NNS': 1...
Name: pos_counts, dtype: object

# Now we can add in the POS that don't appears in the sentence with 0 counts:

>>> def add_pos_with_zero_counts(counter, keys_to_add):
...     for k in keys_to_add:
...         counter[k] = counter.get(k, 0)
...     return counter
... 
>>> df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))
0    {'VB': 0, 'IN': 1, 'PRP': 0, 'DT': 1, 'CC': 0,...
1    {'VB': 3, ')': 0, 'DT': 1, 'CC': 0, 'RB': 0, '...
2    {'VB': 0, ')': 1, 'JJ': 6, 'NN': 11, 'CC': 1, ...
3    {'VB': 1, 'IN': 2, 'PRP': 0, 'NN': 2, 'CC': 0,...
Name: pos_counts, dtype: object

>>> df['pos_counts_with_zero'] = df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))

Теперь объедините значения в список:

>>> df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])
0    [0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 3, 0, 0, 0, 0, ...
1    [0, 0, 0, 1, 0, 0, 1, 3, 2, 2, 3, 1, 1, 0, 0, ...
2    [1, 1, 3, 1, 1, 1, 3, 7, 6, 0, 11, 3, 0, 2, 1,...
3    [0, 0, 0, 0, 0, 0, 3, 2, 1, 1, 2, 1, 0, 1, 0, ...
Name: pos_counts_with_zero, dtype: object

>>> df['sent_vector'] = df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])

Теперь вам нужно создать новую матрицу для хранения BoW:

>>> df2 = pd.DataFrame(df['sent_vector'].tolist)
>>> df2.columns = sorted(possible_tags)

И вуаля:

>>> df2
   (  )  ,  .  CC  CD  DT  IN  JJ  MD ...   NNS  PRP  RB  RBR  TO  VB  VBD  \
0  0  0  0  0   0   0   1   1   1   0 ...     0    0   0    0   1   0    0   
1  0  0  0  1   0   0   1   3   2   2 ...     1    1   0    0   1   3    0   
2  1  1  3  1   1   1   3   7   6   0 ...     3    0   2    1   0   0    1   
3  0  0  0  0   0   0   3   2   1   1 ...     1    0   1    0   0   1    0   

   VBG  VBN  VBZ  
0    0    0    0  
1    0    1    0  
2    1    2    1  
3    0    1    0  

[4 rows x 21 columns]

Вкратце:

from collections import Counter
from itertools import chain

import pandas as pd

from nltk import word_tokenize, pos_tag

df = pd.read_csv('myfile.csv', delimiter=';')
df.columns = ['sent', 'tag']

tok_and_tag = lambda x: pos_tag(word_tokenize(x))

df['lower_sent'] = df['sent'].apply(str.lower)
df['tagged_sent'] = df['lower_sent'].apply(tok_and_tag)

possible_tags = sorted(set(list(zip(*chain(*df['tagged_sent'])))[1]))

def add_pos_with_zero_counts(counter, keys_to_add):
    for k in keys_to_add:
        counter[k] = counter.get(k, 0)
    return counter


# Detailed steps.
df['pos_counts'] = df['tagged_sent'].apply(lambda x: Counter(list(zip(*x))[1]))
df['pos_counts_with_zero'] = df['pos_counts'].apply(lambda x: add_pos_with_zero_counts(x, possible_tags))
df['sent_vector'] = df['pos_counts_with_zero'].apply(lambda x: [count for tag, count in sorted(x.most_common())])

# All in one.
df['sent_vector'] = df['tagged_sent'].apply(lambda x:
    [count for tag, count in sorted(
        add_pos_with_zero_counts(
            Counter(list(zip(*x))[1]), 
                    possible_tags).most_common()
         )
    ]
)

df2 = pd.DataFrame(df['sent_vector'].tolist())
df2.columns = possible_tags

alvas 20.05.2017

comment

Спасибо за решение и, что более важно, хорошее объяснение! - ZverArt; 20.05.2017

comment

Рад, что ответ помогает; P - alvas; 20.05.2017

comment

Привет! Последний вопрос — я пытался добавить также столбец тегов со значениями меток (определенными/неопределенными) в вывод, но не нашел правильного способа. Можете ли вы порекомендовать что-нибудь, пожалуйста? - ZverArt; 21.05.2017

comment

Возможно df2['label'] = df['tag']? - alvas; 21.05.2017

comment

О верно. Извините за беспокойство. Я пробовал df2 = df2 + df['tag'] и различные варианты того, что я мог представить из других языков программирования, но забыл об этом подходе. Благодарю вас! - ZverArt; 21.05.2017

Создайте предложение (строку) в матрице счетчиков (столбцов) POS-тегов из фрейма данных

Ответы (1)

Похожие вопросы