Как прецизно настроих RoBERTa за класифициране на туитове при бедствия: Стъпка по стъпка

В тази статия ще ви преведа през стъпките, които предприех за фина настройка на модел RoBERTa за тази класификационна задача, включително предварителна обработка на данни, обучение на модела и оценка.

Надявам се, че като прочетете тази статия, ще разберете по-добре процеса на фина настройка на езиков модел за NLP задача и ще бъдете вдъхновени да го изпробвате сами. Така че, без повече шум, нека да започнем!

Разбиране на проблема

Следният проблем идва от това „състезание на Kaggle“.

В това състезание участниците са помолени да изградят модел за машинно обучение, за да класифицират туитовете като свързани с истински бедствия или не. Моделът ще бъде обучен върху набор от данни от 10 000 туита, които са ръчно класифицирани. Целта е точно да се предскаже дали един туит е за истинско бедствие или не, въз основа на текста на туита. Това е проблем с обработката на естествен език (NLP), което означава, че моделът ще трябва да може да разбира и интерпретира значението на текста в туитовете.

Това е пример за това как изглежда този набор от данни

Избор на правилния модел

Първото нещо, което търся е модел „трансформатор“. Тази архитектура може да постигне впечатляващи резултати при НЛП задачи. ChatGPT е добър пример за модел, изграден от трансформаторна архитектура.

Има много добри варианти. За тази задача обаче избирам модела RoBERTa

RoBERTa (съкратено от „Robustly Optimized BERT Pretraining Approach“) е езиков модел, разработен от Facebook AI, който се основава на модела „BERT“. Той е специално проектиран да подобри модела BERT чрез обучение върху по-голям набор от данни, използване на по-дълга последователност от обучение и прилагане на няколко други техники за подобряване на способността на модела да обобщава към нови задачи.

Как работи RoBERTa?

1. Въведеният текст първо се токенизира и разделя на части от думи. Всяка дума след това се нанася на уникален идентификационен номер на цяло число и получената последователност от цели числа се предава през модела като вход.

2. Входната последователност се обработва от енкодера на модела, който се състои от серия трансформаторни блокове. Всеки трансформаторен блок приема последователност от вграждания на част от дума (вектори, представящи частите от дума) и произвежда нова последователност от контекстуализирани вграждания на част от дума.

3. Контекстуализираните вграждания на част от дума след това се подават в декодера на модела, който се състои от серия от трансформаторни блокове. Декодерът произвежда прогноза за всяка част от думата във входната последователност въз основа на контекста, предоставен от енкодера.

4. Прогнозите, произведени от декодера, се комбинират, за да се получи крайна прогноза за входната последователност. В случай на езиково моделиране прогнозата е разпределението на вероятността върху речника за следващата дума в последователността. За други задачи, като класифициране или превод, предвиждането е съответно етикетът на класа или преведен текст.

библиотеки

Някои важни библиотеки за нашия проект

# Importing the libraries needed
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import torch
import seaborn as sns
import transformers
import json
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaModel, RobertaTokenizer, get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
import logging


logging.basicConfig(level=logging.ERROR)

# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

Подготовка на данни

Първата стъпка във всеки проект за машинно обучение е получаването и обработката на данните. Удобно Kaggle разполага с набор от данни за обучение и за тестване. И двата набора от данни могат лесно да бъдат заредени в Dataframes.

import pandas as pd
data = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
target_data = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
sample_submission = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')

print(f'data shape => {data.shape}')
print(f'target shape => {target_data.shape}')
print(f'submission shape => {sample_submission.shape}')

data.columns

В кода по-горе имаме 2 основни Dataframes. data и target_data. data ще се използва в процеса на обучение, както и при валидирането.

„target_data“ ще се използва за представяне на конкурса Kaggle.

Туитовете могат да бъдат много объркващи. Те могат да съдържат само връзки или @, което може да накара нашия модел да научи грешни асоциации. И така, за да разрешим това, ще изчистим тези туитове

# Based on the work of @borisdayma
#https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb#scrollTo=ZSCf6QyF8AG-

def clean_tweet(tweet, allow_new_lines=False):
    """
    Clean a tweet by removing URLs, extra white space, and new lines.
    
    Parameters:
        tweet (str): The tweet to clean.
        allow_new_lines (bool, optional): Whether to allow new lines in the tweet. 
            Defaults to False.
            
    Returns:
        str: The cleaned tweet.
    """
    # Remove URLs that start with 'http:' or 'https:'
    bad_start = ['http:', 'https:']
    for w in bad_start:
        # Remove white space before the URL
        tweet = re.sub(f" {w}\\S+", "", tweet)
        # In case a tweet starts with a URL
        tweet = re.sub(f"{w}\\S+ ", "", tweet)
        # In case the URL is on a new line
        tweet = re.sub(f"\n{w}\\S+ ", "", tweet)
        # In case the URL is alone on a new line
        tweet = re.sub(f"\n{w}\\S+", "", tweet)
        # Any other case?
        tweet = re.sub(f"{w}\\S+", "", tweet)
    # Replace multiple spaces with a single space
    tweet = re.sub(r' +', ' ', tweet)
    # Remove new lines if allowed
    if not allow_new_lines:
        tweet = ' '.join(tweet.split())
    # Strip leading and trailing white space
    return tweet.strip()



def boring_tweet(tweet):
    """
    Check if a tweet is boring by checking if it contains fewer than 3 words
    that do not contain 'http', '@', or '#'.
    
    Parameters:
        tweet (str): The tweet to check.
        
    Returns:
        bool: True if the tweet is boring, False otherwise.
    """
    # Words that indicate a tweet is likely to be boring
    boring_stuff = ['http', '@', '#']
    # Count the number of words in the tweet that do not contain boring_stuff
    not_boring_words = len([None for w in tweet.split() if all(bs not in w.lower() for bs in boring_stuff)])
    # Return True if the tweet is boring, False otherwise
    return not_boring_words < 3

Почистване на Dataframe

data["text"] = data['text'].apply(clean_tweet)
target_data["text"] = target_data['text'].apply(clean_tweet)

# Add a 'is_boring' column to the data DataFrame
data['is_boring'] = data['text'].apply(boring_tweet)

# Remove the boring tweets from the data DataFrame
data = data[data['is_boring'] == False]
data.drop(columns=['is_boring'], inplace=True)

print(f'data shape => {data.shape}')
data.head()

Създаване на набор от данни

Сега ще създадем класа tweetData, който е удобен начин за представяне и манипулиране на колекция от туитове за задачи за машинно обучение. Това е подклас на класа Dataset от библиотеката PyTorch и предоставя методи за извличане и предварителна обработка на туитове от DataFrame.

class tweetData(Dataset):
    def __init__(self, dataframe, tokenizer, max_len=256):
        # Store the DataFrame, text of tweets, targets, tokenizer, and maximum length as instance variables
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.text
        self.targets = self.data.target
        self.max_len = max_len
    
    def __len__(self):
        # Return the number of tweets in the DataFrame
        return len(self.text)

    def __getitem__(self, index):
        # Retrieve the text of the tweet at the given index
        text = str(self.text[index])
        # Remove extra white space from the tweet text
        text = " ".join(text.split())

        # Tokenize the tweet text using the stored tokenizer
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,  # Add special tokens to the tweet
            max_length=self.max_len,  # Truncate the tweet if it is longer than the maximum length
            padding='max_length',  # Pad the tweet if it is shorter than the maximum length
            #pad_to_max_length=True,  # Pad the tweet if it is shorter than the maximum length
            return_token_type_ids=True  # Return token type

Ключови променливи

Преди стъпката на обучение трябва да дефинираме някои ключови променливи.

# Set the batch sizes and number of epochs for training and validation
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 10

# Set the learning rate
LEARNING_RATE = 1e-5

# Initialize the Roberta tokenizer with the 'roberta-base' pretrained model
# Set the truncation and lowercase options to True
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', truncation=True, do_lower_case=True)

DataLoaders

След това инициализираме обучението и тестването на DataLoaders с наборите за обучение и тестване и съответните параметри. Класът DataLoader предоставя итератор над набора от данни, позволявайки туитовете да се зареждат и обработват на партиди по време на обучение и оценка. Използването на DataLoader може да улесни работата с големи набори от данни и позволява използването на туитове в моделите на PyTorch.

# Set the parameters for the training and testing DataLoaders
train_params = {'batch_size': TRAIN_BATCH_SIZE,  # Set the batch size for training
                'shuffle': True,  # Shuffle the training data at each epoch
                'num_workers': 0  # Use 0 workers to load the data
                }

test_params = {'batch_size': VALID_BATCH_SIZE,  # Set the batch size for testing
                'shuffle': True,  # Shuffle the testing data at each epoch
                'num_workers': 0  # Use 0 workers to load the data
                }

# Initialize the training and testing DataLoaders with the training and testing sets and the corresponding parameters
training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

Невронна мрежа за фина настройка

Нека дефинираме персонализиран модул PyTorch, наречен RobertaClass, който разширява класа torch.nn.Module. В конструктора зареждаме предварително обучения модел „roberta-base“ и го съхраняваме в self.l1.

Също така инициализираме три линейни слоя: self.pre_classifier, self.dropout и self.classifier.

Резултатът от този модел е тензор с форма (batch_size, 2), където всеки елемент представлява вероятността съответният туит да принадлежи към клас 0 или клас 1.

class RobertaClass(torch.nn.Module):
    def __init__(self):
        # Call the base class constructor
        super(RobertaClass, self).__init__()
        
        # Load the pre-trained 'roberta-base' model and store it in self.l1
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        
        # Initialize a linear layer with 768 input and output units
        self.pre_classifier = torch.nn.Linear(768, 768)
        
        # Initialize a dropout layer with a dropout rate of 0.35
        self.dropout = torch.nn.Dropout(0.35)
        
        # Initialize a linear layer with 768 input units and 2 output units
        self.classifier = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        # Run the input through the pre-trained model and get the hidden state
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        
        # Get the pooled output from the hidden state
        pooler = hidden_state[:, 0]
        
        # Pass the pooled output through the linear layer, ReLU activation, and dropout layers
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        
        # Pass the output of the dropout layer through the final linear layer
        output = self.classifier(pooler)
        
        # Return the output
        return output

Фина настройка

Нека зададем функцията за загуба, оптимизатора и планировчика за обучение на модела.

loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=10, num_training_steps=len(training_set)*EPOCHS)

loss_function се определя като загуба на кръстосана ентропия с помощта на torch.nn.CrossEntropyLoss(). Тази функция на загуба обикновено се използва за класификационни задачи и ще изчисли кръстосаната ентропия между прогнозираните вероятности на класа и етикетите на основния клас на истината.

optimizer се дефинира като оптимизатор на Adam, използващ torch.optim.Adam().

scheduler се дефинира с помощта на функцията get_cosine_schedule_with_warmup() от библиотеката на трансформаторите. Тази функция връща планировчик на скоростта на обучение, който прилага график на косинусово затихване към скоростта на обучение с период на загряване.

Сега ще дефинираме класа Trainer в PyTorch, който отговаря за обучението на модел. Класът Trainer приема като вход модела, който трябва да бъде обучен, броя на епохите, за които да го обучи, и оптимизатора и планировчика, които да се използват по време на обучението.

class Trainer:
    def __init__(self, model, epochs, scheduler, optimizer):
        
        self.model = model
        self.epochs = epochs
        
        self.scheduler = scheduler
        self.optimizer = optimizer
        
        self.device = device
        self.model.to(self.device)
        
        self.historyLoss = []
        self.historyF1Score = []
        self.bestScore = 0


    def plotHistory(self):
        # Create a figure with 2 subplots
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

        # Plot the loss on the first subplot
        ax1 = sns.lineplot(x=range(len(self.historyLoss)), y=self.historyLoss, ax=ax1)
        ax1.set_title("Loss History")
        ax1.set_xlabel("Epoch")
        ax1.set_ylabel("Loss Value")

        # Plot the F1 score on the second subplot
        ax2 = sns.lineplot(x=range(len(self.historyF1Score)),
                           y=self.historyF1Score,
                           palette=sns.color_palette('Spectral', as_cmap = True), 
                           ax=ax2)
        ax2.set_title("F1 Score History")
        ax2.set_xlabel("Epoch")
        ax2.set_ylabel("F1 Score Value")

        plt.show()

    
    def saveModel(self,tokenizer,tokenizerPath="./",name="RoBERTa.bin"):
        torch.save(self.model, name)
        tokenizer.save_vocabulary(tokenizerPath)
    
        
    def train(self,training_loader,lossFunc=torch.nn.CrossEntropyLoss()):
        print("=> Starting Traning")
        print(f"Learning Rate: {LEARNING_RATE}")
        print(f"Batch Size: {TRAIN_BATCH_SIZE}")
        
        
        for epoch in range(self.epochs):
            self.model.train()
            
            losses = []
            preddictions = []
            targets = []
            
            for data in tqdm(training_loader, total=len(training_loader)):  

                ids = data['ids'].to(device, dtype=torch.long)
                mask = data['mask'].to(device, dtype=torch.long)
                token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
                target = data['targets'].to(device, dtype=torch.long)

                self.optimizer.zero_grad()
                
                outputs = model(ids, mask, token_type_ids)
                loss = lossFunc(outputs, target)
                
                
                target = target.detach().cpu().numpy()
                outputs = outputs.detach().cpu().numpy()
                
                losses.append(loss.item())
                targets.extend(target.tolist())
                preddictions.extend(np.argmax(outputs, axis=1))
                
                
                
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                self.optimizer.step()
                self.scheduler.step()
        
            trainLoss = np.mean(losses)
            trainScore = f1_score(preddictions, targets)
            
            #Overfitting break
            if len(self.historyF1Score) > 1:
                if max(self.historyF1Score) > trainScore:
                    print("BREAK")
                    break
            
            self.historyLoss.append(trainLoss)
            self.historyF1Score.append(trainScore)
            
            self.bestScore = max(self.historyF1Score)
            
            print(f"=> {epoch + 1} <= epoch")
            print(f"Train Loss: {trainLoss}, Score: {trainScore}")
            
            
            
    
    def valid(self,testing_loader,lossFunc=torch.nn.CrossEntropyLoss()):
        model.eval()
        
        losses = []
        preddictions = []
        targets = []
        
        with torch.no_grad():
             for data in tqdm(testing_loader, total=len(testing_loader)):  
                ids = data['ids'].to(device, dtype=torch.long)
                mask = data['mask'].to(device, dtype=torch.long)
                token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
                target = data['targets'].to(device, dtype=torch.long)
                
                
                self.optimizer.zero_grad()
                
                outputs = model(ids, mask, token_type_ids)
                loss = lossFunc(outputs, target)
                
                target = target.detach().cpu().numpy()
                outputs = outputs.detach().cpu().numpy()
                
                losses.append(loss.item())
                targets.extend(target.tolist())
                preddictions.extend(np.argmax(outputs, axis=1))
                
                
        trainLoss = np.mean(losses)
        trainScore = f1_score(preddictions, targets)
        
        print(f"Validation Loss: {trainLoss}")
        print(f"Validation Score: {trainScore}")
        
        return trainLoss, trainScore

Класът има няколко метода:

plotHistory: Този метод начертава историята на загубата при тренировка и F1 резултат през епохи.
saveModel: Този метод запазва модела и токенизатора на диск.
train: Този метод обучава модела за даден брой епохи. Той преминава през данните за обучение във всяка епоха и изчислява загубата за всяка партида, след което актуализира теглата на модела с помощта на оптимизатора и планировчика. Той също така следи загубата при тренировка и F1 резултата за всяка епоха и ги съхранява съответно в historyLoss и historyF1Score.
test: Този метод тества модела върху данните за валидиране. Той изчислява загубата и F1 резултата за набора за валидиране и ги връща.

Обучение и чертеж:

model = RobertaClass()
trainer = Trainer(model, EPOCHS, scheduler, optimizer)
trainer.train(training_loader)
trainer.plotHistory()

Валидиране на модела

Следващата стъпка е да потвърдите данните от теста. Ще използваме valid метода, дефиниран в Trainer класа.

#FROM TRAINER CLASS
def valid(self,testing_loader,lossFunc=torch.nn.CrossEntropyLoss()):
        model.eval()
        
        losses = []
        preddictions = []
        targets = []
        
        with torch.no_grad():
             for data in tqdm(testing_loader, total=len(testing_loader)):  
                ids = data['ids'].to(device, dtype=torch.long)
                mask = data['mask'].to(device, dtype=torch.long)
                token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
                target = data['targets'].to(device, dtype=torch.long)
                
                
                self.optimizer.zero_grad()
                
                outputs = model(ids, mask, token_type_ids)
                loss = lossFunc(outputs, target)
                
                target = target.detach().cpu().numpy()
                outputs = outputs.detach().cpu().numpy()
                
                losses.append(loss.item())
                targets.extend(target.tolist())
                preddictions.extend(np.argmax(outputs, axis=1))
                
                
        trainLoss = np.mean(losses)
        trainScore = f1_score(preddictions, targets)
        
        print(f"Validation Loss: {trainLoss}")
        print(f"Validation Score: {trainScore}")
        
        return trainLoss, trainScore

Той приема два аргумента: обект PyTorch DataLoader, който осигурява достъп до набора от данни и незадължителна функция за загуба.

Като начало функцията настройва модела в режим на оценка и инициализира три списъка за съхраняване на стойностите на загубите, прогнозите на модела и истинските цели за всяка партида от данни. След това влиза в контекстен блок, където изчисляването на градиента е деактивирано, което може да помогне за ускоряване на процеса на оценка.

След това функцията итерира набора от данни за валидиране или тест и обработва всяка партида от данни един по един.

След като цикълът завърши обработката на всички партиди, функцията изчислява средната загуба и резултат F1 за целия набор от данни и ги връща като кортеж.

trainer.valid(testing_loader)

Извод

За да предвидим други стойности, ще създадем клас, подобен на класа tweetData. Единствената разлика е, че този клас няма никаква информация за целите, тъй като е само за извод.

class tweetDataTarget(Dataset):
    def __init__(self, dataframe, tokenizer, max_len=256):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.text
        self.max_len = max_len
    
    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
             padding='max_length',
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long)
        }

Следващата стъпка, създадох функция за извод.

def inference(model, target_loader):
    # Initialize an empty dataframe
    df = pd.DataFrame()
    
    # Set the model to evaluation mode
    model.eval()
    
    # Initialize an empty list to store the predictions
    y = []
    
    # Disable gradient computation
    with torch.no_grad():
        # Iterate over the target dataset
        for data in tqdm(target_loader, 0):
            # Extract the input features and move them to the device
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
           
            # Pass the input features to the model and store the output predictions
            output = model(ids, mask, token_type_ids)
            
            # Convert the predictions to a NumPy array
            class_labels = output.detach().cpu().numpy()
            
            # Append the predictions to the list
            y.extend(np.argmax(class_labels, axis=1))

    # Return the list of predictions
    return y

И накрая, нека направим окончателните прогнози

# Define a dictionary of parameters for the target DataLoader
target_params = {
    'batch_size': VALID_BATCH_SIZE,
    'shuffle': False,
    'num_workers': 0
}

# Create a tweetDataTarget dataset from the target data and tokenizer
target_set = tweetDataTarget(target_data, tokenizer)

# Create a DataLoader for the target dataset using the defined parameters
target_loader = DataLoader(target_set, **target_params)

# Load the sample submission file
submission = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

# Use the inference function to get the model's predictions on the target dataset
submission['target'] = inference(model, target_loader)

# Save the predictions to a CSV file
submission.to_csv('../working/submission.csv', index=False)

Краен резултат = 0,83

Надявам се, че сте харесали тази статия. Чувствайте се свободни да ме следвате в Twitter, тъй като понякога публикувам за това, което работя / изучавам в областта на машинното обучение и разработката на iOS.

Допълнителни подобрения

Това беше моят подход към този проблем. В никакъв случай това не е уникалният и най-добрият подход.

Използвайте различни модели и ги комбинирайте
Сравнете производителността на RoBERTa с други модели
Използвайте RoBERTa XL