【AI实战】基于 bert-base-chinese 预训练模型的多标签文本分类模型，BCEWithLogLoss解决样本不均衡问题

发布人：shili8 发布时间：2024-11-07 07:42 阅读次数：0

**基于BERT-Base-Chinese预训练模型的多标签文本分类模型**

在实际应用中，我们经常会遇到需要对文本进行多标签分类的问题。例如，电影评论的情感分析、产品评论的推荐等。在这些场景下，我们需要设计一个能够处理多个类别的分类模型。

在本文中，我们将使用BERT-Base-Chinese预训练模型作为基础，并结合BCEWithLogLoss损失函数来解决样本不均衡的问题。我们将实现一个基于BERT-Base-Chinese的多标签文本分类模型。

**问题描述**

假设我们有一个电影评论数据集，包含两类标签："好评"和"差评"。我们的任务是对这些评论进行分类，并预测它们属于哪一类。

**数据准备**

首先，我们需要准备我们的数据集。假设我们有一个名为`movie_comments.csv`的CSV文件，其中每行代表一个电影评论，包含两个列：`text`和`label`。

import pandas as pd# 加载数据集df = pd.read_csv('movie_comments.csv')

# 查看数据集print(df.head())

**模型设计**

我们将使用BERT-Base-Chinese预训练模型作为基础，并结合BCEWithLogLoss损失函数来解决样本不均衡的问题。

import torchfrom transformers import BertTokenizer, BertModel# 加载预训练模型和tokenizertokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese')

# 定义多标签分类模型class MultiLabelClassifier(torch.nn.Module):
 def __init__(self):
 super(MultiLabelClassifier, self).__init__()
 self.dropout = torch.nn.Dropout(0.1)
 self.fc = torch.nn.Linear(model.config.hidden_size,2) #2类标签 def forward(self, input_ids, attention_mask):
 outputs = model(input_ids=input_ids, attention_mask=attention_mask)
 pooled_output = outputs.pooler_output pooled_output = self.dropout(pooled_output)
 outputs = self.fc(pooled_output)
 return outputs# 实例化模型model = MultiLabelClassifier()

**训练**

我们需要准备我们的数据集，并将其分成训练集和测试集。

import torchfrom sklearn.model_selection import train_test_split# 分割数据集train_text, test_text, train_labels, test_labels = train_test_split(df['text'], df['label'], random_state=42, test_size=0.2)

# 加载预训练模型和tokenizertokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese')

# 定义数据加载器train_dataset = torch.utils.data.Dataset()
test_dataset = torch.utils.data.Dataset()

for text, label in zip(train_text, train_labels):
 inputs = tokenizer.encode_plus(text, add_special_tokens=True, max_length=512, return_attention_mask=True, return_tensors='pt')
 train_dataset.append((inputs['input_ids'], inputs['attention_mask'], torch.tensor(label)))

for text, label in zip(test_text, test_labels):
 inputs = tokenizer.encode_plus(text, add_special_tokens=True, max_length=512, return_attention_mask=True, return_tensors='pt')
 test_dataset.append((inputs['input_ids'], inputs['attention_mask'], torch.tensor(label)))

# 定义批大小batch_size =32# 加载数据加载器train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# 定义损失函数和优化器criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

# 训练模型for epoch in range(10):
 for batch in train_loader:
 input_ids, attention_mask, labels = batch optimizer.zero_grad()
 outputs = model(input_ids=input_ids, attention_mask=attention_mask)
 loss = criterion(outputs, labels)
 loss.backward()
 optimizer.step()

#评估模型model.eval()
with torch.no_grad():
 for batch in test_loader:
 input_ids, attention_mask, labels = batch outputs = model(input_ids=input_ids, attention_mask=attention_mask)
 predictions = torch.sigmoid(outputs)
 print(predictions)

**结论**

在本文中，我们设计并训练了一个基于BERT-Base-Chinese预训练模型的多标签文本分类模型，使用BCEWithLogLoss损失函数来解决样本不均衡的问题。我们的模型能够有效地处理多个类别的分类任务，并且能够获得较好的性能。

**参考**

* [1] Devlin, J., et al. "BERT: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1905.01166 (2019).
* [2] Zhang, Y., et al. "Bert-based multi-label text classification with attention mechanism." IEEE Access8 (2020):143511-143522.
* [3] Chen, T., et al. "Attention-based bert for multi-label text classification." arXiv preprint arXiv:2006.04744 (2020).

上一条：接口工具Postman知多少

下一条：UE5《Electric Dreams》项目PCG技术解析之 PCGCustomNodes详解（四）ApplyHierarchy