python统计一篇文章汉字中的高频词

发布人：shili8 发布时间：2025-02-07 12:18 阅读次数：0

**Python 中文文本分析——高频词统计**

在自然语言处理中，高频词是指在一段中文文本中出现次数较多的词语。这些词语往往反映了文本的主题、内容或特点。在本文中，我们将使用 Python 来实现对一篇文章中的汉字高频词的统计。

### **依赖库**

我们需要以下几个库来完成任务：

* `jieba`：用于中文分词和词性标注* `collections`：用于统计词频* `matplotlib`：用于可视化结果可以使用 pip 安装这些库：

bashpip install jieba matplotlib

### **数据准备**

假设我们有一篇文章的文本内容，保存在一个名为 `article.txt` 的文件中。我们需要将其读入 Python 中。

import re#读取文章内容with open('article.txt', 'r', encoding='utf-8') as f:
 article = f.read()

# 将文本转换为小写article = article.lower()

### **分词**

使用 `jieba` 库进行中文分词：

import jieba# 分词words = jieba.cut(article)

### **统计高频词**

使用 `collections` 库统计每个词的出现次数：

from collections import Counter# 统计词频word_freq = Counter(words)

### **过滤低频词**

我们可能只感兴趣于较高频率的词语。可以通过设置一个阈值来实现：

# 过滤低频词（设定阈值为10）
high_freq_words = {word: freq for word, freq in word_freq.items() if freq >10}

### **可视化结果**

使用 `matplotlib` 库绘制词频分布图：

import matplotlib.pyplot as plt# 绘制词频分布图plt.figure(figsize=(10,6))
plt.bar(high_freq_words.keys(), high_freq_words.values())
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.title('High Frequency Words Distribution')
plt.show()

### **完整代码**

将所有步骤整合到一个函数中：

import reimport jiebafrom collections import Counterimport matplotlib.pyplot as pltdef high_freq_words(article_path):
 #读取文章内容 with open(article_path, 'r', encoding='utf-8') as f:
 article = f.read()

 # 将文本转换为小写 article = article.lower()

 # 分词 words = jieba.cut(article)

 # 统计词频 word_freq = Counter(words)

 # 过滤低频词（设定阈值为10）
 high_freq_words = {word: freq for word, freq in word_freq.items() if freq >10}

 # 绘制词频分布图 plt.figure(figsize=(10,6))
 plt.bar(high_freq_words.keys(), high_freq_words.values())
 plt.xlabel('Word')
 plt.ylabel('Frequency')
 plt.title('High Frequency Words Distribution')
 plt.show()

# 使用示例high_freq_words('article.txt')

本文介绍了如何使用 Python 来统计一篇文章中的汉字高频词，并提供了一个完整的代码示例。

上一条：std::unordered_map 使用总结

下一条：leetcode 11. 盛最多水的容器