题目描述
读取一个文本文件,统计其中每个单词出现的频率(忽略大小写和标点),按频率从高到低输出前 N 个单词。
示例
输入:文件内容: "Hello world! Hello Python. Hello everyone."
输出:hello: 3, world: 1, python: 1, everyone: 1
提示
使用 open() 读取文件,用 split() 分词,用 collections.Counter 统计频率。注意清理标点符号。
参考答案
from collections import Counter
import string
def word_frequency(filename, top_n=10):
"""统计文本文件中每个单词的频率"""
with open(filename, 'r', encoding='utf-8') as f:
text = f.read()
# 转小写,移除标点
text = text.lower()
# 替换标点为空格
for char in string.punctuation:
text = text.replace(char, ' ')
# 分词并统计
words = text.split()
counter = Counter(words)
return counter.most_common(top_n)
# 示例:创建测试文件
with open('test.txt', 'w') as f:
f.write("Hello world! Hello Python. Hello everyone. Python is great!")
# 测试
result = word_frequency('test.txt')
for word, count in result:
print(f"{word}: {count}")
# hello: 3, python: 2, world: 1, everyone: 1, is: 1, great: 1