一.jieba | KWAN的解忧杂货铺

# 一.jieba

# 1.什么是 jieba 分词?

在 Python 中，有几个流行的分词工具可用于处理自然语言文本。以下是其中一些常用的分词工具：

jieba：jieba 是一个非常流行的中文文本分词工具。它使用基于前缀词典的算法，能够高效地将中文文本切分成词语。你可以使用pip命令安装 jieba 库：pip install jieba。下面是使用 jieba 进行中文分词的示例：

import jieba

text = "我喜欢Python编程语言"
seg_list = jieba.cut(text, cut_all=False)  # 精确模式分词
result = " ".join(seg_list)
print(result)

1
2
3
4
5
6

输出：

我 喜欢 Python 编程 语言

# 2.只提取名词

import jieba.posseg as pseg

text = "我 喜欢 Python 编程语言"
words = pseg.cut(text)
nouns = []
for word, flag in words:
    if flag.startswith('n'):  # 以'n'开头的为名词
        if word != ' ':
            nouns.append(word)

print(nouns)

1
2
3
4
5
6
7
8
9
10
11
12

# 3.精确模式

import jieba

text = "什么是计算机网络定义?"

# 精确模式
seg_list = jieba.cut(text, cut_all=False)
result = " ".join(seg_list)
print("精确模式：", result)

1
2
3
4
5
6
7
8

# 4.全模式

import jieba

text = "什么是计算机网络定义?"

# 全模式
seg_list = jieba.cut(text, cut_all=True)
result = " ".join(seg_list)
print("全模式：", result)

1
2
3
4
5
6
7
8

# 5.搜索引擎模式

import jieba

text = "什么是计算机网络定义?"

# 搜索引擎模式
seg_list = jieba.cut_for_search(text)
result = " ".join(seg_list)
print("搜索引擎模式：", result)

1
2
3
4
5
6
7
8

# 6.自定义词汇

自定义词典: custom_dict.txt

计算机网络拓扑的分类与特点
基带传输技术
10 Gigabit Ethernet
高速 Ethernet 的研究与发展
Internet 的网络结构
计算机网络定义

1
2
3
4
5
6

分词代码

import jieba

# 加载自定义词典文件
jieba.load_userdict('custom_dict.txt')

text = "什么是计算机网络定义?"
seg_list = jieba.cut(text, cut_all=False)
result = " ".join(seg_list)
print(result)

1
2
3
4
5
6
7
8
9
10

结果:

什么是计算机网络定义 ?

# 7.添加自定义词汇

import jieba
import json

with open('/Users/qinyingjie/Documents/python-workspace/web/chapter06-kg/05-分词/white_data.json', 'r',
          encoding='utf-8') as file:
    data = json.load(file)

# 现在 'data' 变量包含了 JSON 文件中的数据，您可以根据需要对其进行处理
# 例如，打印数据以查看其内容
print(data)

# 为每个自定义词汇调用 jieba.add_word 方法
for word in data:
    jieba.add_word(word)

text = "什么是高速 Ethernet 的研究与发展?"
seg_list = jieba.cut(text, cut_all=False)
result = "/".join(seg_list)
print(result)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# 二.NLTK

NLTK：Natural Language Toolkit（NLTK）是一个广泛使用的自然语言处理库，提供了各种文本处理功能，包括分词。NLTK 支持多种语言的分词，包括英文、中文等。你可以使用pip命令安装 NLTK 库：pip install nltk。下面是使用 NLTK 进行英文分词的示例：

import nltk

text = "I love natural language processing"
tokens = nltk.word_tokenize(text)
print(tokens)

1
2
3
4
5

输出：

['I', 'love', 'natural', 'language', 'processing']

02-numpy使用 →