首页 / 算法 / 【算法】基于树形结构分词
【算法】基于树形结构分词
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了【算法】基于树形结构分词,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含4224字,纯文字阅读大概需要7分钟。
内容图文
![【算法】基于树形结构分词](/upload/InfoBanner/zyjiaocheng/1321/d2fb68819bc8495cb510db780d812a6a.jpg)
1
#
!/usr/bin/env python
2
#
encoding=gbk
3
import
os
4
import
sys
5
6 G_ENCODING="gbk" 7""" 8===============================
9中文分词
101. 机械分词
112. 统计分词
123. 理解分词
13===============================
14基于树形结构分词策略(结合机械分词,统计分词)
15例:笔记本电脑
16 dict = {"笔":0.8,"记":0.8,"本":0.8,"电":0.8,"脑":0.8,"笔记":0.9,"笔记本":0.9,"电脑":0.9,"笔记本电脑":0.9}
17 -------------------------------
18 | <s> |
19 -------------------------------
20 / / \ 21 [笔] [笔记] [笔记本] [笔记本电脑]
22 / / / 23 [记] [本] [电] [电脑]
24 / / 25 [本] [电] [电脑]
26 / \ /
27 [电] [电脑] [脑]
28 /
29[脑]
30-------------------------------
31path: 笔 记 本 电 脑 -- score: [0.32768]
32path: 笔 记 本 电脑 -- score: [0.4608]
33path: 笔记 本 电 脑 -- score: [0.4608]
34path: 笔记 本 电脑 -- score: [0.648]
35path: 笔记本 电 脑 -- score: [0.576]
36path: 笔记本 电脑 -- score: [0.81]
37path: 笔记本电脑 -- score: [0.9]
38 39best path: 笔记本电脑 -- score: [0.9]
40 41-------------------------------
421、路径加权(通过搜索引擎获取词语的词频,获得词语的权重)
432、最少切分、OOV、最少单字等策略
44==获取最佳分词路径
45-------------------------------
46Q1、如果句子过长,树非常大,遍历费时(需优化)
47Q2、字典加载(需优化)
48以下给出该思想的简单实现[python]:
49""" 50 51class Stack():
52def__init__(self, volume = 0):
53 self.list = [] if volume == 0 else [0 for i in range(0,volume)]
54 self.top = 0
55 56def push(self, element):
57if self.list != None:
58 self.top += 1
59 self.list[self.top] = element
60 61def pop(self):
62if self.list != None and self.top >= 0:
63 ele = self.list[self.top]
64 self.list[self.top] = None
65 self.top -= 1
66return ele
67return None
68def empty(self):
69return self.top == 0
70 71class Node():
72def__init__(self, data, next = None, prev = None, depth = 0, wlen = 0, weight = 0.0):
73 self.data = data
74 self.next = next if next != None else []
75 self.prev = prev
76 self.depth = depth
77 self.wlen = wlen
78 self.weight = weight
79 80def isLeaf(self):
81return self.next == None or self.next == []
82 83class Tree():
84def__init__(self, root = None):
85 self.root = root
86"""append a child node to child""" 87def append(self, node, cnode):
88if node != None and cnode != None:
89 node.next.append(cnode)
90 cnode.prev = node
91 cnode.depth = node.depth + 1
92return 0
93return -1
94 95"""depth first search(binary preorder)""" 96def depth_first_search(self, node):
97 list = []
98if node != None:
99 stack = Stack(30)
100 stack.push(node)
101whilenot stack.empty():
102 tmp = stack.pop()
103 list.append(tmp)
104for i in range(len(tmp.next) - 1, -1, -1):
105 stack.push(tmp.next[i])
106return list
107108class Tokenizer():
109"""init the tree"""110def load(self, tree, pnode, cache, dict):
111 clen = len(cache)
112for node in tree.depth_first_search(pnode):
113if node.isLeaf():
114 i = node.wlen
115 j = i
116while j < clen:
117 j += 1
118 tmp = cache[i:j].encode(G_ENCODING)
119if dict.has_key(tmp) or len(tmp) == 1:
120 tnode = Node(tmp, wlen = j, weight = dict.get(tmp))
121 tree.append(node, tnode)
122 self.load(tree, tnode, cache, dict)
123return 0
124"""backtrance"""125def backtrance(self, node, list):
126if node.prev != None and node.prev.data != "<s>":
127 list.append(node.prev)
128 self.backtrance(node.prev, list)
129return 0
130131def bestpath(self, tree):
132 highestScore = 0
133 bestpath = ""134for node in tree.depth_first_search(tree.root):
135"""find the leaf node and backtrance to find the bese path"""136if node.isLeaf():
137 list = [node]
138 self.backtrance(node, list)
139 list.reverse()
140"""141 1、路径加权(通过搜索引擎获取词语的词频,获得词语的权重)
142 2、最少切分、OOV、最少单字等策略
143 这里只是简单给出路径权重的乘积得分
144145"""146 sc = 1.0
147 tp = ""148for xn in list:
149 sc *= xn.weight if xn.weight > 0 else 1
150 tp += xn.data + ""151if sc > highestScore:
152 highestScore = sc
153 bestpath = tp.strip()
154print"path: %s -- score: [%s]"%(tp.strip(), sc)
155print"\nbest path: %s -- score: [%s]"%(bestpath, highestScore)
156return bestpath
157def example():
158 sent = "笔记本电脑"159 dict = {"笔":0.8,"记":0.8,"本":0.8,"电":0.8,"脑":0.8,"笔记":0.9,"笔记本":0.9,"电脑":0.9,"笔记本电脑":0.9}
160 cache = unicode(sent, G_ENCODING)
161 tokenizer = Tokenizer()
162 tree = Tree(Node("<s>"))
163"""init tree"""164 tokenizer.load(tree, tree.root, cache, dict)
165"""backtrance and find the best path"""166 tokenizer.bestpath(tree)
167 example()
原文:http://www.cnblogs.com/ariesblogs/p/4063106.html
内容总结
以上是互联网集市为您收集整理的【算法】基于树形结构分词全部内容,希望文章能够帮你解决【算法】基于树形结构分词所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。