This repository was archived by the owner on Jan 10, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 129
This repository was archived by the owner on Jan 10, 2023. It is now read-only.
Iterator over internal node structure (feature request) #20
Copy link
Copy link
Open
Description
It would be nice if there was a builtin way to iterate over the internal nodes in the trie like this:
def _iternodes(self):
"""
Generates all nodes in the trie
"""
stack = deque([[self._root]])
while stack:
for node in stack.pop():
yield node
stack.append(node.children.values())That would make it much easier to write the Shortest Unique Prefix algorithm:
def shortest_unique_prefixes(items):
"""
The shortest unique prefix algorithm.
Args:
items (list of str): returned prefixes will be unique wrt this set
Returns:
list of str: a prefix for each item that uniquely identifies it
wrt to the original items.
References:
http://www.geeksforgeeks.org/find-all-shortest-unique-prefixes-to-represent-each-word-in-a-given-list/
https://github.com/Briaares/InterviewBit/blob/master/Level6/Shortest%20Unique%20Prefix.cpp
Requires:
pip install pygtrie
Doctest:
>>> from pysseg.fnameproc import *
>>> items = ["zebra", "dog", "duck", "dove"]
>>> shortest_unique_prefixes(items)
['z', 'dog', 'du', 'dov']
Timeing:
>>> # make numbers larger to stress test
>>> # L = max length of a string, N = number of strings,
>>> # C = smallest gaurenteed common length
>>> # (the setting N=10000, L=100, C=20 is feasible we are good)
>>> import random
>>> def make_data(N, L, C):
>>> rng = random.Random(0)
>>> return [''.join(['a' if i < C else chr(rng.randint(97, 122))
>>> for i in range(L)]) for _ in range(N)]
>>> items = make_data(N=1000, L=10, C=0)
>>> %timeit shortest_unique_prefixes(items)
17.5 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> items = make_data(N=1000, L=100, C=0)
>>> %timeit shortest_unique_prefixes(items)
141 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> items = make_data(N=1000, L=100, C=70)
>>> %timeit shortest_unique_prefixes(items)
141 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> items = make_data(N=10000, L=250, C=20)
>>> %timeit shortest_unique_prefixes(items)
3.55 s ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
import pygtrie
from collections import deque
if len(set(items)) != len(items):
raise ValueError('inputs must be unique')
# construct tree
tree = pygtrie.CharTrie.fromkeys(items, value=0)
# Hack into the internal structure and insert frequencies at each node
def _iternodes(self):
"""
Generates all nodes in the trie
"""
stack = deque([[self._root]])
while stack:
for node in stack.pop():
yield node
stack.append(node.children.values())
# Set the value (frequency) of all (except the root) nodes to zero.
for node in _iternodes(tree):
if node is not tree._root:
node.value = 0
# For each item trace its path and increment frequencies
for item in items:
final_node, trace = tree._get_node(item)
for key, node in trace[1:]:
node.value += 1
# Query for the first prefix with frequency 1 for each item.
# This is the shortest unique prefix over all items.
unique = []
for item in items:
freq = None
for prefix, freq in tree.prefixes(item):
if freq == 1:
break
assert freq == 1, 'item={} has no unique prefix'.format(item)
unique.append(prefix)
return uniqueMetadata
Metadata
Assignees
Labels
No labels