Skip to content
This repository was archived by the owner on Jan 10, 2023. It is now read-only.
This repository was archived by the owner on Jan 10, 2023. It is now read-only.

Iterator over internal node structure (feature request) #20

@Erotemic

Description

@Erotemic

It would be nice if there was a builtin way to iterate over the internal nodes in the trie like this:

            def _iternodes(self):
                """
                Generates all nodes in the trie
                """
                stack = deque([[self._root]])
                while stack:
                    for node in stack.pop():
                        yield node
                        stack.append(node.children.values())

That would make it much easier to write the Shortest Unique Prefix algorithm:

def shortest_unique_prefixes(items):
    """
    The shortest unique prefix algorithm.

    Args:
        items (list of str): returned prefixes will be unique wrt this set

    Returns:
        list of str: a prefix for each item that uniquely identifies it
           wrt to the original items.

    References:
        http://www.geeksforgeeks.org/find-all-shortest-unique-prefixes-to-represent-each-word-in-a-given-list/
        https://github.com/Briaares/InterviewBit/blob/master/Level6/Shortest%20Unique%20Prefix.cpp

    Requires:
        pip install pygtrie

    Doctest:
        >>> from pysseg.fnameproc import *
        >>> items = ["zebra", "dog", "duck", "dove"]
        >>> shortest_unique_prefixes(items)
        ['z', 'dog', 'du', 'dov']

    Timeing:
        >>> # make numbers larger to stress test
        >>> # L = max length of a string, N = number of strings,
        >>> # C = smallest gaurenteed common length
        >>> # (the setting N=10000, L=100, C=20 is feasible we are good)
        >>> import random
        >>> def make_data(N, L, C):
        >>>     rng = random.Random(0)
        >>>     return [''.join(['a' if i < C else chr(rng.randint(97, 122))
        >>>                      for i in range(L)]) for _ in range(N)]
        >>> items = make_data(N=1000, L=10, C=0)
        >>> %timeit shortest_unique_prefixes(items)
        17.5 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
        >>> items = make_data(N=1000, L=100, C=0)
        >>> %timeit shortest_unique_prefixes(items)
        141 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
        >>> items = make_data(N=1000, L=100, C=70)
        >>> %timeit shortest_unique_prefixes(items)
        141 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
        >>> items = make_data(N=10000, L=250, C=20)
        >>> %timeit shortest_unique_prefixes(items)
        3.55 s ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    """
    import pygtrie
    from collections import deque
    if len(set(items)) != len(items):
        raise ValueError('inputs must be unique')

    # construct tree
    tree = pygtrie.CharTrie.fromkeys(items, value=0)

    # Hack into the internal structure and insert frequencies at each node
    def _iternodes(self):
        """
        Generates all nodes in the trie
        """
        stack = deque([[self._root]])
        while stack:
            for node in stack.pop():
                yield node
                stack.append(node.children.values())

    # Set the value (frequency) of all (except the root) nodes to zero.
    for node in _iternodes(tree):
        if node is not tree._root:
            node.value = 0

    # For each item trace its path and increment frequencies
    for item in items:
        final_node, trace = tree._get_node(item)
        for key, node in trace[1:]:
            node.value += 1

    # Query for the first prefix with frequency 1 for each item.
    # This is the shortest unique prefix over all items.
    unique = []
    for item in items:
        freq = None
        for prefix, freq in tree.prefixes(item):
            if freq == 1:
                break
        assert freq == 1, 'item={} has no unique prefix'.format(item)
        unique.append(prefix)
    return unique

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions