Speeding up “suggest” with python-Levenshtein (DEPRECATED)

sphobjinv uses fuzzywuzzy for fuzzy-match searching of object names/domains/roles as part of the Inventory.suggest() functionality, also implemented as the CLI suggest subcommand.

fuzzywuzzy uses difflib.SequenceMatcher from the Python standard library for its fuzzy searching. While earlier versions of sphobjinv were able to make use of fuzzywuzzy‘s optional link to python-Levenshtein, a Python C extension providing similar functionality, due to a licensing conflict this is no longer possible. sphobjinv now uses a vendored copy of fuzzywuzzy from an era when it was released under the MIT License.

Formally:

Removed in version 2.2: Acceleration of the sphobjinv “suggest” mode via python-Levenshtein has been deprecated and is no longer available.

The discussion of performance benchmarks and variations in matching behavior is kept below for historical interest.

Performance Benchmark

The chart below presents one dataset illustrating the performance enhancement that can be obtained by installing python-Levenshtein. The timings plotted here are from execution of timeit.repeat() around a suggest() call, searching for the term “function”, for a number of objects.inv files from different projects (see here).

The timings were collected using the following code:

import sphobjinv as soi

durations = {}
obj_counts = {}

for fn in os.listdir():
    if fn.endswith('.inv'):
        inv = soi.Inventory(fn)

        # Execute the 'suggest' operation 20 times for each file
        timings = timeit.repeat("inv.suggest('function')", repeat=20, number=1, globals=globals())

        # Store the average timing for each file
        durations.update({fn: sum(timings) / len(timings)})

        # Store the number of objects
        obj_counts.update({fn: inv.count})

As can be seen, the fifty-two objects.inv files in this dataset contain widely varying numbers of objects \(n\):

_images/suggest_timing.png

Unsurprisingly, larger inventories require more time to search. Also relatively unsurprisingly, the time required appears to be roughly \(O(n)\), since the fuzzy search must be performed once on the as_rst representation of each object.

For this specific search, using python-Levenshtein instead of difflib decreases the time required from 0.90 seconds per thousand objects down to 0.15 seconds per thousand objects, representing a performance improvement of almost exactly six-fold. Other searches will likely exhibit somewhat better or worse improvement from the use of python-Levenshtein, depending on the average length of the reST-like representations of the objects in an objects.inv and the length of the search term.

Variations in Matching Behavior

Note that the matching scores calculated by difflib and python-Levenshtein can often differ appreciably. (This is illustrated in issue #128 of the fuzzywuzzy GitHub repo.) This difference in behavior doesn’t have much practical significance, save for the potential of causing some confusion between users with/without python-Levenshtein installed.

As an example, the following shows an excerpt of the results of a representative CLI suggest call without python-Levenshtein:

$ sphobjinv suggest objects_scipy.inv surface -asit 40

  Name                                                   Score    Index
------------------------------------------------------  -------  -------
:py:function:`scipy.misc.face`                            64      1018
:py:function:`scipy.misc.source`                          64      1032
:std:doc:`generated/scipy.misc.face`                      64      4042
:std:doc:`generated/scipy.misc.source`                    64      4056
:py:data:`scipy.stats.rice`                               56      2688
:std:label:`continuous-rice`                              56      2896
:py:method:`scipy.integrate.complex_ode.successful`       51       156
:py:method:`scipy.integrate.ode.successful`               51       171
:py:function:`scipy.linalg.lu_factor`                     51       967

... more with score 51 ...

:py:attribute:`scipy.LowLevelCallable.signature`          50        5
:py:function:`scipy.constants.convert_temperature`        50       53
:py:function:`scipy.integrate.quadrature`                 50       176

... more with score 50 and below ...

This is a similar excerpt with python-Levenshtein:

  Name                                                   Score    Index
------------------------------------------------------  -------  -------
:py:function:`scipy.misc.face`                            64      1018
:py:function:`scipy.misc.source`                          64      1032
:std:doc:`generated/scipy.misc.face`                      64      4042
:std:doc:`generated/scipy.misc.source`                    64      4056
:py:method:`scipy.integrate.ode.successful`               51       171
:py:function:`scipy.linalg.lu_factor`                     51       967
:py:function:`scipy.linalg.subspace_angles`               51      1003

... more with score 51 ...

:py:function:`scipy.cluster.hierarchy.fcluster`           49       23
:py:function:`scipy.cluster.hierarchy.fclusterdata`       49       24
:py:method:`scipy.integrate.complex_ode.successful`       49       156

... more with score 49 and below ...