# Speeding up “suggest” with python-Levenshtein¶

sphobjinv uses fuzzywuzzy for fuzzy-match searching of object names/domains/roles as part of the Inventory.suggest() functionality, also implemented as the CLI suggest subcommand.

By default, fuzzywuzzy uses difflib.SequenceMatcher from the Python standard library for its fuzzy searching. However, it also can use python-Levenshtein, a Python C extension providing similar functionality. Thus, sphobjinv has been implemented with python-Levenshtein as an optional dependency: it will be used if it is available.

## Installation¶

On Linux, simple installation via pip install python-Levenshtein should work properly in the vast majority of cases. This may work relatively smoothly for MacOS as well, but first-hand confirmation was not (yet) available as of this writing.

On Windows, again as of this writing, no binary wheel is available on PyPI. So, the easiest way to install is to download a wheel from Christoph Gohlke’s repository and run pip install C:\path\to\wheel. Just make sure to match Python version and CPU type (win32 vs win_amd64) when you download.

## Performance Benchmark¶

The chart below presents one dataset illustrating the performance enhancement that can be obtained by installing python-Levenshtein. The timings plotted here are from execution of timeit.repeat() in the Python standard library around a suggest() call, searching for the term “function”, for a number of objects.inv files from different projects (see here).

The timings were collected using the following code:

import sphobjinv as soi

durations = {}
obj_counts = {}

for fn in os.listdir():
if fn.endswith('.inv'):
inv = soi.Inventory(fn)

# Execute the 'suggest' operation 20 times for each file
timings = timeit.repeat("inv.suggest('function')", repeat=20, number=1, globals=globals())

# Store the average timing for each file
durations.update({fn: sum(timings) / len(timings)})

# Store the number of objects
obj_counts.update({fn: inv.count})


As can be seen, the fifty-two objects.inv files in this dataset contain widely varying numbers of objects $$n$$:

Unsurprisingly, larger inventories require more time to search. Also relatively unsurprisingly, the time required appears to be roughly $$O(n)$$, since the fuzzy search must be performed once on the as_rst representation of each object.

For this specific search, using python-Levenshtein instead of difflib decreases the time required from 0.90 seconds per thousand objects down to 0.15 seconds per thousand objects, representing a performance improvement of almost exactly six-fold. Other searches will likely exhibit somewhat better or worse improvement from the use of python-Levenshtein, depending on the average length of the reST-like representations of the objects in an objects.inv and the length of the search term.

## Variations in Matching Behavior¶

Note that the matching scores calculated by difflib and python-Levenshtein can often differ appreciably. (This is illustrated in issue #128 of the fuzzywuzzy GitHub repo.) This difference in behavior doesn’t have much practical significance, save for the potential of causing some confusion between users with/without python-Levenshtein installed.

As an example, the following shows an excerpt of the results of a representative CLI suggest call without python-Levenshtein:

\$ sphobjinv suggest objects_scipy.inv surface -asit 40

Name                                                   Score    Index
------------------------------------------------------  -------  -------
:py:function:scipy.misc.face                            64      1018
:py:function:scipy.misc.source                          64      1032
:std:doc:generated/scipy.misc.face                      64      4042
:std:doc:generated/scipy.misc.source                    64      4056
:py:data:scipy.stats.rice                               56      2688
:std:label:continuous-rice                              56      2896
:py:method:scipy.integrate.complex_ode.successful       51       156
:py:method:scipy.integrate.ode.successful               51       171
:py:function:scipy.linalg.lu_factor                     51       967

... more with score 51 ...

:py:attribute:scipy.LowLevelCallable.signature          50        5
:py:function:scipy.constants.convert_temperature        50       53
:py:function:scipy.integrate.quadrature                 50       176

... more with score 50 and below ...


This is a similar excerpt with python-Levenshtein:

  Name                                                   Score    Index
------------------------------------------------------  -------  -------
:py:function:scipy.misc.face                            64      1018
:py:function:scipy.misc.source                          64      1032
:std:doc:generated/scipy.misc.face                      64      4042
:std:doc:generated/scipy.misc.source                    64      4056
:py:method:scipy.integrate.ode.successful               51       171
:py:function:scipy.linalg.lu_factor                     51       967
:py:function:scipy.linalg.subspace_angles               51      1003

... more with score 51 ...

:py:function:scipy.cluster.hierarchy.fcluster           49       23
:py:function:scipy.cluster.hierarchy.fclusterdata       49       24
:py:method:scipy.integrate.complex_ode.successful       49       156

... more with score 49 and below ...