(Corrected at bottom.)
Python's in
operator tests for membership in a sequence.
1 2 3 | In [1]: a = [1, 2, 3]
In [2]: 1 in a
Out[2]: True
|
But it does not work directly on nested sequences such as a list of lists or two-dimensional matrix:
1 2 3 | In [3]: a = [[1, 2, 3], [4, 5, 6]]
In [4]: 1 in a
Out[4]: False
|
What is the most efficient way to do that?
I thought of three approaches:
- Iteration through the first-level sequences composing the matrix, using a list comprehension
- Flattening the matrix first
- Creating the matrix in flat form to begin with
My prediction was that creating the matrix in flat form to begin with would be the fastest and that flattening the matrix first would be the slowest. Of course, flattening a matrix is likely to increase the running time.
I wrote five tests. For matrices with very small dimensions, my predictions are correct. But as the dimensions become large, iterating through the first-level sequences turned out to be significantly faster than all other methods. [Wrong! There was an error in my code. I've corrected it after the end of the original post.] Below I discuss only the two most interesting ones.
Results
I populated the matrix with random floating point numbers and searched for an additional random floating point number in it; that should give worst-case performance since it is unlikely that any 16-digit random numbers would appear twice in such relatively small samples.
The difference in speed for the plain use of the in
operator looks
like it must involve a difference in time complexity, but I that there
is such a difference; I'd be happy to hear from anyone who can explain
this to me.
-
Creating the matrix in flat form to begin with
1 2 3 4 5
python -m timeit -s '\ from random import random;\ i = j = 3;\ matrix = [random() for item in range(i\*j)]' '\ random() in matrix'
Results:
# i = j = 10000; 10 loops, best of 3: 1.61 sec per loop # i = j = 1000; 100 loops, best of 3: 15.9 msec per loop # i = j = 100; 10000 loops, best of 3: 158 usec per loop # i = j = 10; 1000000 loops, best of 3: 1.63 usec per loop # i = j = 3; 10000000 loops, best of 3: 0.194 usec per loop
Comments:
in
list: O(n) (following the Python wiki page on time complexity)- total complexity: O(n)
The observed speed increase is about a factor of 100 for every factor of 100 increase to
n
— in other words, O(n). -
With use of
in
operator and a list comprehensionpython -m timeit -s '\ from random import random;\ i = j = 3;\ matrix = [[random() for row in range(j)] for column in range(i)]' '\ random() in [row for row in matrix]'
Results:
\# i = j = 10000; 1000 loops, best of 3: 516 usec per loop \# i = j = 1000: 10000 loops, best of 3: 54.2 usec per loop \# i = j = 100: 100000 loops, best of 3: 6.52 usec per loop \# i = j = 10: 1000000 loops, best of 3: 1.2 usec per loop \# i = j = 3: 1000000 loops, best of 3: 0.457 usec per loop
Comments:
- one-dimensional list comprehension for square matrix: O(sqrt(n))
in
list: O(n)- total complexity: O(sqrt(n)) + O(n) ∈ O(n)
Yet the observed speed increases by about a factor of 10 for every factor of 100 increase to
n
. That looks like O(sqrt(n)), rather than O(n). All I can think is thatin
must be implicitly optimized for use with list comprehensions. But a search of documentation turns up nothing to that effect. (I haven't yet looked at the source code.)
Correction (same day)
Darius Bacon wrote: "Your test #2 looks wrong --
random() in [row for row in matrix]
will check if random()
is equal
to one of the rows in matrix
, not if it's equal to an element in any
of the rows."
I checked it more carefully and he's right.
I rewrote it as
python -m timeit -s '\
from random import random;\
i = j = 3;\
matrix = [[random() for row in range(j)] for column in range(i)]' '\
random() in [item for row in matrix for item in row]'
Results:
\# i = j = 10000; 10 loops, best of 3: 4.3 sec per loop
\# i = j = 1000; 10 loops, best of 3: 44 msec per loop
\# i = j = 100; 1000 loops, best of 3: 393 usec per loop
\# i = j = 10; 100000 loops, best of 3: 5.83 usec per loop
\# i = j = 3; 1000000 loops, best of 3: 1.1 usec per loop
Now we see the 1:1 proportion between a factor of about 100 for change
in speed corresponding to a factor of 100 for change in n
, just as the
order-of-growth equations predict.
And, as originally anticipated, using a linear array is, after all, much faster than a list of lists.
[end]