& : A question about time complexity when testing membership in a Python nested sequence

(Corrected at bottom.)

Python's in operator tests for membership in a sequence.

1
2
3

In [1]: a = [1, 2, 3]
In [2]: 1 in a
Out[2]: True

But it does not work directly on nested sequences such as a list of lists or two-dimensional matrix:

1
2
3

In [3]: a = [[1, 2, 3], [4, 5, 6]]
In [4]: 1 in a
Out[4]: False

What is the most efficient way to do that?

I thought of three approaches:

Iteration through the first-level sequences composing the matrix, using a list comprehension
Flattening the matrix first
Creating the matrix in flat form to begin with

My prediction was that creating the matrix in flat form to begin with would be the fastest and that flattening the matrix first would be the slowest. Of course, flattening a matrix is likely to increase the running time.

I wrote five tests. For matrices with very small dimensions, my predictions are correct. But as the dimensions become large, iterating through the first-level sequences turned out to be significantly faster than all other methods. [Wrong! There was an error in my code. I've corrected it after the end of the original post.] Below I discuss only the two most interesting ones.

Results

I populated the matrix with random floating point numbers and searched for an additional random floating point number in it; that should give worst-case performance since it is unlikely that any 16-digit random numbers would appear twice in such relatively small samples.

The difference in speed for the plain use of the in operator looks like it must involve a difference in time complexity, but I that there is such a difference; I'd be happy to hear from anyone who can explain this to me.

Creating the matrix in flat form to begin with

python -m timeit -s '\  
from random import random;\  
i = j = 3;\  
matrix = [random() for item in range(i\*j)]' '\  
random() in matrix'

Results:

# i = j = 10000; 10 loops, best of 3: 1.61 sec per loop
# i = j = 1000; 100 loops, best of 3: 15.9 msec per loop
# i = j = 100; 10000 loops, best of 3: 158 usec per loop
# i = j = 10; 1000000 loops, best of 3: 1.63 usec per loop
# i = j = 3; 10000000 loops, best of 3: 0.194 usec per loop

Comments:

in list: O(n) (following the Python wiki page on time complexity)
total complexity: O(n)

The observed speed increase is about a factor of 100 for every factor of 100 increase to n — in other words, O(n).

With use of in operator and a list comprehension

python -m timeit -s '\  
from random import random;\  
i = j = 3;\  
matrix = [[random() for row in range(j)] for column in range(i)]' '\  
random() in [row for row in matrix]'

Results:

\# i = j = 10000; 1000 loops, best of 3: 516 usec per loop
\# i = j = 1000: 10000 loops, best of 3: 54.2 usec per loop
\# i = j = 100: 100000 loops, best of 3: 6.52 usec per loop
\# i = j = 10: 1000000 loops, best of 3: 1.2 usec per loop
\# i = j = 3: 1000000 loops, best of 3: 0.457 usec per loop

Comments:

one-dimensional list comprehension for square matrix: O(sqrt(n))
in list: O(n)
total complexity: O(sqrt(n)) + O(n) ∈ O(n)

Yet the observed speed increases by about a factor of 10 for every factor of 100 increase to n. That looks like O(sqrt(n)), rather than O(n). All I can think is that in must be implicitly optimized for use with list comprehensions. But a search of documentation turns up nothing to that effect. (I haven't yet looked at the source code.)

Correction (same day)

Darius Bacon wrote: "Your test #2 looks wrong -- random() in [row for row in matrix] will check if random() is equal to one of the rows in matrix, not if it's equal to an element in any of the rows."

I checked it more carefully and he's right.

I rewrote it as

python -m timeit -s '\  
from random import random;\  
i = j = 3;\  
matrix = [[random() for row in range(j)] for column in range(i)]' '\  
random() in [item for row in matrix for item in row]'

Results:

\# i = j = 10000; 10 loops, best of 3: 4.3 sec per loop
\# i = j = 1000; 10 loops, best of 3: 44 msec per loop
\# i = j = 100; 1000 loops, best of 3: 393 usec per loop
\# i = j = 10; 100000 loops, best of 3: 5.83 usec per loop
\# i = j = 3; 1000000 loops, best of 3: 1.1 usec per loop

Now we see the 1:1 proportion between a factor of about 100 for change in speed corresponding to a factor of 100 for change in n, just as the order-of-growth equations predict.

And, as originally anticipated, using a linear array is, after all, much faster than a list of lists.

[end]

All articles