In Python 3.4 I used this method to find out if a static file on a remote site had changed:
- Request the file using
urllib.request
. - Get the contents of the file by converting the
HTTPResponse
object to a bytes object usingread()
. - Find the
hash()
value of the contents and compare it to the hash value previously found the last time the same file was requested. The previous hash value had been saved to a local file for this purpose.
To my surprise, the file contents had a different hash every day that I ran this function. But when I compared the actual contents of the files received each day, I could find no differences between them.
It turns out that the Python built-in hash function was changed as of Python 3.3, so that hash-randomization is enabled by default. (See https://docs.python.org/3/whatsnew/3.3.html#builtin-functions-and-types) Each time Python is run, PYTHONHASHSEED
is set randomly, so the hash function returns a different value for any uniform input. Here is an illustration at the command line:
$ python -V
Python 2.7.6
$ python -c 'print(hash("a"))'
12416037344
$ python -c 'print(hash("a"))'
12416037344
$ python -c 'print(hash("a"))'
12416037344
$ python -c 'print(hash("a"))'
12416037344
$ python3 -V
Python 3.4.0
$ python3 -c 'print(hash("a"))'
-2245055084068035824
$ python3 -c 'print(hash("a"))'
104473749347316555
$ python3 -c 'print(hash("a"))'
-8144659526577460286
$ python3 -c 'print(hash("a"))'
6778650719047758351
$ python3 -c 'print(hash("a"))'
-3871396616392495287
$
This doesn't matter if a hash value is going to be used only within a single run of Python, but where it is going to be saved and used again across different runs, the problem I described above will occur.
Rather than going to the trouble of setting PYTHONHASHSEED
to 0 each time I run this code, I replaced hash(content)
with hashlib.md5(content).hexdigest()
. This standard hash function gives identical output on identical input.
[end]