A few days ago I spent some hours working on my scraper for collecting the curious Chinese synonym data hidden on certain websites. There were several annoying problems, but the biggest headache was, as usual, an encoding issue.
Standard format for a URI is ASCII, even though Unicode is fully supported — but the Unicode has to be encoded as ASCII even so. There is a special protocol for “quoting” (I think “escaping” is what is meant) the “path” portion of a URI.
I was doing the escaping and unescaping manually, with a series of functions, for the longest time, but it suddenly dawned on me that there must be a standard way of handling this. Indeed there is:
In [1]: import urllib.parse as P
In [2]: a = '寨桑'
In [3]: P.quote(a) Out[3]: '%E5%AF%A8%E6%A1%91'
In [4]: P.unquote(P.quote(a)) == a Out[4]: True
That’s all there is to it. As with everything else, you just have to know the library and methods to use.
[end]
Comments are enabled.