urllib.quote对url的编码

发表于 2017-08-11 更新于 2024-01-15 分类于爬虫阅读次数：
621 1 分钟

最近爬商品时，需要把url里面的中文类别存到数据库里，结果一看都是编码后的，结果在知乎上看到了urllib.quote的作用

In [1]: import urllib

In [2]: s = '历史上那些牛人们.pdf'

In [3]: print  urllib.quote(s.decode('utf-8').encode('gbk'))
%C0%FA%CA%B7%C9%CF%C4%C7%D0%A9%C5%A3%C8%CB%C3%C7.pdf

In [4]: print  urllib.quote(s)
%E5%8E%86%E5%8F%B2%E4%B8%8A%E9%82%A3%E4%BA%9B%E7%89%9B%E4%BA%BA%E4%BB%AC.pdf

In [5]: print  urllib.unquote(urllib.quote(s.decode('utf-8').encode('gbk'))).decode('gbk')
历史上那些牛人们.pdf

In [6]: print  urllib.unquote(urllib.quote(s)).decode('utf-8')
历史上那些牛人们.pdf

所以需要先判断网页的编码，之后先对相应的编码进行解码，然后用urllib.unquote()再解码。