Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

Posted by benhoyt on Stack Overflow See other posts from Stack Overflow or by benhoyt
Published on 2009-04-29T21:21:25Z Indexed on 2010/05/22 3:10 UTC
Read the original article Hit count: 321

Filed under:

I'm wondering what's the best way -- or if there's a simple way with the standard library -- to convert a URL with Unicode chars in the domain name and path to the equivalent ASCII URL, encoded with domain as IDNA and the path %-encoded, as per RFC 3986.

I get from the user a URL in UTF-8. So if they've typed in http://?.ws/? I get 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5' in Python. And what I want out is the ASCII version: 'http://xn--hgi.ws/%E2%99%A5'.

What I do at the moment is split the URL up into parts via a regex, and then manually IDNA-encode the domain, and separately encode the path and query string with different urllib.quote() calls.

# url is UTF-8 here, eg: url = u'http://?.ws/?'.encode('utf-8')
match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'
                 r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I)
if not match:
    raise BadURLException(url)
protocol, domain, port, path, query = match.groups()

try:
    domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
    return ''  # bad UTF-8 chars in domain
domain = domain.encode('idna')

if port is None:
    port = ''

path = urllib.quote(path)

if query is None:
    query = ''
else:
    query = urllib.quote(query, safe='=&?/')

url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'

Is this correct? Any better suggestions? Is there a simple standard-library function to do this?

Developer IT

Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python? - Developer IT

Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

python

url

unicode

utf-8

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about url

mod_rewrite for clean URL doesn't convert the URL to clean URL (but it's accessible) [on hold]

Tip/Trick: Fix Common SEO Problems Using the URL Rewrite Extension

mod_rewrite one url to another url without changing source url

ASP.NET MVC without Url Rewriting/Pretty Url

Ant get task throws "get doesn't support nested resources element" error

Categories cloud