I wish to store URLs in a database (MySQL in this case) and process it in Python. Though the database and programming language are probably not this relevant to my question.
In my setup I receive unicode strings when querying a text field in the database. But is a URL actually text? Is encoding from and decoding to unicode an operation that should be done to a URL? Or is it better to make the column in the database a binary blob?
So, how do you handle this problem?
Clarification: This question is not about urlencoding non-ASCII characters with the percent notation. It’s about the distiction that unicode represents text and byte strings represent a way to encode this text into a sequence of bytes. In Python (prior to 3.0) this distinction is between the unicode and the str types. In MySQL it is TEXT to BLOBS. So the concepts seem to correspond between programming language and database. But what is the best way to handle URLs in this scheme?
The relevant answer is found in RFC 2396, section 2.1 URI and non-ASCII characters
The relationship between URI and characters has been a source of confusion for characters that are not part of US-ASCII. To describe the relationship, it is useful to distinguish between a ‘character’ (as a distinguishable semantic entity) and an ‘octet’ (an 8-bit byte). There are two mappings, one from URI characters to octets, and a second from octets to original characters:
URI character sequence->octet sequence->original character sequence
A URI is represented as a sequence of characters, not as a sequence of octets. That is because URI might be ‘transported’ by means that are not through a computer network, e.g., printed on paper, read over the radio, etc.