Unicode Character Database at Your Hand

Python’s self-explanatory module called unicodedata provides the user with access to the Unicode Character Database and implicitly every character’s properties.

Lookup a character by name with lookup:

>>> import unicodedata
>>> unicodedata.lookup('RIGHT SQUARE BRACKET')
']'
>>> three_wise_monkeys = ["SEE-NO-EVIL MONKEY",
                          "HEAR-NO-EVIL MONKEY",
                          "SPEAK-NO-EVIL MONKEY"]
>>> ''.join(map(unicodedata.lookup, three_wise_monkeys))
'🙈🙉🙊'

Get a character’s name with name:

>>> unicodedata.name(u'~')
'TILDE'

Get the category of a character:

>>> unicodedata.category(u'X')
'Lu'
# L = letter, u = uppercase

Also, using the unicodedata Python module, it’s easy to normalize any unicode data strings (remove accents, etc):

>>> import unicodedata

data = u'ïnvéntìvé'
normal = unicodedata.normalize('NFKD', data).\
    encode('ASCII', 'ignore')
print(normal)
# b'inventive'

The NFKD stands for Normalization Form Compatibility Decomposition, and this is where characters are decomposed by compatibility, also multiple combining characters are arranged in a specific order.

To get the version of the Unicode Database currently used:

>>> unicodedata.unidata_version
'8.0.0'

Read more methods at the python documentation

Unicode Character Database at Your Hand

Related Posts