Python’s self explanatory module called unicodedata
provides the user with access to the Unicode Character Database and implicitly every character’s properties.
Lookup a character by name with lookup
:
>>> import unicodedata
>>> unicodedata.lookup('RIGHT SQUARE BRACKET')
']'
>>> three_wise_monkeys = ["SEE-NO-EVIL MONKEY",
"HEAR-NO-EVIL MONKEY",
"SPEAK-NO-EVIL MONKEY"]
>>> ''.join(map(unicodedata.lookup, three_wise_monkeys))
'🙈🙉🙊'
Get a character’s name with name
:
>>> unicodedata.name(u'~')
'TILDE'
Get the category
of a character:
>>> unicodedata.category(u'X')
'Lu'
# L = letter, u = uppercase
Also, using the unicodedata
Python module, it’s easy to normalize any unicode data strings (remove accents, etc):
>>> import unicodedata
data = u'ïnvéntìvé'
normal = unicodedata.normalize('NFKD', data).\
encode('ASCII', 'ignore')
print(normal)
# b'inventive'
The NFKD
stands for Normalization Form Compatibility Decomposition, and this is where characters are decomposed by compatibility, also multiple combining characters are arranged in a specific order.
To get the version of the Unicode Database currently used:
>>> unicodedata.unidata_version
'8.0.0'
Read more methods at the python documentation