Python, Unicode and Ancient Greek
Coded Character Sets versus Character Encodings
A coded character set assigns an integer to each character.
A character encoding represents a sequence of those integers as bytes.
For example, the Greek lowercase lambda is assigned the number 955 in Unicode. This is more commonly expressed in hex and written U+03BB or, in Python
UTF-8 is just one way of encoding Unicode characters. In UTF-8, the Greek lowercase lambda is the byte sequence
CE BB (or, in Python:
Unicode Strings versus Byte Strings in Python 2 and 3
In Python 2, a literal string is a string of bytes in a character encoding:
>>> word = "λόγος" >>> word '\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82' >>> len(word) 10 >>> type(word) <type 'str'>
You can prefix a literal string with
u to have it be treated as a string of Unicode characters:
>>> word = u"λόγος" >>> word u'\u03bb\u03cc\u03b3\u03bf\u03c2' >>> len(word) 5 >>> type(word) <type 'unicode'>
You can convert between bytes and unicode with
>>> word_bytes = "λόγος" >>> word_bytes '\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82' >>> word_unicode = word_bytes.decode("utf-8") >>> word_unicode u'\u03bb\u03cc\u03b3\u03bf\u03c2' >>> word_unicode.encode("utf-8") '\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82'
Note you can’t encode a Unicode string with a character encoding that doesn’t support the characters:
>>> word_unicode.encode("ASCII") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
In Python 2, source files need to be explicitly marked as UTF-8 with
coding: utf-8 in a comment in the first couple of lines.
When you read a string from a file, you need to
.decode it to convert it from bytes to Unicode characters and when you write a string to a file, you need to
.encode it to convert it from Unicode characters to bytes.
In Python 3, a literal string is assumed to be a string of Unicode characters:
>>> word = "λόγος" >>> word 'λόγος' >>> len(word) 5 >>> type(word) <class 'str'>
Source files in Python 3 are assumed to be UTF-8. Reading and writing of files, unless marked as binary, will do the decode/encode for you (and assume a UTF-8 encoding by default).
Greek versus Greek Extended
(0370–03FF vs 1F00–1FFF)
The relevant Combining Diacritical Marks in the 0300–036F range are:
|U+0300||◦̀||COMBINING GRAVE ACCENT||Mn||230|
|U+0301||◦́||COMBINING ACUTE ACCENT||Mn||230|
|U+0313||◦̓||COMBINING COMMA ABOVE||Mn||230|
|U+0314||◦̔||COMBINING REVERSED COMMA ABOVE||Mn||230|
|U+0342||◦͂||COMBINING GREEK PERISPOMENI||Mn||230|
|U+0343||◦̓||COMBINING GREEK KORONIS||Mn||230||0313|
|U+0344||◦̈́||COMBINING GREEK DIALYTIKA TONOS||Mn||230||0308 0301|
|U+0345||◦ͅ||COMBINING GREEK YPOGEGRAMMENI||Mn||240|
Note that there is a COMBINING CIRCUMFLEX ACCENT at U+0302 but that’s, not what we think of as a circumflex. In Greek we use U+0342, the COMBINING GREEK PERISPOMENI.
Precomposed verses Decomposed Characters
An alpha with smooth breathing (or psili) is available at U+1F00 but it’s also possible with U+03B1 and U+0313.
|U+1F00||ἀ||GREEK SMALL LETTER ALPHA WITH PSILI||Ll||0||03B1 0313|
|U+03B1||α||GREEK SMALL LETTER ALPHA||Ll||0|
|U+0313||◦̓||COMBINING COMMA ABOVE||Mn||230|
U+1F00 is referred to as precomposed and its decomposition is U+03B1 U+0313.
unicodedata library in Python will tell you the decomposition of any precomposed character:
>>> import unicodedata >>> unicodedata.decomposition(precomposed) '03B1 0313'
Visually U+1F00 (ἀ) looks the same as U+03B1 U+0313 (ἀ) (or at least should, if the font properly supports polytonic Greek). But how do we deal with this variation in Python?
>>> precomposed = "\u1F00" >>> decomposed = "\u03B1\u0313" >>> print(precomposed, decomposed, precomposed == decomposed) ἀ ἀ False
Note above that they fail an equality test.
However, with the
normalize function, we can normalize before comparison:
>>> from unicodedata import normalize >>> normalize("NFC", precomposed) == normalize("NFC", decomposed) True >>> normalize("NFD", precomposed) == normalize("NFD", decomposed) True
The first parameter to
normalize is the normalization form. This takes one of four values,
NFKD which refer to the Unicode Normalization Forms D, C, KC and KD respectively.
|Normalization Form D||NFD||do a canonical decomposition|
|Normalization Form C||NFC||do a canonical decomposition followed by a canonical composition|
|Normalization Form KD||NFKD||do a compatibility decomposition|
|Normalization Form KC||NFKC||do a compatibility decomposition followed by a canonical composition|
NFD can be used to decompose a precomposed character into its components and
NFC can be used to recompose them together.
>>> from unicodedata import name, normalize >>> precomposed = "\u1F06" >>> precomposed 'ἆ' >>> len(precomposed) 1 >>> name(precomposed) 'GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI' >>> decomposed = normalize("NFD", precomposed) >>> decomposed 'ἆ' >>> len(decomposed) 3 >>> for component in decomposed: ... print(hex(ord(component)), name(component)) ... 0x3b1 GREEK SMALL LETTER ALPHA 0x313 COMBINING COMMA ABOVE 0x342 COMBINING GREEK PERISPOMENI
decomposition function in
unicodedata we can see that, in fact U+1F06 first decomposes in to U+1F00 U+0342 and then U+1F00 further decomposes:
>>> from unicodedata import decomposition >>> decomposition("\u1F06") '1F00 0342' >>> name("\u1F00") 'GREEK SMALL LETTER ALPHA WITH PSILI' >>> decomposition("\u1F00") '03B1 0313'
You may be wondering what would happen if we had a Unicode string consisting of U+03B1 U+0342 U+0313, i.e. with the smooth breathing (psili) and circumflex (perispomeni) combining characters swapped. Clearly they won’t equate under direct string comparison nor under Normalization Form D. But they won’t under Normalization Form C either. So even with normalization, it is important to get the relative ordering of combining characters correct. If you try displaying U+03B1 U+0342 U+0313, you’ll probably get the smooth breathing above the circumflex so in this case it will visually look wrong.
Canonical Composition won’t necessarily reverse Canonical Decomposition. In other words
normalize("NFC", normalize("NFD", s)) won’t necessarily give you back
NFC normalization always decomposes first so that code is actually just the same as
Here are some examples where an
NFC normalization changes the character.
- GREEK NUMERAL SIGN (U+0374) goes to MODIFIER LETTER PRIME (U+02B9)
- GREEK QUESTION MARK (U+037E) goes to SEMICOLON (U+003B)
- GREEK ANO TELEIA (U+0387) goes to MIDDLE DOT (U+00B7)
- GREEK VARIA (U+1FEF) goes to GRAVE ACCENT (U+0060)
- GREEK OXIA (U+1FFD) goes to ACUTE ACCENT (U+00B4)
- GREEK PROSGEGRAMMENI (U+1FBE) goes to GREEK SMALL LETTER IOTA (U+03B9)
- GREEK DIALYTIKA AND OXIA (U+1FEE) goes to GREEK DIALYTIKA TONOS (U+0385)
(note these are not the combining versions, but standalone accents not on any letters).
The normalization of OXIA to TONOS means there are also 16 other normalizations like:
- GREEK SMALL LETTER ALPHA WITH OXIA (U+1F71) goes to GREEK SMALL LETTER ALPHA WITH TONOS (U+03AC)
This normalization means that, for better or worse, polytonic Greek should use the TONOS-based rather than OXIA-based pre-composed characters (where they exist) and use a plain SEMICOLON for question marks rather than U+037E. This allows one to use standard Unicode Normalization. If instead, you wanted to use GREEK QUESTION MARK, then any code to normalize in that direction would have to be custom. I don’t necessarily like all the decisions made by the Unicode Consortium in this regard but at least it’s a convention you can easily point people to.
Okay, so that’s canonical decomposition—but what is the compatibility decomposition used by the “K” normalization forms?
The normalization via Compatibility Decomposition has a looser notion of whether two things are equal. There are strings that would not be considered equal under Form C (NFC) that are under Form KC (NFKC).
Mostly this is used to normalize over visual variants such as the ffi ligature or ℝ to R. Greek doesn’t have much of this so the only real cases of compatibility fall into one of two categories:
- various “symbol” variants like ϑ GREEK THETA SYMBOL (U+03D1) versus θ GREEK SMALL LETTER THETA (U+03B8) or ϕ GREEK PHI SYMBOL (U+03D5) versus φ GREEK SMALL LETTER PHI (U+03C6).
- breaking more standalone accents into space + combining character variants.
In my experience you don’t really lose anything by using the compatibility forms and you may gain normalization of the occasional stray symbol, etc.
sorted doesn’t work and you need something like
(which characters to use for punctuation in Greek)
Stripping and Adding Diacritics
Let’s first of all consider stripping out specific diacritics.
Say you want to remove acute, grave and circumflex but leave diaeresis and breathing. You could have some mapping file that maps, say, GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA to GREEK SMALL LETTER ALPHA WITH PSILI and so on for all the combinations of precomposed characters. But there’s an easier way: just decompose the characters, filter out the diacritics you don’t want, and then recompose.
Here is an example in Python of removing acutes, graves and circumflexes:
ACUTE = "\u0301" GRAVE = "\u0300" CIRCUMFLEX = "\u0342" unicodedata.normalize("NFC", "".join( ch for ch in unicodedata.normalize("NFD", text) if ch not in [ACUTE, GRAVE, CIRCUMFLEX]) )
(or you could
pip install greek-accentuation and use
characters.strip_accents which does exactly the above)
If you just wanted to test the existence of a particular diacritic, say a diaeresis, you can do something like this:
DIAERESIS = "\u0308" if DIAERESIS in unicodedata.normalize("NFD", word): print("has a diaeresis!")
If you don’t care about a specific subset of diacritics and you just want to test or remove any diacritics, you don’t need to test with
in and a big long list of diacritics. You can instead use
unicodedata.category and see if the character’s category is
TODO: adding diacritics