Python, Unicode and Ancient Greek

This is a work in progress. Feedback welcome.

Coded Character Sets versus Character Encodings

A coded character set assigns an integer to each character.

A character encoding represents a sequence of those integers as bytes.

For example, the Greek lowercase lambda is assigned the number 955 in Unicode. This is more commonly expressed in hex and written U+03BB or, in Python "\u03bb".

UTF-8 is just one way of encoding Unicode characters. In UTF-8, the Greek lowercase lambda is the byte sequence CE BB (or, in Python: "\xce\xbb").

Unicode Strings versus Byte Strings in Python 2 and 3

In Python 2, a literal string is a string of bytes in a character encoding:

>>> word = "λόγος"
>>> word
'\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82'
>>> len(word)
10
>>> type(word)
<type 'str'>

You can prefix a literal string with u to have it be treated as a string of Unicode characters:

>>> word = u"λόγος"
>>> word
u'\u03bb\u03cc\u03b3\u03bf\u03c2'
>>> len(word)
5
>>> type(word)
<type 'unicode'>

You can convert between bytes and unicode with .decode and .encode:

>>> word_bytes = "λόγος"
>>> word_bytes
'\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82'
>>> word_unicode = word_bytes.decode("utf-8")
>>> word_unicode
u'\u03bb\u03cc\u03b3\u03bf\u03c2'
>>> word_unicode.encode("utf-8")
'\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82'

Note you can’t encode a Unicode string with a character encoding that doesn’t support the characters:

>>> word_unicode.encode("ASCII")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

In Python 2, source files need to be explicitly marked as UTF-8 with coding: utf-8 in a comment in the first couple of lines.

When you read a string from a file, you need to .decode it to convert it from bytes to Unicode characters and when you write a string to a file, you need to .encode it to convert it from Unicode characters to bytes.

In Python 3, a literal string is assumed to be a string of Unicode characters:

>>> word = "λόγος"
>>> word
'λόγος'
>>> len(word)
5
>>> type(word)
<class 'str'>

Source files in Python 3 are assumed to be UTF-8. Reading and writing of files, unless marked as binary, will do the decode/encode for you (and assume a UTF-8 encoding by default).

Greek versus Greek Extended

(0370–03FF vs 1F00–1FFF)

Combining Characters

The relevant Combining Diacritical Marks in the 0300–036F range are:

codecharnamecategorycombiningdecomposition
U+0300◦̀COMBINING GRAVE ACCENTMn230
U+0301◦́COMBINING ACUTE ACCENTMn230
U+0304◦̄COMBINING MACRONMn230
U+0306◦̆COMBINING BREVEMn230
U+0308◦̈COMBINING DIAERESISMn230
U+0313◦̓COMBINING COMMA ABOVEMn230
U+0314◦̔COMBINING REVERSED COMMA ABOVEMn230
U+0342◦͂COMBINING GREEK PERISPOMENIMn230
U+0343◦̓COMBINING GREEK KORONISMn2300313
U+0344◦̈́COMBINING GREEK DIALYTIKA TONOSMn2300308 0301
U+0345◦ͅCOMBINING GREEK YPOGEGRAMMENIMn240

Note that there is a COMBINING CIRCUMFLEX ACCENT at U+0302 but that’s, not what we think of as a circumflex. In Greek we use U+0342, the COMBINING GREEK PERISPOMENI.

Precomposed verses Decomposed Characters

An alpha with smooth breathing (or psili) is available at U+1F00 but it’s also possible with U+03B1 and U+0313.

U+1F00GREEK SMALL LETTER ALPHA WITH PSILILl003B1 0313
U+03B1αGREEK SMALL LETTER ALPHALl0
U+0313◦̓COMBINING COMMA ABOVEMn230

U+1F00 is referred to as precomposed and its decomposition is U+03B1 U+0313.

The unicodedata library in Python will tell you the decomposition of any precomposed character:

>>> import unicodedata
>>> unicodedata.decomposition(precomposed)
'03B1 0313'

Normalization

Visually U+1F00 (ἀ) looks the same as U+03B1 U+0313 (ἀ) (or at least should, if the font properly supports polytonic Greek). But how do we deal with this variation in Python?

>>> precomposed = "\u1F00"
>>> decomposed = "\u03B1\u0313"
>>> print(precomposed, decomposed, precomposed == decomposed)
  False

Note above that they fail an equality test.

However, with the unicodedata module’s normalize function, we can normalize before comparison:

>>> from unicodedata import normalize
>>> normalize("NFC", precomposed) == normalize("NFC", decomposed)
True
>>> normalize("NFD", precomposed) == normalize("NFD", decomposed)
True

The first parameter to normalize is the normalization form. This takes one of four values, NFD, NFC, NFKC and NFKD which refer to the Unicode Normalization Forms D, C, KC and KD respectively.

Normalization Form DNFDdo a canonical decomposition
Normalization Form CNFCdo a canonical decomposition followed by a canonical composition
Normalization Form KDNFKDdo a compatibility decomposition
Normalization Form KCNFKCdo a compatibility decomposition followed by a canonical composition

In short, NFD can be used to decompose a precomposed character into its components and NFC can be used to recompose them together.

>>> from unicodedata import name, normalize
>>> precomposed = "\u1F06"
>>> precomposed
'ἆ'
>>> len(precomposed)
1
>>> name(precomposed)
'GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI'
>>> decomposed = normalize("NFD", precomposed)
>>> decomposed
'ἆ'
>>> len(decomposed)
3
>>> for component in decomposed:
...     print(hex(ord(component)), name(component))
... 
0x3b1 GREEK SMALL LETTER ALPHA
0x313 COMBINING COMMA ABOVE
0x342 COMBINING GREEK PERISPOMENI

Using, the decomposition function in unicodedata we can see that, in fact U+1F06 first decomposes in to U+1F00 U+0342 and then U+1F00 further decomposes:

>>> from unicodedata import decomposition
>>> decomposition("\u1F06")
'1F00 0342'
>>> name("\u1F00")
'GREEK SMALL LETTER ALPHA WITH PSILI'
>>> decomposition("\u1F00")
'03B1 0313'

You may be wondering what would happen if we had a Unicode string consisting of U+03B1 U+0342 U+0313, i.e. with the smooth breathing (psili) and circumflex (perispomeni) combining characters swapped. Clearly they won’t equate under direct string comparison nor under Normalization Form D. But they won’t under Normalization Form C either. So even with normalization, it is important to get the relative ordering of combining characters correct. If you try displaying U+03B1 U+0342 U+0313, you’ll probably get the smooth breathing above the circumflex so in this case it will visually look wrong.

Canonical Composition won’t necessarily reverse Canonical Decomposition. In other words normalize("NFC", normalize("NFD", s)) won’t necessarily give you back s. An NFC normalization always decomposes first so that code is actually just the same as normalize("NFC", s).

Here are some examples where an NFC normalization changes the character.

  • GREEK NUMERAL SIGN (U+0374) goes to MODIFIER LETTER PRIME (U+02B9)
  • GREEK QUESTION MARK (U+037E) goes to SEMICOLON (U+003B)
  • GREEK ANO TELEIA (U+0387) goes to MIDDLE DOT (U+00B7)

And also:

  • GREEK VARIA (U+1FEF) goes to GRAVE ACCENT (U+0060)
  • GREEK OXIA (U+1FFD) goes to ACUTE ACCENT (U+00B4)
  • GREEK PROSGEGRAMMENI (U+1FBE) goes to GREEK SMALL LETTER IOTA (U+03B9)
  • GREEK DIALYTIKA AND OXIA (U+1FEE) goes to GREEK DIALYTIKA TONOS (U+0385)

(note these are not the combining versions, but standalone accents not on any letters).

The normalization of OXIA to TONOS means there are also 16 other normalizations like:

  • GREEK SMALL LETTER ALPHA WITH OXIA (U+1F71) goes to GREEK SMALL LETTER ALPHA WITH TONOS (U+03AC)

This normalization means that, for better or worse, polytonic Greek should use the TONOS-based rather than OXIA-based pre-composed characters (where they exist) and use a plain SEMICOLON for question marks rather than U+037E. This allows one to use standard Unicode Normalization. If instead, you wanted to use GREEK QUESTION MARK, then any code to normalize in that direction would have to be custom. I don’t necessarily like all the decisions made by the Unicode Consortium in this regard but at least it’s a convention you can easily point people to.

Compatibility Forms

Okay, so that’s canonical decomposition—but what is the compatibility decomposition used by the “K” normalization forms?

The normalization via Compatibility Decomposition has a looser notion of whether two things are equal. There are strings that would not be considered equal under Form C (NFC) that are under Form KC (NFKC).

Mostly this is used to normalize over visual variants such as the ffi ligature or ℝ to R. Greek doesn’t have much of this so the only real cases of compatibility fall into one of two categories:

  • various “symbol” variants like ϑ GREEK THETA SYMBOL (U+03D1) versus θ GREEK SMALL LETTER THETA (U+03B8) or ϕ GREEK PHI SYMBOL (U+03D5) versus φ GREEK SMALL LETTER PHI (U+03C6).
  • breaking more standalone accents into space + combining character variants.

In my experience you don’t really lose anything by using the compatibility forms and you may gain normalization of the occasional stray symbol, etc.

Collation

(broadly why sorted doesn’t work and you need something like pyuca)

Punctuation

(which characters to use for punctuation in Greek)

Character Classes

Stripping and Adding Diacritics

Let’s first of all consider stripping out specific diacritics.

Say you want to remove acute, grave and circumflex but leave diaeresis and breathing. You could have some mapping file that maps, say, GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA to GREEK SMALL LETTER ALPHA WITH PSILI and so on for all the combinations of precomposed characters. But there’s an easier way: just decompose the characters, filter out the diacritics you don’t want, and then recompose.

Here is an example in Python of removing acutes, graves and circumflexes:

ACUTE = "\u0301"
GRAVE = "\u0300"
CIRCUMFLEX = "\u0342"

unicodedata.normalize("NFC", "".join(
    ch
    for ch in unicodedata.normalize("NFD", text)
    if ch not in [ACUTE, GRAVE, CIRCUMFLEX])
)

(or you could pip install greek-accentuation and use characters.strip_accents which does exactly the above)

If you just wanted to test the existence of a particular diacritic, say a diaeresis, you can do something like this:

DIAERESIS = "\u0308"

if DIAERESIS in unicodedata.normalize("NFD", word):
    print("has a diaeresis!")

If you don’t care about a specific subset of diacritics and you just want to test or remove any diacritics, you don’t need to test with in and a big long list of diacritics. You can instead use unicodedata.category and see if the character’s category is "Mn".

TODO: adding diacritics