Python, Unicode and Ancient Greek
This is a work in progress. Feedback welcome.
Now also see Character Encoding of Classical Languages which you can download as a PDF for free.
Coded Character Sets versus Character Encodings
A coded character set assigns an integer to each character.
A character encoding represents a sequence of those integers as bytes.
For example, the Greek lowercase lambda is assigned the number 955 in Unicode. This is more commonly expressed in hex and written U+03BB or, in Python "\u03bb"
.
UTF-8 is just one way of encoding Unicode characters. In UTF-8, the Greek lowercase lambda is the byte sequence CE BB
(or, in Python: "\xce\xbb"
).
Unicode Strings versus Byte Strings in Python 2 and 3
In Python 2, a literal string is a string of bytes in a character encoding:
>>> word = "λόγος"
>>> word
'\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82'
>>> len(word)
10
>>> type(word)
<type 'str'>
You can prefix a literal string with u
to have it be treated as a string of Unicode characters:
>>> word = u"λόγος"
>>> word
u'\u03bb\u03cc\u03b3\u03bf\u03c2'
>>> len(word)
5
>>> type(word)
<type 'unicode'>
You can convert between bytes and unicode with .decode
and .encode
:
>>> word_bytes = "λόγος"
>>> word_bytes
'\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82'
>>> word_unicode = word_bytes.decode("utf-8")
>>> word_unicode
u'\u03bb\u03cc\u03b3\u03bf\u03c2'
>>> word_unicode.encode("utf-8")
'\xce\xbb\xcf\x8c\xce\xb3\xce\xbf\xcf\x82'
Note you can’t encode a Unicode string with a character encoding that doesn’t support the characters:
>>> word_unicode.encode("ASCII")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
In Python 2, source files need to be explicitly marked as UTF-8 with coding: utf-8
in a comment in the first couple of lines.
When you read a string from a file, you need to .decode
it to convert it from bytes to Unicode characters and when you write a string to a file, you need to .encode
it to convert it from Unicode characters to bytes.
In Python 3, a literal string is assumed to be a string of Unicode characters:
>>> word = "λόγος"
>>> word
'λόγος'
>>> len(word)
5
>>> type(word)
<class 'str'>
Source files in Python 3 are assumed to be UTF-8. Reading and writing of files, unless marked as binary, will do the decode/encode for you (and assume a UTF-8 encoding by default).
Greek versus Greek Extended
(0370–03FF vs 1F00–1FFF)
Combining Characters
The relevant Combining Diacritical Marks in the 0300–036F range are:
code | char | name | category | combining | decomposition |
---|---|---|---|---|---|
U+0300 | ◦̀ | COMBINING GRAVE ACCENT | Mn | 230 | |
U+0301 | ◦́ | COMBINING ACUTE ACCENT | Mn | 230 | |
U+0304 | ◦̄ | COMBINING MACRON | Mn | 230 | |
U+0306 | ◦̆ | COMBINING BREVE | Mn | 230 | |
U+0308 | ◦̈ | COMBINING DIAERESIS | Mn | 230 | |
U+0313 | ◦̓ | COMBINING COMMA ABOVE | Mn | 230 | |
U+0314 | ◦̔ | COMBINING REVERSED COMMA ABOVE | Mn | 230 | |
U+0342 | ◦͂ | COMBINING GREEK PERISPOMENI | Mn | 230 | |
U+0343 | ◦̓ | COMBINING GREEK KORONIS | Mn | 230 | 0313 |
U+0344 | ◦̈́ | COMBINING GREEK DIALYTIKA TONOS | Mn | 230 | 0308 0301 |
U+0345 | ◦ͅ | COMBINING GREEK YPOGEGRAMMENI | Mn | 240 |
Note that there is a COMBINING CIRCUMFLEX ACCENT at U+0302 but that’s, not what we think of as a circumflex. In Greek we use U+0342, the COMBINING GREEK PERISPOMENI.
Precomposed verses Decomposed Characters
An alpha with smooth breathing (or psili) is available at U+1F00 but it’s also possible with U+03B1 and U+0313.
U+1F00 | ἀ | GREEK SMALL LETTER ALPHA WITH PSILI | Ll | 0 | 03B1 0313 |
U+03B1 | α | GREEK SMALL LETTER ALPHA | Ll | 0 | |
U+0313 | ◦̓ | COMBINING COMMA ABOVE | Mn | 230 |
U+1F00 is referred to as precomposed and its decomposition is U+03B1 U+0313.
The unicodedata
library in Python will tell you the decomposition of any precomposed character:
>>> import unicodedata
>>> unicodedata.decomposition(precomposed)
'03B1 0313'
Normalization
Visually U+1F00 (ἀ) looks the same as U+03B1 U+0313 (ἀ) (or at least should, if the font properly supports polytonic Greek). But how do we deal with this variation in Python?
>>> precomposed = "\u1F00"
>>> decomposed = "\u03B1\u0313"
>>> print(precomposed, decomposed, precomposed == decomposed)
ἀ ἀ False
Note above that they fail an equality test.
However, with the unicodedata
module’s normalize
function, we can normalize before comparison:
>>> from unicodedata import normalize
>>> normalize("NFC", precomposed) == normalize("NFC", decomposed)
True
>>> normalize("NFD", precomposed) == normalize("NFD", decomposed)
True
The first parameter to normalize
is the normalization form. This takes one of four values, NFD
, NFC
, NFKC
and NFKD
which refer to the Unicode Normalization Forms D, C, KC and KD respectively.
Normalization Form D | NFD | do a canonical decomposition |
Normalization Form C | NFC | do a canonical decomposition followed by a canonical composition |
Normalization Form KD | NFKD | do a compatibility decomposition |
Normalization Form KC | NFKC | do a compatibility decomposition followed by a canonical composition |
In short, NFD
can be used to decompose a precomposed character into its components and NFC
can be used to recompose them together.
>>> from unicodedata import name, normalize
>>> precomposed = "\u1F06"
>>> precomposed
'ἆ'
>>> len(precomposed)
1
>>> name(precomposed)
'GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI'
>>> decomposed = normalize("NFD", precomposed)
>>> decomposed
'ἆ'
>>> len(decomposed)
3
>>> for component in decomposed:
... print(hex(ord(component)), name(component))
...
0x3b1 GREEK SMALL LETTER ALPHA
0x313 COMBINING COMMA ABOVE
0x342 COMBINING GREEK PERISPOMENI
Using, the decomposition
function in unicodedata
we can see that, in fact U+1F06 first decomposes in to U+1F00 U+0342 and then U+1F00 further decomposes:
>>> from unicodedata import decomposition
>>> decomposition("\u1F06")
'1F00 0342'
>>> name("\u1F00")
'GREEK SMALL LETTER ALPHA WITH PSILI'
>>> decomposition("\u1F00")
'03B1 0313'
You may be wondering what would happen if we had a Unicode string consisting of U+03B1 U+0342 U+0313, i.e. with the smooth breathing (psili) and circumflex (perispomeni) combining characters swapped. Clearly they won’t equate under direct string comparison nor under Normalization Form D. But they won’t under Normalization Form C either. So even with normalization, it is important to get the relative ordering of combining characters correct. If you try displaying U+03B1 U+0342 U+0313, you’ll probably get the smooth breathing above the circumflex so in this case it will visually look wrong.
Canonical Composition won’t necessarily reverse Canonical Decomposition. In other words normalize("NFC", normalize("NFD", s))
won’t necessarily give you back s
. An NFC
normalization always decomposes first so that code is actually just the same as normalize("NFC", s)
.
Here are some examples where an NFC
normalization changes the character.
- GREEK NUMERAL SIGN (U+0374) goes to MODIFIER LETTER PRIME (U+02B9)
- GREEK QUESTION MARK (U+037E) goes to SEMICOLON (U+003B)
- GREEK ANO TELEIA (U+0387) goes to MIDDLE DOT (U+00B7)
And also:
- GREEK VARIA (U+1FEF) goes to GRAVE ACCENT (U+0060)
- GREEK OXIA (U+1FFD) goes to ACUTE ACCENT (U+00B4)
- GREEK PROSGEGRAMMENI (U+1FBE) goes to GREEK SMALL LETTER IOTA (U+03B9)
- GREEK DIALYTIKA AND OXIA (U+1FEE) goes to GREEK DIALYTIKA TONOS (U+0385)
(note these are not the combining versions, but standalone accents not on any letters).
The normalization of OXIA to TONOS means there are also 16 other normalizations like:
- GREEK SMALL LETTER ALPHA WITH OXIA (U+1F71) goes to GREEK SMALL LETTER ALPHA WITH TONOS (U+03AC)
This normalization means that, for better or worse, polytonic Greek should use the TONOS-based rather than OXIA-based pre-composed characters (where they exist) and use a plain SEMICOLON for question marks rather than U+037E. This allows one to use standard Unicode Normalization. If instead, you wanted to use GREEK QUESTION MARK, then any code to normalize in that direction would have to be custom. I don’t necessarily like all the decisions made by the Unicode Consortium in this regard but at least it’s a convention you can easily point people to.
Compatibility Forms
Okay, so that’s canonical decomposition—but what is the compatibility decomposition used by the “K” normalization forms?
The normalization via Compatibility Decomposition has a looser notion of whether two things are equal. There are strings that would not be considered equal under Form C (NFC) that are under Form KC (NFKC).
Mostly this is used to normalize over visual variants such as the ffi ligature or ℝ to R. Greek doesn’t have much of this so the only real cases of compatibility fall into one of two categories:
- various “symbol” variants like ϑ GREEK THETA SYMBOL (U+03D1) versus θ GREEK SMALL LETTER THETA (U+03B8) or ϕ GREEK PHI SYMBOL (U+03D5) versus φ GREEK SMALL LETTER PHI (U+03C6).
- breaking more standalone accents into space + combining character variants.
While you may gain normalization of the occasional stray symbol, etc. with compatibility forms, you do lose some things which may be important such as superscript numbers (which get normalized to their regular variety).
Collation
(broadly why sorted
doesn’t work and you need something like pyuca
)
Punctuation
(which characters to use for punctuation in Greek) (see PDF linked at top)
Character Classes
Stripping and Adding Diacritics
Let’s first of all consider stripping out specific diacritics.
Say you want to remove acute, grave and circumflex but leave diaeresis and breathing. You could have some mapping file that maps, say, GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA to GREEK SMALL LETTER ALPHA WITH PSILI and so on for all the combinations of precomposed characters. But there’s an easier way: just decompose the characters, filter out the diacritics you don’t want, and then recompose.
Here is an example in Python of removing acutes, graves and circumflexes:
ACUTE = "\u0301"
GRAVE = "\u0300"
CIRCUMFLEX = "\u0342"
unicodedata.normalize("NFC", "".join(
ch
for ch in unicodedata.normalize("NFD", text)
if ch not in [ACUTE, GRAVE, CIRCUMFLEX])
)
(or you could pip install greek-accentuation
and use characters.strip_accents
which does exactly the above)
If you just wanted to test the existence of a particular diacritic, say a diaeresis, you can do something like this:
DIAERESIS = "\u0308"
if DIAERESIS in unicodedata.normalize("NFD", word):
print("has a diaeresis!")
If you don’t care about a specific subset of diacritics and you just want to test or remove any diacritics, you don’t need to test with in
and a big long list of diacritics. You can instead use unicodedata.category
and see if the character’s category is "Mn"
.
TODO: adding diacritics