Disentangling an encoding mess

I inherited a MySql database that claims to be encoded as Utf-8. It contains some fields with Html data created from a Cms, and it's inserted verbatim on the web page. Almost all pages are shown correctly, but some of them have weird characters in it, like â€, ’, “ and even worse: control characters that breaks the output. What follows is a backtrack of what happened and how I solved it.

The original data: what is really stored in the DB?

Getting the data in a terminal gives you this:

> SELECT title FROM articles WHERE id=1;
Some text’: more explanations “allegedlyâ€1 row in set (0.024 sec)

After the last character, the €, there's an invisible control character that prevents the Repl from showing the whole text, and any other possible rows remaining. We can get the text from a script as binary data, and looks like this:

b'Some text\xc3\xa2\xe2\x82\xac\xe2\x84\xa2: more explanations \xc3\xa2\xe2\x82\xac\xc5\x93allegedly\xc3\xa2\xe2\x82\xac\xc2\x9d and ends here'

The last \xd9 cannot be decoded by Utf-8, so the MariaDb Repl doesn't show any more chars after it.

What happened (probably)?

At some point, the intended text looked like this: Some text’: more explanations “allegedly” and ends here. The problematic chars are the first single quote and the left and right quotation marks. For example, the single quote was stored in the database as Utf-8 \xd2\x80\x99, but at somepoint it was restored through CP1252, in which \xd2 is â, \x80 is and \x99 is . This was re-stored as Utf-8: â as \xc3\xa2, € as \xe2\x82\xac and ™ as \xe2\x84\xa2, so from now on every time the text is recovered from the database you see the weird symbols.

Solution

Working the solution has the extra difficulty that you cannot use "print" debugging, because you won't know what you see: when is displayed, it can be either \x99 or \xe2\x82\xac. Even worse: trying to print \xd9 can hang your terminal or your Repl. Lets work out the recovering of the double quote closing character (\xac\xc2\x9d in Utf-8, \x94 in CP1252), stored as b'\xc3\xa2\xe2\x82\xac\xc2\x9d' in the database:

raw = b'\xc3\xa2\xe2\x82\xac\xc2\x9d'
syms = raw.decode("utf-8")  # syms contains something like 'â€\x9d'

raw_utf = []
for c in syms:
    try:
        raw_utf.append(c.encode("cp1252"))
    except UnicodeEncodeError:
        raw_utf.append(ord(c).to_bytes(1, byteorder="little"))

assert raw_utf == [b'\xe2', b'\x80', b'\x9d']

In the first two lines we decode the seven bytes as Uft-8, geting a string with symbols that Python Repl represents as 'â€\x9d'. Note the last byte cannot be decoded by Utf-8 and it shows as a raw \x9d.
Next we create a list for all the bytes we want in our final Utf-8 string (raw_utf), one char at a time. The first â is encoded as \xe2 in the CP1252 table, € is encoded as \x80 and \x9d cannot be encoded, so it raises a UnicodeEncodeError. Those non-encodable bytes are stored as they come, but we must first transform the string to a byte.
Finally we join the list and decode it as Utf-8:

new_string = b"".join(utf_raw).decode("utf-8")

assert new_string == '”'

Takeout

Hope this is useful, thanks for reading!

But wait, there is more...

[Added 2023-17-26]

You should avoid all this mess and use instead Ftfy. It can't be more easy to use, all it takes is:

import ftfy

print(ftfy.fix_text("The messed up text"))