Python Helpers for String/Unicode Encoding, Decoding and Printing
String encoding and decoding as well as encoding detection can be a headache, more so in Python 2 than in Python 3. Here are two little helpers which are used in PDFx, the PDF metadata and reference extractor:
make_compat_str
- decode any kind of bytes/str into an unicode objectprint_to_console
- print (unicode) strings to any kind of console (even windows with cp437, etc.)
All of this code is in the public domain via The Unlicense.
print_to_console
print_to_console
detects the output locale and tries to correctly encode the
given (unicode) string. Using this you can safely print to any kind of terminal,
either support UTF-8 or any other encoding (eg. Windows with cp437). Fallback to ascii with backslash-replace:
def print_to_console(text):
# Prints a (unicode) string to the console, encoded depending on the stdout
# encoding (eg. cp437 on Windows). Works with Python 2 and 3.
try:
sys.stdout.write(text)
except UnicodeEncodeError:
bytes_string = text.encode(sys.stdout.encoding, 'backslashreplace')
if hasattr(sys.stdout, 'buffer'):
sys.stdout.buffer.write(bytes_string)
else:
text = bytes_string.decode(sys.stdout.encoding, 'strict')
sys.stdout.write(text)
sys.stdout.write("\n")
make_compat_str
make_compat_str
detects the encoding of a string or bytes object using
chardet, and returns a standard unicode object. Just throw any kind of bytes / string at it!
import sys
import chardet
IS_PY2 = sys.version_info < (3, 0)
if not IS_PY2:
# Helper for Python 2 and 3 compatibility
unicode = str
def make_compat_str(in_str):
"""
Tries to guess encoding of [str/bytes] and decode it into
an unicode object.
"""
assert isinstance(in_str, (bytes, str, unicode))
if not in_str:
return unicode()
# Chardet in Py2 works on str + bytes objects
if IS_PY2 and isinstance(in_str, unicode):
return in_str
# Chardet in Py3 works on bytes objects
if not IS_PY2 and not isinstance(in_str, bytes):
return in_str
# Detect the encoding now
enc = chardet.detect(in_str)
# Decode the object into a unicode object
out_str = in_str.decode(enc['encoding'])
# Cleanup: Sometimes UTF-16 strings include the BOM
if enc['encoding'] == "UTF-16BE":
# Remove byte order marks (BOM)
if out_str.startswith('\ufeff'):
out_str = out_str[1:]
# Return the decoded string
return out_str
If you have suggestions, feedback or ideas, please reach out to @metachris.