Monday, March 18, 2013

QT006: Understanding Character Encodings in CIOS

Quick Tip #006: Understanding Character Encodings in CIOS

Background

A character encoding scheme is a means of digitally encoding characters for electronic interchange and storage.  A character encoding translates the semantic meaning of a character system into a digital format, which is independent of how the characters are displayed.  A font, on the other hand, translates characters to glyphs that can be rendered on screen or paper.  There are a number of standard encodings that have been developed over the years to increase interoperability between systems, however, there is still no universally accepted character encoding scheme and therefore tools like Cast Iron support multiple encodings and provide the ability to translate between them.  Cast Iron supports a number of modern standards for encoding as well as a few legacy encoding systems that are still in use ocassionally.

ASCII

In the early days of computing, processors where designed to work with numeric data in 8-bit bytes.  A byte can encode 256 different values and that was plenty to support commonly used US characters.  Therefore, one of the first standardizations of an encoding scheme the American Standard Code for Information Exchange (ASCII) was born to encode 128 different character values including 26 uppercase and 26 lowercase letters, 10 digits, 33 punctuation and symbol characters, and 33 control characters.

Other Single Byte Encodings

Although 26 lower and upper case letters is sufficient for US English, there are actually other languages out there that use more and different characters.  There have been many attempts to create proprietary standards such as windows-1252 or EBCDIC from IBM.  There are also several encoding schemes from the International Standards Organization (ISO) to provide single byte character encodings for various character sets.  ISO-8859 is an extension of ASCII and uses the unused bit in the ASCII schema as well as replacing some of the control characters with printable characters.  ISO-8859 develops 16 different mappings that are useful for various languages, ISO-8859-1 for example is a single byte encoding for popular characters in Western European languages and is popular because it is backwards compatible with ASCII.

Multi Byte Encodings

Single Byte encodings are sufficient to cover languages where there are less than 256 common characters.  Some languages have thousands of characters.  Therefore, limiting characters to a single byte is not sufficient and a multi byte encoding system is essential.  In order to provide a broader standard for encoding characters the unicode standard was developed to encompass most of the known characters used in writing systems around the world.  Unicode uses over 1,000,000 code points to describe characters and can be encoded in various unicode transformation formats using up to 4 bytes.  There are two main standards in use today for unicode characters, UTF-8 and UTF-16.  Both standards seek to reduce the overhead of using a 4 byte code to represent each character by encoding the most commonly used characters with one or two bytes and expanding to up to 4 bytes to represent other characters.  UTF-8 uses the same encoding scheme as ISO-8859-1 for the first byte but can add additional bytes to represent the unicode characters not represented in ISO-8859-1.  UTF-16 uses two bytes by default to represent the most commonly used characters in modern languages, and is better suited for languages that would be forced to frequently use 3 bytes in the UTF-8 scheme such as Chinese due to the number of characters in common use.

Encodings in CIOS

Translating Encodings at the Endpoints

Because CIOS is a Java based platform the native encoding is UTF-16 and all operations are performed in this encoding scheme.  It is therefore necessary to translate data to this encoding when CIOS loads it from an endpoint.  For most endpoints you do have the option of deferring this translation and loading the data in binary format in which case it will be base 64 encoded and processed in the system as a base 64 encoded string.  Cast Iron supports translation to and from the following encodings: UTF-8, US-ASCII, SHIFT_JIS, EBCDIC-XML-US, ISO-8859-1, EUC-JP, and Cp1252.



You can even dynamically set the encoding in some endpoint activities.  This allows you to parameterize the input and output encodings by reading them from a flat file, database, or configuration property.



Translating the Encoding in Transformation Activities

Most of the Transformation Activities such as Read/Write Flat File, Read/Write XML, and Read/Write JSON allow you to specify the encoding in the Activity.  This functionality allow you to pass the Read activity a Base64 encoded binary message and specify the encoding in the configure step to translate the encoding and transform the data in a single step.  This can be helpful in cases where the encoding cannot be translated in the endpoint, such as data that is read from a BLOB in a database, or in cases where you need to support multiple encodings.



Again, the encoding can be set dynamically in the activity by showing the optional parameters and mapping an encoding parameter to the Encoding input.


MIME Messages

Initially, many Internet specifications required text to be encoded with ASCII characters, the Multi-Purpose Internet Mail Extensions (MIME) protocol was developed to allow other encodings and binary types to be sent over protocols designed with ASCII in mind.  The Read and Write MIME activities can be used in conjunction with the Email, HTTP, FTP or really any other connector to properly format and parse multi part MIME messages.  The most common scenario for using multi part MIME messages is in handling Emails with attachments.  It is in these cases that the dynamic controls for encoding in the various other activities can be very useful.  There are two headers that are important when understanding the encoding parameters of MIME messages, the Content-Type Header and the Content-Transfer-Encoding Header.  The charset parameter in the Content-Type header will tell you how text within each part of the message is encoded, while the Content-Transfer-Encoding will tell you how the binary data is encoded.  In most scenarios, the Content-Transfer-Encoding will be 7bit for ASCII text and Base64 otherwise, however, it is possible to have ASCII data that is sent with a base64 Content-Transfer-Encoding or in rare circumstances 8bit or binary Content-Transfer-Encoding (Most internet protocols are designed with 7bit printable characters in mind and do not allow raw binary data to be transferred).

No comments:

Post a Comment