Monday, March 25, 2013

QT007: FTP Polling Considerations

Quick Tip #007: FTP Polling Considerations

FTP has been around since 1980 and still remains a popular protocol for integration.  The protocol is simple and widely adopted.  There are free open source implementations of the client and server for many platforms and the protocol is supported by major integrations platforms such as IBM Cast Iron, Dell Boomi, Informatica, and many others.

Active vs Passive FTP

There are two modes in which connections are established in FTP, Active and Passive.  In the original protocol which is now called Active mode, the client establishes a control connection to the server and uses the PORT command to tell the server which port to use when establishing a data connection to transfer files.  Such a protocol requires the client to be directly addressable by the server and therefore causes problems if the client is behind a firewall.  There are ways to use active FTP from behind a firewall, however, certain considerations if you are behind a firewall.  If the client has a public or an addressable IP address by the server, then you simply need to open a port to the client for the data connection and tell the client to pass that port when issuing the PORT command.  If the client has a private IP and your firewall uses Network Address Translation (NAT) the firewall may have a feature to enable passive FTP by proxying the PORT command and data connection.  If it does not, then the client and server need to support Passive mode.  In passive FTP, instead of issuing a PORT command, the client issues a PASSV command and initiates the data connection from the client side rather than the server side.

Security Concerns

Basic FTP does not use any form of encryption even for FTP passwords and therefore is not suitable when sensitive information is being transfered over public networks.  There are a couple of protocols that deal with this problem, both are in wide use today.

FTPS

FTPS is the secure implementation of the File Transfer Protocol.  It is an implementation of the entire set of FTP commands over a secure connection.  Again there are two modes, Implicit and Explicit.  Implicit FTPS is now deprecated, but as the name implies all traffic is sent over a secure SSL/TLS connection.  Implicit FTPS uses SSL/TLS to negotiate a secure connection with the client before any commands can be executed.  Explicit FTPS is the currently supported standard and allows one server to  provide both FTP and FTPS, by using the the AUTH SSL or AUTH TLS command the client can request a secure connection or use the standard AUTH command to request an unencrypted connection.  Obviously, users with access to sensitive data should be required to AUTH SSL or TLS and would be rejected if the do not.

sFTP

sFTP is the SSH File Transfer Protocol and is not strictly related to FTP, but implements a very similar command set and for most purposes from the users perspective is very similar.  This protocol uses the same security standards as SSH and is widely available on Unix platforms because most SSH Servers implement the protocol.

Avoiding Partial File Transfers

In FTP there is no standard way to lock a file to indicate that a transfer is in progress.  Many clients have unobtrusive ways of avoiding transferring a partial file.  IBM Cast Iron for example will check the file size before and after the transfer to see if it has changed.  If the file size changes during transfer then the client knows that the file was being transferred while it was being downloaded and it may not have received the entire file, in which case it will restart the transfer and repeat the process until it receives the entire file without the file size changing.  This system works however it is not foolproof, there really is no implicit way to know that the uploader is done with the file before it is downloaded.  There are however several easy ways to avoid this problem by having the uploader take specific action to indicate that the file is ready for download.  The first way is to rename the file after transfer.  If you are loading a file called my-file.csv, you can load the file as my-file.tmp and then rename the file once it has been loaded completely to my-file.csv.  This will ensure that the entire file is loaded before you try to download it.  Another solution is to use a control file.  A control file is a separate file that is loaded that indicates to the client which files are ready to be downloaded and may include some processing instructions such as what encoding was used for the file, etc.  A third option is to use a checksum file, by loading a cryptographic checksum file along with the file to be transfered you can ensure that not only is the file transfered completely, you can ensure that the file has not been corrupted.  

Monday, March 18, 2013

QT006: Understanding Character Encodings in CIOS

Quick Tip #006: Understanding Character Encodings in CIOS

Background

A character encoding scheme is a means of digitally encoding characters for electronic interchange and storage.  A character encoding translates the semantic meaning of a character system into a digital format, which is independent of how the characters are displayed.  A font, on the other hand, translates characters to glyphs that can be rendered on screen or paper.  There are a number of standard encodings that have been developed over the years to increase interoperability between systems, however, there is still no universally accepted character encoding scheme and therefore tools like Cast Iron support multiple encodings and provide the ability to translate between them.  Cast Iron supports a number of modern standards for encoding as well as a few legacy encoding systems that are still in use ocassionally.

ASCII

In the early days of computing, processors where designed to work with numeric data in 8-bit bytes.  A byte can encode 256 different values and that was plenty to support commonly used US characters.  Therefore, one of the first standardizations of an encoding scheme the American Standard Code for Information Exchange (ASCII) was born to encode 128 different character values including 26 uppercase and 26 lowercase letters, 10 digits, 33 punctuation and symbol characters, and 33 control characters.

Other Single Byte Encodings

Although 26 lower and upper case letters is sufficient for US English, there are actually other languages out there that use more and different characters.  There have been many attempts to create proprietary standards such as windows-1252 or EBCDIC from IBM.  There are also several encoding schemes from the International Standards Organization (ISO) to provide single byte character encodings for various character sets.  ISO-8859 is an extension of ASCII and uses the unused bit in the ASCII schema as well as replacing some of the control characters with printable characters.  ISO-8859 develops 16 different mappings that are useful for various languages, ISO-8859-1 for example is a single byte encoding for popular characters in Western European languages and is popular because it is backwards compatible with ASCII.

Multi Byte Encodings

Single Byte encodings are sufficient to cover languages where there are less than 256 common characters.  Some languages have thousands of characters.  Therefore, limiting characters to a single byte is not sufficient and a multi byte encoding system is essential.  In order to provide a broader standard for encoding characters the unicode standard was developed to encompass most of the known characters used in writing systems around the world.  Unicode uses over 1,000,000 code points to describe characters and can be encoded in various unicode transformation formats using up to 4 bytes.  There are two main standards in use today for unicode characters, UTF-8 and UTF-16.  Both standards seek to reduce the overhead of using a 4 byte code to represent each character by encoding the most commonly used characters with one or two bytes and expanding to up to 4 bytes to represent other characters.  UTF-8 uses the same encoding scheme as ISO-8859-1 for the first byte but can add additional bytes to represent the unicode characters not represented in ISO-8859-1.  UTF-16 uses two bytes by default to represent the most commonly used characters in modern languages, and is better suited for languages that would be forced to frequently use 3 bytes in the UTF-8 scheme such as Chinese due to the number of characters in common use.

Encodings in CIOS

Translating Encodings at the Endpoints

Because CIOS is a Java based platform the native encoding is UTF-16 and all operations are performed in this encoding scheme.  It is therefore necessary to translate data to this encoding when CIOS loads it from an endpoint.  For most endpoints you do have the option of deferring this translation and loading the data in binary format in which case it will be base 64 encoded and processed in the system as a base 64 encoded string.  Cast Iron supports translation to and from the following encodings: UTF-8, US-ASCII, SHIFT_JIS, EBCDIC-XML-US, ISO-8859-1, EUC-JP, and Cp1252.



You can even dynamically set the encoding in some endpoint activities.  This allows you to parameterize the input and output encodings by reading them from a flat file, database, or configuration property.



Translating the Encoding in Transformation Activities

Most of the Transformation Activities such as Read/Write Flat File, Read/Write XML, and Read/Write JSON allow you to specify the encoding in the Activity.  This functionality allow you to pass the Read activity a Base64 encoded binary message and specify the encoding in the configure step to translate the encoding and transform the data in a single step.  This can be helpful in cases where the encoding cannot be translated in the endpoint, such as data that is read from a BLOB in a database, or in cases where you need to support multiple encodings.



Again, the encoding can be set dynamically in the activity by showing the optional parameters and mapping an encoding parameter to the Encoding input.


MIME Messages

Initially, many Internet specifications required text to be encoded with ASCII characters, the Multi-Purpose Internet Mail Extensions (MIME) protocol was developed to allow other encodings and binary types to be sent over protocols designed with ASCII in mind.  The Read and Write MIME activities can be used in conjunction with the Email, HTTP, FTP or really any other connector to properly format and parse multi part MIME messages.  The most common scenario for using multi part MIME messages is in handling Emails with attachments.  It is in these cases that the dynamic controls for encoding in the various other activities can be very useful.  There are two headers that are important when understanding the encoding parameters of MIME messages, the Content-Type Header and the Content-Transfer-Encoding Header.  The charset parameter in the Content-Type header will tell you how text within each part of the message is encoded, while the Content-Transfer-Encoding will tell you how the binary data is encoded.  In most scenarios, the Content-Transfer-Encoding will be 7bit for ASCII text and Base64 otherwise, however, it is possible to have ASCII data that is sent with a base64 Content-Transfer-Encoding or in rare circumstances 8bit or binary Content-Transfer-Encoding (Most internet protocols are designed with 7bit printable characters in mind and do not allow raw binary data to be transferred).

Monday, March 11, 2013

QT005: Allocating More Memory for Cast Iron Studio

Quick Tip #005:  Allocating More Memory for Cast Iron Studio

By default, the maximum amount of memory available in Cast Iron Studio is 512MB.  Most of the time, that is more than adequate.   However, if you find yourself working with a large XML Schema in a map or testing a complicated XSLT, or for many other reasons you may need additional memory.

Background

Cast Iron Studio is Java application and like all Java applications, the amount of memory available is bounded by the Java Virtual Machine (JVM).  There are several JVM parameters related to memory that can be set at JVM startup time, typically the most important parameters are those related to the JVM heap.  The JVM is the long term / global memory used by the application and there are two parameters that are important here, the minimum and maximum size of the heap.  The minimum or initial size is the memory that is allocated when the JVM starts and the maximum size is the upper bound of the heap.  If an application tries to allocate more memory than the maximum heap size a java.lang.OutOfMemoryError is thrown.  For our purposes the minimum heap size can be left alone, the heap will grow automatically until the maximum is reached and the default value is typically fine for use with Studio.

How do I set the Maximum Heap Size for Cast Iron Studio?

First located the CastIronStudio.exe executable, it should be in the main folder where you installed studio.  Right click the executable and select create shortcut:



This will create a new file called Shortcut to CastIronStudio.exe, you will now need to edit the shortcut and specify the JVM Parameter to increase the heap size: [-J-XmxSSSSm].  Where SSSS is the new heap size in megabytes i.e. -J-Xmx1024m.  See the screenshot below:



Next you will probably want to switch to the General Tab and change the name to something more meaningful and indicative of the parameter that you set Such as "Cast Iron Studio 6.3.0.1 - 1024m" that way you will be able to distinguish the modified version from the original and will know whether or not you are using the larger heap size.

That's it, just double click the shortcut to run studio with a larger heap size.  Note: we demonstrated this setting with the latest version 6.3.0.1 on Windows XP, however, the process is the same for any install4j based version of Studio.  Also, the process on other versions of Windows such as Windows Server or Windows 7 is almost identical (the only change on Windows 7 is the name windows generates for the shortcut).

Monday, March 4, 2013

FR002: Job Keys

Feature Review #002: Job Keys

What are Job Keys?

Job keys are a useful utility feature in CIOS that allow you to tag values to each job that runs.  Tagging a job with a particular key allows you to search for that key on the WMC to find the job.  The primary key is also displayed in job list views in the WMC.

Using Job Keys

Creating and using Job Keys in CI Studio is a simple two step process:

  1. Managing Available Job Keys: Open your orchestration and click the green starter dot (see the screenshot below).  This will bring up the orchestration pane, in the first section you will see the list of job keys.  To add a job key click add.  There is a checkbox to make a particular key the primary key, note that only one key can be primary you will have to uncheck it to select a different key.  Select the key and click the Remove button to remove a key.
  2. Creating Job Keys: To create a job key simply use the Create Job Keys activity and map a value to the key that you want to create.  Note the name of the activity, Create Job Keys and not set job keys, indicates that each time you use this activity a new job key with that value will be created.  If you call it twice for a single key you will see two values for that key in the WMC after the job runs.   
Click the green starter dot to manage available job keys.

Use the Create Job Keys activity to set your job keys.

Design Patterns

Job keys allow your jobs to be searchable in the WMC.  Therefore, job keys are very useful when storing cross reference information.  For example, if you a writing an orchestration to sync accounts between SAP and salesforce it may be useful to store an account id, for example, as a job key to quickly allow you to see when the last time an account was synced.  In the same example, it may also be useful to store the IDoc number in order to trace an IDoc through the system.

The primary job key is what will be displayed in the WMC in job list views.  Therefore, setting a meaningful primary key will help you to distinguish one job from the next.  It is also a good place to display status information about the jobs that have completed.  This is especially true for batch jobs.  For batch jobs it is very useful to use the primary key to indicate how many items within a batch where processed successfully, had warnings or errors, etc.  To accomplish this all you need to do is add a job key called status and make it primary.  Then calculate the number of successes, warnings, and errors and map them in the Create Job Keys activity using a concatenate function to form a string like: Batch Job Complete. success: 5, warn: 2, error: 1.

Avoid overuse of this feature, as noted above once the create job keys activity is called the job key is logged permanently.  If you are processing hundreds or thousands of items within a batch job, it might seem like a good idea to log a key for each item, however, this can create a dramatic drag on the performance of your orchestration which may not be immediately evident (due to the fact that logging is asynchronous and indexing job keys by the logging system is an expensive operation).  Instead, use status summary design pattern to track aggregate status or an external database for logging if you need that level of detail for batch processes.