Thursday, May 23, 2013

AP001: High Availability for IBM Cast Iron

Architectural Patterns #001: High Availability for IBM Cast Iron

Some folks out there have been asking about High Availability (HA).  With CIOS there are two general purpose High Availability options: For physical appliances you can use an HA Pair setup to provide high availability, in a HyperVisor environment you have several levels of HA build into VMWare.  Every situation is different but we typically recommend VMWare as an HA mechanism because it offers more flexibility and many of our customers already have VMWare infrastructure and expertise.

First a bit of Background on High Availability and Fault Tolerance 

When designing a system you inevitably spend a lot of time thinking about what happens when something goes wrong.  Error handling logic is often the most time consuming part of system design, this must inevitably extend outside of your orchestrations to the platform itself.  System availability as measured by percentage uptime is a common metric used when defining an Service Level Agreement (SLA).  For example a system with 99% uptime can be down for roughly 1.5 hours a week.  A system with 99.999% (five-nines is a common idiom when it comes to availability), the system can be down for about 5 minutes every year.  Typically, when we talk about System Availability, we are concerned with maximizing the amount of time that this system is available to process transactions.  There are two main reasons a system can be unavailable: maintenance or a system failure.  Having no maintenance windows in what is typically referred to as a "Zero Downtime" architecture is extremely difficult, we've never attempted this type of scenario with Cast Iron because when it comes down to it not many users can justify the expense of that kind of SLA and simply schedule downtimes when the system is not heavily used and fall back such as allowing transactions to queue can be used to allow system maintenance to occur.

Avoiding downtime due to system failures is referred to as "High Availability" or "Fault Tolerance."  How a system failure will affect system availability depends on how long it will take to recover from an outtage.  Like anything, there is a spectrum of system availability options.  A typical set of options are as follows:
Zero Redundancy: You have no backup unit, you need to call and order a replacement part and wait for it to be delivered and installed before you can power your unit back up.
Cold Spare: You have a backup unit, its sitting in the box in your data center.  You need to unrack the old unit, rack the new unit, plug in the network and power cables.  Boot the unit, patch it, load your projects, configure them, and finally start your orchestrations.
Hot Spare: your spare unit is already racked and patched, with orchestrations loaded, you just need to switch the IP addresses and start your orchestrations.
High Availability: With a high availability solution, the process is now fully automated.  You have reserved capacity to accommodate failover an automated process to recover in under 10 minutes.
Fault Tolerant: With a Fault Tolerant system, the process is fully automated and resources are not only reserved they are already allocated.  Failover is seamless to external systems recovery time is under 10 seconds, ideally instantaneous.

What is a Physical HA Pair and How does it work?

With a physical HA Pair you actually have two physical appliances that are tied together in a Master / Slave setup.  The appliances have special hardware and dedicated network connections between them so they can replicate and detect failure scenarios.  One of the appliances is determined to be the "Active" appliance and the other runs in "Passive" mode.  The Active appliance carries out all the work that a standalone appliance would do however it commits all changes to the Work In Progress (WIP) memory area to the passive appliance.  The WIP is the persistent store that the appliance uses to store the state of all of your variables before any connector activity.  With the WIP replicated to the passive appliance, should anything happen to the Active appliance, the Passive appliance is ready to take over as soon as it detects a failure.  When the Passive appliance takes over, it will take over the MAC addresses of the former Active appliance, therefore to external systems there is no change.  In the System Availability spectrum this solution is somewhere between HA and FT, recovery is automatic and close to instantaneous, however, because the system recovers at that last state of the WIP network connections need to be reestablished to endpoints and you need to understand the nature of the interactions with your endpoints.  Some endpoints support Exactly Once semantics where the connector will guarantee that an operation is only performed once.  For example, the database connector does this by using control tables to synchronize a key between the appliance and the database.  They insert a key into the control table and the presence of that key is checked before repeating an operation, if the key is present, the operation has already been completed.  We generally recommend that you design all processes to be idempotent, so it won't matter if a single interaction with an endpoint is repeated.  This is the easiest way to recover from errors, but often requires careful design to achieve.

What Options Do I have with VMWare?

VMWare actually gives you several levels of High Availability to choose from depending on the resources that you want to allocate to HA.  The simplest option within VMWare is VMWare High Availability.  VMWare High Availability requires a VMWare cluster with VMotion and VMWare HA configured.  In this mode VMWare will detect a server or appliance fault and automatically bring up the VM on a new server in the cluster.  In this mode, there is a potential for some downtime while the appliance is started on the new server, however, the appliance will recover whether it left off using the last state of the WIP before the crash.  The advantage of this setup is that resources do not have to be allocated to a redundant appliance until a failure occurs.  Essentially, the resources required to recover from a failure are reserved not allocated and therefore can be pooled.  VMWare offers a higher level of high availability called VMWare Fault Tolerance.  With VMWare Fault Tolerance, failover resources are preallocated and VMWare is actively replicating the state of your virtual machine to another server. This method provides instantaneous recovery in the event of a failure and unlike a physical appliance the replication goes beyond the WIP, therefore, in Fault Tolerance mode the failover can occur in the middle of an interaction with an external resource transparently.  The disadvantage of this approach is that you need additional dedicated network resources for FT and you need to preallocate the memory and CPU resources for FT.  Therefore, FT effectively requires more than double the resources as HA due to the extra network requirements and load required to replicate the state.  See this post for more details on setting up CIOS HyperVisor Edition.

Active/Active and Load Balancing Scenarios

The active/passive model works well when you want High Availability and your load does not exceed the capacity of a single appliance.  It is a simple but elegant design that provides transparent recovery in the event of failure.  This ease of use is perfectly aligned with what a typical customer expects from Cast Iron.  Further, in our experience a single appliance, when orchestrations are designed properly, is more than adequate for most Cast Iron users.

That being said, there are other options out there for load balancing and high availability using multiple appliances, however, most are dependent on the use case and the endpoints involved.  If you are using CIOS to host web services over HTTP you can use an HTTP load balancer to distribute load across multiple appliances, most HTTP load balancers have some means of detecting failed nodes and redirecting traffic.  For database sources you can use multiple buffer tables and write triggering logic to balance the load.  Other source systems such as SAP and JMS are also easily setup for load balancing across multiple appliances.

In the past we have also used a dispatcher model to distribute load, this is particularly effective when the load is generated by use cases with complex logic which leads to longer running jobs.  With a dispatcher model, however, eliminating the dispatcher as a signal point of failure can prove to be difficult and is use case dependent.

What About Disaster Recovery?

Disaster Recovery (DR) is a question of how do deal with catastrophic failure such as when a hurricane destroys your datacenter.  Again, how quickly you can recover and what level of service you can provide in such an event will depend on architecture and impact budget.  The lowest cost DR solutions are usually manual workarounds to allow business to continue when a catastrophic failure occurs.  True seamless DR requires a remote datacenter with hardware replicating the main data center and automated recovery.  In most DR plans the recovery requires some manual processes, and in most there is ongoing maintenance that needs to occur to keep project versions and patch levels in sync.  Most DR plans call for DR appliances to be racked and mounted and powered up at all times, but that too is a consideration and a cost.  Most customers who opt for a hardware solution will have an HA pair for there main appliance and a single node in a remote data center for DR.  Typically, the DR node is racked, mounted and powered on with all the orchestrations loaded and configured but undeployed.  When it comes time to activate the DR appliance at that point it is theoretically just a matter of starting up the projects on the DR appliance.  Virtual appliance users typically have a DR plan for their virtual infrastructure and Cast Iron falls in line with that plan.  However, planning for DR is typically application specific and requires thinking about the problem from end to end.  You need to understand the DR plan for any endpoints that you are integrating with and also understand where the state of your integrations is stored.  In the end there is a serious cost benefit analysis that must be considered when planning for HA / FT and DR.  The business must decide where the proper balance is between SLA and budget. 

Monday, May 20, 2013

QT011: Copy And Paste Between Projects in Cast Iron Studio


Quick Tip #011: Copy And Paste Between Projects in Cast Iron Studio

Most Cast Iron Studio users already know that you can copy and paste activities.  Cast Iron has good support for this feature and will identify when it needs to create new variables etc.  What a lot of users don't know is that you can actually copy and paste activities between projects.

How Do I Open two Projects at Once?

The key obstacle to cutting and pasting between projects is the fact that Cast Iron Studio only allows you to open one project at a time.  The answer to the question of how to have two projects open at once is actually quite simple: install a second copy of studio.  All you need to do is run the installer a second time and tell it you want to install studio in a new location.  Make sure that you also create a new start menu folder for your second copy of studio, and that is all there is to it.  You can now have two projects open at once.

Once you have the source and target projects open you can copy activities from the source project as you normally would and paste them in to the target project in your second copy of studio as you normally would paste activities.  Whether you right click and choose copy/paste or use the CTRL-C CTRL-V shortcuts, its just that simple.

What Can I Copy?

There is a catch to using this undocumented feature, certain things cannot be copied and may become corrupted when pasted.  Only Activities and their associated variables can be copied and pasted between projects.  Flat File definitions, XML Schemas, WSDLs, and Stylesheets will have to be imported from the source to the target through the Project Tab's Add Document Dialog (remember that for flat file schemas you will have to choose all files to find them in the source project).  Endpoints will need to be recreated and you will likely need to repair some linkages by going to the pick endpoint step of your activities.  Its up to you to decide whether or not this extra work of relinking endpoints and other repair steps are worth it, but if you have a complex map that you do not want to rebuild in a new project it may be worth using this simple trick.