Thursday, May 23, 2013

AP001: High Availability for IBM Cast Iron

Architectural Patterns #001: High Availability for IBM Cast Iron

Some folks out there have been asking about High Availability (HA).  With CIOS there are two general purpose High Availability options: For physical appliances you can use an HA Pair setup to provide high availability, in a HyperVisor environment you have several levels of HA build into VMWare.  Every situation is different but we typically recommend VMWare as an HA mechanism because it offers more flexibility and many of our customers already have VMWare infrastructure and expertise.

First a bit of Background on High Availability and Fault Tolerance 

When designing a system you inevitably spend a lot of time thinking about what happens when something goes wrong.  Error handling logic is often the most time consuming part of system design, this must inevitably extend outside of your orchestrations to the platform itself.  System availability as measured by percentage uptime is a common metric used when defining an Service Level Agreement (SLA).  For example a system with 99% uptime can be down for roughly 1.5 hours a week.  A system with 99.999% (five-nines is a common idiom when it comes to availability), the system can be down for about 5 minutes every year.  Typically, when we talk about System Availability, we are concerned with maximizing the amount of time that this system is available to process transactions.  There are two main reasons a system can be unavailable: maintenance or a system failure.  Having no maintenance windows in what is typically referred to as a "Zero Downtime" architecture is extremely difficult, we've never attempted this type of scenario with Cast Iron because when it comes down to it not many users can justify the expense of that kind of SLA and simply schedule downtimes when the system is not heavily used and fall back such as allowing transactions to queue can be used to allow system maintenance to occur.

Avoiding downtime due to system failures is referred to as "High Availability" or "Fault Tolerance."  How a system failure will affect system availability depends on how long it will take to recover from an outtage.  Like anything, there is a spectrum of system availability options.  A typical set of options are as follows:
Zero Redundancy: You have no backup unit, you need to call and order a replacement part and wait for it to be delivered and installed before you can power your unit back up.
Cold Spare: You have a backup unit, its sitting in the box in your data center.  You need to unrack the old unit, rack the new unit, plug in the network and power cables.  Boot the unit, patch it, load your projects, configure them, and finally start your orchestrations.
Hot Spare: your spare unit is already racked and patched, with orchestrations loaded, you just need to switch the IP addresses and start your orchestrations.
High Availability: With a high availability solution, the process is now fully automated.  You have reserved capacity to accommodate failover an automated process to recover in under 10 minutes.
Fault Tolerant: With a Fault Tolerant system, the process is fully automated and resources are not only reserved they are already allocated.  Failover is seamless to external systems recovery time is under 10 seconds, ideally instantaneous.

What is a Physical HA Pair and How does it work?

With a physical HA Pair you actually have two physical appliances that are tied together in a Master / Slave setup.  The appliances have special hardware and dedicated network connections between them so they can replicate and detect failure scenarios.  One of the appliances is determined to be the "Active" appliance and the other runs in "Passive" mode.  The Active appliance carries out all the work that a standalone appliance would do however it commits all changes to the Work In Progress (WIP) memory area to the passive appliance.  The WIP is the persistent store that the appliance uses to store the state of all of your variables before any connector activity.  With the WIP replicated to the passive appliance, should anything happen to the Active appliance, the Passive appliance is ready to take over as soon as it detects a failure.  When the Passive appliance takes over, it will take over the MAC addresses of the former Active appliance, therefore to external systems there is no change.  In the System Availability spectrum this solution is somewhere between HA and FT, recovery is automatic and close to instantaneous, however, because the system recovers at that last state of the WIP network connections need to be reestablished to endpoints and you need to understand the nature of the interactions with your endpoints.  Some endpoints support Exactly Once semantics where the connector will guarantee that an operation is only performed once.  For example, the database connector does this by using control tables to synchronize a key between the appliance and the database.  They insert a key into the control table and the presence of that key is checked before repeating an operation, if the key is present, the operation has already been completed.  We generally recommend that you design all processes to be idempotent, so it won't matter if a single interaction with an endpoint is repeated.  This is the easiest way to recover from errors, but often requires careful design to achieve.

What Options Do I have with VMWare?

VMWare actually gives you several levels of High Availability to choose from depending on the resources that you want to allocate to HA.  The simplest option within VMWare is VMWare High Availability.  VMWare High Availability requires a VMWare cluster with VMotion and VMWare HA configured.  In this mode VMWare will detect a server or appliance fault and automatically bring up the VM on a new server in the cluster.  In this mode, there is a potential for some downtime while the appliance is started on the new server, however, the appliance will recover whether it left off using the last state of the WIP before the crash.  The advantage of this setup is that resources do not have to be allocated to a redundant appliance until a failure occurs.  Essentially, the resources required to recover from a failure are reserved not allocated and therefore can be pooled.  VMWare offers a higher level of high availability called VMWare Fault Tolerance.  With VMWare Fault Tolerance, failover resources are preallocated and VMWare is actively replicating the state of your virtual machine to another server. This method provides instantaneous recovery in the event of a failure and unlike a physical appliance the replication goes beyond the WIP, therefore, in Fault Tolerance mode the failover can occur in the middle of an interaction with an external resource transparently.  The disadvantage of this approach is that you need additional dedicated network resources for FT and you need to preallocate the memory and CPU resources for FT.  Therefore, FT effectively requires more than double the resources as HA due to the extra network requirements and load required to replicate the state.  See this post for more details on setting up CIOS HyperVisor Edition.

Active/Active and Load Balancing Scenarios

The active/passive model works well when you want High Availability and your load does not exceed the capacity of a single appliance.  It is a simple but elegant design that provides transparent recovery in the event of failure.  This ease of use is perfectly aligned with what a typical customer expects from Cast Iron.  Further, in our experience a single appliance, when orchestrations are designed properly, is more than adequate for most Cast Iron users.

That being said, there are other options out there for load balancing and high availability using multiple appliances, however, most are dependent on the use case and the endpoints involved.  If you are using CIOS to host web services over HTTP you can use an HTTP load balancer to distribute load across multiple appliances, most HTTP load balancers have some means of detecting failed nodes and redirecting traffic.  For database sources you can use multiple buffer tables and write triggering logic to balance the load.  Other source systems such as SAP and JMS are also easily setup for load balancing across multiple appliances.

In the past we have also used a dispatcher model to distribute load, this is particularly effective when the load is generated by use cases with complex logic which leads to longer running jobs.  With a dispatcher model, however, eliminating the dispatcher as a signal point of failure can prove to be difficult and is use case dependent.

What About Disaster Recovery?

Disaster Recovery (DR) is a question of how do deal with catastrophic failure such as when a hurricane destroys your datacenter.  Again, how quickly you can recover and what level of service you can provide in such an event will depend on architecture and impact budget.  The lowest cost DR solutions are usually manual workarounds to allow business to continue when a catastrophic failure occurs.  True seamless DR requires a remote datacenter with hardware replicating the main data center and automated recovery.  In most DR plans the recovery requires some manual processes, and in most there is ongoing maintenance that needs to occur to keep project versions and patch levels in sync.  Most DR plans call for DR appliances to be racked and mounted and powered up at all times, but that too is a consideration and a cost.  Most customers who opt for a hardware solution will have an HA pair for there main appliance and a single node in a remote data center for DR.  Typically, the DR node is racked, mounted and powered on with all the orchestrations loaded and configured but undeployed.  When it comes time to activate the DR appliance at that point it is theoretically just a matter of starting up the projects on the DR appliance.  Virtual appliance users typically have a DR plan for their virtual infrastructure and Cast Iron falls in line with that plan.  However, planning for DR is typically application specific and requires thinking about the problem from end to end.  You need to understand the DR plan for any endpoints that you are integrating with and also understand where the state of your integrations is stored.  In the end there is a serious cost benefit analysis that must be considered when planning for HA / FT and DR.  The business must decide where the proper balance is between SLA and budget. 


  1. Cloud is one of the tremendous technology that any company in this world would rely on(Salesforce Training institutes in Chennai). Using this technology many tough tasks can be accomplished easily in no time. Your content are also explaining the same(Salesforce developer training in chennai). Thanks for sharing this in here. You are running a great blog, keep up this good work(big data training).

  2. High availability is the one very important thing in any database technologies.... here u presented very nicely... great...

    Sales force is the leader CRM plays a vital role in the successes of any business..... huge openings on sales force ... you can check out here ::::
    Salesforce Online Training In Hyderabad