George Lawton wrote a a good summary of my JavaOne talk in his article titled Google App Engine plus Amazon AWS: Best of both worlds

Google App Engine (GAE) is focused on making development easy, but limits your options. Amazon Web Services is focused on making development flexible, but complicates the development process. Real enterprise applications require both of these paradigms to achieve success… What we really want is the flexibility and performance of AWS and the simplicity and ease of use of GAE.

This is exactly what we had been working on for the past year, leading us to the launch of our new cloud platform. With this platform we leverage GigaSpaces XAP as the high performance scale-out application server and Amazon as the robust and flexible IaaS. Together they form an alternative Platform as a Service geared for enterprise grade applications. This allows the cloud environment to inherit the extreme performance, latency and scalability of the XAP platform, which in turn enables achieving your performance and scaling target with less machines, implying a lower cost.

Real-life case study: Primatics financial – Risk analysis as a service

Francis de la Cruz and Argyn Kuketayev from Primatics Financial joined me through the presentation. In their part of the session they described their experience in developing a SaaS application for Real Time analytics.

Kuketayev described how Primatics used this approach to create a new automatically scaling cloud version of an existing banking application. Primatics initially developed a mortgage securities application that allows banks to estimate the value of a basket of hundreds of thousands of loans. The value of these loans fluctuates as economic conditions change and some portion of home owners cannot afford to make payments on their loans. Banks normally only need to assess the value of these loans at the end of each month, making them an ideal candidate for cloud services like AWS.

From a scalability perspective the challenge is to be able to provide a highly multi-tenant application that need to serve many firms, many users in that same firm each running many jobs at the same time. Implementing such a model can be fairly complex as you will need to be able to manage the life cycle of each job and each user independently and in isolation from one another.

Trying to build such a service directly on Amazon is going to be fairly complex, as you can learn from George’s summary below:

Primatics wrote the first version of EVOLV:Risk as a hosted web application for a regional bank.. The application needed to be fault tolerant so that if one node crashed, they did not have to restart the application over again from the beginning. Kuketayev said that it is not just about the loss of four hours, but the office is trying to close out the month and needs to access data to end the monthly cycle so they can go home. Using GigaSpaces' toolset they rewrote the entire application infrastructure in about four-months to run on top of AWS. Now they can kick off as many instances as required for different banking customers, and each instance runs significantly faster than before. Kuketayev said that it is important for banks that none of their applications run on the same infrastructure as another bank.

The diagram below shows the specific architecture that Primatics ended up using. Those that are familiar with Space Based Architecture would find it fairly straight forward:

The application is built out of a set of processing units. Each processing unit contains the compute agents in the form of a polling-container. The compute agents gets a a reference to a remote Data Grid that is shared by all processing units. Each agent gets the job injected to it by the polling container and gets a reference to the data it required to process the job. Once the job is completed, the result is stored back in the space. The results are flashed out back to a database through a mirror service.

In a case of a failure, other compute agents are able to continue from the exact point of failure and continue the job processing as if nothing happened. This is because the state of the job is kept safe in the data-grid and not in the agent’s memory.

Kuketayev from Primatics nicely summarized thye lesson he learned after going through the experience of trying to build it on his own vs. trying to use GigaSpaces:

Kuketayev said that one of the biggest lessons is that you need to have your infrastructure do the provisioning for you automatically, or otherwise you end up spending a lot of time just turning things on and off. He said they are now using configuration APIs to automate this process, whereas before they were using scripts. This allow for automatically throttling and failover recovery without human intervention. Kuketayev advised "You need to make sure you use the right tools … You don't want to have to worry about provisioning and reliability. Make sure you have provisioning, failover, monitoring and SLA out of the box."

The full JavaOne presentation is available here:

Final words

Fr solution providers the size of Primatics, building a risk analysis application as a service couldn’t be possible without cloud computing. Cloud enabled them to offer their solution as a service without the need to go through major investment of building a data center to support it.

Primatics’ experience is not special. One of the benefits of building Software as a Service is that you have one shared environment for all your customers. At the same time, one of the challenges is that in a shared environment, failure becomes more public and will impact ALL your clients. If the system doesn’t scale well, you’re going to be hit twice as hard as in a standalone application.

Building a robust and scalable SaaS application can be fairly complex. A good cloud infrastructure will get you a first class data center, but it won’t solve your application requirements.What’s interesting with cloud computing is that it forces you to think about the cost and efficiency of your application more than ever before. In the Primatics example, running a simulation of 100 nodes for 3 hours is very likely to fail at some point. A failure during such a simulation will immediately cost you 300 hours, not to mention the fact that you might lose the simulation window for the day and the reputation challenge you’ll will be facing with your customers. In addition, putting the data in-memory and making the application run 3-5 times faster means that you would need 1/5 of the machine power, which saves 80% of the cost of running the application.

I believe that the challenges imposed by cloud computing force us to focus on what we do best and avoid investing in areas which are not core to our business. Because the pay-per-use model significantly lowers the cost barrier, going down the path of writing your own infrastructure, as many have tried to do before, will be much more expensive and risky then ever before.

References: