A few weeks ago I had a fascinating discussing with a guy about usages of cloud computing. Over the course of an hour, we covered quite a lot of very interesting options.

The traditional drive for using cloud computing is the desire to avoid having to manage your own infrastructure, and the problems that one have to face when dealing with expanding one's infrastructure. Being able to hand all of that to a 3rd party is a very attractive offer, especially if you feel like this 3rd party is someone that you can rely on to do better job in managing infrastructure than your own developers.

In general, developers are not IT professionals, and it shows. An additional benefit of having a 3rd party manage your infrastructure is that they tend to place limitations on the things that you do. These limitations are usually good thing, since they force you to design in a scalable manner.

Amazon's EC2 doesn't give you a persistence storage other than S3 and SimpleDB. This means that bringing up a new VM instance is extremely easy, and you have just increased your capacity. Same for Google's App Engine and the limitations it has.

Of course, that is just the common motive. The competitive price of cloud computing vs. the high cost of building your own data center (or hosting your servers in someone else's) is also a reason to move in that direction.

There are other scenarios, equally interesting, however, of using cloud computing.

Consider the case of a spike in traffic, you are running an online shop and is it a major holiday. You are likely to see a huge surge in traffic. Using your own data center would force you to build to the maximum expected capacity, which is often order of magnitude higher than your normal capacity. Using cloud computing, you can simply turn on additional instances on as needed basis.

But even that benefit was beaten to death. But what about this scenario?

You need to test your application, and as usual, it is hard to find a test environment that has 30 servers that you can try your latest version on. Setting up something like that on EC2 for a week will set you back by less than 400$.

Periodic batch processing is another issue that you might need to consider using cloud computing for. I know of several places where the need to do heavy duty processing (diamond image process for cutting machines, for example), takes quite a bit of computing power, but this is something that you can batch and do and on "as needed" basis, reducing your infrastructure cost significantly. Payroll is also a fairly intensive process that happens only periodically.

Another use case is to setup cloud machines as clients in a distributed load test. You can set it up so a large number of clients (a thousand machines, which you probably will not be able to do on your own) against your application.

The ability to easily perform a rolling update is also attractive, not to mention that

The picture is not so rosy when you consider that it also has some not insignificant minuses.

Broadly, there are three types of cloud services.

Google App Engine - you upload the application, and that is it. You don't even have the concept of a machine in this system. Scalability and the distribution is solely the issue of the App Engine, not your code. This is currently limited to Python only, and the environment is modified to ensure that you can only do things the Right Way. A lot of the concerns that I intend to list are not relevant for this scenario, because you don't have the concept of a machine. Broadly, I think that this is the way to go, and given the chance to build my own cloud computing application, I would go with a very similar concept.

Amazon EC2 - you upload a vm image, and you can start creating instances of it. The VM image can contain whatever you want, but you have no persistent storage. That means that you can't actually save data to the local disk or run a RDBMS server. Persistent data is handled using Amazon's web services (S3, SQS, SDB). More on that later.

GoGrid - You take an existing VM template from their site, customize it, and start running it. This is the closest that you can achieve with regards to data center in the cloud, because GoGrid's systems behave just like real machines. That is, you don't have to be worried about persistent storage and the like. On the one hand, it is very convenient. On the other hand, I am not sure that I like this.

The main fault that I find with Google App Engine at the moment is the limitation to Python. From all other perspectives, it is as close to the model of the ideal close service as you can get. All the infrastructure concerns has been stripped away, you only have to deal with the application concerns.

EC2 and GoGrid both allows me to setup a VM and start running it. EC2's no persistence model means that it is much easier to scale by creating new instances and using the Amazon services to handle storage. GoGrid's model means that I have a lot more flexibility, but with it I have the chance of major issues. In particular, it seems that it is more complex to clone a machine a hundred times than it is on EC2.

EC2 worry me somewhat, because I am not sure how it is handling such things as configuration change (remember, no persistence, on reboot, all changes to the system are wiped), and how I handles patches and updates to the system itself. Perhaps it is because I am working on Windows so often, so I worry about how to deal with Patch Tuesday, but even on the Linux images that are common on EC2, there would be a need to perform such an update. On the EC2 model, that would require getting the image, making the change locally, and uploading the image.

GoGrid, however, will allow me to perform the update in place, but by the same token, I would need to perform this update on all machines in my application.

The EC2 model means that it is very easy to bring up new instances, the GoGrid model means that it is likely to be harder, because instances are not frozen images, they are actual servers, which has state.

Other issues that would concern me in such a scenario would be load balancing, including auto discovery of failed or new instances. From a cursory check, both Amazon and GoGrid have at least rudimentary support for this, but that leaves some things to be desired. In particular, the requirements of load balancing are:

Distribute loads among a cluster of application servers.

Handle failover of an application server gracefully.

Ensure the cluster of servers appears as a single server to the end user.

A short search in the Amazon forums has raised several issues regarding load balancing in the EC2 system. It seems like the common method is to have an EC2 instance running either HA Proxy or round robin DNS deal with this. Nothing is said about what happens if that instance goes down, but I think that I can guess...

This is partially why I think that the Google App Server model is preferred. This model is explicitly an unabashedly tells you that you have no business managing the infrastructure. That is left for someone else to manage. In this case, Google. But even if you decide to build such a system on your own, dealing with infrastructure concerns should be way out of the code.

In other words, you would need to create provisioning system that look at the load, create / destroy instances, update routing information, etc. Not hard, I think, but most certainly tiresome.

Thoughts?