Java PaaS shootout

A technical comparison of Google App Engine, Amazon Elastic Beanstalk, and CloudBees RUN@Cloud

The year of PaaS? In December 2010, salesforce.com acquired Heroku, a Ruby-based PaaS platform vendor, for $212 million in cash — a fitting end to a year of rapid growth and hype for PaaS. The buzz around PaaS has only increased since then. Industry research firm Gartner has proclaimed that 2011 is "the year of Platform as a Service," and Gartner analyst Yefim Natis predicts, "By 2015, cloud platform experience will be a listed or demanded skill in most hiring decisions by IT software projects" (see Related topics).

PaaS is a type of cloud service in which the provider delivers not only on-demand hardware and operating-system services, but also application platforms and solution stacks. PaaS services automate most of the IT management aspects associated with application deployment, including resource allocation, staging and testing, load balancing, database access, and access to platform libraries. A key feature of PaaS is multitenant architecture: multiple unrelated applications can run on the same hardware and software infrastructure, resulting in cost savings and more-efficient use of computing resources. Developers can focus on the application itself, as opposed to deployment and IT issues.

Java developers are well positioned to learn and take advantage of the PaaS development model. After all, the PaaS concept has deep roots in the early days of server-side Java. Back then, IT organizations were sold on the vision of using application servers as "containers" and then dropping in application-archive files to run in a shared-resource environment (see the PaaS precursor sidebar). That vision is remarkably similar to the PaaS services we see today.

But the early PaaS vision of Java enterprise applications didn't pan out. Java application servers never became stable enough to deploy and undeploy multiple unrelated applications at will. The archive structure ended up adding overhead to the Java application-development cycle: Whereas PHP and Ruby on Rails developers can change a line of code and reload the browser to see the difference, Java developers had to recompile, repackage, and redeploy their applications, and often even restart the application server.

PaaS precursor: Self-contained Java EE deployment units When Java servlet technology emerged, one of the key features that distinguished Java web applications from CGI or PHP applications was that Java applications were supposed to be self-contained inside "write once, run anywhere" WAR files. The WAR file includes all the code, configuration, media, and other files the application requires. At run time, the WAR application is also supposed to be self-contained — an application can only "see" classes and resources inside its own WAR file, in order not to interfere with classes and files from other WAR files. The application archive format was later expanded to include other types of self-contained, enterprise-application modules. Those self-contained application deployment units are a natural fit for PaaS.

As it turns out, the PaaS vision is only realized with the emergence of a new generation of virtualization technologies that are far more advanced than the JVM. Pioneered by Google App Engine, the new generation of Java PaaS services fulfills the old promise of Java EE. And they provide pay-as-you-need IT infrastructures that grow with your demand without requiring you to invest up front in expensive hardware and system-administration capacities.

In this article, I'll examine three leading Java public PaaS offerings to compare their approaches, strengths, and weaknesses. All three provide the same basic set of features, including:

Uploading and deploying application WARs

Versioning deployed applications

Testing and staging environments

Online access to log files

Automated monitoring and usage reports

Beyond those common features, however, are some important differences. Drawing from my own experience working with these emerging technologies, I'll offer a framework for comparing them and discuss potential workarounds to help you avoid problems when you use them.

Google App Engine

Google App Engine (GAE) is the first widely adopted Java PaaS platform. (The Java version is sometimes called GAE/J to distinguish it from GAE's Python-based PaaS offering.) It is also probably the "purest" PaaS offering on the market — in the sense that it almost completely abstracts away the underlying infrastructure from developers.

Java, but not quite Java

GAE has supported the Java platform as a development and deployment environment since 2009. However, GAE's Java support is limited and not standards-compliant. Because of numerous restrictions it imposes on its applications — many of them for good reasons to maintain scalability — GAE doesn't support certain Java platform APIs: most noticeably, file write I/O (because GAE doesn't provide a file system accessible to applications) and many of the network I/O APIs (because GAE imposes severe limitations on network operations originated from the applications). See Related topics for a complete list of "white listed" Java platform APIs supported by GAE.

By supporting its own limited network I/O API, GAE restricts an application's capacity to connect to other services. GAE nominally allows applications to make outbound connections to other servers. But in an effort to keep the number of threads in the system under control, GAE forces any application-initiated connection to close after 5 to 10 seconds. That makes GAE an unreliable platform for mashup-type applications. This is a major limitation of GAE for the increasing number of applications that make use of third-party web services APIs.

Furthermore, those API limitations impose challenges when you need to use existing application frameworks or move existing applications to GAE. After years of evolution, enterprise Java development is heavily dependent on frameworks. Although some popular frameworks, such as Spring and Struts, work out of the box on GAE, many others either do not work or require patches to their source code. Manually hacking framework source code to make it run on GAE is never a good idea, because you are essentially creating a fork that breaks upstream compatibility, and it could introduce hard-to-debug bugs into the framework. A good example is the JavaServer Faces (JSF) web framework: it requires source-code level hacking to run in the GAE environment, and even then many UI libraries on top of JSF are incompatible with GAE. (See Related topics for a list of GAE-supported Java frameworks.)

Likewise, large enterprise applications that are already developed are likely to use APIs that GAE prohibits. Migrating those applications to GAE might be costly, because you not only need to identify the issues and create workarounds, but also do quality assurance for the entire application all over again.

By not supporting part of the Java platform API, GAE breaks Java's promise of "write once, run anywhere." It's not a deal-breaker for many people, but it is something potential users need to be aware of.

Cloud Computing and IBM’s PaaS Find out what IBM has going on in the Cloud in this interview with Jerry Cuomo, CTO of IBM WebSphere. Cuomo discusses IBM’s vision on public, private and hybrid cloud computing, detailing IBM's plans for a PaaS offering built around WebSphere, DB2 and MQ, and the need for standardization in the Cloud.

Scalability and performance

GAE promises and delivers scalability but not necessarily raw performance. Raw performance for web applications is measured by response time to a web request. Scalability refers to the platform's ability to maintain a consistent response time regardless of how many users are accessing the system. For instance, a scalable system with a response of 3 seconds for 100 concurrent users should have a 3-second response time for 1 million concurrent users.

GAE provides excellent scalability as measured by a consistent response time. But its raw performance is often slow. In my own anecdotal experience, GAE often takes 1 to 3 seconds to respond to database-related requests.

That characteristic has obvious implications for application developers. For web applications that are idle most of the time (namely, most small web applications), deploying on the GAE infrastructure would not yield performance benefits over even a low-end virtual private server. The real performance benefit comes when you need to scale the application massively well beyond the capacity of low-end server hardware.

Another performance issue for low-traffic websites is that GAE swaps inactive JVMs out of memory to optimize for high-traffic web applications on the system. If your JVM is swapped out of memory, GAE must spend additional time to start your entire application the next time a request comes in. For low-traffic web applications, this could lead to slow performance (more than 5 seconds waiting time for a first request). GAE offers an option for developers to pay and keep the inactive JVM in memory for more-consistent performance. One tip: Set up a cron job inside GAE to load your own website every 2 to 3 minutes to keep the JVM active.

Benefits and limitations of BigTable

A key innovation of GAE is the use of a truly scalable data store: Google BigTable. Most web applications use relational databases as data back ends. But relational databases are notoriously difficult to scale. To solve this problem, researchers at Google developed an alternative data-storage solution called BigTable, one of the data-storage solutions in the world of NoSQL databases.

As in a relational database, data in BigTable can be organized into tables with rows and columns, and each row has a unique indexed ID. Unlike relational databases, BigTable tables do not have fixed schema and are typically denormalized. Each row in a table could have different columns. The best practice is to have many columns in a row, as opposed to linking different rows across different tables via key columns. That has big implications for the design of the data models. Instead of designing a normalized relational model, application developers are encouraged to put redundant information into each row for easier retrieval. Think of the access log of a web server where the IP address and browser agent are repeated in every row, taking up space but simplifying bulk processing.

The benefit of BigTable is scalability. Google engineers claim that the response time of data queries in BigTable is only determined by the size of the result dataset. You get the same performance no matter if the query is against a 1,000-row table or a 10-million-row table, as long as the result is limited to 1,000 rows. For its part, GAE limits the returned dataset of each query to 1,000 rows.

Adjusting to the NoSQL paradigm, although it could be challenging for developers from an SQL background, is an important skill to have as more and more IT organizations are facing the Big Data challenge. I have found that GAE is one of the best and easiest places for Java developers to get started learning NoSQL.

However, although BigTable is key to GAE's massive scalability, its current implementation leaves a lot to be desired for Java developers. Specific shortcomings of BigTable (and some potential workarounds) include:

Weak support for data queries : Queries written in Google Query Language (GQL) are used to retrieve data from BigTable. GAE requires that all the data columns involved in a query be indexed, and the index can't contain BLOB or text columns. That's fine, except that GAE allows only 100 indexes per table. That's probably sufficient for a standard SQL database, but denormalized NoSQL databases like BigTable could potentially have thousands of columns, so 100 indexes could be limiting for many applications. Making matters worse, GAE provides no easy way to delete indexes that are no longer in use. Deciding which index to create is a significant burden for GAE developers. If a query uses a combination of columns that are not indexed, GAE will only throw an exception at run time when the query is executed. Although the SDK provides tools for automatically generating index-configuration files as you test the application on your local computer, you could still miss indexes if you do not manually test all the execution paths exhaustively. Merging the autogenerated indexes into an already deployed application is also a potentially error-prone process with no indication of error until web application users hit the misconfigured indexes. Finally, it is somewhat shocking — considering that BigTable is a Google product — that it doesn't support free text search in the database. You could embed a search-engine implementation, such as Apache Lucene, into your application to index and search text columns (see Related topics). But that's a big hassle for smaller websites for which standard SQL LIKE statements are sufficient for simple text search.

: Queries written in Google Query Language (GQL) are used to retrieve data from BigTable. GAE requires that all the data columns involved in a query be indexed, and the index can't contain BLOB or text columns. That's fine, except that GAE allows only 100 indexes per table. That's probably sufficient for a standard SQL database, but denormalized NoSQL databases like BigTable could potentially have thousands of columns, so 100 indexes could be limiting for many applications. Making matters worse, GAE provides no easy way to delete indexes that are no longer in use. Difficulty importing and exporting data: Another major issue with BigTable is the inability to import and export data. Because there's no standard API for directly accessing BigTable, you must write data-import and data-export logic into servlets inside your own application, and use your own web interface to import or export data. Because GAE terminates any web-request thread after 30 seconds, it's impossible to upload a large set of data into BigTable via a persistent connection. A common workaround is to break the data import into many pieces, with each piece requiring less than 30 seconds to upload and process. Then, you can use an automated HTTP driver, such as JMeter or Grinder, to run those tasks one by one until all data is imported. Needless to say, this is a tedious process. Exporting data from BigTable is even more problematic. Because the API limits each data query to 1,000 results, the export data must be managed in even smaller chunks than the 30-second processing timeout constraint allows.

Recognizing the limitations of BigTable for most developers, GAE provides access to hosted MySQL services via its paid business offerings.

Integration with other services

GAE provides excellent integration with other Google services. Notably, the application can integrate with Google Accounts so that users can log into your application using a Google username and password. That could potentially save you a lot of time, given that building a user-management system is duplicate work every website has to do. However, the downsides are that not all users have Google accounts, and that tying your website to Google Accounts would make it hard to move to another PaaS provider later.

GAE applications can also use a simple API to send email messages via GMail servers. Compared with unsecured SMTP servers, GMail servers are much less likely to be blocked by the recipient ISP.

If you host your domain on Google Apps, you can also configure the application to be accessed via any subdomain under your control by linking your Google Apps account with your GAE account. For instance, if mydomain.com is hosted by Google Apps, you can make your application accessible from www.mydomain.com as opposed to mydomain.appspot.com.

Verdict

Overall, GAE provides a well-designed and scalable PaaS. Its generous free quota for small websites is also appealing. However, the lack of support for the complete Java platform is a potential deal breaker, and some of the components in GAE still feel experimental rather than production-ready.

Amazon Elastic Beanstalk

Amazon Elastic Beanstalk, a relatively new offering from Amazon Web Services, provides a managed Apache Tomcat runtime environment based on the Amazon Elastic Computing Cloud (EC2) infrastructure. EC2 is an Infrastructure-as-a-Service (IaaS) offering, so it provides much more flexibility than GAE. But as a trade-off, it also requires more developer effort to manage and scale the applications.

Pure Java Tomcat

The Beanstalk environment supports a full Tomcat server running on an EC2 virtual server. It is a pure Java environment with access to the underlying file system. Because of Tomcat's popularity, almost all enterprise Java frameworks support Tomcat deployment. Those frameworks can be started or bootstrapped from a Tomcat WAR file, offering you a wide variety of choices of frameworks and libraries.

The plain Tomcat runtime has no limitation on threading and file or network I/O. A network I/O thread can stay open as long as it is needed. You are only limited by the capacity of the underlying virtual machine.

Scaling, at a price

Beanstalk scales your application by automatically starting new EC2 instances and deploying your WAR file to the new instance. All your Beanstalk EC2 instances are running behind a load balancer. You can use a web-based management console to monitor the resources available on each EC2 instance and set up rules for automatically starting new server instances behind the load balancer when the existing server load exceeds preset limits.

A common issue in a load-balanced web cluster is how to handle HTTP sessions. Each Tomcat server node creates and manages session objects for its clients. If web requests are load-balanced across multiple server nodes, you need to make sure that the server node serving the request has the correct session object. A simplistic way to archive this is to enable "sticky session" in the load balancer, requiring the load balancer to remember the session cookies maintained by each server behind it, and forward requests to the right server based on the incoming cookies. The "sticky session" can be turned on in the Beanstalk load balancer administration console. More efficient and fail-safe solutions include setting up shared memory across the server nodes or simply saving the session objects into a central database. Those options allow the load balancer to forward requests to a random or the least-busy server node, because every server node has the same session-state information. But all those options require effort from the application developer. Unlike GAE, which automatically saves session data into BigTable, Beanstalk requires you to do all the work.

Perhaps one of Beanstalk's biggest drawbacks is its price, especially for small websites that can get free hosting elsewhere. While Amazon EC2 has a "one year free" program for new signups, Beanstalk's standard pricing runs close to $40 a month even for a single-node setup. That's a cheap price for a cluster-ready infrastructure that can scale out automatically in a matter of minutes when needed, but it's expensive compared with the likes of GAE if your application is mostly idle with an occasional surge of traffic.

Flexible database choices

One of the strengths of the Elastic Beanstalk platform is flexibility in choosing database technologies. It offers several options:

Relational databases : Through Amazon's own Relational Database Service (RDS), you can deploy a variety of relational databases. Those database servers are managed and monitored by Amazon, and it's easy to import data into and export it from them. Inside your application, all you need to do is to point your data sources to your RDS server. But be aware that each RDS instance is another dedicated server instance running your database — and a database instance is 30 percent more expensive than a comparable EC2 instance. The cost could add up, and many applications do not need a dedicated database server.

: Through Amazon's own Relational Database Service (RDS), you can deploy a variety of relational databases. Those database servers are managed and monitored by Amazon, and it's easy to import data into and export it from them. Inside your application, all you need to do is to point your data sources to your RDS server. But be aware that each RDS instance is another dedicated server instance running your database — and a database instance is 30 percent more expensive than a comparable EC2 instance. The cost could add up, and many applications do not need a dedicated database server. NoSQL : One issue with the RDS server is that it is a relational database that is hard to scale. If you prefer a NoSQL approach similar to Google BigTable, it's available too with Amazon SimpleDB. SimpleDB's Java API lets your application easily access the data.

: One issue with the RDS server is that it is a relational database that is hard to scale. If you prefer a NoSQL approach similar to Google BigTable, it's available too with Amazon SimpleDB. SimpleDB's Java API lets your application easily access the data. Your own database server: Because EC2 provides access to raw virtual servers, you can set up your own databases or NoSQL data sources (such as Apache Cassandra) on a separate EC2 instance and just point the Beanstalk application to your own database server.

The flexibility in database choices, especially the ability to use Amazon managed relational databases, is likely to appeal to enterprise developers.

Integration with other services

In addition to Amazon RDS and SimpleDB, Beanstalk servers have access to other Amazon services such as Simple Queue Service, S3 Storage, Simple Email Service (SES), and payment APIs. SES is especially interesting and offers a good comparison point with the GMail API in GAE.

SES has a simple API, and it allows you to use Amazon's SMTP server to send out email messages. The benefit to using Amazon SMTP servers, as opposed to setting up an unsecured SMTP server on your own EC2 instance, is that Amazon servers are less likely to be blocked by major ISPs' spam filters. To this end, SES provides a rich set of tools to control the ramp-up of email volumes and to receive feedback from ISP spam filters. All those features are made available to your Beanstalk application so that you can monitor your campaigns and optimize your email content for more efficient delivery.

Verdict

Overall, Amazon Elastic Beanstalk greatly simplifies the deployment and scaling of Tomcat applications. Yet, it still provides the flexibility of the underlying EC2 infrastructure, which makes it ideally suited for enterprise applications. The cost, however, is high for low-traffic websites or hobbyist developers.

CloudBees RUN@Cloud

CloudBees is a new entrant to the Java PaaS scene. It may be a startup, but the people behind it are enterprise Java veterans. (It was started by JBoss ex-CTO Sacha Labourey, and has employed open source Java heavyweights Adrian Brock of JBoss fame and Kohsuke Kawaguchi of Hudson fame.) Its PaaS technology was acquired from Stax Networks, which has been providing hosted Java application services to enterprise customers for more than 10 years. The CloudBees RUN@Cloud service is based on the robust Stax platform, and it is available to individual developers via a self-service web portal.

In comparison with the big players, RUN@Cloud aims to find the right balance between managed scalability (as in GAE) and flexibility (as in Amazon's PaaS services) while adding its own twist of end-to-end development life-cycle support via the platform.

A robust Java runtime

The RUN@Cloud service is currently based on the EC2 infrastructure, and it can be viewed as a more automated version of Beanstalk + RDS. Like Beanstalk, RUN@Cloud also offers a dedicated Tomcat instance running on an EC2 virtual server for each web application. It provides a pure Java environment with no artificial limitation on file system access, network I/O, and threading.

One of RUN@Cloud's strengths as a small independent company is that it doesn't need to be tied with Amazon. It plans to offer other infrastructure providers to supplement EC2 in the near future.

Free scalable infrastructure

Also similar to Beanstalk, RUN@Cloud provides a scalable infrastructure with load balancer and server instances to be started on demand to meet traffic surges. But RUN@Cloud provides more automation than Beanstalk. For instance, instead of using "sticky sessions," RUN@Cloud has already configured its Tomcat servers to save sessions to databases under its management. This managed-session object database is transparent to developers — much like GAE.

Because RUN@Cloud can use a shared load balancer to manage multiple Tomcat servers running on a single EC2 instance, it does not require one EC2 instance per Tomcat instance. Hence it can run low-traffic websites at much lower cost than Beanstalk. In fact, RUN@Cloud has a free usage tier that is great for low-traffic applications or hobbyist developers and students.

However, also like GAE, RUN@Cloud can swap your JVM out of memory if your application is inactive for too long, to conserve resources. That could cause slow response to the first request as the application "warms up."

Hosted MySQL relational databases

The RUN@Cloud service natively supports a managed MySQL service alongside the Tomcat service. You can create and manage databases through a web-based administration console. And you can connect to the database server directly via a MySQL client in order to manage your data.

Unlike Amazon RDS, the RUN@Cloud service deploys a shared database server across multiple applications. Each application can have its own database but not necessarily a dedicated server. The PaaS platform automatically deploys the database to maximize the utilization of a pool of database servers. Compared with RDS, the shared database server would yield potentially more-efficient use of virtual servers, and hence lower costs.

Integration with other services

RUN@Cloud provides access to platform APIs and services supported by its underlying infrastructure providers. Specifically for RUN@Cloud applications deployed on Amazon EC2, those applications have full access to all Amazon web service APIs — such as S3, SQS, and SES — from inside your application.

But where RUN@Cloud really shines is its tight integration with DEV@Cloud, a cloud-based, Continuous Integration platform. DEV@Cloud provides source-code, version-control systems (Subversion and GIT); a build repository (Apache Maven); and a build server (jenkins, formerly called Hudson). It allows you to run automated building and testing of your applications in the cloud rather than on your own computer. This type of centralized build system is widely adopted by agile software teams to make sure that the source code in the repository is always tested and in a releasable state.

By integrating RUN@Cloud with DEV@Cloud, CloudBees provides a compelling set of PaaS services that can manage the entire development, testing, and deployment cycle of enterprise Java web applications. You just need to edit source code on your own computer, and everything else can be delegated to an automated system in the cloud with minimal IT overhead.

Verdict

CloudBees RUN@Cloud is a low-cost (and even free) alternative to Amazon Elastic Beanstalk and RDS. Its integration with continuous-build systems make it appealing to agile software development teams that wish to automate all the IT functions in the development process.

Conclusions

After years of disappointment, Java PaaS services have finally reached prime time. The three services reviewed and compared in this article each has its unique approach and, as a result, unique strengths and weaknesses.

If you are developing a new application and can live with GAE's constraints, GAE is an excellent and free choice. RUN@Cloud and Elastic Beanstalk are interexchangable runtimes at the application level. Standard Java EE applications can run on either platform unmodified. RUN@Cloud is cheaper to get started with and easier to configure, and it provides excellent support for continuously integrated development processes. I suggest starting with RUN@Cloud for free, knowing that you can easily move to Elastic Beanstalk if you are unhappy with CloudBees' services.

Downloadable resources

Related topics