Today we’re announcing the general availability of Apache Hadoop 3.0 on Azure HDInsight. In partnership with Cloudera, Microsoft Azure is the first cloud provider to offer customers the benefit of the latest innovations in the most popular open source analytics projects, with unmatched scalability, flexibility, and security. With the general availability of Apache Hadoop 3.0 on Azure HDInsight, we are building upon existing capabilities with a number of key enhancements that further improve performance and security, and deepen support for the rich ecosystem of big data analytics applications.

Bringing Apache Hadoop 3.0 and supercharged performance to the cloud

Apache Hadoop 3.0 represents over 5 years of major upgrades contributed by the open source community across key Apache frameworks such as Hive, Spark, and HBase. New features in Hadoop 3.0 provide significant improvements to performance, scalability, and availability, reducing total cost of ownership and accelerating time-to-value.

Apache Hive 3.0 – With ACID transactions on by default and several performance improvements, this latest version of Hive enables developers to build “traditional database” applications on massive data lakes. This is particularly important for enterprises who need to build GDPR/privacy compliant big data applications.

– With ACID transactions on by default and several performance improvements, this latest version of Hive enables developers to build “traditional database” applications on massive data lakes. This is particularly important for enterprises who need to build GDPR/privacy compliant big data applications. Hive Warehouse Connector for Apache Spark – With the Hive Warehouse Connector, the Spark and Hive worlds are coming closer together. The new connector moves the integration from the metastore layer to the query engine layer. This enables higher, more reliable performance with predicate pushdown and other functionality.

– With the Hive Warehouse Connector, the Spark and Hive worlds are coming closer together. The new connector moves the integration from the metastore layer to the query engine layer. This enables higher, more reliable performance with predicate pushdown and other functionality. Apache HBase 2.0 and Apache Phoenix 5.0 – Apache HBase 2.0 and Apache Phoenix 5.0 introduce a number of performance, stability, and integration improvements. With HBase 2.0, periodic reorganization of the data in the memstore with in-memory compactions improves performance as data is not flushed or read too often from remote cloud storage. Phoenix 5.0 brings more visibility into queries with query log by introducing a new system table that captures information about queries that are being run against the cluster.

– Apache HBase 2.0 and Apache Phoenix 5.0 introduce a number of performance, stability, and integration improvements. With HBase 2.0, periodic reorganization of the data in the memstore with in-memory compactions improves performance as data is not flushed or read too often from remote cloud storage. Phoenix 5.0 brings more visibility into queries with query log by introducing a new system table that captures information about queries that are being run against the cluster. Spark IO Cache – IO Cache is a data caching service for Azure HDInsight that improves the performance of Apache Spark jobs. IO Cache also works with Apache TEZ and Apache Hive workloads, which can be run on Apache Spark clusters.

Enhanced enterprise grade security

Enterprise grade security and compliance is a critical requirement for all customers building big data applications that store or process sensitive data in the cloud.

Enterprise Security Package (ESP) support for Apache HBase – With the general availability of ESP support for HBase, customers can ensure that users authenticate to their HDInsight HBase clusters using their corporate domain credentials and are subject to rich, fine-grained access policies (authored and managed in Apache Ranger).

– With the general availability of ESP support for HBase, customers can ensure that users authenticate to their HDInsight HBase clusters using their corporate domain credentials and are subject to rich, fine-grained access policies (authored and managed in Apache Ranger). Bring Your Own Key (BYOK) support for Apache Kafka – Customers can now bring their own encryption keys into the Azure Key Vault and use them to encrypt the Azure Managed Disks storing their Apache Kafka messages. This gives them a high degree of control over the security of their data.

Rich developer tooling

Azure HDInsight offers rich development experiences with various integrated development environment (IDE) extensions, notebooks, and SDKs.

SDKs general availability – HDInsight SDKs for .NET, Python, and Java enable developers to easily manage clusters using the language of their choice.

– HDInsight SDKs for .NET, Python, and Java enable developers to easily manage clusters using the language of their choice. VSCode – HDInsight VSCode extension enables developers to submit Hive batch jobs, interactive Hive queries, and PySpark scripts to HDInsight 4.0 clusters.

– HDInsight VSCode extension enables developers to submit Hive batch jobs, interactive Hive queries, and PySpark scripts to HDInsight 4.0 clusters. IntelliJ – Azure Toolkit for IntelliJ enables Scala and Java developers to program Spark, Scala, and Java projects with built-in templates. Developers can easily perform local run, local debug, open interactive sessions, and submit Scala/Java projects to HDInsight 4.0 Spark clusters directly from the IntelliJ integrated development environment.

Broad application ecosystem

Azure HDInsight supports a vibrant application ecosystem with a variety of popular big data applications available on Azure Marketplace, covering scenarios from interactive analytics to application migration. We are excited to support applications such as:

Starburst (Presto) – Presto is an open source, fast, and scalable distributed SQL query engine that allows you to analyze data anywhere within your organization. Architected for the separation of storage and compute, Presto can easily query data in Azure Blob Storage, Azure Data Lake Storage, SQL and NoSQL databases, and other data sources. Learn more and explore Starburst Presto on Azure Marketplace.

– Presto is an open source, fast, and scalable distributed SQL query engine that allows you to analyze data anywhere within your organization. Architected for the separation of storage and compute, Presto can easily query data in Azure Blob Storage, Azure Data Lake Storage, SQL and NoSQL databases, and other data sources. Learn more and explore Starburst Presto on Azure Marketplace. Kyligence – Kyligence is an enterprise online analytic processing (OLAP) engine for big data, powered by Apache Kylin. Kyligence enables self-service, interactive business analytics on Azure, achieving sub-second query latencies on trillions of records and seamlessly integrating existing Hadoop and BI systems. Learn more and explore Kyligence on Azure Marketplace.

– Kyligence is an enterprise online analytic processing (OLAP) engine for big data, powered by Apache Kylin. Kyligence enables self-service, interactive business analytics on Azure, achieving sub-second query latencies on trillions of records and seamlessly integrating existing Hadoop and BI systems. Learn more and explore Kyligence on Azure Marketplace. WANDisco – WANDisco Fusion de-risks migration to the cloud by ensuring disruption-free data migrations, easy and seamless extensions of Spark and Hadoop deployments, and short or long term hybrid data operations. Learn more and explore WANDisco on Azure Marketplace.

– WANDisco Fusion de-risks migration to the cloud by ensuring disruption-free data migrations, easy and seamless extensions of Spark and Hadoop deployments, and short or long term hybrid data operations. Learn more and explore WANDisco on Azure Marketplace. Unravel Data – Unravel provides a unified view across your entire data stack, providing actionable recommendations and automation for tuning, troubleshooting, and improving performance. The Unravel Data app uses Azure Resource Manager, allowing customers to connect Unravel to a new or existing HDInsight cluster with one click. Learn more and explore Unravel on Azure Marketplace.

– Unravel provides a unified view across your entire data stack, providing actionable recommendations and automation for tuning, troubleshooting, and improving performance. The Unravel Data app uses Azure Resource Manager, allowing customers to connect Unravel to a new or existing HDInsight cluster with one click. Learn more and explore Unravel on Azure Marketplace. Waterline Data – With Waterline Data Catalog and HDInsight, customers can easily discover, organize, and govern their data, all at the global scale of Azure. Learn more and explore Waterline on Azure Marketplace.

Get started now

We look forward to seeing what innovations you will bring to your users and customers with Azure HDInsight. Read the developer guide and follow the quick start guide to learn more about implementing open source analytics pipelines on Azure HDInsight. Stay up-to-date on the latest Azure HDInsight news and exciting features coming in the near future by following us on Twitter (#AzureHDInsight). For questions and feedback, please reach out to AskHDInsight@microsoft.com.

About Azure HDInsight

Azure HDInsight is an enterprise-ready service for open source analytics that enables customers to easily run popular Apache open source frameworks including Apache Hadoop, Spark, Kafka, and others. The service is available in 30 public regions and Azure Government Clouds in the US and Germany. Azure HDInsight powers mission critical applications for a wide range of sectors and use cases including ETL, streaming, and interactive querying.