Supporting a network in transition: Q&A blog post series with David Lef

In a series of blog posts, this being the second, David Lef, principal network architect at Microsoft IT, chats with us about supporting a network as it transitions from a traditional infrastructure to a fully wireless cloud computing platform. Microsoft IT is responsible for supporting 900 locations and 220,000 users around the world. David is helping to define the evolution of the network topology to a cloud-based model in Azure that supports changing customer demands and modern application designs.

David Lef explains the major factors that affect migration of IT-supported services and environments to cloud-based services, focusing on network-related practicalities and processes.

Q: Can you explain your role and the environment you support?

A: My role at Microsoft is principal network architect with Microsoft IT. My team supports almost 900 sites around the world and the networking components that connect those sites, which are used by a combination of over 220,000 Microsoft employees and vendors that work on our behalf Our network supports over 2,500 individual applications and business processes. We are responsible for providing wired, wireless, and remote network access for the organization, implementing network security across our network (including our network edges), and connectivity to Microsoft Azure in the cloud. We support a large Azure tenancy using a single Azure Active Directory tenancy that syncs with our internal Windows Server Active Directory forests. We have several connections from our on-premises datacenters to Azure using ExpressRoute. Our Azure tenancy supports a huge breadth of Azure resources, some of which are public-facing and some that are hosted as apps and services internal to Microsoft, but hosted on the Azure platform.

Q: What are the biggest networking challenges in migrating on-premises services to cloud-based services in Azure?

A: First of all, it's a fundamental change in traffic patterns. It used to be that we hosted most of our network traffic within our corporate network and datacenters, and selectively allowed access from the Internet into our network for apps and services that our employees needed to access while they were outside of the corporate network. From the aspect of traffic going in and out of our corporate network, we had our users accessing what you might call traditional Internet content, as well as users connecting to the corporate network using a virtual private network (VPN). Now, we are moving toward hosting the bulk of our on-premises datacenter infrastructure within Azure and choosing how we want to allow access to it.

Secondly, we’ve had network edge traffic increase a lot. Our bandwidth at the edge is over 500 percent what it was just a couple of years ago. The on-premises datacenter is no longer the hub of traffic for us and, and the cloud is the default app and infrastructure location for new projects at Microsoft. Our traffic pattern now revolves primarily around traffic to Azure datacenters. This, of course, has brought the demand for more robust and higher bandwidth edge connections—the resources that users formerly accessed within the corporate network are now being hosted in Azure, and those users experience the same level of responsiveness from their apps and services that they’ve been accustomed to.

We’re continuously moving apps and services from on-premises datacenters to Azure, so the connectivity requirements between Azure and our on-premises datacenters are changing as that migration continues. In addition, the pipeline between Azure and our datacenters is shrinking as more of our infrastructure moves to Azure. Our migration teams are moving as much as possible to software as a service (Saas) and platform as a service (Paas) in Azure wherever possible and, in situations where SaaS or PaaS doesn’t offer an immediate or beneficial solution, simply lifting the infrastructure components out of on-premises and into Azure infrastructure as a service (Iaas) virtual machines and virtual networks.

A significant part of the migration for these apps and services is analysis for redesign in the cloud. Wherever possible, our engineering teams are redesigning and re-architecting for the cloud. Internet-based traffic can have a higher latency than what Microsoft experiences within its corporate network infrastructure, so designing for that and educating users on the changes they should expect is important.

Q: How do you ensure adequate service levels in an Azure-based cloud delivery model?

A: The network component has a big impact on service levels, but it really does start with service design for our Azure-based resources. Connectivity to Azure is, for all intents and purposes, Internet connectivity, so anything hosted in Azure is designed as an Internet-based solution, wherever possible. Along with accommodating higher latency that I’ve already mentioned, the redesign process also includes retry logic for when a connection experiences any type of outage, caching and prefetching data, and compression of data across client connections.

After services design, we’re doing as much as we can on the network side to ensure robust connectivity. We’re using ExpressRoute extensively for our large locations, and making sure that we locate our hop onto ExpressRoute as close as physically possible to the resources that will use that connection, whether it is servers or users. That means using network service providers that have co-location facilities close to our physical locations. We don’t rely on traditional hub and spoke networking architectures for our location, and we try to avoid moving unnecessary traffic across our network backbones. We’ve found that the quicker you can drop someone onto the Internet, with the exception of cases where the provider infrastructure is very immature or limited, the better off they will be.

We monitor our environment pretty thoroughly. We’re designing the modern apps that run on Azure SaaS and PaaS to use the built-in instrumentation those platforms provide. We’re leveraging built-in synthetic transactions in those services and building in our own, using System Center products and Operations Management Services in Azure. It allows us to get a comprehensive view of our infrastructure; both centralized and decentralized. We treat our cloud services hosted in Azure as a product in which we’re the provider and the consumer—and all of Microsoft—is the customer.

Q: How does the challenge differ by geographic locations, and has that changed since the migration to cloud-based services?

A: Anytime we talk about geography, services placement is a huge consideration. We look at where our clients are for any given services, where the app to app dependencies lie, and plan accordingly. In most cases, we have at least one Azure datacenter within 1,000 kilometers of our clients, so we use that in our business continuity and disaster recovery planning. Azure’s built in geo-redundancy and resiliency components also help in those respects.

From a pure networking perspective, we try to place our Layer 3 management as close to the Azure datacenter as possible. That gives us the greatest control over traffic to Azure, and the best insight into what’s happening with that traffic.

Q: How do you encourage user adoption and buy-in when migrating to cloud-based services?

A: Our Azure teams provide a lot of guidance around the entire Azure experience. From a user experience, we do the best we can to provide them with accurate expectations for their apps and services that are migrated to Azure. In many cases, the general user experience is improved for apps on Azure, so this isn’t as much about softening the blow as it is showing them how having their app hosted on Azure changes the way the app is accessed and experienced. We make sure that users are aware of the ways that making an app available in the cloud can expose new functionality or ways to use the app. We focus on providing a user experience that enables mobile access from multiple device platforms. The key idea here is access from anywhere, on anything, at any time. An excellent example of this is the re-architecting of our licensing platform for the cloud, which was written about in a case study.

For the general migration to Azure, Microsoft IT has allotted people and capital to facilitate a smooth transition whenever a migration takes place. These resources contribute to the technical migration itself, training, and making sure that business processes are running as well or better than when the app or service was hosted on-premises.

Q: How have the IT teams changed to support this new delivery model?

A: The biggest change most people expect is this mass exodus or culling of traditional IT functions, but that’s not really the way it’s worked for us. We still have a network infrastructure to support throughout our physical locations, and datacenters don’t disappear overnight. Whether there are ten servers or 10,000 servers in a datacenter, disaster recovery and business continuity processes still need to happen and we need IT support for that. That being said, the requirement for on-premises infrastructure support does change. A lot of our high-level support teams are transitioning to different projects, sometimes in the Azure space. It’s given a lot of Microsoft employees the chance to improve their skill sets and shift their focus to development and innovation instead of maintenance and management.

With Azure, IT responsibilities become more compartmentalized, where we have IT staff that are focused on providing first-level support in their area of expertise, and it works without requiring a lot of people to have end-to-end knowledge of the environment or solution. Our Azure network experts provide their service and know their product and environment, and our Azure app experts do the same in their area, without needing to know specifically what’s happening with the network. The high-level knowledge is there across teams, of course, but resources and solutions become much more like plug-and-play solutions. This means that we’re more agile and able to respond to demand or start new projects more efficiently. Our teams don’t need to wait for physical servers to be built out or networking hardware to be installed; they simply request what they need, and Azure generates the resources.

Learn more

Other blog posts in this series:

Learn how Microsoft IT is evolving its network architecture.