Microsoft CSEO IPv6 Case Study

My job at Microsoft CSEO (used to be called IT) is to have our end-users be IPv6-only. All our corporate and VPN networks are dual-stack, but our ultimate goal is to run a single stack in the network. Of course, it won’t happen overnight because we have a huge environment. We currently support over 220,000 users in around 800 offices globally. Our biggest user population has over 100,000 users in Puget Sound region in the state of Washington. We have a distributed network with lots of different types of user profiles on the network, including developers and a big salesforce.

When I began at Microsoft CSEO in 2016, they had already been working on IPv6 in some shape or form for about ten years. They already had their backbone network enabled over IPv6, and they had done work for World IPv6 Day and the IPv6 World Launch and later, in summer 2016, enabled user segments. Now, wireless and wired corporate networks are IPv6-enabled too.

We are now focusing on having a single stack in the network. It won’t happen immediately, but we are working on it, and we are further along than any comparable company I know.

Why move to an IPv6-only internal network?

There are four things that drove our decision to move to IPv6-only on our internal network.

First, IPv4 address depletion, and I’m not talking about public IPv4 depletion. Microsoft CSEO, Cloud and Connectivity Engineering (CCE) organization of which I am a part, surrendered our public IPv4 address space to Azure in 2011, because they needed to offer publicly routable addresses to external customers. At that point we renumbered to private addresses. We split 10./8 space with Azure so they could use half on their internal network, and we would use the other half for our users. However, we can foresee based on the consumption and requirements from different product groups at Microsoft, a sliding date of depletion in two to three years.

Microsoft CSEO, Cloud and Connectivity Engineering is the IP master for all IP address needs for the whole organization. Product groups take our space when they virtualize their development and testing environments and take them to the cloud. However, we don’t have any large blocks like /16s or /17s left. We have smaller blocks, but people like the larger blocks to help them manage devices like virtual machines. Currently, we have a need for quite a lot of IPv4 address space which we don’t have in a continuous format.

Since we can see how much space is routed on the network, we are also working on reclaiming IPv4 space that is not heavily used. Yet we know there’s a point in the not-so-distant future that we will run out of IPv4, and we need to be prepared for it. We don’t want to wake up one morning and realize we can’t give any more addresses to our users, and suddenly we need to do something about IPv6.

The second reason we are moving to IPv6-only is because we know that running a dual-stack network makes it more complex including troubleshooting time, security and QoS policies. Dual-stacking also does not remove our reliance on NAT44 which we have to leverage heavily. While dual-stack was good for us to get our toes in the water and for our engineers to have experience with deploying and operating IPv6, it keeps us dependent on IPv4. Ultimately, everywhere where we can we will do IPv6-only.

The third reason is that everyone uses private IPv4. This makes our acquisitions quite difficult as we must insert and operate more NAT in our environments to enable communication between Microsoft environments and the acquired companies.

The fourth reason is the industry pressure. Apple’s decision to enforce IPv6 for all apps submitted to their AppStore was great. It made our Product Groups much more aware of IPv6 and they came to my organization asking for IPv6-only test environment which would enable them to verify correct functioning of their apps. Currently, we are running such a network in twelve locations based on the demand by product groups.

What do you need to get started?

One of the first things you need to have to adopt IPv6 is an address plan. It still amazes me today that we got the first IPv6 addressing architecture document in 2006. The engineers then did such a good job that it got adjusted and adapted a little bit in 2015, and recently last year we made another small change to cater for announcing IPv6 prefixes for local Internet egress and not just through dedicated Internet edges. The initial addressing architecture was so well written, that we didn’t need to make any fundamental changes. Based on that document the team created an IPv6-addressing engineering standard that our engineers follow for deploying and allocating IPv6 prefixes for the network segments today.

Originally, we had only one /32 from ARIN that we started using for small backbone deployments and trials. Then, after 2013, we got prefixes from RIPE NCC and APNIC to give us a total of three /32s. Our engineers like that because we can look at a prefix and know which part of the world it is from. People say IPv6 is difficult because it’s hard to read, but that’s not true. You’re working with only one prefix – what changes is the bits behind it. That’s where people do their address planning. With IPv4 there was no such luxury. The first step in our addressing plan was a great foundation from 13 years ago.

We also had to make an important decision about the method of address assignment. We are using stateless DHCPv6 with SLACC and RDNSS on network segments (this was driven by the mixed level of support of DHCPv6 and RDNSS by user, infrastructure and IoT devices therefore as a network operator we have to support everything).

The next thing you need to do is testing. We had to make sure the features we needed were available in existing hardware and software that we put in our network. We obviously have to do a lot of work with our vendors. To support IPv6-only, we need to make sure default gateway routers had the capability to support RDNSS, because of the above-mentioned mixed device support of DHCPv6. We needed to make sure whatever connects to our wireless or wired network can get DNS information one way or another.

One thing I recommend is extensive training of your engineering staff. The IPv6 knowledge most of our engineers have is self-learned or on-job learned which is okay for deploying dual-stack, but not for IPv6-only. I work with a small virtual team of IPv6 gurus and they will not scale to support users when we grow our IPv6-only deployment from pilot into production. Everyone in my organization must be 100% comfortable with IPv6. Another problem is that even today, new hires – be it university graduates or engineers with some work experience, don’t come with working knowledge of IPv6. The industry is still not fluent in IPv6 but that’s no surprise considering the global levels of IPv6 adoption. Therefore, I’m working with my management on a consistent IPv6 training program that we would like to put in place for all of Microsoft CSEO, CCE engineers so they can not only deploy and do the design, but they be comfortable with troubleshooting IPv6-only. Getting a consistent IPv6 training program in place is key.

Proactive work with vendors required

So much goes in making a single stack-only network. Working with vendors and making them understand what IPv6 really means to us has been a big obstacle. IPv6 has been enabled or available as a feature in some shape and form for many years; however, when we talk about IPv6-only, we don’t want any IPv4 on end user segments or for managing our network devices. It took vendors a while to understand what we are really trying to do and why it matters.

For example, we’ve found cloud security an obstacle. Whereas physical network security has decent IPv6 support, many cloud security providers literally still live in clouds. They can’t inspect IPv6-only traffic on the Internet. From my perspective, that means they only secure 73% of the Internet, not the remaining 27%, if you look at current IPv6 traffic statistics. We are not comfortable being only 73% secure. Cloud security vendors are behind (in all fairness it reflects IPv6 adoption among enterprises, they don’t ask for it because they are behind). But we managed to get some traction and since we started, features are either getting delivered or are committed to be delivered in not so distant future.

The way I look at it, vendors enable and deliver features that bring them money. On the contrary to what many engineers believe, the vendors do not have moral duty to code in new features, they develop what brings them money. If some feature is underscored by spending from a customer, then that has a bigger impact than if someone buys the product and asks for the feature afterward. In this way, proactive work with vendors is crucial. Ask for IPv6 feature support before you buy and be completely clear about what it means. Ticking off an RFP box asking for generic IPv6 support does not always cover all cases.

Leaving IPv4’s restrictions behind

The main benefit of dual-stack is the experience. People stop being afraid of IPv6 and it becomes the normal thing. People get used to it. You can also see that there is good support in operating systems of end-user devices. For example, Windows 10, MacOS and iOS prefer IPv6 by default, so as soon as a network is enabled, we can see IPv6 traffic, which is greatly justifies our engineering efforts.

The benefit of IPv6-only is losing dependency on the legacy protocol. Getting out of those restrictions means we won’t have to do multiple layers of NAT in our internal network as we’re doing today. There is an undisturbed traffic flow. We’ve observed internally faster network connections, because IPv6 is not disrupted by NAT, and we assume that the code in network devices that supports IPv6 is newer and it seems to be written in a better way. We still must find a better way to measure this but that’s our observation to date. As mentioned earlier, IPv6-only can take away a lot of headaches during mergers and acquisitions as well. The real benefit of IPv6 is when it’s a single-stack network.

From a broader perspective, deploying IPv6 can contribute to better traffic flow on the Internet, because we know the IPv4 Internet routing table is big. There is a general worry that the fragmentation of IPv4 space could potentially lead to slowing down the IPv4 traffic. While the IPv6 routing table is better organized, getting to your destination could be faster. I think lots of folks are thinking about that.

Enterprises don’t always think about the networks that are outside their control. The Internet which their employees use to access their enterprise services is changing. Some mobile and broadband service providers already started working on IPv4 as a service over IPv6-only networks, which is effectively treating IPv4 as a second-class protocol. Before that happens, I recommend enterprises enable their external facing services with IPv6 – websites, interactions with users, even their remote access, to make sure that the changing IPv4 networks don’t impact them.

Getting as much traffic as we can on IPv6

Since dual-stack has been enabled in our corporate network in 2016, our telemetry shows 20-30% of internal traffic to internal resources on IPv6, which means the remainder is still IPv4-only. We know there is a dependency on Azure Express Route to have the capability to connect us to our cloud environment on IPv6. That work is in progress this calendar year. When that’s completed, we want all applications that serve our internal population to be dual-stack to start with, eventually to be IPv6-only in the future. We want to get as much user traffic as we can on IPv6. The real driver is that for users on the IPv6-only segments, we want to avoid sending traffic through NAT64 and DNS64 as much as possible.

Talking about NAT64 and DNS64, that technology is essential to make sure that users can continue working in IPv6-only environment. Even when all our internal services are enabled with IPv6, the Internet will still be IPv4-only to a certain degree. We need to keep people connected, that’s the core business of my organization. Currently, we have NAT64 and DNS64 in our North American and European regions, and we’re building it out for the Asia Pacific region because we need to deploy IPv6-only there to enable more pilot sites.

The existing IPv6-only internal network pilot that we have been running since April 2018 is opt-in and runs in parallel to the dual-stack corporate network. We expect to have about 20 locations before the end of June 2019. The main goal of the pilot is to collect as much user feedback about applications that break in IPv6-only environment. At the same time, we are actively engaging with the owners of failing applications to rectify the deficiencies in IPv6 support. It turns out, IPv6-only network is the easy part of this. It’s a big undertaking and we would like to run some scream tests in selected locations to get more user information late in 2019, barring we hit any new major blockers. By a scream test we mean removing IPv4 from the standard corporate network in selected locations for a short period of time and closely monitoring user experience.

All our Internet edges are IPv6 enabled, they have our dual-stack environment. We are piloting dual-stack for our guest wireless network in 10 locations. So far it’s been going well without any complaints. Nowadays, dual-stack is business as usual. To advance this to production, we have a dependency on enabling our captive guest portal with IPv6, which is work in progress. We could leave it as IPv4-only because it the end user segment will remain dual stack for a while but we want to have IPv6 end-to-end in this environment in case an IPv6-only client turns up on the network.

Over the last 12 months we have deployed a new remote access solution for our employees with dual-stack enabled for the VPN tunnel connection and inside it too. We are targeting IPv6-only for VPN too. The reason is simple – inside the tunnel we must provide our addresses since the VPN is an extension of our corporate network, and we want to remove IPv4 from there. Because we use the same private IPv4 address space in VPN as everyone else, partner organizations that provide services to Microsoft and use our VPN, clash with our address space. We want to rid of the hacking we have to do around that and make things less complicated. Obviously, outside the tunnel, the VPN gateways are dual-stack, so no matter which network (IPv4-only or dual-stack) the client connects from, they can handle it.

To support IPv6-only inside the VPN tunnel, we are deploying the gateways alongside with NAT64 and DNS64 because we want to avoid cutting our users off from internal services or applications that aren’t IPv6 just yet. To re-iterate the importance of the vendor engagement, when we started IPv6-only VPN Proof of Concept over a year ago, we found out that they did not have the support for such a feature. Since then they have not only delivered beta code for testing but also a production version. Over the summer timeframe, we’ll move to a pilot and see how things go for the users.

We are also looking at and testing a management network to deploy dual-stack because we want to manage all our network devices and infrastructure on IPv6 and not on IPv4. Recently acquired out of band management solution can’t support IPv6 connectivity to new terminal servers but now we have a commitment from the vendor for early 2020.

Ultimately, we want to have IPv6 everywhere we can, and preferably in the form of IPv6-only.

Take it bit by bit

Deploying IPv6 can be a big project, depending on the size and complexity of your environment. Lots of people believe they can live with NAT, but they really need to enable IPv6 on their network because there are lots of devices that are trying to connect that prefer IPv6. Start with security because IPv6 is there even on an IPv4-only network. And continue with your Internet presence and Internet facing services as I mentioned earlier.

Yes, IPv6 can be overwhelming, my advice is to take your deployment bit by bit. Focus on things that give you the biggest benefit, the biggest learning, the biggest impact on the largest group of users. You could think of that in terms of experience or in getting IPv4 space back. Start slow, it’s ok, you are setting your own deadlines. Think about the applications and all the services your internal and external customers are accessing. Work with your vendors, it is doable, it only requires energy and building enthusiasm in your own team and with your management.

Also, remember to look back and appreciate the work and the results you’ve achieved. I often need to remind myself about the progress we’ve made. Sometimes I feel we are moving too slow, but then I look back at all the things we’ve done. It’s important to go back, reflect, and appreciate the work, the learnings and how we’ve progressed our environment.

Finally, dual-stack is only a temporary solution. The ultimate solution is IPv6-only.