

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”7″ chap=”Chapter 25(a) from “beta” Volume VII”]

Very very roughly – DevOps (a portmanteau of “DEVelopment” and “OPerationS”) is an understanding that there should be no firewall between development team and operations (deployment) team – and as long as it is understood as such, I am all for DevOps in the context of the multiplayer games.1

When looking at DevOps from a developer’s-plane-flying-at-30’000-feet, we can translate it into the following very important statement:

our responsibility as developers does NOT end with producing a program which passes the tests. We are also responsible for the program working after deployment.

In particular:

All-favorite excuse of all times 2 “it works on my box” is not allowed. If there is a reproducible bug in production environment – we must fix it. Consequence: If there is an irreproducible bug in production environment – we must fix it too <sad-smile /> We should think about “what we’ll do if there is a bug in production”, well, in advance That’s where techniques such as “production post-factum debugging” come really handy (for more details – see Vol. II’s chapter on (Re)Actors). While developing, we should think about minimizing dependencies. We must not use system-dependent stuff which is not strictly required for us to work As one example, for 99.99% of Server-Side game apps, exact Linux distro where our app is running, should be completely irrelevant. Explosive mixtures of conflicting dependencies (such as “we’re using lib C which wants lib A version >= 3, and lib B which wants lib A version <= 2” should be completely out of the question “ We should enable production monitoring in our apps. We’ll discuss more on it in [[TODO]] section below, but for now – let’s note that for our app (especially stateful apps), monitoring their health becomes of paramount importance for the Operations. And nobody but us can really say that the app is really healthy (both logically-consistent and not-about-to-blow-performance-wise). I cannot count the number of times when app-level performance reporting (translated into alerts by the monitoring system) saved my bacon, allowing to look at a problem several weeks/months before it started to cause player-observable trouble

“it works on my box” is not allowed. If there is a reproducible bug in production environment – we must fix it.

On the other hand, DevOps3 do not necessarily imply that developers are doing all the Operations; from what I’ve seen – successful DevOps do include both Development Team and Operations/Deployment Team; the difference between classical Development+Operations, and DevOps is all about communications between these teams. Within a pre-DevOps world, developers used to think that their responsibility is to produce the program, and after that they can forget about it. With DevOps, there is an ongoing communication channel coming from Operations Team back to Development Team. As a result – we as developers should be ready to deal with requests coming from Operations such as:

“What do you think is the optimal way to configure our program 4 in such-and-such environment?”

in such-and-such environment?” “Could you guys provide an option to use syslog for logging (it would be simpler for us as Operations to manage all the logs this way, and running a separate daemon which gets your text log files and feeds them to syslog, is ugly and inefficient)?”

“How can we monitor the load on that-mutex-or-thread-which-caused-Big-Slowdown-yesterday?”

And of course, The Ultimate Nightmare™ of both Development and Operations: “our program has crashed in production and we cannot relaunch it; let’s take a look at it Right Now™.”

“This communication channel (coming from Operations back to Development) is extremely important to have a game which is able to work anywhere-reliably in the real worldThis communication channel (coming from Operations back to Development) is extremely important to have a game which is able to work anywhere-reliably in the real world. In a certain sense – without this communication, we’re working within the (IMO badly obsolete at least for vast-majority-of-the-projects) waterfall model; and adding this channel makes the whole process much more agile, which in turn means much better ability to withstand ever-changing real-world requirements.

Continuous Delivery/Continuous Deployment

In most of the literature out there, DevOps are very tightly associated with “CD” abbreviation. Unfortunately, there is absolutely no agreement over what this “CD” really means. First, let’s note that “CD” can stand either for “Continuous Delivery”, or for “Continuous Deployment”, and they (depending on the point of view) can mean quite different things (see, for example, [Fowler]). As usual, I won’t spend time on arguing which interpretation of the terms is “right”; instead, I’ll try to present different interpretations and their consequences in the context of MOGs (and how to name them – is not that important).

Oh, and one last thing before going onto terminological minefield around “CD”. Before we can even start to speak about any CD – we should implement CI; as discussed in Vol. III’s chapter on Pre-Coding, CI stands for “Continuous Integration”, and is a pretty-much-universally Good Thing™ to have. Very briefly – CI is about making sure that parts of your system are still working together after any change (at least they compile, and they successfully pass certain automated tests); more importantly for us now – CI is a prerequisite for any type of CD.5

Back to CD. One rather extreme school of thought interprets “CD” as being an ability for any of the developers to push anything-which-passes-automated-tests, to production.6 And while it might even work in certain environments and for certain sub-projects – when trying to apply it to the whole MOG, this approach won’t fly at all. The problem with Continuous-Delivery-in-this-sense is that certain changes of most systems are just way too risky to allow one single person to make a decision to push the change in production.

If a change carries a chance to kill the whole project instantly – it must be signed off by several people, including both architects and stakeholders

It is just a question of risk control and common sense: if we’re speaking about a change which has a potential not just to crash the whole thing, but to bring the whole project down for good – it must not be taken lightly.

And in a MOG world, there are lots of things which qualify as having such deadly such: in particular, pretty much any change to game rules can have such an effect.

On the other hand,

if we’re speaking about some peripheral change (which can be still very important for your bottom line, but which is not likely to kill your project) – then for that specific class of changes an ability to deploy things ASAP can be much more important than jumping through the hoops of multiple approvals.

A/B testing In marketing and business intelligence, A/B testing is a term for a controlled experiment with two variants, A and B— Wikipedia —One classical example of such a not-too-immediately-risky-change-which-needs-to-have-fast-turnaround-times is layout of your “create account” form. It is very important for your game,7 but on the other hand – it is rather difficult for a layout of such a form (provided that it is somehow working), to kill your game. In addition – this layout is an extremely good candidate for all kinds of experimentation, including A/B testing.

As a result – for such features-which-are-important-but-which-have-no-risk-to-kill-your-game-instantly – I’d argue that we could (and actually should) have a very very shortened list of approvals,8 and a very very shortened life cycle too.

As we can see – approaches to (a) deployment of immediately-critical features which can kill the whole thing with a press of one button, and (b) deployment of not-so-immediately-critical stuff which requires lots of experimentation – can and should be different. In other words,

I am arguing for each of the teams being able to move with its own development and deployment speed (at least as long as the changes do not involve cross-team interactions).

For the game team which deals with game balance – changes will require lots of discussions (including discussions with players), careful planning, and often even well-defined “beta” servers where players can playtest results of a change. As a result – such changes are likely to take many weeks.

In contrast, for a change of layout of the “create account” form – a quick look at the A/B testing candidate is often all we need (and pretty often, a simplistic A/B experiment can and should be done in a matter of hours).

To summarize my feelings about CI and CD:

CI (=”Continuous Integration”) is a pretty much universally a Good Thing™. I don’t know of any environment where it wasn’t beneficial. Let’s do it.

As for CD (=”Continuous Delivery/Deployment”) it is less obvious (and even terminology is not well-established). However, the following is clear: Different teams (and different pieces of code) tend to have different requirements in this regard – which in turn leads to different practices being optimal for them. In particular: For those teams where a change can lead to the instant death of the project – much longer deployment cycles are typical (for MOGs – can be up to several months, but weeks is more typical) While technically, it might be possible to stretch a definition to name this process “CD”, probably it won’t go with a spirit of CD. For those teams/fields where changes are not instantly critical – an ability to experiment can easily be much more important than risks (and typical deployment times can be on the order of hours). Related processes (with some reservations) might qualify as “CD”. “ Most importantly, our processes should allow these different production cycles to co-exist. We MUST NOT say that those-not-so-instantly-critical-but-requiring-lots-of-experimentation things should be tied to our “core” release schedule (which will probably go at some-weeks release intervals). OTOH, for “heavy” changes (such as game rule change) – it is typical to have regular deployment cycles (with giving-players-time-to-leave before booting them, with stopping servers, etc. etc.). How to name this process-with-different-speeds-for-each-team – is not really important; much more important is for this multi-speed process to exist.



[[To Be Continued…

This concludes beta Chapter 25(a) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 25(b), where we’ll proceed to discussing potential uses for containers (Docker) for game deployments]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.