Security Analysis of SMS as a Second Factor of Authentication:

The challenges of multifactor authentication based on SMS, including cellular security deficiencies, SS7 exploits, and SIM swapping Despite their popularity and ease of use, SMS-based authentication tokens are arguably one of the least secure forms of two-factor authentication. This does not imply, however, that it is an invalid method for securing an online account. The current security landscape is very different from that of two decades ago. Regardless of the critical nature of an online account or the individual who owns it, using a second form of authentication should always be the default option, regardless of the method chosen.

Scrum Essentials Cards:

Experiences of Scrum Teams Improving with Essence This article presents a series of examples and case studies on how people have used the Scrum Essentials cards to benefit their teams and improve how they work. Scrum is one of the most popular agile frameworks used successfully all over the world. It has been taught and used for 15-plus years. It is by far the most-used practice when developing software, and it has been generalized to be applicable for not just software but all kinds of products. It has been taught to millions of developers, all based on the Scrum Guide.

Data on the Outside vs. Data on the Inside:

Data kept outside SQL has different characteristics from data kept inside. This article describes the impact of services and trust on the treatment of data. It introduces the notions of inside data as distinct from outside data. After discussing the temporal implications of not sharing transactions across the boundaries of services, the article considers the need for immutability and stability in outside data. This leads to a depiction of outside data as a DAG of data items being independently generated by disparate services.

The History, Status, and Future of FPGAs:

Hitting a nerve with field-programmable gate arrays This article is a summary of a three-hour discussion at Stanford University in September 2019 among the authors. It has been written with combined experiences at and with organizations such as Zilog, Altera, Xilinx, Achronix, Intel, IBM, Stanford, MIT, Berkeley, University of Wisconsin, the Technion, Fairchild, Bell Labs, Bigstream, Google, DIGITAL (DEC), SUN, Nokia, SRI, Hitachi, Silicom, Maxeler Technologies, VMware, Xerox PARC, Cisco, and many others. These organizations are not responsible for the content, but may have inspired the authors in some ways, to arrive at the colorful ride through FPGA space described above.

Debugging Incidents in Google’s Distributed Systems:

How experts debug production issues in complex distributed systems This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively. It examines the research approach used to capture data, summarizing the common engineering journeys for production investigations and sharing examples of how experts debug complex distributed systems. Finally, the article extends the Google specifics of this research to provide some practical strategies that you can apply in your organization.

Is Persistent Memory Persistent?:

A simple and inexpensive test of failure-atomic update mechanisms Power failures pose the most severe threat to application data integrity, and painful experience teaches that the integrity promises of failure-atomic update mechanisms can’t be taken at face value. Diligent developers and operators insist on confirming integrity claims by extensive firsthand tests. This article presents a simple and inexpensive testbed capable of subjecting storage devices, system software, and application software to ten thousand sudden whole-system power-interruption tests per week.

Dark Patterns: Past, Present, and Future:

The evolution of tricky user interfaces Dark patterns are an abuse of the tremendous power that designers hold in their hands. As public awareness of dark patterns grows, so does the potential fallout. Journalists and academics have been scrutinizing dark patterns, and the backlash from these exposures can destroy brand reputations and bring companies under the lenses of regulators. Design is power. In the past decade, software engineers have had to confront the fact that the power they hold comes with responsibilities to users and to society. In this decade, it is time for designers to learn this lesson as well.

Demystifying Stablecoins:

Cryptography meets monetary policy Self-sovereign stablecoins are interesting and probably here to stay; however, they face numerous regulatory hurdles from banking, financial tracking, and securities laws. For stablecoins backed by a governmental currency, the ultimate expression would be a CBDC. Since paper currency has been in steady decline (and disproportionately for legitimate transactions), a CBDC could reintroduce cash with technological advantages and efficient settlement while minimizing user fees.

Beyond the Fix-it Treadmill:

The Use of Post-Incident Artifacts in High-Performing Organizations Given that humanity’s study of the sociological factors in safety is almost a century old, the technology industry’s post-incident analysis practices and how we create and use the artifacts those practices produce are all still in their infancy. So don’t be surprised that many of these practices are so similar, that the cognitive and social models used to parse apart and understand incidents and outages are few and cemented in the operational ethos, and that the byproducts sought from post-incident analyses are far-and-away focused on remediation items and prevention.

Managing the Hidden Costs of Coordination:

Controlling coordination costs when multiple, distributed perspectives are essential Some initial considerations to control cognitive costs for incident responders include: (1) assessing coordination strategies relative to the cognitive demands of the incident; (2) recognizing when adaptations represent a tension between multiple competing demands (coordination and cognitive work) and seeking to understand them better rather than unilaterally eliminating them; (3) widening the lens to study the joint cognition system (integration of human-machine capabilities) as the unit of analysis; and (4) viewing joint activity as an opportunity for enabling reciprocity across inter- and intra-organizational boundaries.

Cognitive Work of Hypothesis Exploration During Anomaly Response:

A look at how we respond to the unexpected Four incidents from web-based software companies reveal important aspects of anomaly response processes when incidents arise in web operations, two of which are discussed in this article. One particular cognitive function examined in detail is hypothesis generation and exploration, given the impact of obscure automation on engineers’ development of coherent models of the systems they manage. Each case was analyzed using the techniques and concepts of cognitive systems engineering. The set of cases provides a window into the cognitive work "above the line" in incident management of complex web-operation systems.

Above the Line, Below the Line:

The resilience of Internet-facing systems relies on what is below the line of representation. Knowledge and understanding of below-the-line structure and function are continuously in flux. Near-constant effort is required to calibrate and refresh the understanding of the workings, dependencies, limitations, and capabilities of what is present there. In this dynamic situation no individual or group can ever know the system state. Instead, individuals and groups must be content with partial, fragmented mental models that require more or less constant updating and adjustment if they are to be useful.

Revealing the Critical Role of Human Performance in Software:

It’s time to revise our appreciation of the human side of Internet-facing software systems. Understanding, supporting, and sustaining the capabilities above the line of representation require all stakeholders to be able to continuously update and revise their models of how the system is messy and yet usually manages to work. This kind of openness to continually reexamine how the system really works requires expanding the efforts to learn from incidents.

Blockchain Technology: What Is It Good for?:

Industry’s dreams and fears for this new technology Business executives, government leaders, investors, and researchers frequently ask the following three questions: (1) What exactly is blockchain technology? (2) What capabilities does it provide? (3) What are good applications? Here we answer these questions thoroughly, provide a holistic overview of blockchain technology that separates hype from reality, and propose a useful lexicon for discussing the specifics of blockchain technology in the future.

The Reliability of Enterprise Applications:

Understanding enterprise reliability Enterprise reliability is a discipline that ensures applications will deliver the required business functionality in a consistent, predictable, and cost-effective manner without compromising core aspects such as availability, performance, and maintainability. This article describes a core set of principles and engineering methodologies that enterprises can apply to help them navigate the complex environment of enterprise reliability and deliver highly reliable and cost-efficient applications.

Optimizations in C++ Compilers:

A practical journey There’s a tradeoff to be made in giving the compiler more information: it can make compilation slower. Technologies such as link time optimization can give you the best of both worlds. Optimizations in compilers continue to improve, and upcoming improvements in indirect calls and virtual function dispatch might soon lead to even faster polymorphism.

Hack for Hire:

Investigating the emerging black market of retail email account hacking services Hack-for-hire services charging $100-$400 per contract were found to produce sophisticated, persistent, and personalized attacks that were able to bypass 2FA via phishing. The demand for these services, however, appears to be limited to a niche market, as evidenced by the small number of discoverable services, an even smaller number of successful services, and the fact that these attackers target only about one in a million Google users.

The Effects of Mixing Machine Learning and Human Judgment:

Collaboration between humans and machines does not necessarily lead to better outcomes. Based on the theoretical findings from the existing literature, some policymakers and software engineers contend that algorithmic risk assessments such as the COMPAS software can alleviate the incarceration epidemic and the occurrence of violent crimes by informing and improving decisions about policing, treatment, and sentencing. Considered in tandem, these findings indicate that collaboration between humans and machines does not necessarily lead to better outcomes, and human supervision does not sufficiently address problems when algorithms err or demonstrate concerning biases.

Persistent Memory Programming on Conventional Hardware:

The persistent memory style of programming can dramatically simplify application software. Driven by the advent of byte-addressable non-volatile memory, the persistent memory style of programming will gain traction among developers, taking its rightful place alongside existing paradigms for managing persistent application state. Until NVM becomes available on all computers, developers can use the techniques presented in this article to enjoy the benefits of persistent memory programming on conventional hardware.

Velocity in Software Engineering:

From tectonic plate to F-16 Software engineering occupies an increasingly critical role in companies across all sectors, but too many software initiatives end up both off target and over budget. A surer path is optimized for speed, open to experimentation and learning, agile, and subject to regular course correcting. Good ideas tend to be abundant, though execution at high velocity is elusive. The good news is that velocity is controllable; companies can invest systematically to increase it.

Open-source Firmware:

Step into the world behind the kernel. Open-source firmware can help bring computing to a more secure place by making the actions of firmware more visible and less likely to do harm. This article’s goal is to make readers feel empowered to demand more from vendors who can help drive this change.

Surviving Software Dependencies:

Software reuse is finally here but comes with risks. Software reuse is finally here, and its benefits should not be understated, but we’ve accepted this transformation without completely thinking through the potential consequences. The Copay and Equifax attacks are clear warnings of real problems in the way software dependencies are consumed today. There’s a lot of good software out there. Let’s work together to find out how to reuse it safely.

Industry-scale Knowledge Graphs: Lessons and Challenges:

Five diverse technology companies show how it’s done This article looks at the knowledge graphs of five diverse tech companies, comparing the similarities and differences in their respective experiences of building and using the graphs, and discussing the challenges that all knowledge-driven enterprises face today. The collection of knowledge graphs discussed here covers the breadth of applications, from search, to product descriptions, to social networks.

Garbage Collection as a Joint Venture:

A collaborative approach to reclaiming memory in heterogeneous software systems Cross-component tracing is a way to solve the problem of reference cycles across component boundaries. This problem appears as soon as components can form arbitrary object graphs with nontrivial ownership across API boundaries. An incremental version of CCT is implemented in V8 and Blink, enabling effective and efficient reclamation of memory in a safe manner.

Online Event Processing:

Achieving consistency where distributed transactions have failed Support for distributed transactions across heterogeneous storage technologies is either nonexistent or suffers from poor operational and performance characteristics. In contrast, OLEP is increasingly used to provide good performance and strong consistency guarantees in such settings. In data systems it is very common for logs to be used as internal implementation details. The OLEP approach is different: it uses event logs, rather than transactions, as the primary application programming model for data management. Traditional databases are still used, but their writes come from a log rather than directly from the application. The use of OLEP is not simply pragmatism on the part of developers, but rather it offers a number of advantages.

Net Neutrality: Unexpected Solution to Blockchain Scaling:

Cloud-delivery networks could dramatically improve blockchains’ scalability, but clouds must be provably neutral first. Provably neutral clouds are undoubtedly a viable solution to blockchain scaling. By optimizing the transport layer, not only can the throughput be fundamentally scaled up, but the latency could be dramatically reduced. Indeed, the latency distribution in today’s data centers is already biased toward microsecond timescales for most of the flows, with millisecond timescales residing only at the tail of the distribution. There is no reason why a BDN point of presence would not be able to achieve a similar performance. Adding dedicated optical infrastructure among such BDN points of presence would further alleviate throughput and reduce latency, creating the backbone of an advanced BDN.

Identity by Any Other Name:

The complex cacophony of intertwined systems New emerging systems and protocols both tighten and loosen our notions of identity, and that’s good! They make it easier to get stuff done. REST, IoT, big data, and machine learning all revolve around notions of identity that are deliberately kept flexible and sometimes ambiguous. Notions of identity underlie our basic mechanisms of distributed systems, including interchangeability, idempotence, and immutability.

Achieving Digital Permanence:

The many challenges to maintaining stored information and ways to overcome them Today’s Information Age is creating new uses for and new ways to steward the data that the world depends on. The world is moving away from familiar, physical artifacts to new means of representation that are closer to information in its essence. We need processes to ensure both the integrity and accessibility of knowledge in order to guarantee that history will be known and true.

Metrics That Matter:

Critical but oft-neglected service metrics that every SRE and product owner should care about Measure your site reliability metrics, set the right targets, and go through the work to measure the metrics accurately. Then, you’ll find that your service runs better, with fewer outages, and much more user adoption.

A Hitchhiker’s Guide to the Blockchain Universe:

Blockchain remains a mystery, despite its growing acceptance. It is difficult these days to avoid hearing about blockchain. Despite the significant potential of blockchain, it is also difficult to find a consistent description of what it really is. This article looks at the basics of blockchain: the individual components, how those components fit together, and what changes might be made to solve some of the problems with blockchain technology.

Tear Down the Method Prisons! Set Free the Practices!:

Essence: a new way of thinking that promises to liberate the practices and enable true learning organizations This article explains why we need to break out of this repetitive dysfunctional behavior, and it introduces Essence, a new way of thinking that promises to free the practices from their method prisons and thus enable true learning organizations.

Understanding Database Reconstruction Attacks on Public Data:

These attacks on statistical databases are no longer a theoretical danger. With the dramatic improvement in both computer speeds and the efficiency of SAT and other NP-hard solvers in the last decade, DRAs on statistical databases are no longer just a theoretical danger. The vast quantity of data products published by statistical agencies each year may give a determined attacker more than enough information to reconstruct some or all of a target database and breach the privacy of millions of people. Traditional disclosure-avoidance techniques are not designed to protect against this kind of attack.

Benchmarking "Hello, World!":

Six different views of the execution of "Hello, World!" show what is often missing in today’s tools As more and more software moves off the desktop and into data centers, and more and more cell phones use server requests as the other half of apps, observation tools for large-scale distributed transaction systems are not keeping up. This makes it tempting to look under the lamppost using simpler tools. You will waste a lot of high-pressure time following that path when you have a sudden complex performance crisis.

Using Remote Cache Service for Bazel:

Save time by sharing and reusing build and test output Remote cache service is a new development that significantly saves time in running builds and tests. It is particularly useful for a large code base and any size of development team. Bazel is an actively developed open-source build and test system that aims to increase productivity in software development. It has a growing number of optimizations to improve the performance of daily development tasks.

Why SRE Documents Matter:

How documentation enables SRE teams to manage new and existing services SRE (site reliability engineering) is a job function, a mindset, and a set of engineering approaches for making web products and services run reliably. SREs operate at the intersection of software development and systems engineering to solve operational problems and engineer solutions to design, build, and run large-scale distributed systems scalably, reliably, and efficiently. A mature SRE team likely has well-defined bodies of documentation associated with many SRE functions.

How to Live in a Post-Meltdown and -Spectre World:

Learn from the past to prepare for the next battle. Spectre and Meltdown create a risk landscape that has more questions than answers. This article addresses how these vulnerabilities were triaged when they were announced and the practical defenses that are available. Ultimately, these vulnerabilities present a unique set of circumstances, but for the vulnerability management program at Goldman Sachs, the response was just another day at the office.

Tracking and Controlling Microservice Dependencies:

Dependency management is a crucial part of system and software design. Dependency cycles will be familiar to you if you have ever locked your keys inside your house or car. You can’t open the lock without the key, but you can’t get the key without opening the lock. Some cycles are obvious, but more complex dependency cycles can be challenging to find before they lead to outages. Strategies for tracking and controlling dependencies are necessary for maintaining reliable systems.

Corp to Cloud: Google’s Virtual Desktops:

How Google moved its virtual desktops to the cloud Over one-fourth of Googlers use internal, data-center-hosted virtual desktops. This on-premises offering sits in the corporate network and allows users to develop code, access internal resources, and use GUI tools remotely from anywhere in the world. Among its most notable features, a virtual desktop instance can be sized according to the task at hand, has persistent user storage, and can be moved between corporate data centers to follow traveling Googlers. Until recently, our virtual desktops were hosted on commercially available hardware on Google’s corporate network using a homegrown open-source virtual cluster-management system called Ganeti. Today, this substantial and Google-critical workload runs on GCP (Google Compute Platform).

The Mythos of Model Interpretability:

In machine learning, the concept of interpretability is both important and slippery. Supervised machine-learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world?

Mind Your State for Your State of Mind:

The interactions between storage and applications can be complex and subtle. Applications have had an interesting evolution as they have moved into the distributed and scalable world. Similarly, storage and its cousin databases have changed side by side with applications. Many times, the semantics, performance, and failure models of storage and applications do a subtle dance as they change in support of changing business requirements and environmental challenges. Adding scale to the mix has really stirred things up. This article looks at some of these issues and their impact on systems.

Workload Frequency Scaling Law - Derivation and Verification:

Workload scalability has a cascade relation via the scale factor. This article presents equations that relate to workload utilization scaling at a per-DVFS subsystem level. A relation between frequency, utilization, and scale factor (which itself varies with frequency) is established. The verification of these equations turns out to be tricky, since inherent to workload, the utilization also varies seemingly in an unspecified manner at the granularity of governance samples. Thus, a novel approach called histogram ridge trace is applied. Quantifying the scaling impact is critical when treating DVFS as a building block. Typical application includes DVFS governors and or other layers that influence utilization, power, and performance of the system.

Algorithms Behind Modern Storage Systems:

Different uses for read-optimized B-trees and write-optimized LSM-trees This article takes a closer look at two storage system design approaches used in a majority of modern databases (read-optimized B-trees and write-optimized LSM (log-structured merge)-trees) and describes their use cases and tradeoffs.

C Is Not a Low-level Language:

Your computer is not a fast PDP-11. In the wake of the recent Meltdown and Spectre vulnerabilities, it’s worth spending some time looking at root causes. Both of these vulnerabilities involved processors speculatively executing instructions past some kind of access check and allowing the attacker to observe the results via a side channel. The features that led to these vulnerabilities, along with several others, were added to let C programmers continue to believe they were programming in a low-level language, when this hasn’t been the case for decades.

Thou Shalt Not Depend on Me:

A look at JavaScript libraries in the wild Most websites use JavaScript libraries, and many of them are known to be vulnerable. Understanding the scope of the problem, and the many unexpected ways that libraries are included, are only the first steps toward improving the situation. The goal here is that the information included in this article will help inform better tooling, development practices, and educational efforts for the community.

Designing Cluster Schedulers for Internet-Scale Services:

Embracing failures for improving availability Engineers looking to build scheduling systems should consider all failure modes of the underlying infrastructure they use and consider how operators of scheduling systems can configure remediation strategies, while aiding in keeping tenant systems as stable as possible during periods of troubleshooting by the owners of the tenant systems.

Canary Analysis Service:

Automated canarying quickens development, improves production safety, and helps prevent outages. It is unreasonable to expect engineers working on product development or reliability to have statistical knowledge; removing this hurdle led to widespread CAS adoption. CAS has proven useful even for basic cases that don’t need configuration, and has significantly improved Google’s rollout reliability. Impact analysis shows that CAS has likely prevented hundreds of postmortem-worthy outages, and the rate of postmortems among groups that do not use CAS is noticeably higher.

Continuous Delivery Sounds Great, but Will It Work Here?:

It’s not magic, it just requires continuous, daily improvement at all levels. Continuous delivery is a set of principles, patterns, and practices designed to make deployments predictable, routine affairs that can be performed on demand at any time. This article introduces continuous delivery, presents both common objections and actual obstacles to implementing it, and describes how to overcome them using real-life examples. Continuous delivery is not magic. It’s about continuous, daily improvement at all levels of the organization.

Containers Will Not Fix Your Broken Culture (and Other Hard Truths):

Complex socio-technical systems are hard; film at 11. We focus so often on technical anti-patterns, neglecting similar problems inside our social structures. Spoiler alert: the solutions to many difficulties that seem technical can be found by examining our interactions with others. Let’s talk about five things you’ll want to know when working with those pesky creatures known as humans.

DevOps Metrics:

Your biggest mistake might be collecting the wrong data. Delivering value to the business through software requires processes and coordination that often span multiple teams across complex systems, and involves developing and delivering software with both quality and resiliency. As practitioners and professionals, we know that software development and delivery is an increasingly difficult art and practice, and that managing and improving any process or system requires insights into that system. Therefore, measurement is paramount to creating an effective software value stream. Yet accurate measurement is no easy feat.

Monitoring in a DevOps World:

Perfect should never be the enemy of better. Monitoring can seem quite overwhelming. The most important thing to remember is that perfect should never be the enemy of better. DevOps enables highly iterative improvement within organizations. If you have no monitoring, get something; get anything. Something is better than nothing, and if you’ve embraced DevOps, you’ve already signed up for making it better over time.

Bitcoin’s Underlying Incentives:

The unseen economic forces that govern the Bitcoin protocol Incentives are crucial for the Bitcoin protocol’s security and effectively drive its daily operation. Miners go to extreme lengths to maximize their revenue and often find creative ways to do so that are sometimes at odds with the protocol. Cryptocurrency protocols should be placed on stronger foundations of incentives. There are many areas left to improve, ranging from the very basics of mining rewards and how they interact with the consensus mechanism, through the rewards in mining pools, and all the way to the transaction fee market itself.

Titus: Introducing Containers to the Netflix Cloud:

Approaching container adoption in an already cloud-native infrastructure We believe our approach has enabled Netflix to quickly adopt and benefit from containers. Though the details may be Netflix-specific, the approach of providing low-friction container adoption by integrating with existing infrastructure and working with the right early adopters can be a successful strategy for any organization looking to adopt containers.

Abstracting the Geniuses Away from Failure Testing:

Ordinary users need tools that automate the selection of custom-tailored faults to inject. This article presents a call to arms for the distributed systems research community to improve the state of the art in fault tolerance testing. Ordinary users need tools that automate the selection of custom-tailored faults to inject. We conjecture that the process by which superusers select experiments can be effectively modeled in software. The article describes a prototype validating this conjecture, presents early results from the lab and the field, and identifies new research directions that can make this vision a reality.

Network Applications Are Interactive:

The network era requires new models, with interactions instead of algorithms. The miniaturization of devices and the prolific interconnectedness of these devices over high-speed wireless networks is completely changing how commerce is conducted. These changes (a.k.a. digital) will profoundly change how enterprises operate. Software is at the heart of this digital world, but the software toolsets and languages were conceived for the host-based era. The issues that already plague software practice (such as high defects, poor software productivity, information vulnerability, poor software project success rates, etc.) will be more profound with such an approach. It is time for software to be made simpler, secure, and reliable.

Cache Me If You Can:

Building a decentralized web-delivery model The world is more connected than it ever has been before, and with our pocket supercomputers and IoT (Internet of Things) future, the next generation of the web might just be delivered in a peer-to-peer model. It’s a giant problem space, but the necessary tools and technology are here today. We just need to define the problem a little better.

Bitcoin’s Academic Pedigree:

The concept of cryptocurrencies is built from forgotten ideas in research literature. We’ve seen repeatedly that ideas in the research literature can be gradually forgotten or lie unappreciated, especially if they are ahead of their time, even in popular areas of research. Both practitioners and academics would do well to revisit old ideas to glean insights for present systems. Bitcoin was unusual and successful not because it was on the cutting edge of research on any of its components, but because it combined old ideas from many previously unrelated fields. This is not easy to do, as it requires bridging disparate terminology, assumptions, etc., but it is a valuable blueprint for innovation.

Metaphors We Compute By:

Code is a story that explains how to solve a particular problem. Programmers must be able to tell a story with their code, explaining how they solved a particular problem. Like writers, programmers must know their metaphors. Many metaphors will be able to explain a concept, but you must have enough skill to choose the right one that’s able to convey your ideas to future programmers who will read the code. Thus, you cannot use every metaphor you know. You must master the art of metaphor selection, of meaning amplification. You must know when to add and when to subtract. You will learn to revise and rewrite code as a writer does. Once there’s nothing else to add or remove, you have finished your work.

Is There a Single Method for the Internet of Things?:

Essence can keep software development for the IoT from becoming unwieldy. The Industrial Internet Consortium predicts the IoT (Internet of Things) will become the third technological revolution after the Industrial Revolution and the Internet Revolution. Its impact across all industries and businesses can hardly be imagined. Existing software (business, telecom, aerospace, defense, etc.) is expected to be modified or redesigned, and a huge amount of new software, solving new problems, will have to be developed. As a consequence, the software industry should welcome new and better methods.

Data Sketching:

The approximate approach is often faster and more efficient. Do you ever feel overwhelmed by an unending stream of information? It can seem like a barrage of new email and text messages demands constant attention, and there are also phone calls to pick up, articles to read, and knocks on the door to answer. Putting these pieces together to keep track of what’s important can be a real challenge. In response to this challenge, the model of streaming data processing has grown in popularity. The aim is no longer to capture, store, and index every minute event, but rather to process each observation quickly in order to create a summary of the current state.

The Calculus of Service Availability:

You’re only as available as the sum of your dependencies. Most services offered by Google aim to offer 99.99 percent (sometimes referred to as the "four 9s") availability to users. Some services contractually commit to a lower figure externally but set a 99.99 percent target internally. This more stringent target accounts for situations in which users become unhappy with service performance well before a contract violation occurs, as the number one aim of an SRE team is to keep users happy. For many services, a 99.99 percent internal target represents the sweet spot that balances cost, complexity, and availability.

The IDAR Graph:

An improvement over UML UML is the de facto standard for representing object-oriented designs. It does a fine job of recording designs, but it has a severe problem: its diagrams don’t convey what humans need to know, making them hard to understand. This is why most software developers use UML only when forced to. People understand an organization, such as a corporation, in terms of a control hierarchy. When faced with an organization of people or objects, the first question usually is, "What’s controlling all this?" Surprisingly, UML has no concept of one object controlling another. Consequently, in every type of UML diagram, no object appears to have greater or lesser control than its neighbors.

Too Big NOT to Fail:

Embrace failure so it doesn’t embrace you. Web-scale infrastructure implies LOTS of servers working together, often tens or hundreds of thousands of servers all working toward the same goal. How can the complexity of these environments be managed? How can commonality and simplicity be introduced?

The Debugging Mindset:

Understanding the psychology of learning strategies leads to effective problem-solving skills. Software developers spend 35-50 percent of their time validating and debugging software. The cost of debugging, testing, and verification is estimated to account for 50-75 percent of the total budget of software development projects, amounting to more than $100 billion annually. While tools, languages, and environments have reduced the time spent on individual debugging tasks, they have not significantly reduced the total time spent debugging, nor the cost of doing so. Therefore, a hyperfocus on elimination of bugs during development is counterproductive; programmers should instead embrace debugging as an exercise in problem solving.

MongoDB’s JavaScript Fuzzer:

The fuzzer is for those edge cases that your testing didn’t catch. As MongoDB becomes more feature-rich and complex with time, the need to develop more sophisticated methods for finding bugs grows as well. Three years ago, MongDB added a home-grown JavaScript fuzzer to its toolkit, and it is now our most prolific bug-finding tool, responsible for detecting almost 200 bugs over the course of two release cycles. These bugs span a range of MongoDB components from sharding to the storage engine, with symptoms ranging from deadlocks to data inconsistency. The fuzzer runs as part of the CI (continuous integration) system, where it frequently catches bugs in newly committed code.

Making Money Using Math:

Modern applications are increasingly using probabilistic machine-learned models. A big difference between human-written code and learned models is that the latter are usually not represented by text and hence are not understandable by human developers or manipulable by existing tools. The consequence is that none of the traditional software engineering techniques for conventional programs (such as code reviews, source control, and debugging) are applicable anymore. Since incomprehensibility is not unique to learned code, these aspects are not of concern here.

Pervasive, Dynamic Authentication of Physical Items:

The use of silicon PUF circuits Authentication of physical items is an age-old problem. Common approaches include the use of bar codes, QR codes, holograms, and RFID (radio-frequency identification) tags. Traditional RFID tags and bar codes use a public identifier as a means of authenticating. A public identifier, however, is static: it is the same each time when queried and can be easily copied by an adversary. Holograms can also be viewed as public identifiers: a knowledgeable verifier knows all the attributes to inspect visually. It is difficult to make hologram-based authentication pervasive; a casual verifier does not know all the attributes to look for.

Uninitialized Reads:

Understanding the proposed revisions to the C language Most developers understand that reading uninitialized variables in C is a defect, but some do it anyway. What happens when you read uninitialized objects is unsettled in the current version of the C standard (C11).3 Various proposals have been made to resolve these issues in the planned C2X revision of the standard. Consequently, this is a good time to understand existing behaviors as well as proposed revisions to the standard to influence the evolution of the C language. Given that the behavior of uninitialized reads is unsettled in C11, prudence dictates eliminating uninitialized reads from your code.

Heterogeneous Computing: Here to Stay:

Hardware and Software Perspectives Mentions of the buzzword heterogeneous computing have been on the rise in the past few years and will continue to be heard for years to come, because heterogeneous computing is here to stay. What is heterogeneous computing, and why is it becoming the norm? How do we deal with it, from both the software side and the hardware side? This article provides answers to some of these questions and presents different points of view on others.

Time, but Faster:

A computing adventure about time through the looking glass The first premise was summed up perfectly by the late Douglas Adams in The Hitchhiker’s Guide to the Galaxy: "Time is an illusion. Lunchtime doubly so." The concept of time, when colliding with decoupled networks of computers that run at billions of operations per second, is... well, the truth of the matter is that you simply never really know what time it is. That is why Leslie Lamport’s seminal paper on Lamport timestamps was so important to the industry, but this article is actually about wall-clock time, or a reasonably useful estimation of it.

Life Beyond Distributed Transactions:

An apostate’s opinion This article explores and names some of the practical approaches used in the implementation of large-scale mission-critical applications in a world that rejects distributed transactions. Topics include the management of fine-grained pieces of application data that may be repartitioned over time as the application grows. Design patterns support sending messages between these repartitionable pieces of data.

BBR: Congestion-Based Congestion Control:

Measuring bottleneck bandwidth and round-trip propagation time When bottleneck buffers are large, loss-based congestion control keeps them full, causing bufferbloat. When bottleneck buffers are small, loss-based congestion control misinterprets loss as a signal of congestion, leading to low throughput. Fixing these problems requires an alternative to loss-based congestion control. Finding this alternative requires an understanding of where and how network congestion originates.

Faucet: Deploying SDN in the Enterprise:

Using OpenFlow and DevOps for rapid development While SDN as a technology continues to evolve and become even more programmable, Faucet and OpenFlow 1.3 hardware together are sufficient to realize benefits today. This article describes specifically how to take advantage of DevOps practices to develop and deploy features rapidly. It also describes several practical deployment scenarios, including firewalling and network function virtualization.

Industrial Scale Agile - from Craft to Engineering:

Essence is instrumental in moving software development toward a true engineering discipline. There are many, many ways to illustrate how fragile IT investments can be. You just have to look at the way that, even after huge investments in education and coaching, many organizations are struggling to broaden their agile adoption to the whole of their organization - or at the way other organizations are struggling to maintain the momentum of their agile adoptions as their teams change and their systems mature.

Functional at Scale:

Applying functional programming principles to distributed computing projects Modern server software is demanding to develop and operate: it must be available at all times and in all locations; it must reply within milliseconds to user requests; it must respond quickly to capacity demands; it must process a lot of data and even more traffic; it must adapt quickly to changing product needs; and in many cases it must accommodate a large engineering organization, its many engineers the proverbial cooks in a big, messy kitchen.

Scaling Synchronization in Multicore Programs:

Advanced synchronization methods can boost the performance of multicore software. Designing software for modern multicore processors poses a dilemma. Traditional software designs, in which threads manipulate shared data, have limited scalability because synchronization of updates to shared data serializes threads and limits parallelism. Alternative distributed software designs, in which threads do not share mutable data, eliminate synchronization and offer better scalability. But distributed designs make it challenging to implement features that shared data structures naturally provide, such as dynamic load balancing and strong consistency guarantees, and are simply not a good fit for every program. Often, however, the performance of shared mutable data structures is limited by the synchronization methods in use today, whether lock-based or lock-free.

Idle-Time Garbage-Collection Scheduling:

Taking advantage of idleness to reduce dropped frames and memory consumption Google’s Chrome web browser strives to deliver a smooth user experience. An animation will update the screen at 60 FPS (frames per second), giving Chrome around 16.6 milliseconds to perform the update. Within these 16.6 ms, all input events have to be processed, all animations have to be performed, and finally the frame has to be rendered. A missed deadline will result in dropped frames. These are visible to the user and degrade the user experience. Such sporadic animation artifacts are referred to here as jank. This article describes an approach implemented in the JavaScript engine V8, used by Chrome, to schedule garbage-collection pauses during times when Chrome is idle.

Dynamics of Change: Why Reactivity Matters:

Tame the dynamics of change by centralizing each concern in its own module. Professional programming is about dealing with software at scale. Everything is trivial when the problem is small and contained: it can be elegantly solved with imperative programming or functional programming or any other paradigm. Real-world challenges arise when programmers have to deal with large amounts of data, network requests, or intertwined entities, as in UI (user interface) programming.

Cluster-level Logging of Containers with Containers:

Logging Challenges of Container-Based Cloud Deployments This article shows how cluster-level logging infrastructure can be implemented using open source tools and deployed using the very same abstractions that are used to compose and manage the software systems being logged. Collecting and analyzing log information is an essential aspect of running production systems to ensure their reliability and to provide important auditing information. Many tools have been developed to help with the aggregation and collection of logs for specific software components (e.g., an Apache web server) running on specific servers (e.g., Fluentd and Logstash.)

The Hidden Dividends of Microservices:

Microservices aren’t for every company, and the journey isn’t easy. Microservices are an approach to building distributed systems in which services are exposed only through hardened APIs; the services themselves have a high degree of internal cohesion around a specific and well-bounded context or area of responsibility, and the coupling between them is loose. Such services are typically simple, yet they can be composed into very rich and elaborate applications. The effort required to adopt a microservices-based approach is considerable, particularly in cases that involve migration from more monolithic architectures. The explicit benefits of microservices are well known and numerous, however, and can include increased agility, resilience, scalability, and developer productivity.

Debugging Distributed Systems:

Challenges and options for validation and debugging Distributed systems pose unique challenges for software developers. Reasoning about concurrent activities of system nodes and even understanding the system’s communication topology can be difficult. A standard approach to gaining insight into system activity is to analyze system logs. Unfortunately, this can be a tedious and complex process. This article looks at several key features and debugging challenges that differentiate distributed systems from other kinds of software. The article presents several promising tools and ongoing research to help resolve these challenges.

Should You Upload or Ship Big Data to the Cloud?:

The accepted wisdom does not always hold true. It is accepted wisdom that when the data you wish to move into the cloud is at terabyte scale and beyond, you are better off shipping it to the cloud provider, rather than uploading it. This article takes an analytical look at how shipping and uploading strategies compare, the various factors on which they depend, and under what circumstances you are better off shipping rather than uploading data, and vice versa. Such an analytical determination is important to make, given the increasing availability of gigabit-speed Internet connections, along with the explosive growth in data-transfer speeds supported by newer editions of drive interfaces such as SAS and PCI Express.

The Flame Graph:

This visualization of software execution is a new necessity for performance profiling and debugging. An everyday problem in our industry is understanding how software is consuming resources, particularly CPUs. What exactly is consuming how much, and how did this change since the last software version? These questions can be answered using software profilers, tools that help direct developers to optimize their code and operators to tune their environment. The output of profilers can be verbose, however, making it laborious to study and comprehend. The flame graph provides a new visualization for profiler output and can make for much faster comprehension, reducing the time for root cause analysis.

Why Logical Clocks are Easy:

Sometimes all you need is the right language. Any computing system can be described as executing sequences of actions, with an action being any relevant change in the state of the system. For example, reading a file to memory, modifying the contents of the file in memory, or writing the new contents to the file are relevant actions for a text editor. In a distributed system, actions execute in multiple locations; in this context, actions are often called events. Examples of events in distributed systems include sending or receiving messages, or changing some state in a node. Not all events are related, but some events can cause and influence how other, later events occur.

Use-Case 2.0:

The Hub of Software Development Use cases have been around for almost 30 years as a requirements approach and have been part of the inspiration for more-recent techniques such as user stories. Now the inspiration has flown in the other direction. Use-Case 2.0 is the new generation of use-case-driven development-light, agile, and lean-inspired by user stories and the agile methodologies Scrum and Kanban.

Statistics for Engineers:

Applying statistical techniques to operations data Modern IT systems collect an increasing wealth of data from network gear, operating systems, applications, and other components. This data needs to be analyzed to derive vital information about the user experience and business performance. For instance, faults need to be detected, service quality needs to be measured and resource usage of the next days and month needs to be forecast.

Borg, Omega, and Kubernetes:

Lessons learned from three container-management systems over a decade Though widespread interest in software containers is a relatively recent phenomenon, at Google we have been managing Linux containers at scale for more than ten years and built three different container-management systems in that time. Each system was heavily influenced by its predecessors, even though they were developed for different reasons. This article describes the lessons we’ve learned from developing and operating them.

The Verification of a Distributed System:

A practitioner’s guide to increasing confidence in system correctness Leslie Lamport, known for his seminal work in distributed systems, famously said, "A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable." Given this bleak outlook and the large set of possible failures, how do you even begin to verify and validate that the distributed systems you build are doing the right thing?

Accountability in Algorithmic Decision-making:

A view from computational journalism Every fiscal quarter automated writing algorithms churn out thousands of corporate earnings articles for the AP (Associated Press) based on little more than structured data. Companies such as Automated Insights, which produces the articles for AP, and Narrative Science can now write straight news articles in almost any domain that has clean and well-structured data: finance, sure, but also sports, weather, and education, among others. The articles aren’t cardboard either; they have variability, tone, and style, and in some cases readers even have difficulty distinguishing the machine-produced articles from human-written ones.

Immutability Changes Everything:

We need it, we can afford it, and the time is now. There is an inexorable trend toward storing and sending immutable data. We need immutability to coordinate at a distance, and we can afford immutability as storage gets cheaper. This article is an amuse-bouche sampling the repeated patterns of computing that leverage immutability. Climbing up and down the compute stack really does yield a sense of déjà vu all over again.

Time is an Illusion.:

Lunchtime doubly so. - Ford Prefect to Arthur Dent in "The Hitchhiker’s Guide to the Galaxy", by Douglas Adams One of the more surprising things about digital systems - and, in particular, modern computers - is how poorly they keep time. When most programs ran on a single system this was not a significant issue for the majority of software developers, but once software moved into the distributed-systems realm this inaccuracy became a significant challenge.

Non-volatile Storage:

Implications of the Datacenter’s Shifting Center For the entire careers of most practicing computer scientists, a fundamental observation has consistently held true: CPUs are significantly more performant and more expensive than I/O devices. The fact that CPUs can process data at extremely high rates, while simultaneously servicing multiple I/O devices, has had a sweeping impact on the design of both hardware and software for systems of all sizes, for pretty much as long as we’ve been building them.

Schema.org: Evolution of Structured Data on the Web:

Big data makes common schemas even more necessary. Separation between content and presentation has always been one of the important design aspects of the Web. Historically, however, even though most Web sites were driven off structured databases, they published their content purely in HTML. Services such as Web search, price comparison, reservation engines, etc. that operated on this content had access only to HTML. Applications requiring access to the structured data underlying these Web pages had to build custom extractors to convert plain HTML into structured data. These efforts were often laborious and the scrapers were fragile and error-prone, breaking every time a site changed its layout.

A Purpose-built Global Network: Google’s Move to SDN:

A discussion with Amin Vahdat, David Clark, and Jennifer Rexford Everything about Google is at scale, of course -- a market cap of legendary proportions, an unrivaled talent pool, enough intellectual property to keep armies of attorneys in Guccis for life, and, oh yeah, a private WAN (wide area network) bigger than you can possibly imagine that also happens to be growing substantially faster than the Internet as a whole.

It Probably Works:

Probabilistic algorithms are all around us--not only are they acceptable, but some programmers actually seek out chances to use them. Probabilistic algorithms exist to solve problems that are either impossible or unrealistic (too expensive, too time-consuming, etc.) to solve precisely. In an ideal world, you would never actually need to use probabilistic algorithms. To programmers who are not familiar with them, the idea can be positively nervewracking: "How do I know that it will actually work? What if it’s inexplicably wrong? How can I debug it? Maybe we should just punt on this problem, or buy a whole lot more servers..."

Challenges of Memory Management on Modern NUMA System:

Optimizing NUMA systems applications with Carrefour Modern server-class systems are typically built as several multicore chips put together in a single system. Each chip has a local DRAM (dynamic random-access memory) module; together they are referred to as a node. Nodes are connected via a high-speed interconnect, and the system is fully coherent. This means that, transparently to the programmer, a core can issue requests to its node’s local memory as well as to the memories of other nodes. The key distinction is that remote requests will take longer, because they are subject to longer wire delays and may have to jump several hops as they traverse the interconnect.

Componentizing the Web:

We may be on the cusp of a new revolution in web development. There is no task in software engineering today quite as herculean as web development. A typical specification for a web application might read: The app must work across a wide variety of browsers. It must run animations at 60 fps. It must be immediately responsive to touch. It must conform to a specific set of design principles and specs. It must work on just about every screen size imaginable, from TVs and 30-inch monitors to mobile phones and watch faces. It must be well-engineered and maintainable in the long term.

Fail at Scale:

Reliability in the face of rapid change Failure is part of engineering any large-scale system. One of Facebook’s cultural values is embracing failure. This can be seen in the posters hung around the walls of our Menlo Park headquarters: "What Would You Do If You Weren’t Afraid?" and "Fortune Favors the Bold."

How to De-identify Your Data:

Balancing statistical accuracy and subject privacy in large social-science data sets Big data is all the rage; using large data sets promises to give us new insights into questions that have been difficult or impossible to answer in the past. This is especially true in fields such as medicine and the social sciences, where large amounts of data can be gathered and mined to find insightful relationships among variables. Data in such fields involves humans, however, and thus raises issues of privacy that are not faced by fields such as physics or astronomy.

Crash Consistency:

Rethinking the Fundamental Abstractions of the File System The reading and writing of data, one of the most fundamental aspects of any Von Neumann computer, is surprisingly subtle and full of nuance. For example, consider access to a shared memory in a system with multiple processors. While a simple and intuitive approach known as strong consistency is easiest for programmers to understand, many weaker models are in widespread use (e.g., x86 total store ordering); such approaches improve system performance, but at the cost of making reasoning about system behavior more complex and error-prone.

Testing a Distributed System:

Testing a distributed system can be trying even under the best of circumstances. Distributed systems can be especially difficult to program, for a variety of reasons. They can be difficult to design, difficult to manage, and, above all, difficult to test. Testing a normal system can be trying even under the best of circumstances, and no matter how diligent the tester is, bugs can still get through. Now take all of the standard issues and multiply them by multiple processes written in multiple languages running on multiple boxes that could potentially all be on different operating systems, and there is potential for a real disaster.

Natural Language Translation at the Intersection of AI and HCI:

Old questions being answered with both AI and HCI The fields of artificial intelligence (AI) and human-computer interaction (HCI) are influencing each other like never before. Widely used systems such as Google Translate, Facebook Graph Search, and RelateIQ hide the complexity of large-scale AI systems behind intuitive interfaces. But relations were not always so auspicious. The two fields emerged at different points in the history of computer science, with different influences, ambitions, and attendant biases. AI aimed to construct a rival, and perhaps a successor, to the human intellect. Early AI researchers such as McCarthy, Minsky, and Shannon were mathematicians by training, so theorem-proving and formal models were attractive research directions.

Beyond Page Objects: Testing Web Applications with State Objects:

Use states to drive your tests End-to-end testing of Web applications typically involves tricky interactions with Web pages by means of a framework such as Selenium WebDriver. The recommended method for hiding such Web-page intricacies is to use page objects, but there are questions to answer first: Which page objects should you create when testing Web applications? What actions should you include in a page object? Which test scenarios should you specify, given your page objects?

Dismantling the Barriers to Entry:

We have to choose to build a web that is accessible to everyone. A war is being waged in the world of web development. On one side is a vanguard of toolmakers and tool users, who thrive on the destruction of bad old ideas ("old," in this milieu, meaning anything that debuted on Hacker News more than a month ago) and raucous debates about transpilers and suchlike.

Hadoop Superlinear Scalability:

The perpetual motion of parallel performance We often see more than 100 percent speedup efficiency! came the rejoinder to the innocent reminder that you can’t have more than 100 percent of anything. But this was just the first volley from software engineers during a presentation on how to quantify computer system scalability in terms of the speedup metric. In different venues, on subsequent occasions, that retort seemed to grow into a veritable chorus that not only was superlinear speedup commonly observed, but also the model used to quantify scalability for the past 20 years failed when applied to superlinear speedup data.

Evolution and Practice: Low-latency Distributed Applications in Finance:

The finance industry has unique demands for low-latency distributed systems. Virtually all systems have some requirements for latency, defined here as the time required for a system to respond to input. Latency requirements appear in problem domains as diverse as aircraft flight controls, voice communications, multiplayer gaming, online advertising, and scientific experiments. Distributed systems present special latency considerations. In recent years the automation of financial trading has driven requirements for distributed systems with challenging latency requirements and global geographic distribution. Automated trading provides a window into the engineering challenges of ever-shrinking latency requirements, which may be useful to software engineers in other fields.

The Science of Managing Data Science:

Lessons learned managing a data science research team What are they doing all day? When I first took over as VP of Engineering at a startup doing data mining and machine learning research, this was what the other executives wanted to know. They knew the team was super smart, and they seemed like they were working really hard, but the executives had lots of questions about the work itself. How did they know that the work they were doing was the "right" work? Were there other projects they could be doing instead? And how could we get this research into the hands of our customers faster?

Using Free and Open Source Tools to Manage Software Quality:

An agile process implementation The principles of agile software development place more emphasis on individuals and interactions than on processes and tools. They steer us away from heavy documentation requirements and guide us along a path of reacting efficiently to change rather than sticking rigidly to a pre-defined plan. To support this flexible method of operation, it is important to have suitable applications to manage the team’s activities. It is also essential to implement effective frameworks to ensure quality is being built into the product early and at all levels.

From the EDVAC to WEBVACs:

Cloud computing for computer scientists By now everyone has heard of cloud computing and realized that it is changing how both traditional enterprise IT and emerging startups are building solutions for the future. Is this trend toward the cloud just a shift in the complicated economics of the hardware and software industry, or is it a fundamentally different way of thinking about computing? Having worked in the industry, I can confidently say it is both.

Spicing Up Dart with Side Effects:

A set of extensions to the Dart programming language, designed to support asynchrony and generator functions The Dart programming language has recently incorporated a set of extensions designed to support asynchrony and generator functions. Because Dart is a language for Web programming, latency is an important concern. To avoid blocking, developers must make methods asynchronous when computing their results requires nontrivial time. Generator functions ease the task of computing iterable sequences.

Reliable Cron across the Planet:

...or How I stopped worrying and learned to love time This article describes Google’s implementation of a distributed Cron service, serving the vast majority of internal teams that need periodic scheduling of compute jobs. During its existence, we have learned many lessons on how to design and implement what might seem like a basic service. Here, we discuss the problems that distributed Crons face and outline some potential solutions.

There is No Now:

Problems with simultaneity in distributed systems Now. The time elapsed between when I wrote that word and when you read it was at least a couple of weeks. That kind of delay is one that we take for granted and don’t even think about in written media. "Now." If we were in the same room and instead I spoke aloud, you might have a greater sense of immediacy. You might intuitively feel as if you were hearing the word at exactly the same time that I spoke it. That intuition would be wrong. If, instead of trusting your intuition, you thought about the physics of sound, you would know that time must have elapsed between my speaking and your hearing.

Parallel Processing with Promises:

A simple method of writing a collaborative system In today’s world, there are many reasons to write concurrent software. The desire to improve performance and increase throughput has led to many different asynchronous techniques. The techniques involved, however, are generally complex and the source of many subtle bugs, especially if they require shared mutable state. If shared state is not required, then these problems can be solved with a better abstraction called promises. These allow programmers to hook asynchronous function calls together, waiting for each to return success or failure before running the next appropriate function in the chain.

META II: Digital Vellum in the Digital Scriptorium:

Revisiting Schorre’s 1962 compiler-compiler Some people do living history -- reviving older skills and material culture by reenacting Waterloo or knapping flint knives. One pleasant rainy weekend in 2012, I set my sights a little more recently and settled in for a little meditative retro-computing, ca. 1962, following the ancient mode of transmission of knowledge: lecture and recitation -- or rather, grace of living in historical times, lecture (here, in the French sense, reading) and transcription (or even more specifically, grace of living post-Post, lecture and reimplementation).

Model-based Testing: Where Does It Stand?:

MBT has positive effects on efficiency and effectiveness, even if it only partially fulfills high expectations. You have probably heard about MBT (model-based testing), but like many software-engineering professionals who have not used MBT, you might be curious about others’ experience with this test-design method. From mid-June 2014 to early August 2014, we conducted a survey to learn how MBT users view its efficiency and effectiveness. The 2014 MBT User Survey, a follow-up to a similar 2012 survey, was open to all those who have evaluated or used any MBT approach. Its 32 questions included some from a survey distributed at the 2013 User Conference on Advanced Automated Testing. Some questions focused on the efficiency and effectiveness of MBT, providing the figures that managers are most interested in.

Go Static or Go Home:

In the end, dynamic systems are simply less secure. Most current and historic problems in computer and network security boil down to a single observation: letting other people control our devices is bad for us. At another time, I’ll explain what I mean by "other people" and "bad." For the purpose of this article, I’ll focus entirely on what I mean by control. One way we lose control of our devices is to external distributed denial of service (DDoS) attacks, which fill a network with unwanted traffic, leaving no room for real ("wanted") traffic. Other forms of DDoS are similar: an attack by the Low Orbit Ion Cannon (LOIC), for example, might not totally fill up a network, but it can keep a web server so busy answering useless attack requests that the server can’t answer any useful customer requests.

Securing the Network Time Protocol:

Crackers discover how to use NTP as a weapon for abuse. In the late 1970s David L. Mills began working on the problem of synchronizing time on networked computers, and NTP (Network Time Protocol) version 1 made its debut in 1980. This was at a time when the net was a much friendlier place - the ARPANET days. NTP version 2 appeared approximately a year later, about the same time as CSNET (Computer Science Network). NSFNET (National Science Foundation Network) launched in 1986. NTP version 3 showed up in 1993.

Scalability Techniques for Practical Synchronization Primitives:

Designing locking primitives with performance in mind In an ideal world, applications are expected to scale automatically when executed on increasingly larger systems. In practice, however, not only does this scaling not occur, but it is common to see performance actually worsen on those larger systems.

Internal Access Controls:

Trust, but Verify Every day seems to bring news of another dramatic and high-profile security incident, whether it is the discovery of longstanding vulnerabilities in widely used software such as OpenSSL or Bash, or celebrity photographs stolen and publicized. There seems to be an infinite supply of zero-day vulnerabilities and powerful state-sponsored attackers. In the face of such threats, is it even worth trying to protect your systems and data? What can systems security designers and administrators do?

Disambiguating Databases:

Use the database built for your access model. The topic of data storage is one that doesn’t need to be well understood until something goes wrong (data disappears) or something goes really right (too many customers). Because databases can be treated as black boxes with an API, their inner workings are often overlooked. They’re often treated as magic things that just take data when offered and supply it when asked. Since these two operations are the only understood activities of the technology, they are often the only features presented when comparing different technologies.

A New Software Engineering:

What happened to the promise of rigorous, disciplined, professional practices for software development? What happened to software engineering? What happened to the promise of rigorous, disciplined, professional practices for software development, like those observed in other engineering disciplines? What has been adopted under the rubric of "software engineering" is a set of practices largely adapted from other engineering disciplines: project management, design and blueprinting, process control, and so forth. The basic analogy was to treat software as a manufactured product, with all the real "engineering" going on upstream of that - in requirements analysis, design, modeling, etc.

There’s No Such Thing as a General-purpose Processor:

And the belief in such a device is harmful There is an increasing trend in computer architecture to categorize processors and accelerators as "general purpose." Of the papers published at this year’s International Symposium on Computer Architecture (ISCA 2014), nine out of 45 explicitly referred to general-purpose processors; one additionally referred to general-purpose FPGAs (field-programmable gate arrays), and another referred to general-purpose MIMD (multiple instruction, multiple data) supercomputers, stretching the definition to the breaking point. This article presents the argument that there is no such thing as a truly general-purpose processor and that the belief in such a device is harmful.

The Responsive Enterprise: Embracing the Hacker Way:

Soon every company will be a software company. As of July 2014, Facebook, founded in 2004, is in the top 20 of the most valuable companies in the S&P 500, putting the 10-year-old software company in the same league as IBM, Oracle, and Coca-Cola. Of the top five fastest-growing companies with regard to market capitalization in 2014 (table 1), three are software companies: Apple, Google, and Microsoft (in fact, one could argue that Intel is also driven by software, making it four out of five).

Evolution of the Product Manager:

Better education needed to develop the discipline Software practitioners know that product management is a key piece of software development. Product managers talk to users to help figure out what to build, define requirements, and write functional specifications. They work closely with engineers throughout the process of building software. They serve as a sounding board for ideas, help balance the schedule when technical challenges occur - and push back to executive teams when technical revisions are needed. Product managers are involved from before the first code is written, until after it goes out the door.

Productivity in Parallel Programming: A Decade of Progress:

Looking at the design and benefits of X10 In 2002 DARPA (Defense Advanced Research Projects Agency) launched a major initiative in HPCS (high-productivity computing systems). The program was motivated by the belief that the utilization of the coming generation of parallel machines was gated by the difficulty of writing, debugging, tuning, and maintaining software at peta scale.

JavaScript and the Netflix User Interface:

Conditional dependency resolution In the two decades since its introduction, JavaScript has become the de facto official language of the Web. JavaScript trumps every other language when it comes to the number of runtime environments in the wild. Nearly every consumer hardware device on the market today supports the language in some way. While this is done most commonly through the integration of a Web browser application, many devices now also support Web views natively as part of the operating system UI (user interface).

Security Collapse in the HTTPS Market:

Assessing legal and technical solutions to secure HTTPS HTTPS (Hypertext Transfer Protocol Secure) has evolved into the de facto standard for secure Web browsing. Through the certificate-based authentication protocol, Web services and Internet users first authenticate one another ("shake hands") using a TLS/SSL certificate, encrypt Web communications end-to-end, and show a padlock in the browser to signal that a communication is secure. In recent years, HTTPS has become an essential technology to protect social, political, and economic activities online.

Why Is It Taking So Long to Secure Internet Routing?:

Routing security incidents can still slip past deployed security defenses. BGP (Border Gateway Protocol) is the glue that sticks the Internet together, enabling data communications between large networks operated by different organizations. BGP makes Internet communications global by setting up routes for traffic between organizations - for example, from Boston University’s network, through larger ISPs (Internet service providers) such as Level3, Pakistan Telecom, and China Telecom, then on to residential networks such as Comcast or enterprise networks such as Bank of America.

Certificate Transparency:

Public, verifiable, append-only logs On August 28, 2011, a mis-issued wildcard HTTPS certificate for google.com was used to conduct a man-in-the-middle attack against multiple users in Iran. The certificate had been issued by a Dutch CA (certificate authority) known as DigiNotar, a subsidiary of VASCO Data Security International. Later analysis showed that DigiNotar had been aware of the breach in its systems for more than a month - since at least July 19. It also showed that at least 531 fraudulent certificates had been issued. The final count may never be known, since DigiNotar did not have records of all the mis-issued certificates.

Securing the Tangled Web:

Preventing script injection vulnerabilities through software design Script injection vulnerabilities are a bane of Web application development: deceptively simple in cause and remedy, they are nevertheless surprisingly difficult to prevent in large-scale Web development.

Privacy, Anonymity, and Big Data in the Social Sciences:

Quality social science research and the privacy of human subjects requires trust. Open data has tremendous potential for science, but, in human subjects research, there is a tension between privacy and releasing high-quality open data. Federal law governing student privacy and the release of student records suggests that anonymizing student data protects student privacy. Guided by this standard, we de-identified and released a data set from 16 MOOCs (massive open online courses) from MITx and HarvardX on the edX platform. In this article, we show that these and other de-identification procedures necessitate changes to data sets that threaten replication and extension of baseline analyses. To balance student privacy and the benefits of open data, we suggest focusing on protecting privacy without anonymizing data by instead expanding policies that compel researchers to uphold the privacy of the subjects in open data sets.

The Network is Reliable:

An informal survey of real-world communications failures The network is reliable tops Peter Deutsch’s classic list, "Eight fallacies of distributed computing", "all [of which] prove to be false in the long run and all [of which] cause big trouble and painful learning experiences." Accounting for and understanding the implications of network behavior is key to designing robust distributed programs; in fact, six of Deutsch’s "fallacies" directly pertain to limitations on networked communications.

Undergraduate Software Engineering: Addressing the Needs of Professional Software Development:

Addressing the Needs of Professional Software Development In the fall semester of 1996 RIT (Rochester Institute of Technology) launched the first undergraduate software engineering program in the United States. The culmination of five years of planning, development, and review, the program was designed from the outset to prepare graduates for professional positions in commercial and industrial software development.

Bringing Arbitrary Compute to Authoritative Data:

Many disparate use cases can be satisfied with a single storage system. While the term ’big data’ is vague enough to have lost much of its meaning, today’s storage systems are growing more quickly and managing more data than ever before. Consumer devices generate large numbers of photos, videos, and other large digital assets. Machines are rapidly catching up to humans in data generation through extensive recording of system logs and metrics, as well as applications such as video capture and genome sequencing. Large data sets are now commonplace, and people increasingly want to run sophisticated analyses on the data.

Who Must You Trust?:

You must have some trust if you want to get anything done. In his novel The Diamond Age, author Neal Stephenson describes a constructed society (called a phyle) based on extreme trust in one’s fellow members. Part of the membership requirements is that, from time to time, each member is called upon to undertake certain tasks to reinforce that trust. For example, a phyle member might be told to go to a particular location at the top of a cliff at a specific time, where he will find bungee cords with ankle harnesses attached. The other ends of the cords trail off into the bushes. At the appointed time he is to fasten the harnesses to his ankles and jump off the cliff.

Automated QA Testing at EA: Driven by Events:

A discussion with Michael Donat, Jafar Husain, and Terry Coatta To millions of game geeks, the position of QA (quality assurance) tester at Electronic Arts must seem like a dream job. But from the company’s perspective, the overhead associated with QA can look downright frightening, particularly in an era of massively multiplayer games.

Design Exploration through Code-generating DSLs:

High-level DSLs for low-level programming DSLs (domain-specific languages) make programs shorter and easier to write. They can be stand-alone - for example, LaTeX, Makefiles, and SQL - or they can be embedded in a host language. You might think that DSLs embedded in high-level languages would be abstract or mathematically oriented, far from the nitty-gritty of low-level programming. This is not the case. This article demonstrates how high-level EDSLs (embedded DSLs) really can ease low-level programming. There is no contradiction.

Finding More Than One Worm in the Apple:

If you see something, say something. In February Apple revealed and fixed an SSL (Secure Sockets Layer) vulnerability that had gone undiscovered since the release of iOS 6.0 in September 2012. It left users vulnerable to man-in-the-middle attacks thanks to a short circuit in the SSL/TLS (Transport Layer Security) handshake algorithm introduced by the duplication of a goto statement. Since the discovery of this very serious bug, many people have written about potential causes.

Domain-specific Languages and Code Synthesis Using Haskell:

Looking at embedded DSLs There are many ways to give instructions to a computer: an electrical engineer might write a MATLAB program; a database administrator might write an SQL script; a hardware engineer might write in Verilog; and an accountant might write a spreadsheet with embedded formulas. Aside from the difference in language used in each of these examples, there is an important difference in form and idiom. Each uses a language customized to the job at hand, and each builds computational requests in a form both familiar and productive for programmers (although accountants may not think of themselves as programmers).

The NSA and Snowden: Securing the All-Seeing Eye:

How good security at the NSA could have stopped him Edward Snowden, while an NSA (National Security Agency) contractor at Booz Allen Hamilton in Hawaii, copied up to 1.7 million top-secret and above documents, smuggling copies on a thumb drive out of the secure facility in which he worked, and later released many to the press. This has altered the relationship of the U.S. government with the American people, as well as with other countries. This article examines the computer security aspects of how the NSA could have prevented this, perhaps the most damaging breach of secrets in U.S. history.

The Curse of the Excluded Middle:

Mostly functional programming does not work. There is a trend in the software industry to sell "mostly functional" programming as the silver bullet for solving problems developers face with concurrency, parallelism (manycore), and, of course, Big Data. Contemporary imperative languages could continue the ongoing trend, embrace closures, and try to limit mutation and other side effects. Unfortunately, just as "mostly secure" does not work, "mostly functional" does not work either. Instead, developers should seriously consider a completely fundamentalist option as well: embrace pure lazy functional programming with all effects explicitly surfaced in the type system using monads.

Don’t Settle for Eventual Consistency:

Stronger properties for low-latency geo-replicated storage Geo-replicated storage provides copies of the same data at multiple, geographically distinct locations. Facebook, for example, geo-replicates its data (profiles, friends lists, likes, etc.) to data centers on the east and west coasts of the United States, and in Europe. In each data center, a tier of separate Web servers accepts browser requests and then handles those requests by reading and writing data from the storage system.

A Primer on Provenance:

Better understanding of data requires tracking its history and context. Assessing the quality or validity of a piece of data is not usually done in isolation. You typically examine the context in which the data appears and try to determine its original sources or review the process through which it was created. This is not so straightforward when dealing with digital data, however: the result of a computation might have been derived from numerous sources and by applying complex successive transformations, possibly over long periods of time.

Multipath TCP:

Decoupled from IP, TCP is at last able to support multihomed hosts. The Internet relies heavily on two protocols. In the network layer, IP (Internet Protocol) provides an unreliable datagram service and ensures that any host can exchange packets with any other host. Since its creation in the 1970s, IP has seen the addition of several features, including multicast, IPsec (IP security), and QoS (quality of service). The latest revision, IPv6 (IP version 6), supports 16-byte addresses.

Major-league SEMAT: Why Should an Executive Care?:

Becoming better, faster, cheaper, and happier In today’s ever more competitive world, boards of directors and executives demand that CIOs and their teams deliver "more with less." Studies show, without any real surprise, that there is no one-size-fits-all method to suit all software initiatives, and that a practice-based approach with some light but effective degree of order and governance is the goal of most software-development departments.

Eventually Consistent: Not What You Were Expecting?:

Methods of quantifying consistency (or lack thereof) in eventually consistent storage systems Storage systems continue to lay the foundation for modern Internet services such as Web search, e-commerce, and social networking. Pressures caused by rapidly growing user bases and data sets have driven system designs away from conventional centralized databases and toward more scalable distributed solutions, including simple NoSQL key-value storage systems, as well as more elaborate NewSQL databases that support transactions at scale.

Scaling Existing Lock-based Applications with Lock Elision:

Lock elision enables existing lock-based programs to achieve the performance benefits of nonblocking synchronization and fine-grain locking with minor software engineering effort. Multithreaded applications take advantage of increasing core counts to achieve high performance. Such programs, however, typically require programmers to reason about data shared among multiple threads. Programmers use synchronization mechanisms such as mutual-exclusion locks to ensure correct updates to shared data in the presence of accesses from multiple threads. Unfortunately, these mechanisms serialize thread accesses to the data and limit scalability.

Rate-limiting State:

The edge of the Internet is an unruly place By design, the Internet core is dumb, and the edge is smart. This design decision has enabled the Internet’s wildcat growth, since without complexity the core can grow at the speed of demand. On the downside, the decision to put all smartness at the edge means we’re at the mercy of scale when it comes to the quality of the Internet’s aggregate traffic load. Not all device and software builders have the skills and the quality assurance budgets that something the size of the Internet deserves.

The API Performance Contract:

How can the expected interactions between caller and implementation be guaranteed? When you call functions in an API, you expect them to work correctly; sometimes this expectation is called a contract between the caller and the implementation. Callers also have performance expectations about these functions, and often the success of a software system depends on the API meeting these expectations. So there’s a performance contract as well as a correctness contract. The performance contract is usually implicit, often vague, and sometimes breached (by caller or implementation). How can this aspect of API design and documentation be improved?

Provenance in Sensor Data Management:

A cohesive, independent solution for bringing provenance to scientific research In today’s information-driven workplaces, data is constantly being moved around and undergoing transformation. The typical business-as-usual approach is to use e-mail attachments, shared network locations, databases, and more recently, the cloud. More often than not, there are multiple versions of the data sitting in different locations, and users of this data are confounded by the lack of metadata describing its provenance or in other words, its lineage. The ProvDMS project at the Oak Ridge National Laboratory (ORNL) described in this article aims to solve this issue in the context of sensor data.

Unikernels: Rise of the Virtual Library Operating System:

What if all the software layers in a virtual appliance were compiled within the same safe, high-level language framework? Cloud computing has been pioneering the business of renting computing resources in large data centers to multiple (and possibly competing) tenants. The basic enabling technology for the cloud is operating-system virtualization such as Xen1 or VMWare, which allows customers to multiplex VMs (virtual machines) on a shared cluster of physical machines. Each VM presents as a self-contained computer, booting a standard operating-system kernel and running unmodified applications just as if it were executing on a physical machine.

Toward Software-defined SLAs:

Enterprise computing in the public cloud The public cloud has introduced new technology and architectures that could reshape enterprise computing. In particular, the public cloud is a new design center for enterprise applications, platform software, and services. API-driven orchestration of large-scale, on-demand resources is an important new design attribute, which differentiates public-cloud from conventional enterprise data-center infrastructure. Enterprise applications must adapt to the new public-cloud design center, but at the same time new software and system design patterns can add enterprise attributes and service levels to public-cloud services.

The Road to SDN:

An intellectual history of programmable networks Designing and managing networks has become more innovative over the past few years with the aid of SDN (software-defined networking). This technology seems to have appeared suddenly, but it is actually part of a long history of trying to make computer networks more programmable.

The Software Inferno:

Dante’s tale, as experienced by a software architect The Software Inferno is a tale that parallels The Inferno, Part One of The Divine Comedy written by Dante Alighieri in the early 1300s. That literary masterpiece describes the condemnation and punishment faced by a variety of sinners in their hell-spent afterlives as recompense for atrocities committed during their earthly existences. The Software Inferno is a similar account, describing a journey where "sinners against software" are encountered amidst their torment, within their assigned areas of eternal condemnation, and paying their penance.

Making the Web Faster with HTTP 2.0:

HTTP continues to evolve HTTP (Hypertext Transfer Protocol) is one of the most widely used application protocols on the Internet. Since its publication, RFC 2616 (HTTP 1.1) has served as a foundation for the unprecedented growth of the Internet: billions of devices of all shapes and sizes, from desktop computers to the tiny Web devices in our pockets, speak HTTP every day to deliver news, video, and millions of other Web applications we have all come to depend on in our everyday lives.

Intermediate Representation:

The increasing significance of intermediate representations in compilers Program compilation is a complicated process. A compiler is a software program that translates a high-level source language program into a form ready to execute on a computer. Early in the evolution of compilers, designers introduced IRs (intermediate representations, also commonly called intermediate languages) to manage the complexity of the compilation process. The use of an IR as the compiler’s internal representation of the program enables the compiler to be broken up into multiple phases and components, thus benefiting from modularity.

The Challenge of Cross-language Interoperability:

Interfacing between languages is increasingly important. Interoperability between languages has been a problem since the second programming language was invented. Solutions have ranged from language-independent object models such as COM (Component Object Model) and CORBA (Common Object Request Broker Architecture) to VMs (virtual machines) designed to integrate languages, such as JVM (Java Virtual Machine) and CLR (Common Language Runtime). With software becoming ever more complex and hardware less homogeneous, the likelihood of a single language being the correct tool for an entire program is lower than ever. As modern compilers become more modular, there is potential for a new generation of interesting solutions.

Agile and SEMAT - Perfect Partners:

Combining agile and SEMAT yields more advantages than either one alone Today, as always, many different initiatives are under way to improve the ways in which software is developed. The most popular and prevalent of these is the agile movement. One of the newer kids on the block is the SEMAT (Software Engineering Method and Theory) initiative. As with any new initiative, people are struggling to see how it fits into the world and relates to all the other things going on. For example, does it improve or replace their current ways of working?

Adopting DevOps Practices in Quality Assurance:

Merging the art and science of software development Software life-cycle management was, for a very long time, a controlled exercise. The duration of product design, development, and support was predictable enough that companies and their employees scheduled their finances, vacations, surgeries, and mergers around product releases. When developers were busy, QA (quality assurance) had it easy. As the coding portion of a release cycle came to a close, QA took over while support ramped up. Then when the product released, the development staff exhaled, rested, and started the loop again while the support staff transitioned to busily supporting the new product.

Passively Measuring TCP Round-trip Times:

A close look at RTT measurements with TCP Measuring and monitoring network RTT (round-trip time) is important for multiple reasons: it allows network operators and end users to understand their network performance and help optimize their environment, and it helps businesses understand the responsiveness of their services to sections of their user base.

Leaking Space:

Eliminating memory hogs A space leak occurs when a computer program uses more memory than necessary. In contrast to memory leaks, where the leaked memory is never released, the memory consumed by a space leak is released, but later than expected. This article presents example space leaks and how to spot and eliminate them.

Barbarians at the Gateways:

High-frequency Trading and Exchange Technology I am a former high-frequency trader. For a few wonderful years I led a group of brilliant engineers and mathematicians, and together we traded in the electronic marketplaces and pushed systems to the edge of their capability.

Online Algorithms in High-frequency Trading:

The challenges faced by competing HFT algorithms HFT (high-frequency trading) has emerged as a powerful force in modern financial markets. Only 20 years ago, most of the trading volume occurred in exchanges such as the New York Stock Exchange, where humans dressed in brightly colored outfits would gesticulate and scream their trading intentions. Nowadays, trading occurs mostly in electronic servers in data centers, where computers communicate their trading intentions through network messages. This transition from physical exchanges to electronic platforms has been particularly profitable for HFT firms, which invested heavily in the infrastructure of this new environment.

The Balancing Act of Choosing Nonblocking Features:

Design requirements of nonblocking systems What is nonblocking progress? Consider the simple example of incrementing a counter C shared among multiple threads. One way to do so is by protecting the steps of incrementing C by a mutual exclusion lock L (i.e., acquire(L); old := C ; C := old+1; release(L);). If a thread P is holding L, then a different thread Q must wait for P to release L before Q can proceed to operate on C. That is, Q is blocked by P.

NUMA (Non-Uniform Memory Access): An Overview:

NUMA becomes more common because memory controllers get close to execution units on microprocessors. NUMA (non-uniform memory access) is the phenomenon that memory at various points in the address space of a processor have different performance characteristics. At current processor speeds, the signal path length from the processor to memory plays a significant role. Increased signal path length not only increases latency to memory but also quickly becomes a throughput bottleneck if the signal path is shared by multiple processors. The performance differences to memory were noticeable first on large-scale systems where data paths were spanning motherboards or chassis. These systems required modified operating-system kernels with NUMA support that explicitly understood the topological properties of the system’s memory (such as the chassis in which a region of memory was located) in order to avoid excessively long signal path lengths.

20 Obstacles to Scalability:

Watch out for these pitfalls that can prevent Web application scaling. Web applications can grow in fits and starts. Customer numbers can increase rapidly, and application usage patterns can vary seasonally. This unpredictability necessitates an application that is scalable. What is the best way of achieving scalability?

Rules for Mobile Performance Optimization:

An overview of techniques to speed page loading Performance has always been crucial to the success of Web sites. A growing body of research has proven that even small improvements in page-load times lead to more sales, more ad revenue, more stickiness, and more customer satisfaction for enterprises ranging from small e-commerce shops to megachains such as Walmart.

Best Practices on the Move: Building Web Apps for Mobile Devices:

Which practices should be modified or avoided altogether by developers for the mobile Web? If it wasn’t your priority last year or the year before, it’s sure to be your priority now: bring your Web site or service to mobile devices in 2013 or suffer the consequences. Early adopters have been talking about mobile taking over since 1999 - anticipating the trend by only a decade or so. Today, mobile Web traff