This is an informal overview of the major library enhancements in Java SE 8 to take advantage of new language features, primarily lambda expressions and default methods, specified by JSR 335 and implemented in the OpenJDK Lambda Project. The new language features for Java SE8 are described in Lambda Expression in Java 8, which should be read first.

Background

Had lambda expressions (closures) been part of the Java language from the beginning, our Collections APIs would certainly look different than they do today. As the Java language acquires lambda expressions as part of JSR 335, this has the unfortunate side effect of making our Collections interfaces look even more out of date! While it might be tempting to start from scratch and build a replacement collections framework ("Collections II"), replacing the collections framework would be a major task, as Collections interfaces permeate the entire Java ecosystem, and the adoption lag would be many years. Instead, we pursue an evolutionary strategy of adding extension methods to existing interfaces (such as Collection , List , or Iterable ), and adding a stream abstraction (for example, java.util.stream.Stream ) for performing aggregate operations on datasets, and retrofitting existing classes to provide stream views, enabling many of the desired idioms without making people trade in their trusty ArrayList s and HashMap s. (This is not to say that Collections will never be replaced; clearly, there are limitations beyond simply not being designed for lambdas. A more modern collections framework may be considered for a future version of the JDK.)

A key driver for this work is making parallelism more accessible to developers. While the Java platform provides strong support for concurrency and parallelism already, developers face unnecessary impediments in migrating their code from sequential to parallel as needed. Therefore, it is important to encourage idioms that are both sequential- and parallel-friendly. This is facilitated by shifting the focus towards describing what computation should be performed, rather than how it should be performed. It is also important to strike the balance between making parallelism easier but not going so far as to make it invisible; our goal is explicit but unobtrusive parallelism. (Making parallelism transparent would introduce nondeterminism and the possibility of data races where users might not expect it.)

Internal vs. External Iteration

The collections framework relies on the concept of external iteration, where a Collection provides, by implementing Iterable , a means to enumerate its elements, and clients use this to step sequentially through the elements of a collection. For example, if we wanted to set the color of every shape in a collection of shapes to red, we would write:

for (Shape s : shapes) { s.setColor(RED); }

This example illustrates external iteration; the for-each loop calls the iterator() method of shapes , and steps through the collection one by one. External iteration is straightforward enough, but it has several problems:

Java's for-each loop is inherently sequential, and must process the elements in the order specified by the collection.

loop is inherently sequential, and must process the elements in the order specified by the collection. It deprives the library method of the opportunity to manage the control flow, which might be able to provide better performance by exploiting reordering of the data, parallelism, short-circuiting, or laziness.

Sometimes the strong guarantees of the for-each loop (sequential, in-order) are desirable, but often are just an impediment to performance. The alternative to external iteration is internal iteration, where instead of controlling the iteration, the client delegates that to the library and passes in snippets of code to execute at various points in the computation.

The internal-iteration equivalent of the previous example is:

shapes.forEach(s -> s.setColor(RED));

This may appear to be a small syntactic change, but the difference is significant. The control of the operation has been handed off from the client code to the library code, allowing the libraries not only to abstract over common control flow operations, but also enabling them to potentially use laziness, parallelism, and out-of-order execution to improve performance. (Whether an implementation of forEach actually does any of these things is for the implementation to determine; but with internal iteration, they are at least possible, whereas with external iteration, they are not.)

Where external iteration mixes the what (color the shapes red) and the how (get an Iterator and iterate it sequentially), internal iteration lets the client dictate the what but lets the library control the how. This offers several potential benefits: client code can be clearer because it need only focus on stating the problem, not the details of how to go about solving it, and we can move complex optimization code into libraries where it can benefit all users.

Streams

The key new library abstraction introduced in Java SE 8 is a stream, defined in package java.util.stream . (There are several stream types; Stream<T> represents a stream of object references, and there are specializations such as IntStream to describe streams of primitives.) A stream represents a sequence of values, and exposes a set of aggregate operations that allow us to express common manipulations on those values easily and clearly. The libraries provide convenient ways to obtain stream views of collections, arrays, and other data sources.

Stream operations are chained together into pipelines. For example, if we wanted to color only the blue shapes red, we could say:

shapes.stream() .filter(s -> s.getColor() == BLUE) .forEach(s -> s.setColor(RED));

The stream() method on Collection produces a stream view of the elements of that collection; the filter() operation then produces a stream containing the shapes that are blue, and these elements are then made red by forEach() .

If we wanted to collect the blue shapes into a new List , we could say:

List<Shape> blue = shapes.stream() .filter(s -> s.getColor() == BLUE) .collect(Collectors.toList());

The collect() operation collects the input elements into an aggregate (such as a List ) or a summary description; the argument to collect() is a recipe for how to do this aggregation. In this case, we use toList() , which is a simple recipe that accumulates the elements into a List . (More detail on collect() will be discussed shortly.)

If each shape were contained in a Box , and we wanted to know which boxes contained at least one blue shape, we could say:

Set<Box> hasBlueShape = shapes.stream() .filter(s -> s.getColor() == BLUE) .map(s -> s.getContainingBox()) .collect(Collectors.toSet());

The map() operation produces a stream whose values are the result of applying a mapping function (here, one that takes a shape and returns its containing box) to each element in its input stream.

If we wanted to add up the total weight of the blue shapes, we could express that as:

int sum = shapes.stream() .filter(s -> s.getColor() == BLUE) .mapToInt(s -> s.getWeight()) .sum();

So far, we haven't provided much detail about the specific signatures of the Stream operations shown; these examples simply illustrate the types of problems that the streams framework is designed to address.

Streams vs. Collections

Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing the computational operations which will be performed in aggregate on that source. Accordingly, streams differ from collections in several ways:

No storage. Streams don't have storage for values; they carry values from a source (which could be a data structure, a generating function, an I/O channel, etc) through a pipeline of computational steps.

Functional in nature. Operations on a stream produce a result, but do not modify its underlying data source.

Laziness-seeking. Many stream operations, such as filtering, mapping, sorting, or duplicate removal) can be implemented lazily. This facilitates efficient single-pass execution of entire pipelines, as well as facilitating efficient implementation of short-circuiting operations.

Bounds optional. There are many problems that are sensible to express as infinite streams, letting clients consume values until they are satisfied. (If we were enumerating perfect numbers, it is easy to express this as a filtering operation on the stream of all integers.) While a Collection is constrained to be finite, a stream is not. (To terminate in finite time, a stream pipeline with an infinite source can use short-circuiting operations; alternately, you can request an Iterator from a Stream and traverse it manually.)

As an API, Streams is completely independent from Collections . While it is easy to use a collection as the source for a stream ( Collection has stream() and parallelStream() methods) or to dump the elements of a stream into a collection (using the collect() operation as shown earlier), aggregates other than Collection can be used as sources for streams as well. Many JDK classes, such as BufferedReader , Random , and BitSet , have been retrofitted to act as sources for streams, and Arrays.stream() provides stream view of arrays. In fact, anything that can be described with an Iterator can be used as a stream source, though if more information is available (such as size or metadata about stream contents like "sortedness"), the library can provide an optimized execution.

Laziness

Operations such as filtering or mapping can be performed eagerly (where the filtering is performed on all elements before the filter method returns), or lazily (where the Stream representing the filtered result only applies the filter to elements from its source as needed.) Performing computations lazily, where practical, can be beneficial. For example, if we perform filtering lazily, we can fuse the filtering with other operations later in the pipeline, so as not to require multiple passes on the data. Similarly, if we are searching a large collection for the first element that matches a given criteria, we can stop once we find one, rather than processing the entire collection. (This is especially important for infinite sources; laziness is merely an optimization for finite sources, it makes operations on infinite sources possible, whereas an eager approach would never terminate.)

Operations such as filtering and mapping can be thought of as naturally lazy, whether or not they are implemented as such. On the other hand, value-producing operations such as sum() , or side-effect-producing operations such as forEach() , are "naturally eager," because they must produce a concrete result.

In a pipeline such as:

int sum = shapes.stream() .filter(s -> s.getColor() == BLUE) .mapToInt(s -> s.getWeight()) .sum();

the filtering and mapping operations are lazy. This means that we don't start drawing elements from the source until we start the sum operation, and when we do perform the sum operation, we fuse filtering, mapping, and addition into a single pass on the data. This minimizes the bookkeeping costs required to manage intermediate elements.