Migrating methods in Collections

I'd like to start with the "What about Collections" discussion. While this topic is inextricably linked to all the other topics (value types, generic specialization, nullability, etc), to avoid getting overwhelmed, let's start by trying to keep our focus on the Collections API and the process of migrating it into an anyfied world. So, I'll offer an ahead-of-time warning: there's going to be lots here that you will want to question, and to many of those questions I'm going to answer "OK, but let's come back to that later", because I'd like to stay focused for the time being on the Collection-related issues. (Let's also try to create separate threads for separate discussion topics, though I know this is harder to remember to do.) Most of these are already prototyped in some form in the Valhalla repos (which at least demonstrates that they are plausible.) BackgroundMaterial ------------------- The last "State of the Specialization" (http://cr.openjdk.java.net/~briangoetz/valhalla/specialization.html, Dec 2014) outlines an early approach towards the specialization challenges and features. My talk at JVMLS 2015 (https://www.youtube.com/watch?v=uNgAFSUXuwc) outlines the progress that had been made between last December and August. The State of the Values (http://cr.openjdk.java.net/~jrose/values/values-0.html) outlines the basic model for value types. If you've not read/viewed these, that's a good place to start. Core Goals ---------- The key requirement that drove many of the design choices in Java 5 when adding generics is *gradual migration compatibility*. This means it must be possible to evolve a non-generic class to be generic in a manner that is binary-compatible and source-compatible for both clients and subtypes. So an existing client or subtype should continue to work whether it is recompiled or not, and generified or not, and it should be equally practical to generify subtypes and clients at the same time as a library class, or later, or never. We adopt similar goals for "anyfication" -- anyfying a class should be binary- and source-compatible for clients and subclasses, and it should be possible to anyfy a generic class without requiring that either its clients or subclasses be anyfied, and whether or not they are recompiled. Basic Model, Ultra-summarized ----------------------------- Just as the requirement of gradual migration compatibility was a significant design constraint in Java 5 (and which pushed us strongly towards erasure), this requirement is going to be a significant constraint here. Essentially, this means that parameterizations like ArrayList<String> need to continue to be erased (otherwise they'd not be binary compatible), but parameterizations like ArrayList<int> need to be reified (since we cannot erase int to Object without boxing, and if boxing were good enough, we wouldn't be bothering at all.) This pushes us towards an interpretation of anyfied generics that is erased over reference instantiations and reified over value instantiations. At the same time, there is code that is valid under the assumption that T <: Object that would not be valid if T can take on 'int'. These include domain assumptions (e.g., assignment to null) and identity-based assumptions (a T can be used as a monitor lock.) Accordingly, we cannot simply reinterpret existing reference-generic code to be any-generic; we need some indication from the user that they are opting into the broadened genericity domain (and therefore willing to accept the limitations of this broadened domain.) Our working model for this is to annotate the declaration of a type variable with an "any" modifier (e.g., "class Foo<any T>"). We sometimes use the abbreviation 'tvar' to describe a type variable, and 'avar' to describe an any-type variable (similarly, 'ivar' for an inference type variable.) Conceptually, the enhanced generics model is simple: - Non-anyfied generic code means exactly the same thing in Valhalla as it does in Java 5-9; - A reference instantiation of an anyfied generic class means exactly the same thing in Valhalla as it does under erased generics; - A class or method tvar declaration can be annotated with 'any', in which case it can range over value types (including primitives) as well as reference types. There are some restrictions on what you can do to an avar-typed variable to reflect the fact that values may not be nullable, have no monitor locks, that they don't participate in a subtyping relationship with Object (instead, they have a boxing conversion to Object), and that a V[] is not a subtype of Object[]; - There is a new wildcard type, Foo<any>, which is a supertype of both value and reference instantiations of Foo. The enhanced model embraces erasure more explicitly than the current model. Migration Challenges -------------------- Ideally, we could just sprinkle 'any' over our codebase, and be done -- and hopefully for many classes, this is all that will be required. But for some classes, there may be migration challenges, which come in two forms: - A method signature is fundamentally incompatible with anyfication (such as Map.get(), which returns null when the mapping is not present, which makes no sense if V=int); - The existing implementation appeals to assumptions of object-ness, and needs to be adjusted. For the purposes of this memo, I'm going to restrict myself to the first category; we'll come back to the second category later. Here's a more-or-less complete list of methods in Collections that may need some sort of migration help. A blank cell indicates "same as above." *Class** * *Method** * *Issues** * Collection contains(Object) Assumes all Rs are castable to Object without boxing. remove(Object) removeAll(Collection<?>) Should take Collection<? extends E>. Collection<?> not a supertype of all instantiations. retainAll(Collection<?>) containsAll(Collection<?>) toArray() Returns Object[], which is not a supertype of V[] for any value type V. toArray(T[]) Not strictly problematic, but relies on runtime checking that E <: T and on reflection for implementation. List remove(int) Not strictly problematic, but if remove(Object) becomes remove(T), will be a confusing overload. indexOf(Object) Same as contains(Object). Also, would be better as a lambda-consuming method, now that we have lambdas. lastIndexOf(Object) Map containsKey(Object) Same as contains(Object) containsValue(Object) remove(Object) put(K,V) Returns what was there before, or null if there was no mapping. Not strictly problematic, but doesn't project so well to non-nullable Vs. putIfAbsent, replace, computeIfAbsent get(K) Uses null to signal no mapping getOrDefault(Object, V) Same as contains(Object) Queue poll(),peek() Uses null to signal no element Deque poll(), peek() pollFirst(), pollLast() peekFirst(), peekLast() removeFirstOccurrence(), removeLastOccurrence() Same as remove(Object) It has often been suggested that "maybe it is time for Collections 2.0"; some would like to declare this to be the end of the road for existing collections. However, redoing collections is only part of the battle; the Collection types are riddled throughout the JDK and other libraries, so we would need a mechanism for migrating all the methods that consume/dispense collections, and if we were to build new collections, saddling them with compatibility with old collections would be a big stone to hang around their neck. So I think this path is less appealing that it might first appear. However, the techniques described here allow us to fix some of the errors of the past. Partial Methods --------------- Our approach is guided by the following observation: not all methods declared in a generic class Foo<any T> need be members of all instantiations of Foo; it is reasonable for some members to be restricted to certain instantiations. In particular, the instantiation Foo<all type vars instantiated with erased ref> is an important distinguished instantiation. If a method m() in Foo<any T> is a member of all instantiations of Foo<T>, we say m() is a *total* method. Otherwise we call it a *restricted* or *partial* method. For each method in an existing generic class to be anyfied, it could be made a total method in the new anyfied class, or it could be restricted to reference instantiations -- and either approach is fully compatible with existing clients and subclasses. This provides us a migration path to "leave behind" certain problematic methods (making them members only of reference instantiations), and also to add new methods as long as there is a default implementation at least for reference instantiations. As a simple illustration of this approach, let's say that we want to effectively rename the method List.remove(int) to removeAt(int). Clearly we cannot simply take away remove(int); existing clients would fail to link. Similarly we cannot simply add a new method removeAt(int) without a default; then existing subclasses would fail to compile. What we do here is (using a strawman syntax) add a new total method, with a ref default, and demote the existing method to a partial method on ref instantiations: interface List<any T> { // A new, total method void removeAt(int i); // demote the old method to ref-only <where ref T> // Just a strawman syntax void remove(int i); // give the new method a default in terms of the old, for ref <where ref T> default void removeAt(int i) { remove(i); } } Now: - Clients of List<ref T> can invoke either remove(int) or removeAt(int), and both will do the same thing; this allows existing clients to (if they want) migrate away from the old method completely, since removeAt(int) is total, or keep using the old method, at their choice (IDEs can also provide migration help); - New clients of List<any T> can only use removeAt(int) -- they won't even *see* remove(int) when they ask their IDE to auto-complete -- and this is not a compatibility issue as there were no existing clients of List<any T>; - Existing (ref) subclasses of List<T> still see the same set of abstract methods, and therefore continue to work exactly as before; - When anyfying a subclass of List<T>, now you have to provide a new implementation for removeAt(int), but this is OK because *you* have decided to anyfy your class, and it is reasonable to expect some possible code changes in this situation; - Further, AbstractList can insulate most subclasses from this change. This gives us at least two tactics for dealing with problematic methods: - Migration: leave the old method in the "ref layer", and create a new total method with a ref-default in the "any layer", as with the removeAt example above; - Abandonment: leave the old method in the ref layer, and do nothing else, which may be suitable if an alternate idiom already exists (e.g., we could abandon removeAll because now we can express removeAll(c) as removeIf(c::contains), without adding any new method.) With the controlled-migration option, the primary impediment is that many of the good names are already taken. These two tactics get us much of the way there, but not all the way there. I'm going to close this note with where these tactics get us, and then open a separate note for some of the options for the remaining cases. The following table shows some possibilities. There's plenty of time to bikeshed on the names. *Class** * *Method** * *Possible Approaches** * Collection contains(Object) Migrate to containsElement(E) remove(Object) Migrate to removeElement(E) removeAll(Collection<?>) Migrate to removeElements(Collection<? extends T>). Alternately, abandon in favor of existing removeIf(Predicate<? super E>). retainAll(Collection<?>) Same containsAll(Collection<?>) Migrate to containsElements(Collection<? extends E>). toArray() Nothing good yet ... toArray(T[]) Leave as is, or abandon in favor of new method toArray(IntFunction<E[]>), like Stream has. List remove(int) Leave as is, or migrate to removeAt(int). indexOf(Object) Migrate to Optional-bearing findFirst(Predicate<? super E>) lastIndexOf(Object) Migrate to findLast(Predicate<? super E>) Map containsKey(Object) Migrate to hasKey(K) containsValue(Object) Migrate to hasValue(V) remove(Object) Migrate to removeMapping(K) put(K,V) Leave as is; accept that we cannot distinguish between "nothing was there before" and "default value was there before" (as is true with null today.) get(K) Migrate to one (or all) of: Optional<V> map(K) mapOrElse(K, V) tryMap(K, Consumer<? super V>) getOrDefault(Object, V) Migrate to mapOrElse, as above Queue poll(), peek() Migrate to tryPoll(Consumer) *or* optional-bearing method Deque poll(), peek() pollFirst(), pollLast() peekFirst(), peekLast() removeFirstOccurrence(), removeLastOccurrence() Migrate to predicate-accepting method, or simply migrate to new name Notes: For Collection.contains() and remove(), while it may be tempting to try and narrow the argument type from Object to T, this is likely problematic. In the following case, we can't express something that seems pretty natural: Collection<Animal> animals = ... Collection<Dog> dogs = ... Dog d = ... // Can't say these for (Animal a : animals) if (dogs.contains(a)) ... animals.removeIf(a -> dogs.contains(a)) Its actually pretty convenient for contains/remove to accept anything that might be a member of the collection, but Object is not the top type we're looking for here (since values require boxing to convert to Object.) So neither the status quo (accept Object) nor recasting to a T-accepting method is all that great. (But see next memo for an alternative.) Its not clear whether all the other Object-accepting methods need the same treatment, or if migration to an E-accepting method is acceptable there. For removeAll/retainAll, we probably just want to abandon in favor of the more powerful removeIf added in 8. (Kevin/Louis -- would be good to get stats on actual usage of removeAll/retainAll.) For Map.get(), it really sucks that the good name is taken but the existing signature is terminally polluted with nullness (even for non-null-supporting maps.) We can migrate to multiple new methods (most of which have defaults in terms of the others): Optional<V> map(K k) V mapOrElse(K k, V defaultV) boolean tryMap(K k, Consumer<? super V>) With Optional as a value type, there is essentially no cost (either footprint or invocation overhead) for using Optional over a naked reference. So the first sig is not as problematic as it might look. The second has an obvious default in terms of the first. The Optional version, though, has an issue: if the map can contain null values, there's no way to express that. However, I think its OK if this method throws on null values -- the other three get-like methods (the two here plus legacy get()) can all express null values. (So null-lovers don't get to enjoy the Optional goodness. More motivation for them to give up on their nulls!)