Today I published lazy-init, a Rust crate that scratches an itch I’ve had for a while. lazy-init is designed for when:

you want to do some work (a computation, disk I/O, etc) lazily, the product of this work is immutable once it is created, and you want to share this data across threads.

Rust has a good built-in solution if you only require #s 1 and 2: the Option type. But requirement #3 makes things much harder. Both of the built-in, thread-safe primitives for interior mutability have significant drawbacks, as we’ll see later. But first, the API!

impl<T> Lazy<T> {

/// Construct a new, uninitialized `Lazy<T>`.

pub fn new() -> Lazy<T>; /// Get a reference to the contained value, invoking `f` to create it

/// if the `Lazy<T>` is uninitialized. It is guaranteed that if multiple

/// calls to `get_or_create` race, only one will invoke its closure, and

/// every call will receive a reference to the newly created value.

///

/// The value stored in the `Lazy<T>` is immutable after the closure returns

/// it, so think carefully about what you want to put inside!

pub fn get_or_create<'a, F>(&'a self, f: F) -> &'a T

where F: FnOnce() -> T;

/// Get a reference to the contained value, returning `Some(ref)` if the

/// `Lazy<T>` has been initialized or `None` if it has not. It is

/// guaranteed that if a reference is returned it is to the value inside

/// the `Lazy<T>`.

pub fn get<'a>(&'a self) -> Option<&'a T>;

}

There’s a constructor and two methods, one to get an existing value and another to get_or_create the value if it does not already exist. get_or_create will ensure that the closure is invoked only once even if multiple threads race to call it on an uninitialized Lazy<T> . Simple enough, right?

Lazy<T> is actually a degenerate version of a more generic LazyTransform<T, U> included in the crate which is initialized with a T that is later converted to a U . Lazy<Foo> is essentially LazyTransform<(), Foo> . For simplicity, I’ll refer to them interchangeably.



Rust provides two primitives for threadsafe interior mutability, std::sync::Mutex and std::sync::RwLock . Lazy<T> is better than both of them because:

Unlike the locking types, Lazy<T> guarantees immutability after the value is created. This also means you can hold an immutable reference to the interior value without having to hold the lock.

guarantees immutability after the value is created. This also means you can hold an immutable reference to the interior value without having to hold the lock. Unlike std::sync::Mutex , Lazy<T> does not exclude multiple readers after the value is created, and a panic while reading the value will not poison the Lazy<T> .



, does not exclude multiple readers after the value is created, and a panic while reading the value will not poison the . Lazy<T> is at least no worse in performance compared to either locking type, and likely much better.

The first two are self-explanatory, so lets dive into the third one. On Unix systems, std::sync::Mutex and std::sync::RwLock boil down to pthread_mutex_t and pthread_rwlock_init respectively. Lazy<T> meanwhile, becomes a single std::sync::Mutex and a std::sync::atomic::AtomicBool .

The (slightly simplified, to elide details not relevant to synchronization) code inside get_or_create looks like

if !self.initialized.load(Ordering::Acquire) {

// We *may* not be initialized. We have to block to be certain.

let _lock = self.lock.lock().unwrap();

if !self.initialized.load(Ordering::Relaxed) {

// Ok, we're definitely uninitialized.

// Safe to fiddle with the UnsafeCell now, because we're locked,

// and there can't be any outstanding references.

let value = unsafe { &mut *self.value.get() };

*value = f(value);

self.initialized.store(true, Ordering::Release);

} else {

// We raced, and someone else initialized us. We can fall

// through now.

}

} // We're initialized, our value is immutable, no synchronization needed.

*self.value.get()



Where self.value is an UnsafeCell . This is a standard double-checked locking pattern. Jeff Preshing has a great explanation of how this pattern actually works, and why the various Ordering values are what they are here. The simple explanation is that the AtomicBool::store call with Ordering::Release after the closure synchronizes with the AtomicBool::load call with Ordering::Acquire at the top of the function. So if a thread sees the write to self.initialized it must also see the write to self.value . If a thread doesn’t see that write, it grabs the lock. Memory accesses cannot be reordered across a lock acquisition or release (because, internally, a mutex uses semantics that are at least as strong as the acquire/release semantics mentioned before) so self.initialized is now a definitive source of truth. The lock also ensures that only one thread invokes the closure no matter how many threads are racing.

The code inside get is even simpler

if self.initialized.load(Ordering::Acquire) {

// We're initialized, our value is immutable, no synchronization needed.

Some(&*self.value.get())

} else {

None

}

We use the same acquire semantics as before to check if we are initialized. You’ll notice that we don’t have to acquire a lock at all here. Even in get_or_create , we only have to acquire the lock if self.initialized appears to be false. Once that write propagates to all threads, Lazy<T> allows lock-free access to the underlying value at the cost of a single load-acquire check.

Best of all, with x86′s strong memory model, every load from memory has acquire semantics. The atomic operations here really just tell the compiler not to do anything crazy with reordering. This is not true on other architectures with weaker memory models. On ARM, for instance, getting load-acquire semantics does require a DMB instruction. Peter Sewell maintains a list of what the various atomic orderings map to on different architectures.

Depending on your pthreads implementation and architecture the performance of pthread_mutex_t and pthread_rwlock_t can vary wildly. But as any sort of read-write lock needs to, at the bare minimum, both ensure there are no outstanding writers and increment the read count, a pthread_rwlock_t is never going to be any faster than the single load-acquire that Lazy<T> performs.

I hope others find this crate useful. Bug reports and pull requests are always welcome!



EDIT: Thanks to Huon for pointing out that I needed to bound the contained type with Sync