

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

This is the third and final part of the article about implementing reliable persistent storage over Flash.

Previously published parts of the article:

Part I. Flash vs EEPROM

Part II. Existing Implementations by Atmel, SiLabs, TI, STM, and Microchip

In Part I we’ve discussed specifics of working with Flash, and defined requirements for our storage system. In Part II, we’ve analyzed five of existing EEPROM-over-Flash implementations; unfortunately, none of them satisfied our requirements; moreover, we have found that none of them is a faithful EEPROM emulation (i.e. none of the considered emulations provides the same guarantees as EEPROM).

Before JoFS

In this Part III we will propose a “Journaled Flash Storage” (JoFS, not to be confused with JFS – Journaled File System) – storage system which operates over Flash, while satisfying all the Requirements specified in Part I. As it will be shown below, JoFS can be used as a simple (but faithful) EEPROM emulation, or as a significantly more generic ACID-compliant transactional storage.

With JoFS

In fact, JoFS is based on the same principles as those implementations described in Part II, it just fixes the problems outlined there (with one of the fixes – the one related to partial page erase – being non-trivial), and generalises over simple EEPROM emulation to enable object storing and multi-object transactions.

JoFS Generic Framework

Within JoFS there are certain design choices which depend on intended usage of the specific JoFS instance. That’s why we will first specify JoFS Generic Framework, which provides all the necessary properties, though is relying on its components to perform certain (more basic) operations with certain guarantees; specific implementation of JoFS components, which are compliant with their respective requirements, will be discussed in detail below.

JoFS Generic Framework uses two components: Frame-Writer and Page-Eraser.

Frame-Writer Requirements

Frame-Writer needs to support three operations:

WriteFrame(intra_page_addr,object_id,data) . Writes a Frame into circular buffer, returns intra_page_addr of the end of frame. It is assumed that there is enough space within the circular buffer for the Frame.

. Writes a Frame into circular buffer, returns intra_page_addr of the end of frame. It is assumed that there is enough space within the circular buffer for the Frame. ReadFrame(intra_page_addr) . Reads a previously written frame, returns intra_page_addr of the end of the frame, and returns whether the frame is a correct one, a “canceled” one (see below), or a “partially written” one (i.e. previous write has been incomplete, for example, due to power loss). If the frame is a correct one, the previously written object_id and data from the frame also need to be returned.

. Reads a previously written frame, returns intra_page_addr of the end of the frame, and returns whether the frame is a correct one, a “canceled” one (see below), or a “partially written” one (i.e. previous write has been incomplete, for example, due to power loss). If the frame is a correct one, the previously written object_id and data from the frame also need to be returned. CancelFrame(intra_page_addr). “Cancels” a previously written invalid frame, marking it as “canceled”. “Canceled” frames differ from “partially written” frames: for a “partially written” frame, its length may be undefined; for “cancelled” frame, it must be well-defined.

Above, intra_page_addr is an address within Flash page, and object_id is an identifier of the object being written or read. Frame-Writer should know size of each frame, though, depending on specific implementation of the Frame-Writer, size of each object may be constant, implicit (derived from object_id), or explicitly written to the frame.

Frame-Writer Guarantees

Frame-Writer needs to ensure that if WriteFrame() is interrupted by a power loss at any point (including power loss within underlying writeByte(), which may lead to undefined byte as specified in Requirements), on subsequent read (after the power is restored) the frame-with-interrupted-WriteFrame() MUST be recognized as one of the following: non-existing frame (nothing has been written), partially written frame, or correct frame (the last case may happen if the frame is indeed valid and contains all the data which was fed to interrupted-WriteFrame()). Also, if CancelFrame() is interrupted by a power loss at any point, on subsequent read (after the power is restored) the frame-with-interrupted-CancelFrame() MUST be recognized as either partially written frame, or canceled frame. It means that CancelFrame() operation over partially-written frame can be seen as an atomic one.

Page-Eraser Requirements

Page-Eraser is responsible for handling pages and their erasures. From the point of view of JoFS Generic Framework, each page can be in one of the following states: Valid-Data, Transfer-Ready, Transfer-In-Progress. Valid-Data is a normal page state with some data potentially present in the page. From the point of view of JoFS Generic Framework, there is always exactly one page in Transfer-Ready state; such a page has no data in it; Page-Eraser MAY rely on exactly one page being in Transfer-Ready state. Transfer-In-Progress is a temporary state, which is used during data transfer (“garbage collection”) of the data from the older pages to the newer ones; there is at most one page in Transfer-In-Progress state. All the data in Transfer-In-Progress page is always a duplicate from the data in Valid-Data pages, so it MAY be safely dropped if the transfer is interrupted.

Page-Eraser and JoFS Generic Framework rely on the concept of “page order”; pages in Valid-Data state are always ordered according to “page creation time” for each page. Implementation specifics of “page creation time” is an implementation detail of respective Page-Eraser; the only thing which is required from Page-Eraser is that “page creation time” MUST be strictly monotonous (this includes prohibition on duplicate “page creation times”). In particular, it means that if there is wrap-around possible for “page creation time”, they need to be handled within Page-Eraser, without exposing its effects to JoFS Generic Framework.1

Page-Eraser needs to support six operations:

AllPagesFirstInit() . AllPagesFirstInit() MUST erase (if they’re not erased yet) and initialize all the storage pages; one page MUST be initialized as a Transfer-Ready page, and all the other pages MUST be initialized as Valid-Data pages (without data); all Valid-Data pages MUST have different “page creation times”.

. AllPagesFirstInit() MUST erase (if they’re not erased yet) and initialize all the storage pages; one page MUST be initialized as a Transfer-Ready page, and all the other pages MUST be initialized as Valid-Data pages (without data); all Valid-Data pages MUST have different “page creation times”. AllPagesInit(). AllPagesInit() may clean up inconsistencies, allowing for correct further operation. In particular: if there is a page in Transfer-In-Progress state, AllPagesInit() MUST erase it and initialize it to be in Transfer-Ready state. If there is no page in Transfer-In-Progress state, and there is no page in Transfer-Ready state, the “oldest” Valid-Data page is to be erased and brought to Transfer-Ready state

AllPagesInit() may clean up inconsistencies, allowing for correct further operation. In particular: PageErase() . Erases the “oldest” Valid-Data page and initializes it to Transfer-Ready state

. Erases the “oldest” Valid-Data page and initializes it to Transfer-Ready state PageStartTransfer() . Changes page state from Transfer-Ready to Transfer-In-Progress. Returns page_addr of the Transfer-In-Progress page.

. Changes page state from Transfer-Ready to Transfer-In-Progress. Returns page_addr of the Transfer-In-Progress page. PageCompleteTransfer() . Replaces Transfer-In-Progress state for the page-currently-in-Transfer-In-Progress-state with Valid-Data, and sets “page creation time” which determines the order within ListOrderedValidPages(), making the page the newest page out of available Valid-Data pages

. Replaces Transfer-In-Progress state for the page-currently-in-Transfer-In-Progress-state with Valid-Data, and sets “page creation time” which determines the order within ListOrderedValidPages(), making the page the newest page out of available Valid-Data pages ListOrderedValidPages(). Provides list of the pages in Valid-Data state, in the order of their respective “page creation times”, so JoFS Generic Framework can determine which of the pages contain more recent data.

Page-Eraser MUST ensure that all its modifying operations (i.e. all operations except for ListOrderedValidPages()) are atomic. That is, if any of modifying operations is interrupted by a power loss at any point (including power loss within underlying erasePage(), which may lead to undefined page data as specified in Requirements), after subsequent AllPagesInit() the operation-interrupted-by-power-loss should be seen by upper layers (i.e. by JoFS Generic Framework) as either 100% completed, or not started at all.

“While implementing Page-Eraser may seem trivial, ensuring correctness when power loss interrupts underlying erasePage(), is not.While implementing Page-Eraser may seem trivial, ensuring correctness when power loss interrupts underlying erasePage(), is not. In fact, none of the five implementations discussed in Part II, was able to handle this failure scenario properly.

Operation of JoFS Generic Framework

Now, as we have defined Frame-Writer and Page-Eraser, we can describe the operation of the JoFS Generic Framework. The idea is to have one or more flash pages, organized as a circular buffer, which acts both as a data storage and as a “journal” of all the changes in the storage. Most of the time, one of the pages is kept in Transfer-Ready state. Whenever we’re about to run out of space in other pages, this Transfer-Ready page is used to perform Transfer (a.k.a. “garbage collection”) from the oldest-page, and then this oldest-page can be erased (and initialized to Transfer-Ready state) to free the space in the circular buffer.

Processing of JoFS Generic Framework consists of four algorithms:

JofsFirstInit(). Causes complete initialization of the storage, on the very first start. Calls PageEraser.AllPagesFirstInit().

Causes complete initialization of the storage, on the very first start. Calls PageEraser.AllPagesFirstInit(). JofsInit() . Should be called after every boot to initialize the JoFS storage system. Calls PageEraser.AllPagesInit(), and the gets an ordered list of pages via PageEraser.ListOrderedValidPages(). Then, for the oldest of the pages in the list – calls FrameWriter.ReadFrame() until the last frame, or a “partially written” frame is encountered. If “partially written” frame is encountered, the frame is “canceled” using FrameWriter.CancelFrame()

. Should be called after every boot to initialize the JoFS storage system. Calls PageEraser.AllPagesInit(), and the gets an ordered list of pages via PageEraser.ListOrderedValidPages(). Then, for the oldest of the pages in the list – calls FrameWriter.ReadFrame() until the last frame, or a “partially written” frame is encountered. If “partially written” frame is encountered, the frame is “canceled” using FrameWriter.CancelFrame() JofsReadObject() . Scans ordered list of all the pages returned by PageEraser.ListOrderedValidPages(). For each of the pages, scans all the frames; returns data from the last frame which corresponds to the requested object. In practice, to reduce number of scans, caching (whole or partial) is possible and recommended. Note that as this recommended caching is a read-caching, it doesn’t suffer from write-caching problems mentioned in Part II.

. Scans ordered list of all the pages returned by PageEraser.ListOrderedValidPages(). For each of the pages, scans all the frames; returns data from the last frame which corresponds to the requested object. In practice, to reduce number of scans, caching (whole or partial) is possible and recommended. Note that as this recommended caching is a read-caching, it doesn’t suffer from write-caching problems mentioned in Part II. JofsWriteObject() . Writes are always made as FrameWriter.WriteFrame(), and are made into free space (after the last frame) of a Valid-Data page. Under no circumstances a write can be made into a page if there is any-page-with-data-which-is-newer than the page in question. Normally, a write should be made into a page-which-is-the-newest-out-of-the-Valid-Data-pages-with-data; if such a page doesn’t have enough free space for required Frame, the next page (which is also an oldest-Valid-Data-page-without-data) is used for writing. If such a page is not available, Transfer (also known as “garbage collection”) procedure is used as follows: PageEraser.PageStartTransfer() is invoked, returning page_addr of page T (the one where we’re going to transfer). Then, the “oldest” Valid-Data page (let’s name it O ) is scanned for all the completed (non-canceled) frames in the page O . As a result, a list of all the objects-within-these-frames is prepared (each object has its associated last occurrence within the page O ). Then, this object list is checked against all the other (newer) Valid-Data pages, and objects-which-are-present-in-newer-pages are filtered out from the list. Now, all the objects from the list are written to the page T as Frames, using FrameWriter.WriteFrame() for each of the objects (order of the these frames doesn’t matter) . 2 If desired, we may include the currently-requested object (the one which is a parameter to currently-called JofsWriteObject()) to the list of objects to be written into page T , eliminating the need to write an older version of the currently-requested object (even if such an older version was present in page O and never overwritten in a newer page), and also eliminating the need to write the currently-requested object after the Transfer procedure is completed Then, PageEraser.PageCompleteTransfer() is invoked. Then, PageEraser.ErasePage() is invoked, erasing page O , and making it a Transfer-Ready page. At this point, we have another page with Valid-Data (former page T ), which page will be returned by PageEraser.ListOrderedValidPages() and which can be used to write the required data using normal JofsWriteObject() flow (the branch without Transfer). Note that if the currently-requested object has already been written within Transfer as described as an option above, we don’t need to write it again after the Transfer is completed.

. Writes are always made as FrameWriter.WriteFrame(), and are made into free space (after the last frame) of a Valid-Data page. Under no circumstances a write can be made into a page if there is any-page-with-data-which-is-newer than the page in question. Normally, a write should be made into a page-which-is-the-newest-out-of-the-Valid-Data-pages-with-data; if such a page doesn’t have enough free space for required Frame, the next page (which is also an oldest-Valid-Data-page-without-data) is used for writing. If such a page is not available, Transfer (also known as “garbage collection”) procedure is used as follows:

Strict proof of correctness of the algorithm above under the failure modes specified in Requirements, is beyond the scope of present article. Proof sketch: atomicity of all the modifying Page-Eraser operations is guaranteed by Page-Eraser. For interrupted Frame-Writer operations, we can rely on Frame-Writer being able to recognize “partially written” frame; then frame “cancel” of such “partially written” frames which is performed within JofsInit(), will effectively roll back interrupted WriteFrame() operation; this will make WriteFrame() operation essentially atomic for our purposes. CancelFrame() is atomic too, as it is guaranteed by Frame-Writer. Now, as all the modifying operations within JoFS Generic Framework are essentially atomic, all we need to do is to analyze the impact of power loss between any of the operations. Such analysis (which is admittedly quite bulky) shows that for any given point, if the operation is interrupted at that point, then all the Requirements are met.

Generic Frame-Writer Wrapper

“Now, as we have a solid JoFS Generic Framework, we need to describe equally solid implementations for Frame-Writer and Page-Eraser.Now, as we have a solid JoFS Generic Framework, we need to describe equally solid implementations for Frame-Writer and Page-Eraser. First, we will describe a generic compliant3 Frame-Writer, which provides all the necessary guarantees. Let’s name it a Generic Frame-Writer Wrapper.

The idea behind Generic Frame-Writer Wrapper is the following: we have an abstract inner-Frame with arbitrary data. Specific Frames will be defined later; the only two things which Frame needs to provide for our purposes, are object_id and size (the latter can be constant, implicit, or explicit, as described below).

Frame of Generic Frame-Writer Wrapper consists of the status byte, followed by inner-Frame data. WriteFrame() always writes inner-Frame data first, and status byte = 0xFE afterwards4

In CancelFrame():

from supplied intra_page_addr forward, it scans the page and finds the address first_free_addr of the first byte, from which all the bytes are equal to 0xFF.

of the first byte, from which all the bytes are equal to 0xFF. for all the bytes in [intra_page_addr+1,first_free_addr) range (i.e. excluding status byte), writes them to 0x00; this is possible without erasing, due to Flash properties (we can always toggle a bit from 1 to 0, but not vice versa)

byte), writes them to 0x00; this is possible without erasing, due to Flash properties (we can always toggle a bit from 1 to 0, but not vice versa) writes status byte = 0xFD5

In ReadFrame():

if status byte is 0xFF, it is either a “partially written” frame, or the start of the free space in page; they can be distinguished by reading the rest of the page

byte is 0xFF, it is either a “partially written” frame, or the start of the free space in page; they can be distinguished by reading the rest of the page if status byte is 0xFD, it indicates a “canceled” frame. “Canceled” frame is scanned while the following bytes are 0x00; the first byte which is not 0x00, is the status byte of the next frame.

byte is 0xFD, it indicates a “canceled” frame. “Canceled” frame is scanned while the following bytes are 0x00; the first byte which is not 0x00, is the byte of the next frame. if status byte is 0xFE, it indicates a normal frame with valid data. Size of such a frame is determined from the inner-Frame data.

byte is 0xFE, it indicates a normal frame with valid data. Size of such a frame is determined from the inner-Frame data. Other values of status byte are invalid (and indicate corrupted Flash storage).

The implementation above provides guarantees which are necessary for JoFS Generic Framework, but only under “Weaker” version of Requirements (see Part I for details). Strict proof is beyond the scope of the present article, but it can be obtained by analysis of failures for each of writeByte() operations involved in the described algorithms. If “Stronger” version of Requirements is needed, it can be obtained by splitting the status into 2 bytes, with one of the status bytes carrying bit0 of original status, and another status byte carrying bit1 of original status.

It should be noted that the order of writing of inner-Frame is not essential for Generic Frame-Writer to satisfy our Requirements.

Inner-Frame for EEPROM Emulation

Now we can proceed to describing inner-Frames, which can be optimized for different purposes. One common scenario for Flash storage is EEPROM emulation. In this case, inner-Frame can for example, have a consist from address (i.e. EEPROM address will act as an object_id from JoFS point of view) and value. As long as size of both the address and the value is always constant, there is no need to store frame size within the frame itself. If both address and value are 2 bytes in size, it will make our inner-Frame look pretty much as a frame in [STM] or [Microchip] (though full JoFS frame will be one byte larger due to status byte in Generic Frame-Writer Wrapper).

Inner-Frame for Object Storage

If we want to go beyond plain EEPROM emulation, we can say that our inner-Frame consists of object_id, and object_data. Size of object_data may be constant, implicitly derived from object_id, or explicitly written within the frame itself. This will allow us to provide API for object storage (which is the thing usually needed by the app), and (as discussed below) will allow to improve storage efficiency for storing objects which are larger than 2 bytes in size.

Inner-Frame for Transactional Storage

As a nice side effect of JoFS circular buffer implementation, any JoFS Frame in fact provides not only Validity and Durability, but also provides Atomicity property. As it will be discussed later, it will allow to achieve full ACID transactional properties for our storage. However, if we want to have transactions which involve multiple objects, we need to store more than one object within the same inner-Frame. In this case, an inner-Frame may look (for example) as a list of (object_id,data) pairs; end of the list may be, for example, labeled with an impossible object_id. As with inner-Frame for Object storage, size of data may be constant, implicitly derived from object_id, or explicitly written within the frame itself.

Deterministic Page-Eraser for “Weaker” Requirements

“Now, as we've described how to implement compliant Frame-Writers, we need to start addressing more complicated task of implementing compliant Page-Eraser.Now, as we’ve described how to implement compliant Frame-Writers, we need to start addressing more complicated task of implementing compliant Page-Eraser. As noted before, this is quite a non-trivial task and none of the five implementations analyzed in Part II, has properly solved this problem. We will describe two different compliant implementations for Page-Eraser. The first one is deterministic, but provides guarantees only for “Weaker” version of Requirements; the second one can provide guarantees under “Stronger” version of Requirements, but is probabilistic (though probability of failure can be made as small as necessary at a very low cost).

Let’s describe a Deterministic Page-Eraser which provides necessary guarantees for “Weaker” Requirements. With Deterministic Page-Eraser each page has a header, which consists of 4-byte inverse_page_creation_time, and 1-byte kinda_state flag, that can take values Transfer-Ready=0xFF, Transfer-In-Progress=0xFE, and Valid-Data=0xFC 6. Note that while kinda_state is similar to state as described in “Page-Eraser Requirements” section, they are not strictly identical.

It is very important that inverse_page_creation_time is written to the header as an bitwise-negation of the real page_creation_time (for this article, we’ll denote bitwise-negation as an unary ~ operator). That is, if the page_creation_time is 1, it MUST be written to the page as ~1 = 0xFFFFFFFE. We’ll discuss later why this inversion/negation is important. In the further document, whenever we say something about page’s page_creation_time, we actually mean ~inverse_page_creation_time (always treated as an unsigned 32-bit integer). It is also important that page_creation_time never wraparounds; with 4-byte page_creation_time and usual limits on the number of Flash erasures being in the range of 10’000 to 100’000, we have about 1e4 reserve in this regard (i.e. pages will start to fail due to physical restrictions orders of magnitude earlier than we risk wrapping around); however, if it ever becomes a problem, expanding page_creation_time to use more than 32 bits is very straightforward.

Then, an implementation of our Deterministic Page-Eraser can be described as follows:

AllPagesFirstInit(). AllPagesFirstInit() erases and initializes all the storage pages; one page is initialized as a Transfer-Ready page, and all the other pages are initialized as Valid-Data pages (without data); first of Valid-Data pages has page_creation_time equal to 1, the next one has page_creation_time equal to 2, and so on.

AllPagesFirstInit() erases and initializes all the storage pages; one page is initialized as a Transfer-Ready page, and all the other pages are initialized as Valid-Data pages (without data); first of Valid-Data pages has equal to 1, the next one has equal to 2, and so on. AllPagesInit() – Properties. Detailed logic for AllPagesInit() will be described later; for now, it is important to note that it cleans up all the states which arise from the incomplete operations.

Detailed logic for AllPagesInit() will be described later; for now, it is important to note that it cleans up all the states which arise from the incomplete operations. PageErase(). Finds the Valid-Data page with the smallest page_creation_time and erases it; as a side effect, erasure leads to kinda_state being 0xFF (equivalent to Transfer-Ready state), and inverse_page_creation_time being 0xFFFFFFFF (equivalent to page_creation_time being 0).

Finds the Valid-Data page with the smallest page_creation_time and erases it; as a side effect, erasure leads to being 0xFF (equivalent to Transfer-Ready state), and being 0xFFFFFFFF (equivalent to page_creation_time being 0). PageStartTransfer() . Writes Transfer-In-Progress (0xFE) to kinda_state (overwriting former Transfer-Ready=0xFF) . Returns page_addr of the Transfer-In-Progress page.

. Writes Transfer-In-Progress (0xFE) to (overwriting former Transfer-Ready=0xFF) Returns page_addr of the Transfer-In-Progress page. PageCompleteTransfer() . First, writes ~( maximum_for_all_pages(page.page_creation_time)+1) to transfer_in_progress_page . inverse_page_creation_time. Then, writes Valid-Data (0xFD) to kinda_state (overwriting former Transfer-In-Progress=0xFE).

. First, writes ~( to . Then, writes Valid-Data (0xFD) to (overwriting former Transfer-In-Progress=0xFE). ListOrderedValidPages() . Reads all the pages, finds out those which have kinda_state= Valid-Data, and sorts them by page_creation_time. Provides resulting list.

. Reads all the pages, finds out those which have Valid-Data, and sorts them by page_creation_time. Provides resulting list. AllPagesInit() – Implementation . Now we’re ready to describe implementation of AllPagesInit(). AllPagesInit() cleans up inconsistencies as follows: if there is a page with Transfer-Ready status, but not entirely empty, it means that erasePage() operation has been interrupted. To recover from this inconsistency, the page needs to be erased (note that there can be only one page with Transfer-Ready status). if there is a page with kinda_state being Transfer-In-Progress, it means that the Transfer has been interrupted. To recover from this situation, it is necessary to erase this page. Note that in such scenario there can be only one Transfer-In-Progress page, and no Transfer-Ready pages. if there are no Transfer-Ready pages and are no Transfer-In-Progress pages, it means that the erasePage() operation has been interrupted. In such a case, the page with the “oldest” page_creation_time must be erased, regardless of its kinda_state. As a side effect of erasure, it will result in the page having a Transfer-Ready state. For further reference, let’s name this clean-up operation “Erase-The-Oldest-if-No-Transfer-Ready Clean-Up”

. Now we’re ready to describe implementation of AllPagesInit(). AllPagesInit() cleans up inconsistencies as follows:

Proof of correctness of the Deterministic Page-Eraser is based on the following two Lemmas.

Lemma 1. For a number written in a usual binary form, changing of any of the number’s bits from 1 to 0, cannot possibly lead to the number increasing.

The proof should be fairly obvious: changing one single bit from 1 to 0 in usual binary form of the unsigned integer is equivalent to subtracting 2^i (where i is a bit number), which is always positive; changing more than one bit will result in several similar subtractions, which means that the resulting_number cannot be larger than the original_number.

Lemma 2. For a number written in an inverted binary form, a partial Flash erasure (the one where bits can change only from 0 to 1, according to “Weaker” version of Requirements) cannot possibly lead to the number increasing.

The proof is based on Lemma 1.

Strict proof of the correctness of the Deterministic Page-Eraser is beyond the scope of present article. Proof sketch: the proof is based on analysis of the recovery from power loss happening at any point during operation of Deterministic Page-Eraser. In most cases, such analysis is relatively simple; however, scenario of power loss which leads to a partial page erasure, is of specific interest. In this case, as before the erasePage() the page being erased is guaranteed to have the smallest page_creation_time, then according to Lemma 2, at any point during erasePage() this guarantee still stands. This means that while during erasePage() kinda_status may change, this page is guaranteed to be the “oldest” one at all points during erasePage(), so Erase-The-Oldest-if-No-Transfer-Ready Clean-Up will lead to erasing the-page-which-wasn’t-completely-erased, that is exactly the desired effect.

Probabilistic Page-Eraser for “Stronger” Requirements

“To satisfy 'Stronger' version of Requirements, another algorithm can be used – Probabilistic Page-Eraser.As noted above, correctness of Deterministic Page-Eraser stands only under “Weaker” version of our Requirements. To satisfy “Stronger” version of Requirements, another algorithm can be used – Probabilistic Page-Eraser.

With Probabilistic Page-Eraser each page has a header, which consists of 4-byte page_creation_time, and 1-byte kinda_state flag, which can take values Transfer-Ready, Transfer-In-Progress, and Valid-Data (exact values are not important for Probabilistic Page-Eraser). Unlike for Deterministic Page-Eraser, format of storing page_creation_time is not important.

Overall, Probabilistic Page-Eraser operates in a manner similar to Deterministic Page-Eraser, with the following differences:

PageErase() is implemented as follows: before erasing the page, within current most-recent Valid-Data page ( not in a page which is about to be erased), a “special Frame” is written (for example, using an impossible object_id to identify it). This “special Frame” consists of a page_addr of the page being erased, and a random-looking but well-known Special-Frame Signature (for example, the signature can have 128-bit length). After the page is erased, the “special frame” in Valid-Data page is canceled (effectively removing the Special-Frame Signature, preferably zeroing the signature too).

is implemented as follows: in AllPagesInit(), instead of Erase-The-Oldest-if-No-Transfer-Ready Clean-Up, the following clean-up is used. If there is no Transfer-Ready page and no Transfer-In-Progress page, then Probabilistic Page-Eraser scans all the pages in search for a “special Frame” described above (there MUST be exactly one such page, otherwise it means that the JoFS has been corrupted). If there is such a frame (with a Frame-Specific Signature exactly matching the signature-which-is-written-in-special-Frames), then the page which has page_addr mentioned in “special Frame”, is the one to be erased and brought to Transfer-Ready state.

The proof of the correctness of Probabilistic Page-Eraser is similar to the one for Deterministic Page-Eraser. However, the “scenario of interest” which we’ve described above for Deterministic Page-Eraser is considered differently. With “Stronger” version of our Requirements, whenever we’re interrupted in the middle of erasePage(), we cannot make any assumptions about the data we read from the page, so we have an impossible-to-solve problem: how to distinguish the page-being-erased (which can have any data according to “Stronger” version of our requirements) from all the other pages? With Probabilistic Page-Eraser, we’re saying that in such a scenario, exactly one of the pages must have a non-canceled “special Frame” with a well-known Frame Signature. While theoretically, the page-being-erased can get exactly the same signature, in practice chances of it are very low (if we can assume that probabilities of 0 and 1 are the same, then chances of having a pre-defined random signature are of the order of 2^-120, and can be easily lowered further if necessary; for non-even probabilities and random pre-defined signature with about-the-same number of 0 and 1s, the analysis is more complicated, but the result won’t be significantly worse than 2^-120 or so). Therefore, chances of obtaining two “special Frames” (one legitimate, and another spurious due to erasePage()) can be made as small as necessary.

JoFS as an EEPROM Emulation and Simple Object Storage

The algorithms described above, allow to construct storage which has Valid, Durable and Atomic properties. As it was discussed in Part II, faithful EEPROM emulation requires only a subset of these properties (namely Validity and Durability), so our JoFS is a faithful implementation of EEPROM too. Additional implementation complexity of JoFS (compared to more complicated algorithms such as [STM] and [Microchip]) is rather low.

In addition, JoFS has low space overhead. In particular, with inner-Frame for EEPROM emulation described above, JoFS has space overhead which is just a little bit (20% to be exact) more than that of [STM] and [Microchip] (one additional byte per frame is necessary to provide Validity and Durability guarantees). However, it also provides an object-level API, that is normally much more space-efficient; for example, if average object is 10 bytes in size, then our JoFS Storage (using inner-Frame for Object Storage) is about 1.5x more space-efficient than [STM] and [Microchip].

On ACID

As we’ve already discussed above, JoFS storage provides Validity, Durability, and Atomicity guarantees. In order to become a full-scale ACID transactional storage, we need to support Consistency and Isolation properties. In the ACID context, Consistency is usually defined as one or more of the following [WikiConsistency]:

The guarantee that any transactions started in the future necessarily see the effects of other transactions committed in the past

The guarantee that database constraints are not violated, particularly once a transaction commits

The guarantee that operations in transactions are performed accurately, correctly, and with validity, with respect to application semantics

The first of these guarantees becomes automatically complied with as soon as we restrict our storage to one single outstanding transaction. While such a restriction can be a problem for traditional databases serving millions of simultaneous users, for MCU environments the restriction looks quite reasonable. If this restriction needs to be removed, Consistency in this sense can still be provided, though at the cost of significant complexity increase.

The second of these guarantees is all about database constraints; as we don’t enforce any such constraints, then at least formally we do guarantee Consistency (again, adding constraints in JoFS is possible, but as of now we’re not sure if they are of practical use).

The third of the guarantees is defined quite vaguely, but it seems to us that our JoFS does provide it too.

“Therefore, we can say that JoFS is ACID-compliant (with certain observations)Now let’s see how JoFS stands in relation to Isolation property. Isolation in one of its most stringent forms can be formulated as “the system behaves as if transactions are serialized”. As soon as we have restricted our JoFS to be a single-connection only (see above), then Isolation comes automatically (as transactions are already serialized).

Therefore, we can say that JoFS is ACID-compliant, with the following observations:

JoFS as described doesn’t support concurrent transactions; there can be only one outstanding transaction at every point in time

JoFS as described doesn’t implement constraints for intra-storage data

Comparison with Existing EEPROM Emulations

To compare JoFS with existing EEPROM emulations, we can extend Tables from Part II to include JoFS:

Implementation Resilience to Power Loss while Outside of EEPROM Emulator Resilience to Power Loss around Erasure Resilience to Power Loss while Writing Frame Resilience to Partially Completed Writes while Writing Frame Resilience to Partially Completed Erasure Is a Faithful EEPROM Emulation? Vulnerability Window Any with Write Caching No No No No No No Largest (Worst) [Atmel] Yes No No No No No 2nd Largest [TI] Yes No Potentially Yes 7 No No 3rd Largest [STM], [Microchip] Yes Yes No No No No 2nd Smallest [SiLabs] Yes Yes Yes Probably Yes 8 No Not Exactly

Smallest JoFS Yes Yes Yes

Yes Yes Yes None (Best)

Implementation Is a Faithful EEPROM Emulation? Vulnerability Window Is ACID Compliant? Number of Byte Writes Under Example Conditions9 Number of Object Writes Under Example Conditions10 EEPROM_Size = 10 bytes EEPROM_Size = 256 bytes EEPROM_Size = 256 bytes, Obj_Size=10 bytes EEPROM_Size = 256 bytes, Obj_Size=32 bytes Any with Write Caching No Largest (Worst) No Depends on Caching Depends on Caching Depends on Caching Depends on Caching [Atmel] No 2nd Largest No 100*N 4*N 4*N 4*N [TI] No 3rd Largest No 50*N to 500*N11 N/A N/A N/A [STM], [Microchip] No 2nd Smallest No 256*N 256*N ~50*N ~16*N [SiLabs] Not Exactly Smallest No 100*N 4*N 4*N 4*N JoFS Yes None (Best) Yes 12 ~200*N ~200*N ~75*N ~25*N

JoFS Further Uses

JoFS can be used in a multitude of different scenarios. As described above, it can be used as an EEPROM emulation, or as a more efficient object storage. In addition, as JoFS guarantees ACID properties, it can be used as a building block for Flash-based file systems or for Flash-based databases.

When using JoFS as a building block for Flash-based file systems or for Flash-based databases, it should be understood that if size of JoFS storage is large, page scanning inherent to JoFS (as well as to any circular-buffer-based solution) can become expensive. To mitigate this issue, non-trivial read caching can (and probably should) be used for larger storage sizes.

Alternatively, one can build a “hybrid” system with only a “catalog” of objects being JoFS-based (and ACID-compliant), and objects themselves being stored directly in Flash (outside of Flash pages allocated for JoFS). This is close to the approach of traditional journaling file systems. From the database point of view, in such a “hybrid” approach JoFS can be used as an ACID-compliant transaction log (a.k.a. transaction journal), and the rest of the database can reside outside of JoFS. In some implementations (where the log/journal is used only to track changes), JoFS can be simplified by removing “garbage collection” (as only changes are tracked, the oldest Flash page can be erased as soon as there are no active transactions there).

Implementation

“The first working version of JoFS is expected approximately by the end of 2015Currently a JoFS implementation is being worked on as a part of an open-source “Zepto OS” project, which is in turn a part of an open-source SmartAnthill project. Within “Zepto OS”, the first working (though inevitably buggy) version of JoFS is expected approximately by the end of 2015 (which is not too far from now). It is expected that “Zepto OS” implementation of JoFS will be able to work standalone (without requiring the rest of “Zepto OS”).

Conclusion

We have presented JoFS (Journaled Flash Storage), which can be used either as EEPROM emulation, or as an ACID-Compliant transactional storage (which in turn can be used as a building block for databases and file systems). When using as an EEPROM Emulation, it is a faithful EEPROM Emulation (as defined in Part II). When using as an ACID-Compliant transactional storage, JoFS guarantees all the ACID properties; while JoFS as described doesn’t support full functionality of traditional databases, all the functionality it provides, is 100% ACID-Compliant (i.e. all upper layers can safely rely on JoFS to be ACID-Compliant, as long as they’re working within JoFS functionality).

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.