diff options
author | Clyne Sullivan <tullivan99@gmail.com> | 2019-02-28 17:04:22 -0500 |
---|---|---|
committer | Clyne Sullivan <tullivan99@gmail.com> | 2019-02-28 17:04:22 -0500 |
commit | d6869d1ec4bd24cd2c3eafa534f0849b25ec5607 (patch) | |
tree | 79e54ed27b39c31864895535d11399708d5a45c0 /arduino/libraries/FileSystem/src/littlefs/DESIGN.md | |
parent | 614ee97bf3a2270c413527a7f35c54cbecd9e601 (diff) |
added basic code
Diffstat (limited to 'arduino/libraries/FileSystem/src/littlefs/DESIGN.md')
-rwxr-xr-x | arduino/libraries/FileSystem/src/littlefs/DESIGN.md | 1226 |
1 files changed, 1226 insertions, 0 deletions
diff --git a/arduino/libraries/FileSystem/src/littlefs/DESIGN.md b/arduino/libraries/FileSystem/src/littlefs/DESIGN.md new file mode 100755 index 0000000..3afb0a2 --- /dev/null +++ b/arduino/libraries/FileSystem/src/littlefs/DESIGN.md @@ -0,0 +1,1226 @@ +## The design of the little filesystem + +A little fail-safe filesystem designed for embedded systems. + +``` + | | | .---._____ + .-----. | | +--|o |---| littlefs | +--| |---| | + '-----' '----------' + | | | +``` + +For a bit of backstory, the littlefs was developed with the goal of learning +more about filesystem design by tackling the relative unsolved problem of +managing a robust filesystem resilient to power loss on devices +with limited RAM and ROM. + +The embedded systems the littlefs is targeting are usually 32 bit +microcontrollers with around 32KB of RAM and 512KB of ROM. These are +often paired with SPI NOR flash chips with about 4MB of flash storage. + +Flash itself is a very interesting piece of technology with quite a bit of +nuance. Unlike most other forms of storage, writing to flash requires two +operations: erasing and programming. The programming operation is relatively +cheap, and can be very granular. For NOR flash specifically, byte-level +programs are quite common. Erasing, however, requires an expensive operation +that forces the state of large blocks of memory to reset in a destructive +reaction that gives flash its name. The [Wikipedia entry](https://en.wikipedia.org/wiki/Flash_memory) +has more information if you are interested in how this works. + +This leaves us with an interesting set of limitations that can be simplified +to three strong requirements: + +1. **Power-loss resilient** - This is the main goal of the littlefs and the + focus of this project. + + Embedded systems are usually designed without a shutdown routine and a + notable lack of user interface for recovery, so filesystems targeting + embedded systems must be prepared to lose power at any given time. + + Despite this state of things, there are very few embedded filesystems that + handle power loss in a reasonable manner, and most can become corrupted if + the user is unlucky enough. + +2. **Wear leveling** - Due to the destructive nature of flash, most flash + chips have a limited number of erase cycles, usually in the order of around + 100,000 erases per block for NOR flash. Filesystems that don't take wear + into account can easily burn through blocks used to store frequently updated + metadata. + + Consider the [FAT filesystem](https://en.wikipedia.org/wiki/Design_of_the_FAT_file_system), + which stores a file allocation table (FAT) at a specific offset from the + beginning of disk. Every block allocation will update this table, and after + 100,000 updates, the block will likely go bad, rendering the filesystem + unusable even if there are many more erase cycles available on the storage + as a whole. + +3. **Bounded RAM/ROM** - Even with the design difficulties presented by the + previous two limitations, we have already seen several flash filesystems + developed on PCs that handle power loss just fine, such as the + logging filesystems. However, these filesystems take advantage of the + relatively cheap access to RAM, and use some rather... opportunistic... + techniques, such as reconstructing the entire directory structure in RAM. + These operations make perfect sense when the filesystem's only concern is + erase cycles, but the idea is a bit silly on embedded systems. + + To cater to embedded systems, the littlefs has the simple limitation of + using only a bounded amount of RAM and ROM. That is, no matter what is + written to the filesystem, and no matter how large the underlying storage + is, the littlefs will always use the same amount of RAM and ROM. This + presents a very unique challenge, and makes presumably simple operations, + such as iterating through the directory tree, surprisingly difficult. + +## Existing designs? + +There are of course, many different existing filesystem. Here is a very rough +summary of the general ideas behind some of them. + +Most of the existing filesystems fall into the one big category of filesystem +designed in the early days of spinny magnet disks. While there is a vast amount +of interesting technology and ideas in this area, the nature of spinny magnet +disks encourage properties, such as grouping writes near each other, that don't +make as much sense on recent storage types. For instance, on flash, write +locality is not important and can actually increase wear. + +One of the most popular designs for flash filesystems is called the +[logging filesystem](https://en.wikipedia.org/wiki/Log-structured_file_system). +The flash filesystems [jffs](https://en.wikipedia.org/wiki/JFFS) +and [yaffs](https://en.wikipedia.org/wiki/YAFFS) are good examples. In a +logging filesystem, data is not stored in a data structure on disk, but instead +the changes to the files are stored on disk. This has several neat advantages, +such as the fact that the data is written in a cyclic log format and naturally +wear levels as a side effect. And, with a bit of error detection, the entire +filesystem can easily be designed to be resilient to power loss. The +journaling component of most modern day filesystems is actually a reduced +form of a logging filesystem. However, logging filesystems have a difficulty +scaling as the size of storage increases. And most filesystems compensate by +caching large parts of the filesystem in RAM, a strategy that is inappropriate +for embedded systems. + +Another interesting filesystem design technique is that of [copy-on-write (COW)](https://en.wikipedia.org/wiki/Copy-on-write). +A good example of this is the [btrfs](https://en.wikipedia.org/wiki/Btrfs) +filesystem. COW filesystems can easily recover from corrupted blocks and have +natural protection against power loss. However, if they are not designed with +wear in mind, a COW filesystem could unintentionally wear down the root block +where the COW data structures are synchronized. + +## Metadata pairs + +The core piece of technology that provides the backbone for the littlefs is +the concept of metadata pairs. The key idea here is that any metadata that +needs to be updated atomically is stored on a pair of blocks tagged with +a revision count and checksum. Every update alternates between these two +pairs, so that at any time there is always a backup containing the previous +state of the metadata. + +Consider a small example where each metadata pair has a revision count, +a number as data, and the XOR of the block as a quick checksum. If +we update the data to a value of 9, and then to a value of 5, here is +what the pair of blocks may look like after each update: +``` + block 1 block 2 block 1 block 2 block 1 block 2 +.---------.---------. .---------.---------. .---------.---------. +| rev: 1 | rev: 0 | | rev: 1 | rev: 2 | | rev: 3 | rev: 2 | +| data: 3 | data: 0 | -> | data: 3 | data: 9 | -> | data: 5 | data: 9 | +| xor: 2 | xor: 0 | | xor: 2 | xor: 11 | | xor: 6 | xor: 11 | +'---------'---------' '---------'---------' '---------'---------' + let data = 9 let data = 5 +``` + +After each update, we can find the most up to date value of data by looking +at the revision count. + +Now consider what the blocks may look like if we suddenly lose power while +changing the value of data to 5: +``` + block 1 block 2 block 1 block 2 block 1 block 2 +.---------.---------. .---------.---------. .---------.---------. +| rev: 1 | rev: 0 | | rev: 1 | rev: 2 | | rev: 3 | rev: 2 | +| data: 3 | data: 0 | -> | data: 3 | data: 9 | -x | data: 3 | data: 9 | +| xor: 2 | xor: 0 | | xor: 2 | xor: 11 | | xor: 2 | xor: 11 | +'---------'---------' '---------'---------' '---------'---------' + let data = 9 let data = 5 + powerloss!!! +``` + +In this case, block 1 was partially written with a new revision count, but +the littlefs hadn't made it to updating the value of data. However, if we +check our checksum we notice that block 1 was corrupted. So we fall back to +block 2 and use the value 9. + +Using this concept, the littlefs is able to update metadata blocks atomically. +There are a few other tweaks, such as using a 32 bit CRC and using sequence +arithmetic to handle revision count overflow, but the basic concept +is the same. These metadata pairs define the backbone of the littlefs, and the +rest of the filesystem is built on top of these atomic updates. + +## Non-meta data + +Now, the metadata pairs do come with some drawbacks. Most notably, each pair +requires two blocks for each block of data. I'm sure users would be very +unhappy if their storage was suddenly cut in half! Instead of storing +everything in these metadata blocks, the littlefs uses a COW data structure +for files which is in turn pointed to by a metadata block. When +we update a file, we create copies of any blocks that are modified until +the metadata blocks are updated with the new copy. Once the metadata block +points to the new copy, we deallocate the old blocks that are no longer in use. + +Here is what updating a one-block file may look like: +``` + block 1 block 2 block 1 block 2 block 1 block 2 +.---------.---------. .---------.---------. .---------.---------. +| rev: 1 | rev: 0 | | rev: 1 | rev: 0 | | rev: 1 | rev: 2 | +| file: 4 | file: 0 | -> | file: 4 | file: 0 | -> | file: 4 | file: 5 | +| xor: 5 | xor: 0 | | xor: 5 | xor: 0 | | xor: 5 | xor: 7 | +'---------'---------' '---------'---------' '---------'---------' + | | | + v v v + block 4 block 4 block 5 block 4 block 5 +.--------. .--------. .--------. .--------. .--------. +| old | | old | | new | | old | | new | +| data | | data | | data | | data | | data | +| | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' + update data in file update metadata pair +``` + +It doesn't matter if we lose power while writing new data to block 5, +since the old data remains unmodified in block 4. This example also +highlights how the atomic updates of the metadata blocks provide a +synchronization barrier for the rest of the littlefs. + +At this point, it may look like we are wasting an awfully large amount +of space on the metadata. Just looking at that example, we are using +three blocks to represent a file that fits comfortably in one! So instead +of giving each file a metadata pair, we actually store the metadata for +all files contained in a single directory in a single metadata block. + +Now we could just leave files here, copying the entire file on write +provides the synchronization without the duplicated memory requirements +of the metadata blocks. However, we can do a bit better. + +## CTZ skip-lists + +There are many different data structures for representing the actual +files in filesystems. Of these, the littlefs uses a rather unique [COW](https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg) +data structure that allows the filesystem to reuse unmodified parts of the +file without additional metadata pairs. + +First lets consider storing files in a simple linked-list. What happens when we +append a block? We have to change the last block in the linked-list to point +to this new block, which means we have to copy out the last block, and change +the second-to-last block, and then the third-to-last, and so on until we've +copied out the entire file. + +``` +Exhibit A: A linked-list +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 |->| data 1 |->| data 2 |->| data 4 |->| data 5 |->| data 6 | +| | | | | | | | | | | | +| | | | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +To get around this, the littlefs, at its heart, stores files backwards. Each +block points to its predecessor, with the first block containing no pointers. +If you think about for a while, it starts to make a bit of sense. Appending +blocks just point to their predecessor and no other blocks need to be updated. +If we update a block in the middle, we will need to copy out the blocks that +follow, but can reuse the blocks before the modified block. Since most file +operations either reset the file each write or append to files, this design +avoids copying the file in the most common cases. + +``` +Exhibit B: A backwards linked-list +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 |<-| data 1 |<-| data 2 |<-| data 4 |<-| data 5 |<-| data 6 | +| | | | | | | | | | | | +| | | | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +However, a backwards linked-list does come with a rather glaring problem. +Iterating over a file _in order_ has a runtime cost of O(n^2). Gah! A quadratic +runtime to just _read_ a file? That's awful. Keep in mind reading files is +usually the most common filesystem operation. + +To avoid this problem, the littlefs uses a multilayered linked-list. For +every nth block where n is divisible by 2^x, the block contains a pointer +to block n-2^x. So each block contains anywhere from 1 to log2(n) pointers +that skip to various sections of the preceding list. If you're familiar with +data-structures, you may have recognized that this is a type of deterministic +skip-list. + +The name comes from the use of the +[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros) +instruction, which allows us to calculate the power-of-two factors efficiently. +For a given block n, the block contains ctz(n)+1 pointers. + +``` +Exhibit C: A backwards CTZ skip-list +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 | +| |<-| |--| |<-| |--| | | | +| |<-| |--| |--| |--| | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +The additional pointers allow us to navigate the data-structure on disk +much more efficiently than in a singly linked-list. + +Taking exhibit C for example, here is the path from data block 5 to data +block 1. You can see how data block 3 was completely skipped: +``` +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 | | data 1 |<-| data 2 | | data 3 | | data 4 |<-| data 5 | +| | | | | |<-| |--| | | | +| | | | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +The path to data block 0 is even more quick, requiring only two jumps: +``` +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 | | data 1 | | data 2 | | data 3 | | data 4 |<-| data 5 | +| | | | | | | | | | | | +| |<-| |--| |--| |--| | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +We can find the runtime complexity by looking at the path to any block from +the block containing the most pointers. Every step along the path divides +the search space for the block in half. This gives us a runtime of O(log n). +To get to the block with the most pointers, we can perform the same steps +backwards, which puts the runtime at O(2 log n) = O(log n). The interesting +part about this data structure is that this optimal path occurs naturally +if we greedily choose the pointer that covers the most distance without passing +our target block. + +So now we have a representation of files that can be appended trivially with +a runtime of O(1), and can be read with a worst case runtime of O(n log n). +Given that the the runtime is also divided by the amount of data we can store +in a block, this is pretty reasonable. + +Unfortunately, the CTZ skip-list comes with a few questions that aren't +straightforward to answer. What is the overhead? How do we handle more +pointers than we can store in a block? How do we store the skip-list in +a directory entry? + +One way to find the overhead per block is to look at the data structure as +multiple layers of linked-lists. Each linked-list skips twice as many blocks +as the previous linked-list. Another way of looking at it is that each +linked-list uses half as much storage per block as the previous linked-list. +As we approach infinity, the number of pointers per block forms a geometric +series. Solving this geometric series gives us an average of only 2 pointers +per block. + + + +Finding the maximum number of pointers in a block is a bit more complicated, +but since our file size is limited by the integer width we use to store the +size, we can solve for it. Setting the overhead of the maximum pointers equal +to the block size we get the following equation. Note that a smaller block size +results in more pointers, and a larger word width results in larger pointers. + + + +where: +B = block size in bytes +w = word width in bits + +Solving the equation for B gives us the minimum block size for various word +widths: +32 bit CTZ skip-list = minimum block size of 104 bytes +64 bit CTZ skip-list = minimum block size of 448 bytes + +Since littlefs uses a 32 bit word size, we are limited to a minimum block +size of 104 bytes. This is a perfectly reasonable minimum block size, with most +block sizes starting around 512 bytes. So we can avoid additional logic to +avoid overflowing our block's capacity in the CTZ skip-list. + +So, how do we store the skip-list in a directory entry? A naive approach would +be to store a pointer to the head of the skip-list, the length of the file +in bytes, the index of the head block in the skip-list, and the offset in the +head block in bytes. However this is a lot of information, and we can observe +that a file size maps to only one block index + offset pair. So it should be +sufficient to store only the pointer and file size. + +But there is one problem, calculating the block index + offset pair from a +file size doesn't have an obvious implementation. + +We can start by just writing down an equation. The first idea that comes to +mind is to just use a for loop to sum together blocks until we reach our +file size. We can write this equation as a summation: + + + +where: +B = block size in bytes +w = word width in bits +n = block index in skip-list +N = file size in bytes + +And this works quite well, but is not trivial to calculate. This equation +requires O(n) to compute, which brings the entire runtime of reading a file +to O(n^2 log n). Fortunately, the additional O(n) does not need to touch disk, +so it is not completely unreasonable. But if we could solve this equation into +a form that is easily computable, we can avoid a big slowdown. + +Unfortunately, the summation of the CTZ instruction presents a big challenge. +How would you even begin to reason about integrating a bitwise instruction? +Fortunately, there is a powerful tool I've found useful in these situations: +The [On-Line Encyclopedia of Integer Sequences (OEIS)](https://oeis.org/). +If we work out the first couple of values in our summation, we find that CTZ +maps to [A001511](https://oeis.org/A001511), and its partial summation maps +to [A005187](https://oeis.org/A005187), and surprisingly, both of these +sequences have relatively trivial equations! This leads us to a rather +unintuitive property: + + + +where: +ctz(x) = the number of trailing bits that are 0 in x +popcount(x) = the number of bits that are 1 in x + +It's a bit bewildering that these two seemingly unrelated bitwise instructions +are related by this property. But if we start to dissect this equation we can +see that it does hold. As n approaches infinity, we do end up with an average +overhead of 2 pointers as we find earlier. And popcount seems to handle the +error from this average as it accumulates in the CTZ skip-list. + +Now we can substitute into the original equation to get a trivial equation +for a file size: + + + +Unfortunately, we're not quite done. The popcount function is non-injective, +so we can only find the file size from the block index, not the other way +around. However, we can solve for an n' block index that is greater than n +with an error bounded by the range of the popcount function. We can then +repeatedly substitute this n' into the original equation until the error +is smaller than the integer division. As it turns out, we only need to +perform this substitution once. Now we directly calculate our block index: + + + +Now that we have our block index n, we can just plug it back into the above +equation to find the offset. However, we do need to rearrange the equation +a bit to avoid integer overflow: + + + +The solution involves quite a bit of math, but computers are very good at math. +Now we can solve for both the block index and offset from the file size in O(1). + +Here is what it might look like to update a file stored with a CTZ skip-list: +``` + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 | + | file: 6 | file: 0 | + | size: 4 | size: 0 | + | xor: 3 | xor: 0 | + '---------'---------' + | + v + block 3 block 4 block 5 block 6 +.--------. .--------. .--------. .--------. +| data 0 |<-| data 1 |<-| data 2 |<-| data 3 | +| |<-| |--| | | | +| | | | | | | | +'--------' '--------' '--------' '--------' + +| update data in file +v + + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 | + | file: 6 | file: 0 | + | size: 4 | size: 0 | + | xor: 3 | xor: 0 | + '---------'---------' + | + v + block 3 block 4 block 5 block 6 +.--------. .--------. .--------. .--------. +| data 0 |<-| data 1 |<-| old |<-| old | +| |<-| |--| data 2 | | data 3 | +| | | | | | | | +'--------' '--------' '--------' '--------' + ^ ^ ^ + | | | block 7 block 8 block 9 block 10 + | | | .--------. .--------. .--------. .--------. + | | '----| new |<-| new |<-| new |<-| new | + | '----------------| data 2 |<-| data 3 |--| data 4 | | data 5 | + '------------------| |--| |--| | | | + '--------' '--------' '--------' '--------' + +| update metadata pair +v + + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 2 | + | file: 6 | file: 10| + | size: 4 | size: 6 | + | xor: 3 | xor: 14 | + '---------'---------' + | + | + block 3 block 4 block 5 block 6 | +.--------. .--------. .--------. .--------. | +| data 0 |<-| data 1 |<-| old |<-| old | | +| |<-| |--| data 2 | | data 3 | | +| | | | | | | | | +'--------' '--------' '--------' '--------' | + ^ ^ ^ v + | | | block 7 block 8 block 9 block 10 + | | | .--------. .--------. .--------. .--------. + | | '----| new |<-| new |<-| new |<-| new | + | '----------------| data 2 |<-| data 3 |--| data 4 | | data 5 | + '------------------| |--| |--| | | | + '--------' '--------' '--------' '--------' +``` + +## Block allocation + +So those two ideas provide the grounds for the filesystem. The metadata pairs +give us directories, and the CTZ skip-lists give us files. But this leaves +one big [elephant](https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg) +of a question. How do we get those blocks in the first place? + +One common strategy is to store unallocated blocks in a big free list, and +initially the littlefs was designed with this in mind. By storing a reference +to the free list in every single metadata pair, additions to the free list +could be updated atomically at the same time the replacement blocks were +stored in the metadata pair. During boot, every metadata pair had to be +scanned to find the most recent free list, but once the list was found the +state of all free blocks becomes known. + +However, this approach had several issues: + +- There was a lot of nuanced logic for adding blocks to the free list without + modifying the blocks, since the blocks remain active until the metadata is + updated. +- The free list had to support both additions and removals in FIFO order while + minimizing block erases. +- The free list had to handle the case where the file system completely ran + out of blocks and may no longer be able to add blocks to the free list. +- If we used a revision count to track the most recently updated free list, + metadata blocks that were left unmodified were ticking time bombs that would + cause the system to go haywire if the revision count overflowed. +- Every single metadata block wasted space to store these free list references. + +Actually, to simplify, this approach had one massive glaring issue: complexity. + +> Complexity leads to fallibility. +> Fallibility leads to unmaintainability. +> Unmaintainability leads to suffering. + +Or at least, complexity leads to increased code size, which is a problem +for embedded systems. + +In the end, the littlefs adopted more of a "drop it on the floor" strategy. +That is, the littlefs doesn't actually store information about which blocks +are free on the storage. The littlefs already stores which files _are_ in +use, so to find a free block, the littlefs just takes all of the blocks that +exist and subtract the blocks that are in use. + +Of course, it's not quite that simple. Most filesystems that adopt this "drop +it on the floor" strategy either rely on some properties inherent to the +filesystem, such as the cyclic-buffer structure of logging filesystems, +or use a bitmap or table stored in RAM to track free blocks, which scales +with the size of storage and is problematic when you have limited RAM. You +could iterate through every single block in storage and check it against +every single block in the filesystem on every single allocation, but that +would have an abhorrent runtime. + +So the littlefs compromises. It doesn't store a bitmap the size of the storage, +but it does store a little bit-vector that contains a fixed set lookahead +for block allocations. During a block allocation, the lookahead vector is +checked for any free blocks. If there are none, the lookahead region jumps +forward and the entire filesystem is scanned for free blocks. + +Here's what it might look like to allocate 4 blocks on a decently busy +filesystem with a 32bit lookahead and a total of +128 blocks (512Kbytes of storage if blocks are 4Kbyte): +``` +boot... lookahead: + fs blocks: fffff9fffffffffeffffffffffff0000 +scanning... lookahead: fffff9ff + fs blocks: fffff9fffffffffeffffffffffff0000 +alloc = 21 lookahead: fffffdff + fs blocks: fffffdfffffffffeffffffffffff0000 +alloc = 22 lookahead: ffffffff + fs blocks: fffffffffffffffeffffffffffff0000 +scanning... lookahead: fffffffe + fs blocks: fffffffffffffffeffffffffffff0000 +alloc = 63 lookahead: ffffffff + fs blocks: ffffffffffffffffffffffffffff0000 +scanning... lookahead: ffffffff + fs blocks: ffffffffffffffffffffffffffff0000 +scanning... lookahead: ffffffff + fs blocks: ffffffffffffffffffffffffffff0000 +scanning... lookahead: ffff0000 + fs blocks: ffffffffffffffffffffffffffff0000 +alloc = 112 lookahead: ffff8000 + fs blocks: ffffffffffffffffffffffffffff8000 +``` + +While this lookahead approach still has an asymptotic runtime of O(n^2) to +scan all of storage, the lookahead reduces the practical runtime to a +reasonable amount. Bit-vectors are surprisingly compact, given only 16 bytes, +the lookahead could track 128 blocks. For a 4Mbyte flash chip with 4Kbyte +blocks, the littlefs would only need 8 passes to scan the entire storage. + +The real benefit of this approach is just how much it simplified the design +of the littlefs. Deallocating blocks is as simple as simply forgetting they +exist, and there is absolutely no concern of bugs in the deallocation code +causing difficult to detect memory leaks. + +## Directories + +Now we just need directories to store our files. Since we already have +metadata blocks that store information about files, lets just use these +metadata blocks as the directories. Maybe turn the directories into linked +lists of metadata blocks so it isn't limited by the number of files that fit +in a single block. Add entries that represent other nested directories. +Drop "." and ".." entries, cause who needs them. Dust off our hands and +we now have a directory tree. + +``` + .--------. + |root dir| + | pair 0 | + | | + '--------' + .-' '-------------------------. + v v + .--------. .--------. .--------. + | dir A |------->| dir A | | dir B | + | pair 0 | | pair 1 | | pair 0 | + | | | | | | + '--------' '--------' '--------' + .-' '-. | .-' '-. + v v v v v +.--------. .--------. .--------. .--------. .--------. +| file C | | file D | | file E | | file F | | file G | +| | | | | | | | | | +| | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' +``` + +Unfortunately it turns out it's not that simple. See, iterating over a +directory tree isn't actually all that easy, especially when you're trying +to fit in a bounded amount of RAM, which rules out any recursive solution. +And since our block allocator involves iterating over the entire filesystem +tree, possibly multiple times in a single allocation, iteration needs to be +efficient. + +So, as a solution, the littlefs adopted a sort of threaded tree. Each +directory not only contains pointers to all of its children, but also a +pointer to the next directory. These pointers create a linked-list that +is threaded through all of the directories in the filesystem. Since we +only use this linked list to check for existence, the order doesn't actually +matter. As an added plus, we can repurpose the pointer for the individual +directory linked-lists and avoid using any additional space. + +``` + .--------. + |root dir|-. + | pair 0 | | + .--------| |-' + | '--------' + | .-' '-------------------------. + | v v + | .--------. .--------. .--------. + '->| dir A |------->| dir A |------->| dir B | + | pair 0 | | pair 1 | | pair 0 | + | | | | | | + '--------' '--------' '--------' + .-' '-. | .-' '-. + v v v v v +.--------. .--------. .--------. .--------. .--------. +| file C | | file D | | file E | | file F | | file G | +| | | | | | | | | | +| | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' +``` + +This threaded tree approach does come with a few tradeoffs. Now, anytime we +want to manipulate the directory tree, we find ourselves having to update two +pointers instead of one. For anyone familiar with creating atomic data +structures this should set off a whole bunch of red flags. + +But unlike the data structure guys, we can update a whole block atomically! So +as long as we're really careful (and cheat a little bit), we can still +manipulate the directory tree in a way that is resilient to power loss. + +Consider how we might add a new directory. Since both pointers that reference +it can come from the same directory, we only need a single atomic update to +finagle the directory into the filesystem: +``` + .--------. + |root dir|-. + | pair 0 | | +.--| |-' +| '--------' +| | +| v +| .--------. +'->| dir A | + | pair 0 | + | | + '--------' + +| create the new directory block +v + + .--------. + |root dir|-. + | pair 0 | | + .--| |-' + | '--------' + | | + | v + | .--------. +.--------. '->| dir A | +| dir B |---->| pair 0 | +| pair 0 | | | +| | '--------' +'--------' + +| update root to point to directory B +v + + .--------. + |root dir|-. + | pair 0 | | +.--------| |-' +| '--------' +| .-' '-. +| v v +| .--------. .--------. +'->| dir B |->| dir A | + | pair 0 | | pair 0 | + | | | | + '--------' '--------' +``` + +Note that even though directory B was added after directory A, we insert +directory B before directory A in the linked-list because it is convenient. + +Now how about removal: +``` + .--------. .--------. + |root dir|------->|root dir|-. + | pair 0 | | pair 1 | | +.--------| |--------| |-' +| '--------' '--------' +| .-' '-. | +| v v v +| .--------. .--------. .--------. +'->| dir A |->| dir B |->| dir C | + | pair 0 | | pair 0 | | pair 0 | + | | | | | | + '--------' '--------' '--------' + +| update root to no longer contain directory B +v + + .--------. .--------. + |root dir|------------->|root dir|-. + | pair 0 | | pair 1 | | +.--| |--------------| |-' +| '--------' '--------' +| | | +| v v +| .--------. .--------. .--------. +'->| dir A |->| dir B |->| dir C | + | pair 0 | | pair 0 | | pair 0 | + | | | | | | + '--------' '--------' '--------' + +| remove directory B from the linked-list +v + + .--------. .--------. + |root dir|->|root dir|-. + | pair 0 | | pair 1 | | +.--| |--| |-' +| '--------' '--------' +| | | +| v v +| .--------. .--------. +'->| dir A |->| dir C | + | pair 0 | | pair 0 | + | | | | + '--------' '--------' +``` + +Wait, wait, wait, that's not atomic at all! If power is lost after removing +directory B from the root, directory B is still in the linked-list. We've +just created a memory leak! + +And to be honest, I don't have a clever solution for this case. As a +side-effect of using multiple pointers in the threaded tree, the littlefs +can end up with orphan blocks that have no parents and should have been +removed. + +To keep these orphan blocks from becoming a problem, the littlefs has a +deorphan step that simply iterates through every directory in the linked-list +and checks it against every directory entry in the filesystem to see if it +has a parent. The deorphan step occurs on the first block allocation after +boot, so orphans should never cause the littlefs to run out of storage +prematurely. Note that the deorphan step never needs to run in a read-only +filesystem. + +## The move problem + +Now we have a real problem. How do we move things between directories while +remaining power resilient? Even looking at the problem from a high level, +it seems impossible. We can update directory blocks atomically, but atomically +updating two independent directory blocks is not an atomic operation. + +Here's the steps the filesystem may go through to move a directory: +``` + .--------. + |root dir|-. + | pair 0 | | +.--------| |-' +| '--------' +| .-' '-. +| v v +| .--------. .--------. +'->| dir A |->| dir B | + | pair 0 | | pair 0 | + | | | | + '--------' '--------' + +| update directory B to point to directory A +v + + .--------. + |root dir|-. + | pair 0 | | +.--------| |-' +| '--------' +| .-----' '-. +| | v +| | .--------. +| | .->| dir B | +| | | | pair 0 | +| | | | | +| | | '--------' +| | .-------' +| v v | +| .--------. | +'->| dir A |-' + | pair 0 | + | | + '--------' + +| update root to no longer contain directory A +v + .--------. + |root dir|-. + | pair 0 | | +.----| |-' +| '--------' +| | +| v +| .--------. +| .->| dir B | +| | | pair 0 | +| '--| |-. +| '--------' | +| | | +| v | +| .--------. | +'--->| dir A |-' + | pair 0 | + | | + '--------' +``` + +We can leave any orphans up to the deorphan step to collect, but that doesn't +help the case where dir A has both dir B and the root dir as parents if we +lose power inconveniently. + +Initially, you might think this is fine. Dir A _might_ end up with two parents, +but the filesystem will still work as intended. But then this raises the +question of what do we do when the dir A wears out? For other directory blocks +we can update the parent pointer, but for a dir with two parents we would need +work out how to update both parents. And the check for multiple parents would +need to be carried out for every directory, even if the directory has never +been moved. + +It also presents a bad user-experience, since the condition of ending up with +two parents is rare, it's unlikely user-level code will be prepared. Just think +about how a user would recover from a multi-parented directory. They can't just +remove one directory, since remove would report the directory as "not empty". + +Other atomic filesystems simple COW the entire directory tree. But this +introduces a significant bit of complexity, which leads to code size, along +with a surprisingly expensive runtime cost during what most users assume is +a single pointer update. + +Another option is to update the directory block we're moving from to point +to the destination with a sort of predicate that we have moved if the +destination exists. Unfortunately, the omnipresent concern of wear could +cause any of these directory entries to change blocks, and changing the +entry size before a move introduces complications if it spills out of +the current directory block. + +So how do we go about moving a directory atomically? + +We rely on the improbableness of power loss. + +Power loss during a move is certainly possible, but it's actually relatively +rare. Unless a device is writing to a filesystem constantly, it's unlikely that +a power loss will occur during filesystem activity. We still need to handle +the condition, but runtime during a power loss takes a back seat to the runtime +during normal operations. + +So what littlefs does is inelegantly simple. When littlefs moves a file, it +marks the file as "moving". This is stored as a single bit in the directory +entry and doesn't take up much space. Then littlefs moves the directory, +finishing with the complete remove of the "moving" directory entry. + +``` + .--------. + |root dir|-. + | pair 0 | | +.--------| |-' +| '--------' +| .-' '-. +| v v +| .--------. .--------. +'->| dir A |->| dir B | + | pair 0 | | pair 0 | + | | | | + '--------' '--------' + +| update root directory to mark directory A as moving +v + + .----------. + |root dir |-. + | pair 0 | | +.-------| moving A!|-' +| '----------' +| .-' '-. +| v v +| .--------. .--------. +'->| dir A |->| dir B | + | pair 0 | | pair 0 | + | | | | + '--------' '--------' + +| update directory B to point to directory A +v + + .----------. + |root dir |-. + | pair 0 | | +.-------| moving A!|-' +| '----------' +| .-----' '-. +| | v +| | .--------. +| | .->| dir B | +| | | | pair 0 | +| | | | | +| | | '--------' +| | .-------' +| v v | +| .--------. | +'->| dir A |-' + | pair 0 | + | | + '--------' + +| update root to no longer contain directory A +v + .--------. + |root dir|-. + | pair 0 | | +.----| |-' +| '--------' +| | +| v +| .--------. +| .->| dir B | +| | | pair 0 | +| '--| |-. +| '--------' | +| | | +| v | +| .--------. | +'--->| dir A |-' + | pair 0 | + | | + '--------' +``` + +Now, if we run into a directory entry that has been marked as "moved", one +of two things is possible. Either the directory entry exists elsewhere in the +filesystem, or it doesn't. This is a O(n) operation, but only occurs in the +unlikely case we lost power during a move. + +And we can easily fix the "moved" directory entry. Since we're already scanning +the filesystem during the deorphan step, we can also check for moved entries. +If we find one, we either remove the "moved" marking or remove the whole entry +if it exists elsewhere in the filesystem. + +## Wear awareness + +So now that we have all of the pieces of a filesystem, we can look at a more +subtle attribute of embedded storage: The wear down of flash blocks. + +The first concern for the littlefs, is that perfectly valid blocks can suddenly +become unusable. As a nice side-effect of using a COW data-structure for files, +we can simply move on to a different block when a file write fails. All +modifications to files are performed in copies, so we will only replace the +old file when we are sure none of the new file has errors. Directories, on +the other hand, need a different strategy. + +The solution to directory corruption in the littlefs relies on the redundant +nature of the metadata pairs. If an error is detected during a write to one +of the metadata pairs, we seek out a new block to take its place. Once we find +a block without errors, we iterate through the directory tree, updating any +references to the corrupted metadata pair to point to the new metadata block. +Just like when we remove directories, we can lose power during this operation +and end up with a desynchronized metadata pair in our filesystem. And just like +when we remove directories, we leave the possibility of a desynchronized +metadata pair up to the deorphan step to clean up. + +Here's what encountering a directory error may look like with all of +the directories and directory pointers fully expanded: +``` + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 |--. + | | |-.| +.------| | |-|' +|.-----| | |-' +|| '---------'---------' +|| |||||'--------------------------------------------------. +|| ||||'-----------------------------------------. | +|| |||'-----------------------------. | | +|| ||'--------------------. | | | +|| |'-------. | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 6 block 7 block 8 +|| .---------.---------. .---------.---------. .---------.---------. +|'->| rev: 1 | rev: 0 |->| rev: 1 | rev: 0 |->| rev: 1 | rev: 0 | +'-->| | |->| | |->| | | + | | | | | | | + | | | | | | | | | + '---------'---------' '---------'---------' '---------'---------' + +| update directory B +v + + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 |--. + | | |-.| +.------| | |-|' +|.-----| | |-' +|| '---------'---------' +|| |||||'--------------------------------------------------. +|| ||||'-----------------------------------------. | +|| |||'-----------------------------. | | +|| ||'--------------------. | | | +|| |'-------. | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 6 block 7 block 8 +|| .---------.---------. .---------.---------. .---------.---------. +|'->| rev: 1 | rev: 0 |->| rev: 1 | rev: 2 |->| rev: 1 | rev: 0 | +'-->| | |->| | corrupt!|->| | | + | | | | | corrupt!| | | | + | | | | | corrupt!| | | | + '---------'---------' '---------'---------' '---------'---------' + +| oh no! corruption detected +v allocate a replacement block + + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 |--. + | | |-.| +.------| | |-|' +|.-----| | |-' +|| '---------'---------' +|| |||||'----------------------------------------------------. +|| ||||'-------------------------------------------. | +|| |||'-----------------------------. | | +|| ||'--------------------. | | | +|| |'-------. | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 6 block 7 block 8 +|| .---------.---------. .---------.---------. .---------.---------. +|'->| rev: 1 | rev: 0 |->| rev: 1 | rev: 2 |--->| rev: 1 | rev: 0 | +'-->| | |->| | corrupt!|--->| | | + | | | | | corrupt!| .->| | | + | | | | | corrupt!| | | | | + '---------'---------' '---------'---------' | '---------'---------' + block 9 | + .---------. | + | rev: 2 |-' + | | + | | + | | + '---------' + +| update root directory to contain block 9 +v + + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 2 |--. + | | |-.| +.-----| | |-|' +|.----| | |-' +|| '---------'---------' +|| .--------'||||'----------------------------------------------. +|| | |||'-------------------------------------. | +|| | ||'-----------------------. | | +|| | |'------------. | | | +|| | | | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 9 block 7 block 8 +|| .---------.---------. .---------. .---------. .---------.---------. +|'->| rev: 1 | rev: 0 |-->| rev: 1 |-| rev: 2 |--->| rev: 1 | rev: 0 | +'-->| | |-. | | | |--->| | | + | | | | | | | | .->| | | + | | | | | | | | | | | | + '---------'---------' | '---------' '---------' | '---------'---------' + | block 6 | + | .---------. | + '------------>| rev: 2 |-' + | corrupt!| + | corrupt!| + | corrupt!| + '---------' + +| remove corrupted block from linked-list +v + + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 2 |--. + | | |-.| +.-----| | |-|' +|.----| | |-' +|| '---------'---------' +|| .--------'||||'-----------------------------------------. +|| | |||'--------------------------------. | +|| | ||'--------------------. | | +|| | |'-----------. | | | +|| | | | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 9 block 7 block 8 +|| .---------.---------. .---------.---------. .---------.---------. +|'->| rev: 1 | rev: 2 |->| rev: 1 | rev: 2 |->| rev: 1 | rev: 0 | +'-->| | |->| | |->| | | + | | | | | | | | | + | | | | | | | | | + '---------'---------' '---------'---------' '---------'---------' +``` + +Also one question I've been getting is, what about the root directory? +It can't move so wouldn't the filesystem die as soon as the root blocks +develop errors? And you would be correct. So instead of storing the root +in the first few blocks of the storage, the root is actually pointed to +by the superblock. The superblock contains a few bits of static data, but +outside of when the filesystem is formatted, it is only updated when the root +develops errors and needs to be moved. + +## Wear leveling + +The second concern for the littlefs is that blocks in the filesystem may wear +unevenly. In this situation, a filesystem may meet an early demise where +there are no more non-corrupted blocks that aren't in use. It's common to +have files that were written once and left unmodified, wasting the potential +erase cycles of the blocks it sits on. + +Wear leveling is a term that describes distributing block writes evenly to +avoid the early termination of a flash part. There are typically two levels +of wear leveling: +1. Dynamic wear leveling - Wear is distributed evenly across all **dynamic** + blocks. Usually this is accomplished by simply choosing the unused block + with the lowest amount of wear. Note this does not solve the problem of + static data. +2. Static wear leveling - Wear is distributed evenly across all **dynamic** + and **static** blocks. Unmodified blocks may be evicted for new block + writes. This does handle the problem of static data but may lead to + wear amplification. + +In littlefs's case, it's possible to use the revision count on metadata pairs +to approximate the wear of a metadata block. And combined with the COW nature +of files, littlefs could provide your usual implementation of dynamic wear +leveling. + +However, the littlefs does not. This is for a few reasons. Most notably, even +if the littlefs did implement dynamic wear leveling, this would still not +handle the case of write-once files, and near the end of the lifetime of a +flash device, you would likely end up with uneven wear on the blocks anyways. + +As a flash device reaches the end of its life, the metadata blocks will +naturally be the first to go since they are updated most often. In this +situation, the littlefs is designed to simply move on to another set of +metadata blocks. This travelling means that at the end of a flash device's +life, the filesystem will have worn the device down nearly as evenly as the +usual dynamic wear leveling could. More aggressive wear leveling would come +with a code-size cost for marginal benefit. + + +One important takeaway to note, if your storage stack uses highly sensitive +storage such as NAND flash, static wear leveling is the only valid solution. +In most cases you are going to be better off using a full [flash translation layer (FTL)](https://en.wikipedia.org/wiki/Flash_translation_layer). +NAND flash already has many limitations that make it poorly suited for an +embedded system: low erase cycles, very large blocks, errors that can develop +even during reads, errors that can develop during writes of neighboring blocks. +Managing sensitive storage such as NAND flash is out of scope for the littlefs. +The littlefs does have some properties that may be beneficial on top of a FTL, +such as limiting the number of writes where possible, but if you have the +storage requirements that necessitate the need of NAND flash, you should have +the RAM to match and just use an FTL or flash filesystem. + +## Summary + +So, to summarize: + +1. The littlefs is composed of directory blocks +2. Each directory is a linked-list of metadata pairs +3. These metadata pairs can be updated atomically by alternating which + metadata block is active +4. Directory blocks contain either references to other directories or files +5. Files are represented by copy-on-write CTZ skip-lists which support O(1) + append and O(n log n) reading +6. Blocks are allocated by scanning the filesystem for used blocks in a + fixed-size lookahead region that is stored in a bit-vector +7. To facilitate scanning the filesystem, all directories are part of a + linked-list that is threaded through the entire filesystem +8. If a block develops an error, the littlefs allocates a new block, and + moves the data and references of the old block to the new. +9. Any case where an atomic operation is not possible, mistakes are resolved + by a deorphan step that occurs on the first allocation after boot + +That's the little filesystem. Thanks for reading! + |