Let's Build a Simple Database Writing a sqlite clone from scratch in C https://cstack.github.io/db_tutorial Part 6 - The Cursor Abstraction <p>This should be a shorter part than the last one. We’re just going to refactor a bit to make it easier to start the B-Tree implementation.</p> <p>We’re going to add a <code class="language-plaintext highlighter-rouge">Cursor</code> object which represents a location in the table. Things you might want to do with cursors:</p> <ul> <li>Create a cursor at the beginning of the table</li> <li>Create a cursor at the end of the table</li> <li>Access the row the cursor is pointing to</li> <li>Advance the cursor to the next row</li> </ul> <p>Those are the behaviors we’re going to implement now. Later, we will also want to:</p> <ul> <li>Delete the row pointed to by a cursor</li> <li>Modify the row pointed to by a cursor</li> <li>Search a table for a given ID, and create a cursor pointing to the row with that ID</li> </ul> <p>Without further ado, here’s the <code class="language-plaintext highlighter-rouge">Cursor</code> type:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+typedef struct { + Table* table; + uint32_t row_num; + bool end_of_table; // Indicates a position one past the last element +} Cursor; </span></code></pre></div></div> <p>Given our current table data structure, all you need to identify a location in a table is the row number.</p> <p>A cursor also has a reference to the table it’s part of (so our cursor functions can take just the cursor as a parameter).</p> <p>Finally, it has a boolean called <code class="language-plaintext highlighter-rouge">end_of_table</code>. This is so we can represent a position past the end of the table (which is somewhere we may want to insert a row).</p> <p><code class="language-plaintext highlighter-rouge">table_start()</code> and <code class="language-plaintext highlighter-rouge">table_end()</code> create new cursors:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+Cursor* table_start(Table* table) { + Cursor* cursor = malloc(sizeof(Cursor)); + cursor-&gt;table = table; + cursor-&gt;row_num = 0; + cursor-&gt;end_of_table = (table-&gt;num_rows == 0); + + return cursor; +} + +Cursor* table_end(Table* table) { + Cursor* cursor = malloc(sizeof(Cursor)); + cursor-&gt;table = table; + cursor-&gt;row_num = table-&gt;num_rows; + cursor-&gt;end_of_table = true; + + return cursor; +} </span></code></pre></div></div> <p>Our <code class="language-plaintext highlighter-rouge">row_slot()</code> function will become <code class="language-plaintext highlighter-rouge">cursor_value()</code>, which returns a pointer to the position described by the cursor:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-void* row_slot(Table* table, uint32_t row_num) { </span><span class="gi">+void* cursor_value(Cursor* cursor) { + uint32_t row_num = cursor-&gt;row_num; </span> uint32_t page_num = row_num / ROWS_PER_PAGE; <span class="gd">- void* page = get_page(table-&gt;pager, page_num); </span><span class="gi">+ void* page = get_page(cursor-&gt;table-&gt;pager, page_num); </span> uint32_t row_offset = row_num % ROWS_PER_PAGE; uint32_t byte_offset = row_offset * ROW_SIZE; return page + byte_offset; } </code></pre></div></div> <p>Advancing the cursor in our current table structure is as simple as incrementing the row number. This will be a bit more complicated in a B-tree.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void cursor_advance(Cursor* cursor) { + cursor-&gt;row_num += 1; + if (cursor-&gt;row_num &gt;= cursor-&gt;table-&gt;num_rows) { + cursor-&gt;end_of_table = true; + } +} </span></code></pre></div></div> <p>Finally we can change our “virtual machine” methods to use the cursor abstraction. When inserting a row, we open a cursor at the end of table, write to that cursor location, then close the cursor.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Row* row_to_insert = &amp;(statement-&gt;row_to_insert); <span class="gi">+ Cursor* cursor = table_end(table); </span> - serialize_row(row_to_insert, row_slot(table, table-&gt;num_rows)); <span class="gi">+ serialize_row(row_to_insert, cursor_value(cursor)); </span> table-&gt;num_rows += 1; + free(cursor); <span class="gi">+ </span> return EXECUTE_SUCCESS; } </code></pre></div></div> <p>When selecting all rows in the table, we open a cursor at the start of the table, print the row, then advance the cursor to the next row. Repeat until we’ve reached the end of the table.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ExecuteResult execute_select(Statement* statement, Table* table) { <span class="gi">+ Cursor* cursor = table_start(table); + </span> Row row; <span class="gd">- for (uint32_t i = 0; i &lt; table-&gt;num_rows; i++) { - deserialize_row(row_slot(table, i), &amp;row); </span><span class="gi">+ while (!(cursor-&gt;end_of_table)) { + deserialize_row(cursor_value(cursor), &amp;row); </span> print_row(&amp;row); <span class="gi">+ cursor_advance(cursor); </span> } <span class="gi">+ + free(cursor); + </span> return EXECUTE_SUCCESS; } </code></pre></div></div> <p>Alright, that’s it! Like I said, this was a shorter refactor that should help us as we rewrite our table data structure into a B-Tree. <code class="language-plaintext highlighter-rouge">execute_select()</code> and <code class="language-plaintext highlighter-rouge">execute_insert()</code> can interact with the table entirely through the cursor without assuming anything about how the table is stored.</p> <p>Here’s the complete diff to this part:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -78,6 +78,13 @@</span> struct { } Table; +typedef struct { <span class="gi">+ Table* table; + uint32_t row_num; + bool end_of_table; // Indicates a position one past the last element +} Cursor; + </span> void print_row(Row* row) { printf("(%d, %s, %s)\n", row-&gt;id, row-&gt;username, row-&gt;email); } <span class="p">@@ -126,12 +133,38 @@</span> void* get_page(Pager* pager, uint32_t page_num) { return pager-&gt;pages[page_num]; } -void* row_slot(Table* table, uint32_t row_num) { <span class="gd">- uint32_t page_num = row_num / ROWS_PER_PAGE; - void *page = get_page(table-&gt;pager, page_num); - uint32_t row_offset = row_num % ROWS_PER_PAGE; - uint32_t byte_offset = row_offset * ROW_SIZE; - return page + byte_offset; </span><span class="gi">+Cursor* table_start(Table* table) { + Cursor* cursor = malloc(sizeof(Cursor)); + cursor-&gt;table = table; + cursor-&gt;row_num = 0; + cursor-&gt;end_of_table = (table-&gt;num_rows == 0); + + return cursor; +} + +Cursor* table_end(Table* table) { + Cursor* cursor = malloc(sizeof(Cursor)); + cursor-&gt;table = table; + cursor-&gt;row_num = table-&gt;num_rows; + cursor-&gt;end_of_table = true; + + return cursor; +} + +void* cursor_value(Cursor* cursor) { + uint32_t row_num = cursor-&gt;row_num; + uint32_t page_num = row_num / ROWS_PER_PAGE; + void *page = get_page(cursor-&gt;table-&gt;pager, page_num); + uint32_t row_offset = row_num % ROWS_PER_PAGE; + uint32_t byte_offset = row_offset * ROW_SIZE; + return page + byte_offset; +} + +void cursor_advance(Cursor* cursor) { + cursor-&gt;row_num += 1; + if (cursor-&gt;row_num &gt;= cursor-&gt;table-&gt;num_rows) { + cursor-&gt;end_of_table = true; + } </span> } Pager* pager_open(const char* filename) { <span class="p">@@ -327,19 +360,28 @@</span> ExecuteResult execute_insert(Statement* statement, Table* table) { } Row* row_to_insert = &amp;(statement-&gt;row_to_insert); <span class="gi">+ Cursor* cursor = table_end(table); </span> - serialize_row(row_to_insert, row_slot(table, table-&gt;num_rows)); <span class="gi">+ serialize_row(row_to_insert, cursor_value(cursor)); </span> table-&gt;num_rows += 1; + free(cursor); <span class="gi">+ </span> return EXECUTE_SUCCESS; } ExecuteResult execute_select(Statement* statement, Table* table) { <span class="gi">+ Cursor* cursor = table_start(table); + </span> Row row; <span class="gd">- for (uint32_t i = 0; i &lt; table-&gt;num_rows; i++) { - deserialize_row(row_slot(table, i), &amp;row); </span><span class="gi">+ while (!(cursor-&gt;end_of_table)) { + deserialize_row(cursor_value(cursor), &amp;row); </span> print_row(&amp;row); <span class="gi">+ cursor_advance(cursor); </span> } <span class="gi">+ + free(cursor); + </span> return EXECUTE_SUCCESS; } </code></pre></div></div> Sun, 10 Sep 2017 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part6.html https://cstack.github.io/db_tutorial/parts/part6.html Part 7 - Introduction to the B-Tree <p>The B-Tree is the data structure SQLite uses to represent both tables and indexes, so it’s a pretty central idea. This article will just introduce the data structure, so it won’t have any code.</p> <p>Why is a tree a good data structure for a database?</p> <ul> <li>Searching for a particular value is fast (logarithmic time)</li> <li>Inserting / deleting a value you’ve already found is fast (constant-ish time to rebalance)</li> <li>Traversing a range of values is fast (unlike a hash map)</li> </ul> <p>A B-Tree is different from a binary tree (the “B” probably stands for the inventor’s name, but could also stand for “balanced”). Here’s an example B-Tree:</p> <table class="image"> <caption align="bottom">example B-Tree (https://en.wikipedia.org/wiki/File:B-tree.svg)</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/B-tree.png"><img src="https://cstack.github.io/db_tutorial/assets/images/B-tree.png" alt="example B-Tree (https://en.wikipedia.org/wiki/File:B-tree.svg)" /></a></td></tr> </table> <p>Unlike a binary tree, each node in a B-Tree can have more than 2 children. Each node can have up to m children, where m is called the tree’s “order”. To keep the tree mostly balanced, we also say nodes have to have at least m/2 children (rounded up).</p> <p>Exceptions:</p> <ul> <li>Leaf nodes have 0 children</li> <li>The root node can have fewer than m children but must have at least 2</li> <li>If the root node is a leaf node (the only node), it still has 0 children</li> </ul> <p>The picture from above is a B-Tree, which SQLite uses to store indexes. To store tables, SQLites uses a variation called a B+ tree.</p> <table> <thead> <tr> <th> </th> <th>B-tree</th> <th>B+ tree</th> </tr> </thead> <tbody> <tr> <td>Pronounced</td> <td>“Bee Tree”</td> <td>“Bee Plus Tree”</td> </tr> <tr> <td>Used to store</td> <td>Indexes</td> <td>Tables</td> </tr> <tr> <td>Internal nodes store keys</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Internal nodes store values</td> <td>Yes</td> <td>No</td> </tr> <tr> <td>Number of children per node</td> <td>Less</td> <td>More</td> </tr> <tr> <td>Internal nodes vs. leaf nodes</td> <td>Same structure</td> <td>Different structure</td> </tr> </tbody> </table> <p>Until we get to implementing indexes, I’m going to talk solely about B+ trees, but I’ll just refer to it as a B-tree or a btree.</p> <p>Nodes with children are called “internal” nodes. Internal nodes and leaf nodes are structured differently:</p> <table> <thead> <tr> <th>For an order-m tree…</th> <th>Internal Node</th> <th>Leaf Node</th> </tr> </thead> <tbody> <tr> <td>Stores</td> <td>keys and pointers to children</td> <td>keys and values</td> </tr> <tr> <td>Number of keys</td> <td>up to m-1</td> <td>as many as will fit</td> </tr> <tr> <td>Number of pointers</td> <td>number of keys + 1</td> <td>none</td> </tr> <tr> <td>Number of values</td> <td>none</td> <td>number of keys</td> </tr> <tr> <td>Key purpose</td> <td>used for routing</td> <td>paired with value</td> </tr> <tr> <td>Stores values?</td> <td>No</td> <td>Yes</td> </tr> </tbody> </table> <p>Let’s work through an example to see how a B-tree grows as you insert elements into it. To keep things simple, the tree will be order 3. That means:</p> <ul> <li>up to 3 children per internal node</li> <li>up to 2 keys per internal node</li> <li>at least 2 children per internal node</li> <li>at least 1 key per internal node</li> </ul> <p>An empty B-tree has a single node: the root node. The root node starts as a leaf node with zero key/value pairs:</p> <table class="image"> <caption align="bottom">empty btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree1.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree1.png" alt="empty btree" /></a></td></tr> </table> <p>If we insert a couple key/value pairs, they are stored in the leaf node in sorted order.</p> <table class="image"> <caption align="bottom">one-node btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree2.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree2.png" alt="one-node btree" /></a></td></tr> </table> <p>Let’s say that the capacity of a leaf node is two key/value pairs. When we insert another, we have to split the leaf node and put half the pairs in each node. Both nodes become children of a new internal node which will now be the root node.</p> <table class="image"> <caption align="bottom">two-level btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree3.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree3.png" alt="two-level btree" /></a></td></tr> </table> <p>The internal node has 1 key and 2 pointers to child nodes. If we want to look up a key that is less than or equal to 5, we look in the left child. If we want to look up a key greater than 5, we look in the right child.</p> <p>Now let’s insert the key “2”. First we look up which leaf node it would be in if it was present, and we arrive at the left leaf node. The node is full, so we split the leaf node and create a new entry in the parent node.</p> <table class="image"> <caption align="bottom">four-node btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree4.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree4.png" alt="four-node btree" /></a></td></tr> </table> <p>Let’s keep adding keys. 18 and 21. We get to the point where we have to split again, but there’s no room in the parent node for another key/pointer pair.</p> <table class="image"> <caption align="bottom">no room in internal node</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree5.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree5.png" alt="no room in internal node" /></a></td></tr> </table> <p>The solution is to split the root node into two internal nodes, then create new root node to be their parent.</p> <table class="image"> <caption align="bottom">three-level btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree6.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree6.png" alt="three-level btree" /></a></td></tr> </table> <p>The depth of the tree only increases when we split the root node. Every leaf node has the same depth and close to the same number of key/value pairs, so the tree remains balanced and quick to search.</p> <p>I’m going to hold off on discussion of deleting keys from the tree until after we’ve implemented insertion.</p> <p>When we implement this data structure, each node will correspond to one page. The root node will exist in page 0. Child pointers will simply be the page number that contains the child node.</p> <p>Next time, we start implementing the btree!</p> Sat, 23 Sep 2017 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part7.html https://cstack.github.io/db_tutorial/parts/part7.html Part 8 - B-Tree Leaf Node Format <p>We’re changing the format of our table from an unsorted array of rows to a B-Tree. This is a pretty big change that is going to take multiple articles to implement. By the end of this article, we’ll define the layout of a leaf node and support inserting key/value pairs into a single-node tree. But first, let’s recap the reasons for switching to a tree structure.</p> <h2 id="alternative-table-formats">Alternative Table Formats</h2> <p>With the current format, each page stores only rows (no metadata) so it is pretty space efficient. Insertion is also fast because we just append to the end. However, finding a particular row can only be done by scanning the entire table. And if we want to delete a row, we have to fill in the hole by moving every row that comes after it.</p> <p>If we stored the table as an array, but kept rows sorted by id, we could use binary search to find a particular id. However, insertion would be slow because we would have to move a lot of rows to make space.</p> <p>Instead, we’re going with a tree structure. Each node in the tree can contain a variable number of rows, so we have to store some information in each node to keep track of how many rows it contains. Plus there is the storage overhead of all the internal nodes which don’t store any rows. In exchange for a larger database file, we get fast insertion, deletion and lookup.</p> <table> <thead> <tr> <th> </th> <th>Unsorted Array of rows</th> <th>Sorted Array of rows</th> <th>Tree of nodes</th> </tr> </thead> <tbody> <tr> <td>Pages contain</td> <td>only data</td> <td>only data</td> <td>metadata, primary keys, and data</td> </tr> <tr> <td>Rows per page</td> <td>more</td> <td>more</td> <td>fewer</td> </tr> <tr> <td>Insertion</td> <td>O(1)</td> <td>O(n)</td> <td>O(log(n))</td> </tr> <tr> <td>Deletion</td> <td>O(n)</td> <td>O(n)</td> <td>O(log(n))</td> </tr> <tr> <td>Lookup by id</td> <td>O(n)</td> <td>O(log(n))</td> <td>O(log(n))</td> </tr> </tbody> </table> <h2 id="node-header-format">Node Header Format</h2> <p>Leaf nodes and internal nodes have different layouts. Let’s make an enum to keep track of node type:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+typedef enum { NODE_INTERNAL, NODE_LEAF } NodeType; </span></code></pre></div></div> <p>Each node will correspond to one page. Internal nodes will point to their children by storing the page number that stores the child. The btree asks the pager for a particular page number and gets back a pointer into the page cache. Pages are stored in the database file one after the other in order of page number.</p> <p>Nodes need to store some metadata in a header at the beginning of the page. Every node will store what type of node it is, whether or not it is the root node, and a pointer to its parent (to allow finding a node’s siblings). I define constants for the size and offset of every header field:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+/* + * Common Node Header Layout + */ +const uint32_t NODE_TYPE_SIZE = sizeof(uint8_t); +const uint32_t NODE_TYPE_OFFSET = 0; +const uint32_t IS_ROOT_SIZE = sizeof(uint8_t); +const uint32_t IS_ROOT_OFFSET = NODE_TYPE_SIZE; +const uint32_t PARENT_POINTER_SIZE = sizeof(uint32_t); +const uint32_t PARENT_POINTER_OFFSET = IS_ROOT_OFFSET + IS_ROOT_SIZE; +const uint8_t COMMON_NODE_HEADER_SIZE = + NODE_TYPE_SIZE + IS_ROOT_SIZE + PARENT_POINTER_SIZE; </span></code></pre></div></div> <h2 id="leaf-node-format">Leaf Node Format</h2> <p>In addition to these common header fields, leaf nodes need to store how many “cells” they contain. A cell is a key/value pair.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+/* + * Leaf Node Header Layout + */ +const uint32_t LEAF_NODE_NUM_CELLS_SIZE = sizeof(uint32_t); +const uint32_t LEAF_NODE_NUM_CELLS_OFFSET = COMMON_NODE_HEADER_SIZE; +const uint32_t LEAF_NODE_HEADER_SIZE = + COMMON_NODE_HEADER_SIZE + LEAF_NODE_NUM_CELLS_SIZE; </span></code></pre></div></div> <p>The body of a leaf node is an array of cells. Each cell is a key followed by a value (a serialized row).</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+/* + * Leaf Node Body Layout + */ +const uint32_t LEAF_NODE_KEY_SIZE = sizeof(uint32_t); +const uint32_t LEAF_NODE_KEY_OFFSET = 0; +const uint32_t LEAF_NODE_VALUE_SIZE = ROW_SIZE; +const uint32_t LEAF_NODE_VALUE_OFFSET = + LEAF_NODE_KEY_OFFSET + LEAF_NODE_KEY_SIZE; +const uint32_t LEAF_NODE_CELL_SIZE = LEAF_NODE_KEY_SIZE + LEAF_NODE_VALUE_SIZE; +const uint32_t LEAF_NODE_SPACE_FOR_CELLS = PAGE_SIZE - LEAF_NODE_HEADER_SIZE; +const uint32_t LEAF_NODE_MAX_CELLS = + LEAF_NODE_SPACE_FOR_CELLS / LEAF_NODE_CELL_SIZE; </span></code></pre></div></div> <p>Based on these constants, here’s what the layout of a leaf node looks like currently:</p> <table class="image"> <caption align="bottom">Our leaf node format</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/leaf-node-format.png"><img src="https://cstack.github.io/db_tutorial/assets/images/leaf-node-format.png" alt="Our leaf node format" /></a></td></tr> </table> <p>It’s a little space inefficient to use an entire byte per boolean value in the header, but this makes it easier to write code to access those values.</p> <p>Also notice that there’s some wasted space at the end. We store as many cells as we can after the header, but the leftover space can’t hold an entire cell. We leave it empty to avoid splitting cells between nodes.</p> <h2 id="accessing-leaf-node-fields">Accessing Leaf Node Fields</h2> <p>The code to access keys, values and metadata all involve pointer arithmetic using the constants we just defined.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+uint32_t* leaf_node_num_cells(void* node) { + return node + LEAF_NODE_NUM_CELLS_OFFSET; +} + +void* leaf_node_cell(void* node, uint32_t cell_num) { + return node + LEAF_NODE_HEADER_SIZE + cell_num * LEAF_NODE_CELL_SIZE; +} + +uint32_t* leaf_node_key(void* node, uint32_t cell_num) { + return leaf_node_cell(node, cell_num); +} + +void* leaf_node_value(void* node, uint32_t cell_num) { + return leaf_node_cell(node, cell_num) + LEAF_NODE_KEY_SIZE; +} + +void initialize_leaf_node(void* node) { *leaf_node_num_cells(node) = 0; } + </span></code></pre></div></div> <p>These methods return a pointer to the value in question, so they can be used both as a getter and a setter.</p> <h2 id="changes-to-pager-and-table-objects">Changes to Pager and Table Objects</h2> <p>Every node is going to take up exactly one page, even if it’s not full. That means our pager no longer needs to support reading/writing partial pages.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) { </span><span class="gi">+void pager_flush(Pager* pager, uint32_t page_num) { </span> if (pager-&gt;pages[page_num] == NULL) { printf("Tried to flush null page\n"); exit(EXIT_FAILURE); <span class="p">@@ -242,7 +337,7 @@</span> void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) { } ssize_t bytes_written = <span class="gd">- write(pager-&gt;file_descriptor, pager-&gt;pages[page_num], size); </span><span class="gi">+ write(pager-&gt;file_descriptor, pager-&gt;pages[page_num], PAGE_SIZE); </span> if (bytes_written == -1) { printf("Error writing: %d\n", errno); </code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void db_close(Table* table) { Pager* pager = table-&gt;pager; <span class="gd">- uint32_t num_full_pages = table-&gt;num_rows / ROWS_PER_PAGE; </span> <span class="gd">- for (uint32_t i = 0; i &lt; num_full_pages; i++) { </span><span class="gi">+ for (uint32_t i = 0; i &lt; pager-&gt;num_pages; i++) { </span> if (pager-&gt;pages[i] == NULL) { continue; } <span class="gd">- pager_flush(pager, i, PAGE_SIZE); </span><span class="gi">+ pager_flush(pager, i); </span> free(pager-&gt;pages[i]); pager-&gt;pages[i] = NULL; } <span class="gd">- // There may be a partial page to write to the end of the file - // This should not be needed after we switch to a B-tree - uint32_t num_additional_rows = table-&gt;num_rows % ROWS_PER_PAGE; - if (num_additional_rows &gt; 0) { - uint32_t page_num = num_full_pages; - if (pager-&gt;pages[page_num] != NULL) { - pager_flush(pager, page_num, num_additional_rows * ROW_SIZE); - free(pager-&gt;pages[page_num]); - pager-&gt;pages[page_num] = NULL; - } - } - </span> int result = close(pager-&gt;file_descriptor); if (result == -1) { printf("Error closing db file.\n"); </code></pre></div></div> <p>Now it makes more sense to store the number of pages in our database rather than the number of rows. The number of pages should be associated with the pager object, not the table, since it’s the number of pages used by the database, not a particular table. A btree is identified by its root node page number, so the table object needs to keep track of that.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> const uint32_t PAGE_SIZE = 4096; const uint32_t TABLE_MAX_PAGES = 100; <span class="gd">-const uint32_t ROWS_PER_PAGE = PAGE_SIZE / ROW_SIZE; -const uint32_t TABLE_MAX_ROWS = ROWS_PER_PAGE * TABLE_MAX_PAGES; </span> typedef struct { int file_descriptor; uint32_t file_length; <span class="gi">+ uint32_t num_pages; </span> void* pages[TABLE_MAX_PAGES]; } Pager; typedef struct { Pager* pager; <span class="gd">- uint32_t num_rows; </span><span class="gi">+ uint32_t root_page_num; </span> } Table; </code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -127,6 +200,10 @@</span> void* get_page(Pager* pager, uint32_t page_num) { } pager-&gt;pages[page_num] = page; <span class="gi">+ + if (page_num &gt;= pager-&gt;num_pages) { + pager-&gt;num_pages = page_num + 1; + } </span> } return pager-&gt;pages[page_num]; </code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -184,6 +269,12 @@</span> Pager* pager_open(const char* filename) { Pager* pager = malloc(sizeof(Pager)); pager-&gt;file_descriptor = fd; pager-&gt;file_length = file_length; <span class="gi">+ pager-&gt;num_pages = (file_length / PAGE_SIZE); + + if (file_length % PAGE_SIZE != 0) { + printf("Db file is not a whole number of pages. Corrupt file.\n"); + exit(EXIT_FAILURE); + } </span> for (uint32_t i = 0; i &lt; TABLE_MAX_PAGES; i++) { pager-&gt;pages[i] = NULL; </code></pre></div></div> <h2 id="changes-to-the-cursor-object">Changes to the Cursor Object</h2> <p>A cursor represents a position in the table. When our table was a simple array of rows, we could access a row given just the row number. Now that it’s a tree, we identify a position by the page number of the node, and the cell number within that node.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> typedef struct { Table* table; <span class="gd">- uint32_t row_num; </span><span class="gi">+ uint32_t page_num; + uint32_t cell_num; </span> bool end_of_table; // Indicates a position one past the last element } Cursor; </code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Cursor* table_start(Table* table) { Cursor* cursor = malloc(sizeof(Cursor)); cursor-&gt;table = table; <span class="gd">- cursor-&gt;row_num = 0; - cursor-&gt;end_of_table = (table-&gt;num_rows == 0); </span><span class="gi">+ cursor-&gt;page_num = table-&gt;root_page_num; + cursor-&gt;cell_num = 0; + + void* root_node = get_page(table-&gt;pager, table-&gt;root_page_num); + uint32_t num_cells = *leaf_node_num_cells(root_node); + cursor-&gt;end_of_table = (num_cells == 0); </span> return cursor; } </code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Cursor* table_end(Table* table) { Cursor* cursor = malloc(sizeof(Cursor)); cursor-&gt;table = table; <span class="gd">- cursor-&gt;row_num = table-&gt;num_rows; </span><span class="gi">+ cursor-&gt;page_num = table-&gt;root_page_num; + + void* root_node = get_page(table-&gt;pager, table-&gt;root_page_num); + uint32_t num_cells = *leaf_node_num_cells(root_node); + cursor-&gt;cell_num = num_cells; </span> cursor-&gt;end_of_table = true; return cursor; } </code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void* cursor_value(Cursor* cursor) { <span class="gd">- uint32_t row_num = cursor-&gt;row_num; - uint32_t page_num = row_num / ROWS_PER_PAGE; </span><span class="gi">+ uint32_t page_num = cursor-&gt;page_num; </span> void* page = get_page(cursor-&gt;table-&gt;pager, page_num); <span class="gd">- uint32_t row_offset = row_num % ROWS_PER_PAGE; - uint32_t byte_offset = row_offset * ROW_SIZE; - return page + byte_offset; </span><span class="gi">+ return leaf_node_value(page, cursor-&gt;cell_num); </span> } </code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void cursor_advance(Cursor* cursor) { <span class="gd">- cursor-&gt;row_num += 1; - if (cursor-&gt;row_num &gt;= cursor-&gt;table-&gt;num_rows) { </span><span class="gi">+ uint32_t page_num = cursor-&gt;page_num; + void* node = get_page(cursor-&gt;table-&gt;pager, page_num); + + cursor-&gt;cell_num += 1; + if (cursor-&gt;cell_num &gt;= (*leaf_node_num_cells(node))) { </span> cursor-&gt;end_of_table = true; } } </code></pre></div></div> <h2 id="insertion-into-a-leaf-node">Insertion Into a Leaf Node</h2> <p>In this article we’re only going to implement enough to get a single-node tree. Recall from last article that a tree starts out as an empty leaf node:</p> <table class="image"> <caption align="bottom">empty btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree1.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree1.png" alt="empty btree" /></a></td></tr> </table> <p>Key/value pairs can be added until the leaf node is full:</p> <table class="image"> <caption align="bottom">one-node btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree2.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree2.png" alt="one-node btree" /></a></td></tr> </table> <p>When we open the database for the first time, the database file will be empty, so we initialize page 0 to be an empty leaf node (the root node):</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Table* db_open(const char* filename) { Pager* pager = pager_open(filename); <span class="gd">- uint32_t num_rows = pager-&gt;file_length / ROW_SIZE; </span> Table* table = malloc(sizeof(Table)); table-&gt;pager = pager; <span class="gd">- table-&gt;num_rows = num_rows; </span><span class="gi">+ table-&gt;root_page_num = 0; + + if (pager-&gt;num_pages == 0) { + // New database file. Initialize page 0 as leaf node. + void* root_node = get_page(pager, 0); + initialize_leaf_node(root_node); + } </span> return table; } </code></pre></div></div> <p>Next we’ll make a function for inserting a key/value pair into a leaf node. It will take a cursor as input to represent the position where the pair should be inserted.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void leaf_node_insert(Cursor* cursor, uint32_t key, Row* value) { + void* node = get_page(cursor-&gt;table-&gt;pager, cursor-&gt;page_num); + + uint32_t num_cells = *leaf_node_num_cells(node); + if (num_cells &gt;= LEAF_NODE_MAX_CELLS) { + // Node full + printf("Need to implement splitting a leaf node.\n"); + exit(EXIT_FAILURE); + } + + if (cursor-&gt;cell_num &lt; num_cells) { + // Make room for new cell + for (uint32_t i = num_cells; i &gt; cursor-&gt;cell_num; i--) { + memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i - 1), + LEAF_NODE_CELL_SIZE); + } + } + + *(leaf_node_num_cells(node)) += 1; + *(leaf_node_key(node, cursor-&gt;cell_num)) = key; + serialize_row(value, leaf_node_value(node, cursor-&gt;cell_num)); +} + </span></code></pre></div></div> <p>We haven’t implemented splitting yet, so we error if the node is full. Next we shift cells one space to the right to make room for the new cell. Then we write the new key/value into the empty space.</p> <p>Since we assume the tree only has one node, our <code class="language-plaintext highlighter-rouge">execute_insert()</code> function simply needs to call this helper method:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ExecuteResult execute_insert(Statement* statement, Table* table) { <span class="gd">- if (table-&gt;num_rows &gt;= TABLE_MAX_ROWS) { </span><span class="gi">+ void* node = get_page(table-&gt;pager, table-&gt;root_page_num); + if ((*leaf_node_num_cells(node) &gt;= LEAF_NODE_MAX_CELLS)) { </span> return EXECUTE_TABLE_FULL; } Row* row_to_insert = &amp;(statement-&gt;row_to_insert); Cursor* cursor = table_end(table); <span class="gd">- serialize_row(row_to_insert, cursor_value(cursor)); - table-&gt;num_rows += 1; </span><span class="gi">+ leaf_node_insert(cursor, row_to_insert-&gt;id, row_to_insert); </span> free(cursor); </code></pre></div></div> <p>With those changes, our database should work as before! Except now it returns a “Table Full” error much sooner, since we can’t split the root node yet.</p> <p>How many rows can the leaf node hold?</p> <h2 id="command-to-print-constants">Command to Print Constants</h2> <p>I’m adding a new meta command to print out a few constants of interest.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void print_constants() { + printf("ROW_SIZE: %d\n", ROW_SIZE); + printf("COMMON_NODE_HEADER_SIZE: %d\n", COMMON_NODE_HEADER_SIZE); + printf("LEAF_NODE_HEADER_SIZE: %d\n", LEAF_NODE_HEADER_SIZE); + printf("LEAF_NODE_CELL_SIZE: %d\n", LEAF_NODE_CELL_SIZE); + printf("LEAF_NODE_SPACE_FOR_CELLS: %d\n", LEAF_NODE_SPACE_FOR_CELLS); + printf("LEAF_NODE_MAX_CELLS: %d\n", LEAF_NODE_MAX_CELLS); +} + </span><span class="p">@@ -294,6 +376,14 @@</span> MetaCommandResult do_meta_command(InputBuffer* input_buffer, Table* table) { if (strcmp(input_buffer-&gt;buffer, ".exit") == 0) { db_close(table); exit(EXIT_SUCCESS); <span class="gi">+ } else if (strcmp(input_buffer-&gt;buffer, ".constants") == 0) { + printf("Constants:\n"); + print_constants(); + return META_COMMAND_SUCCESS; </span> } else { return META_COMMAND_UNRECOGNIZED_COMMAND; } </code></pre></div></div> <p>I’m also adding a test so we get alerted when those constants change:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ it 'prints constants' do + script = [ + ".constants", + ".exit", + ] + result = run_script(script) + + expect(result).to match_array([ + "db &gt; Constants:", + "ROW_SIZE: 293", + "COMMON_NODE_HEADER_SIZE: 6", + "LEAF_NODE_HEADER_SIZE: 10", + "LEAF_NODE_CELL_SIZE: 297", + "LEAF_NODE_SPACE_FOR_CELLS: 4086", + "LEAF_NODE_MAX_CELLS: 13", + "db &gt; ", + ]) + end </span></code></pre></div></div> <p>So our table can hold 13 rows right now!</p> <h2 id="tree-visualization">Tree Visualization</h2> <p>To help with debugging and visualization, I’m also adding a meta command to print out a representation of the btree.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void print_leaf_node(void* node) { + uint32_t num_cells = *leaf_node_num_cells(node); + printf("leaf (size %d)\n", num_cells); + for (uint32_t i = 0; i &lt; num_cells; i++) { + uint32_t key = *leaf_node_key(node, i); + printf(" - %d : %d\n", i, key); + } +} + </span></code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -294,6 +376,14 @@</span> MetaCommandResult do_meta_command(InputBuffer* input_buffer, Table* table) { if (strcmp(input_buffer-&gt;buffer, ".exit") == 0) { db_close(table); exit(EXIT_SUCCESS); <span class="gi">+ } else if (strcmp(input_buffer-&gt;buffer, ".btree") == 0) { + printf("Tree:\n"); + print_leaf_node(get_page(table-&gt;pager, 0)); + return META_COMMAND_SUCCESS; </span> } else if (strcmp(input_buffer-&gt;buffer, ".constants") == 0) { printf("Constants:\n"); print_constants(); return META_COMMAND_SUCCESS; } else { return META_COMMAND_UNRECOGNIZED_COMMAND; } </code></pre></div></div> <p>And a test</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ it 'allows printing out the structure of a one-node btree' do + script = [3, 1, 2].map do |i| + "insert #{i} user#{i} person#{i}@example.com" + end + script &lt;&lt; ".btree" + script &lt;&lt; ".exit" + result = run_script(script) + + expect(result).to match_array([ + "db &gt; Executed.", + "db &gt; Executed.", + "db &gt; Executed.", + "db &gt; Tree:", + "leaf (size 3)", + " - 0 : 3", + " - 1 : 1", + " - 2 : 2", + "db &gt; " + ]) + end </span></code></pre></div></div> <p>Uh oh, we’re still not storing rows in sorted order. You’ll notice that <code class="language-plaintext highlighter-rouge">execute_insert()</code> inserts into the leaf node at the position returned by <code class="language-plaintext highlighter-rouge">table_end()</code>. So rows are stored in the order they were inserted, just like before.</p> <h2 id="next-time">Next Time</h2> <p>This all might seem like a step backwards. Our database now stores fewer rows than it did before, and we’re still storing rows in unsorted order. But like I said at the beginning, this is a big change and it’s important to break it up into manageable steps.</p> <p>Next time, we’ll implement finding a record by primary key, and start storing rows in sorted order.</p> <h2 id="complete-diff">Complete Diff</h2> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -62,29 +62,101 @@</span> const uint32_t ROW_SIZE = ID_SIZE + USERNAME_SIZE + EMAIL_SIZE; const uint32_t PAGE_SIZE = 4096; #define TABLE_MAX_PAGES 100 <span class="gd">-const uint32_t ROWS_PER_PAGE = PAGE_SIZE / ROW_SIZE; -const uint32_t TABLE_MAX_ROWS = ROWS_PER_PAGE * TABLE_MAX_PAGES; </span> typedef struct { int file_descriptor; uint32_t file_length; <span class="gi">+ uint32_t num_pages; </span> void* pages[TABLE_MAX_PAGES]; } Pager; typedef struct { Pager* pager; <span class="gd">- uint32_t num_rows; </span><span class="gi">+ uint32_t root_page_num; </span> } Table; typedef struct { Table* table; <span class="gd">- uint32_t row_num; </span><span class="gi">+ uint32_t page_num; + uint32_t cell_num; </span> bool end_of_table; // Indicates a position one past the last element } Cursor; +typedef enum { NODE_INTERNAL, NODE_LEAF } NodeType; <span class="gi">+ +/* + * Common Node Header Layout + */ +const uint32_t NODE_TYPE_SIZE = sizeof(uint8_t); +const uint32_t NODE_TYPE_OFFSET = 0; +const uint32_t IS_ROOT_SIZE = sizeof(uint8_t); +const uint32_t IS_ROOT_OFFSET = NODE_TYPE_SIZE; +const uint32_t PARENT_POINTER_SIZE = sizeof(uint32_t); +const uint32_t PARENT_POINTER_OFFSET = IS_ROOT_OFFSET + IS_ROOT_SIZE; +const uint8_t COMMON_NODE_HEADER_SIZE = + NODE_TYPE_SIZE + IS_ROOT_SIZE + PARENT_POINTER_SIZE; + +/* + * Leaf Node Header Layout + */ +const uint32_t LEAF_NODE_NUM_CELLS_SIZE = sizeof(uint32_t); +const uint32_t LEAF_NODE_NUM_CELLS_OFFSET = COMMON_NODE_HEADER_SIZE; +const uint32_t LEAF_NODE_HEADER_SIZE = + COMMON_NODE_HEADER_SIZE + LEAF_NODE_NUM_CELLS_SIZE; + +/* + * Leaf Node Body Layout + */ +const uint32_t LEAF_NODE_KEY_SIZE = sizeof(uint32_t); +const uint32_t LEAF_NODE_KEY_OFFSET = 0; +const uint32_t LEAF_NODE_VALUE_SIZE = ROW_SIZE; +const uint32_t LEAF_NODE_VALUE_OFFSET = + LEAF_NODE_KEY_OFFSET + LEAF_NODE_KEY_SIZE; +const uint32_t LEAF_NODE_CELL_SIZE = LEAF_NODE_KEY_SIZE + LEAF_NODE_VALUE_SIZE; +const uint32_t LEAF_NODE_SPACE_FOR_CELLS = PAGE_SIZE - LEAF_NODE_HEADER_SIZE; +const uint32_t LEAF_NODE_MAX_CELLS = + LEAF_NODE_SPACE_FOR_CELLS / LEAF_NODE_CELL_SIZE; + +uint32_t* leaf_node_num_cells(void* node) { + return node + LEAF_NODE_NUM_CELLS_OFFSET; +} + +void* leaf_node_cell(void* node, uint32_t cell_num) { + return node + LEAF_NODE_HEADER_SIZE + cell_num * LEAF_NODE_CELL_SIZE; +} + +uint32_t* leaf_node_key(void* node, uint32_t cell_num) { + return leaf_node_cell(node, cell_num); +} + +void* leaf_node_value(void* node, uint32_t cell_num) { + return leaf_node_cell(node, cell_num) + LEAF_NODE_KEY_SIZE; +} + +void print_constants() { + printf("ROW_SIZE: %d\n", ROW_SIZE); + printf("COMMON_NODE_HEADER_SIZE: %d\n", COMMON_NODE_HEADER_SIZE); + printf("LEAF_NODE_HEADER_SIZE: %d\n", LEAF_NODE_HEADER_SIZE); + printf("LEAF_NODE_CELL_SIZE: %d\n", LEAF_NODE_CELL_SIZE); + printf("LEAF_NODE_SPACE_FOR_CELLS: %d\n", LEAF_NODE_SPACE_FOR_CELLS); + printf("LEAF_NODE_MAX_CELLS: %d\n", LEAF_NODE_MAX_CELLS); +} + +void print_leaf_node(void* node) { + uint32_t num_cells = *leaf_node_num_cells(node); + printf("leaf (size %d)\n", num_cells); + for (uint32_t i = 0; i &lt; num_cells; i++) { + uint32_t key = *leaf_node_key(node, i); + printf(" - %d : %d\n", i, key); + } +} + </span> void print_row(Row* row) { printf("(%d, %s, %s)\n", row-&gt;id, row-&gt;username, row-&gt;email); } <span class="p">@@ -101,6 +173,8 @@</span> void deserialize_row(void *source, Row* destination) { memcpy(&amp;(destination-&gt;email), source + EMAIL_OFFSET, EMAIL_SIZE); } <span class="gi">+void initialize_leaf_node(void* node) { *leaf_node_num_cells(node) = 0; } + </span> void* get_page(Pager* pager, uint32_t page_num) { if (page_num &gt; TABLE_MAX_PAGES) { printf("Tried to fetch page number out of bounds. %d &gt; %d\n", page_num, <span class="p">@@ -128,6 +202,10 @@</span> void* get_page(Pager* pager, uint32_t page_num) { } pager-&gt;pages[page_num] = page; <span class="gi">+ + if (page_num &gt;= pager-&gt;num_pages) { + pager-&gt;num_pages = page_num + 1; + } </span> } return pager-&gt;pages[page_num]; <span class="p">@@ -136,8 +214,12 @@</span> void* get_page(Pager* pager, uint32_t page_num) { Cursor* table_start(Table* table) { Cursor* cursor = malloc(sizeof(Cursor)); cursor-&gt;table = table; <span class="gd">- cursor-&gt;row_num = 0; - cursor-&gt;end_of_table = (table-&gt;num_rows == 0); </span><span class="gi">+ cursor-&gt;page_num = table-&gt;root_page_num; + cursor-&gt;cell_num = 0; + + void* root_node = get_page(table-&gt;pager, table-&gt;root_page_num); + uint32_t num_cells = *leaf_node_num_cells(root_node); + cursor-&gt;end_of_table = (num_cells == 0); </span> return cursor; } <span class="p">@@ -145,24 +227,28 @@</span> Cursor* table_start(Table* table) { Cursor* table_end(Table* table) { Cursor* cursor = malloc(sizeof(Cursor)); cursor-&gt;table = table; <span class="gd">- cursor-&gt;row_num = table-&gt;num_rows; </span><span class="gi">+ cursor-&gt;page_num = table-&gt;root_page_num; + + void* root_node = get_page(table-&gt;pager, table-&gt;root_page_num); + uint32_t num_cells = *leaf_node_num_cells(root_node); + cursor-&gt;cell_num = num_cells; </span> cursor-&gt;end_of_table = true; return cursor; } void* cursor_value(Cursor* cursor) { <span class="gd">- uint32_t row_num = cursor-&gt;row_num; - uint32_t page_num = row_num / ROWS_PER_PAGE; </span><span class="gi">+ uint32_t page_num = cursor-&gt;page_num; </span> void* page = get_page(cursor-&gt;table-&gt;pager, page_num); <span class="gd">- uint32_t row_offset = row_num % ROWS_PER_PAGE; - uint32_t byte_offset = row_offset * ROW_SIZE; - return page + byte_offset; </span><span class="gi">+ return leaf_node_value(page, cursor-&gt;cell_num); </span> } void cursor_advance(Cursor* cursor) { <span class="gd">- cursor-&gt;row_num += 1; - if (cursor-&gt;row_num &gt;= cursor-&gt;table-&gt;num_rows) { </span><span class="gi">+ uint32_t page_num = cursor-&gt;page_num; + void* node = get_page(cursor-&gt;table-&gt;pager, page_num); + + cursor-&gt;cell_num += 1; + if (cursor-&gt;cell_num &gt;= (*leaf_node_num_cells(node))) { </span> cursor-&gt;end_of_table = true; } } <span class="p">@@ -185,6 +271,12 @@</span> Pager* pager_open(const char* filename) { Pager* pager = malloc(sizeof(Pager)); pager-&gt;file_descriptor = fd; pager-&gt;file_length = file_length; <span class="gi">+ pager-&gt;num_pages = (file_length / PAGE_SIZE); + + if (file_length % PAGE_SIZE != 0) { + printf("Db file is not a whole number of pages. Corrupt file.\n"); + exit(EXIT_FAILURE); + } </span> for (uint32_t i = 0; i &lt; TABLE_MAX_PAGES; i++) { pager-&gt;pages[i] = NULL; <span class="p">@@ -194,11 +285,15 @@</span> Pager* pager_open(const char* filename) { <span class="p">@@ -195,11 +287,16 @@</span> Pager* pager_open(const char* filename) { Table* db_open(const char* filename) { Pager* pager = pager_open(filename); <span class="gd">- uint32_t num_rows = pager-&gt;file_length / ROW_SIZE; </span> Table* table = malloc(sizeof(Table)); table-&gt;pager = pager; <span class="gd">- table-&gt;num_rows = num_rows; </span><span class="gi">+ table-&gt;root_page_num = 0; + + if (pager-&gt;num_pages == 0) { + // New database file. Initialize page 0 as leaf node. + void* root_node = get_page(pager, 0); + initialize_leaf_node(root_node); + } </span> return table; } <span class="p">@@ -234,7 +331,7 @@</span> void close_input_buffer(InputBuffer* input_buffer) { free(input_buffer); } <span class="gd">-void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) { </span><span class="gi">+void pager_flush(Pager* pager, uint32_t page_num) { </span> if (pager-&gt;pages[page_num] == NULL) { printf("Tried to flush null page\n"); exit(EXIT_FAILURE); <span class="p">@@ -242,7 +337,7 @@</span> void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) { <span class="p">@@ -249,7 +346,7 @@</span> void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) { } ssize_t bytes_written = <span class="gd">- write(pager-&gt;file_descriptor, pager-&gt;pages[page_num], size); </span><span class="gi">+ write(pager-&gt;file_descriptor, pager-&gt;pages[page_num], PAGE_SIZE); </span> if (bytes_written == -1) { printf("Error writing: %d\n", errno); <span class="p">@@ -252,29 +347,16 @@</span> void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) { <span class="p">@@ -260,29 +357,16 @@</span> void pager_flush(Pager* pager, uint32_t page_num, uint32_t size) { void db_close(Table* table) { Pager* pager = table-&gt;pager; <span class="gd">- uint32_t num_full_pages = table-&gt;num_rows / ROWS_PER_PAGE; </span> <span class="gd">- for (uint32_t i = 0; i &lt; num_full_pages; i++) { </span><span class="gi">+ for (uint32_t i = 0; i &lt; pager-&gt;num_pages; i++) { </span> if (pager-&gt;pages[i] == NULL) { continue; } <span class="gd">- pager_flush(pager, i, PAGE_SIZE); </span><span class="gi">+ pager_flush(pager, i); </span> free(pager-&gt;pages[i]); pager-&gt;pages[i] = NULL; } <span class="gd">- // There may be a partial page to write to the end of the file - // This should not be needed after we switch to a B-tree - uint32_t num_additional_rows = table-&gt;num_rows % ROWS_PER_PAGE; - if (num_additional_rows &gt; 0) { - uint32_t page_num = num_full_pages; - if (pager-&gt;pages[page_num] != NULL) { - pager_flush(pager, page_num, num_additional_rows * ROW_SIZE); - free(pager-&gt;pages[page_num]); - pager-&gt;pages[page_num] = NULL; - } - } - </span> int result = close(pager-&gt;file_descriptor); if (result == -1) { printf("Error closing db file.\n"); <span class="p">@@ -305,6 +389,14 @@</span> MetaCommandResult do_meta_command(InputBuffer* input_buffer, Table *table) { if (strcmp(input_buffer-&gt;buffer, ".exit") == 0) { db_close(table); exit(EXIT_SUCCESS); <span class="gi">+ } else if (strcmp(input_buffer-&gt;buffer, ".btree") == 0) { + printf("Tree:\n"); + print_leaf_node(get_page(table-&gt;pager, 0)); + return META_COMMAND_SUCCESS; + } else if (strcmp(input_buffer-&gt;buffer, ".constants") == 0) { + printf("Constants:\n"); + print_constants(); + return META_COMMAND_SUCCESS; </span> } else { return META_COMMAND_UNRECOGNIZED_COMMAND; } <span class="p">@@ -354,16 +446,39 @@</span> PrepareResult prepare_statement(InputBuffer* input_buffer, return PREPARE_UNRECOGNIZED_STATEMENT; } <span class="gi">+void leaf_node_insert(Cursor* cursor, uint32_t key, Row* value) { + void* node = get_page(cursor-&gt;table-&gt;pager, cursor-&gt;page_num); + + uint32_t num_cells = *leaf_node_num_cells(node); + if (num_cells &gt;= LEAF_NODE_MAX_CELLS) { + // Node full + printf("Need to implement splitting a leaf node.\n"); + exit(EXIT_FAILURE); + } + + if (cursor-&gt;cell_num &lt; num_cells) { + // Make room for new cell + for (uint32_t i = num_cells; i &gt; cursor-&gt;cell_num; i--) { + memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i - 1), + LEAF_NODE_CELL_SIZE); + } + } + + *(leaf_node_num_cells(node)) += 1; + *(leaf_node_key(node, cursor-&gt;cell_num)) = key; + serialize_row(value, leaf_node_value(node, cursor-&gt;cell_num)); +} + </span> ExecuteResult execute_insert(Statement* statement, Table* table) { <span class="gd">- if (table-&gt;num_rows &gt;= TABLE_MAX_ROWS) { </span><span class="gi">+ void* node = get_page(table-&gt;pager, table-&gt;root_page_num); + if ((*leaf_node_num_cells(node) &gt;= LEAF_NODE_MAX_CELLS)) { </span> return EXECUTE_TABLE_FULL; } Row* row_to_insert = &amp;(statement-&gt;row_to_insert); Cursor* cursor = table_end(table); <span class="gd">- serialize_row(row_to_insert, cursor_value(cursor)); - table-&gt;num_rows += 1; </span><span class="gi">+ leaf_node_insert(cursor, row_to_insert-&gt;id, row_to_insert); </span> free(cursor); </code></pre></div></div> <p>And the specs:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ it 'allows printing out the structure of a one-node btree' do + script = [3, 1, 2].map do |i| + "insert #{i} user#{i} person#{i}@example.com" + end + script &lt;&lt; ".btree" + script &lt;&lt; ".exit" + result = run_script(script) + + expect(result).to match_array([ + "db &gt; Executed.", + "db &gt; Executed.", + "db &gt; Executed.", + "db &gt; Tree:", + "leaf (size 3)", + " - 0 : 3", + " - 1 : 1", + " - 2 : 2", + "db &gt; " + ]) + end + + it 'prints constants' do + script = [ + ".constants", + ".exit", + ] + result = run_script(script) + + expect(result).to match_array([ + "db &gt; Constants:", + "ROW_SIZE: 293", + "COMMON_NODE_HEADER_SIZE: 6", + "LEAF_NODE_HEADER_SIZE: 10", + "LEAF_NODE_CELL_SIZE: 297", + "LEAF_NODE_SPACE_FOR_CELLS: 4086", + "LEAF_NODE_MAX_CELLS: 13", + "db &gt; ", + ]) + end </span> end </code></pre></div></div> Mon, 25 Sep 2017 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part8.html https://cstack.github.io/db_tutorial/parts/part8.html Part 9 - Binary Search and Duplicate Keys <p>Last time we noted that we’re still storing keys in unsorted order. We’re going to fix that problem, plus detect and reject duplicate keys.</p> <p>Right now, our <code class="language-plaintext highlighter-rouge">execute_insert()</code> function always chooses to insert at the end of the table. Instead, we should search the table for the correct place to insert, then insert there. If the key already exists there, return an error.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">ExecuteResult execute_insert(Statement* statement, Table* table) { </span> void* node = get_page(table-&gt;pager, table-&gt;root_page_num); <span class="gd">- if ((*leaf_node_num_cells(node) &gt;= LEAF_NODE_MAX_CELLS)) { </span><span class="gi">+ uint32_t num_cells = (*leaf_node_num_cells(node)); + if (num_cells &gt;= LEAF_NODE_MAX_CELLS) { </span> return EXECUTE_TABLE_FULL; } Row* row_to_insert = &amp;(statement-&gt;row_to_insert); <span class="gd">- Cursor* cursor = table_end(table); </span><span class="gi">+ uint32_t key_to_insert = row_to_insert-&gt;id; + Cursor* cursor = table_find(table, key_to_insert); + + if (cursor-&gt;cell_num &lt; num_cells) { + uint32_t key_at_index = *leaf_node_key(node, cursor-&gt;cell_num); + if (key_at_index == key_to_insert) { + return EXECUTE_DUPLICATE_KEY; + } + } </span> leaf_node_insert(cursor, row_to_insert-&gt;id, row_to_insert); </code></pre></div></div> <p>We don’t need the <code class="language-plaintext highlighter-rouge">table_end()</code> function anymore.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-Cursor* table_end(Table* table) { - Cursor* cursor = malloc(sizeof(Cursor)); - cursor-&gt;table = table; - cursor-&gt;page_num = table-&gt;root_page_num; - - void* root_node = get_page(table-&gt;pager, table-&gt;root_page_num); - uint32_t num_cells = *leaf_node_num_cells(root_node); - cursor-&gt;cell_num = num_cells; - cursor-&gt;end_of_table = true; - - return cursor; -} </span></code></pre></div></div> <p>We’ll replace it with a method that searches the tree for a given key.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+/* +Return the position of the given key. +If the key is not present, return the position +where it should be inserted +*/ +Cursor* table_find(Table* table, uint32_t key) { + uint32_t root_page_num = table-&gt;root_page_num; + void* root_node = get_page(table-&gt;pager, root_page_num); + + if (get_node_type(root_node) == NODE_LEAF) { + return leaf_node_find(table, root_page_num, key); + } else { + printf("Need to implement searching an internal node\n"); + exit(EXIT_FAILURE); + } +} </span></code></pre></div></div> <p>I’m stubbing out the branch for internal nodes because we haven’t implemented internal nodes yet. We can search the leaf node with binary search.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+Cursor* leaf_node_find(Table* table, uint32_t page_num, uint32_t key) { + void* node = get_page(table-&gt;pager, page_num); + uint32_t num_cells = *leaf_node_num_cells(node); + + Cursor* cursor = malloc(sizeof(Cursor)); + cursor-&gt;table = table; + cursor-&gt;page_num = page_num; + + // Binary search + uint32_t min_index = 0; + uint32_t one_past_max_index = num_cells; + while (one_past_max_index != min_index) { + uint32_t index = (min_index + one_past_max_index) / 2; + uint32_t key_at_index = *leaf_node_key(node, index); + if (key == key_at_index) { + cursor-&gt;cell_num = index; + return cursor; + } + if (key &lt; key_at_index) { + one_past_max_index = index; + } else { + min_index = index + 1; + } + } + + cursor-&gt;cell_num = min_index; + return cursor; +} </span></code></pre></div></div> <p>This will either return</p> <ul> <li>the position of the key,</li> <li>the position of another key that we’ll need to move if we want to insert the new key, or</li> <li>the position one past the last key</li> </ul> <p>Since we’re now checking node type, we need functions to get and set that value in a node.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+NodeType get_node_type(void* node) { + uint8_t value = *((uint8_t*)(node + NODE_TYPE_OFFSET)); + return (NodeType)value; +} + +void set_node_type(void* node, NodeType type) { + uint8_t value = type; + *((uint8_t*)(node + NODE_TYPE_OFFSET)) = value; +} </span></code></pre></div></div> <p>We have to cast to <code class="language-plaintext highlighter-rouge">uint8_t</code> first to ensure it’s serialized as a single byte.</p> <p>We also need to initialize node type.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-void initialize_leaf_node(void* node) { *leaf_node_num_cells(node) = 0; } </span><span class="gi">+void initialize_leaf_node(void* node) { + set_node_type(node, NODE_LEAF); + *leaf_node_num_cells(node) = 0; +} </span></code></pre></div></div> <p>Lastly, we need to make and handle a new error code.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-enum ExecuteResult_t { EXECUTE_SUCCESS, EXECUTE_TABLE_FULL }; </span><span class="gi">+enum ExecuteResult_t { + EXECUTE_SUCCESS, + EXECUTE_DUPLICATE_KEY, + EXECUTE_TABLE_FULL +}; </span></code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> case (EXECUTE_SUCCESS): printf("Executed.\n"); break; <span class="gi">+ case (EXECUTE_DUPLICATE_KEY): + printf("Error: Duplicate key.\n"); + break; </span> case (EXECUTE_TABLE_FULL): printf("Error: Table full.\n"); break; </code></pre></div></div> <p>With these changes, our test can change to check for sorted order:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "db &gt; Executed.", "db &gt; Tree:", "leaf (size 3)", <span class="gd">- " - 0 : 3", - " - 1 : 1", - " - 2 : 2", </span><span class="gi">+ " - 0 : 1", + " - 1 : 2", + " - 2 : 3", </span> "db &gt; " ]) end </code></pre></div></div> <p>And we can add a new test for duplicate keys:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ it 'prints an error message if there is a duplicate id' do + script = [ + "insert 1 user1 [email protected]", + "insert 1 user1 [email protected]", + "select", + ".exit", + ] + result = run_script(script) + expect(result).to match_array([ + "db &gt; Executed.", + "db &gt; Error: Duplicate key.", + "db &gt; (1, user1, [email protected])", + "Executed.", + "db &gt; ", + ]) + end </span></code></pre></div></div> <p>That’s it! Next up: implement splitting leaf nodes and creating internal nodes.</p> Sun, 01 Oct 2017 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part9.html https://cstack.github.io/db_tutorial/parts/part9.html Part 10 - Splitting a Leaf Node <p>Our B-Tree doesn’t feel like much of a tree with only one node. To fix that, we need some code to split a leaf node in twain. And after that, we need to create an internal node to serve as a parent for the two leaf nodes.</p> <p>Basically our goal for this article is to go from this:</p> <table class="image"> <caption align="bottom">one-node btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree2.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree2.png" alt="one-node btree" /></a></td></tr> </table> <p>to this:</p> <table class="image"> <caption align="bottom">two-level btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree3.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree3.png" alt="two-level btree" /></a></td></tr> </table> <p>First things first, let’s remove the error handling for a full leaf node:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void leaf_node_insert(Cursor* cursor, uint32_t key, Row* value) { void* node = get_page(cursor-&gt;table-&gt;pager, cursor-&gt;page_num); uint32_t num_cells = *leaf_node_num_cells(node); if (num_cells &gt;= LEAF_NODE_MAX_CELLS) { // Node full <span class="gd">- printf("Need to implement splitting a leaf node.\n"); - exit(EXIT_FAILURE); </span><span class="gi">+ leaf_node_split_and_insert(cursor, key, value); + return; </span> } </code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">ExecuteResult execute_insert(Statement* statement, Table* table) { </span> void* node = get_page(table-&gt;pager, table-&gt;root_page_num); uint32_t num_cells = (*leaf_node_num_cells(node)); <span class="gd">- if (num_cells &gt;= LEAF_NODE_MAX_CELLS) { - return EXECUTE_TABLE_FULL; - } </span> Row* row_to_insert = &amp;(statement-&gt;row_to_insert); uint32_t key_to_insert = row_to_insert-&gt;id; </code></pre></div></div> <h2 id="splitting-algorithm">Splitting Algorithm</h2> <p>Easy part’s over. Here’s a description of what we need to do from <a href="https://play.google.com/store/books/details/Sibsankar_Haldar_SQLite_Database_System_Design_and?id=9Z6IQQnX1JEC&amp;hl=en">SQLite Database System: Design and Implementation</a></p> <blockquote> <p>If there is no space on the leaf node, we would split the existing entries residing there and the new one (being inserted) into two equal halves: lower and upper halves. (Keys on the upper half are strictly greater than those on the lower half.) We allocate a new leaf node, and move the upper half into the new node.</p> </blockquote> <p>Let’s get a handle to the old node and create the new node:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void leaf_node_split_and_insert(Cursor* cursor, uint32_t key, Row* value) { + /* + Create a new node and move half the cells over. + Insert the new value in one of the two nodes. + Update parent or create a new parent. + */ + + void* old_node = get_page(cursor-&gt;table-&gt;pager, cursor-&gt;page_num); + uint32_t new_page_num = get_unused_page_num(cursor-&gt;table-&gt;pager); + void* new_node = get_page(cursor-&gt;table-&gt;pager, new_page_num); + initialize_leaf_node(new_node); </span></code></pre></div></div> <p>Next, copy every cell into its new location:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ /* + All existing keys plus new key should be divided + evenly between old (left) and new (right) nodes. + Starting from the right, move each key to correct position. + */ + for (int32_t i = LEAF_NODE_MAX_CELLS; i &gt;= 0; i--) { + void* destination_node; + if (i &gt;= LEAF_NODE_LEFT_SPLIT_COUNT) { + destination_node = new_node; + } else { + destination_node = old_node; + } + uint32_t index_within_node = i % LEAF_NODE_LEFT_SPLIT_COUNT; + void* destination = leaf_node_cell(destination_node, index_within_node); + + if (i == cursor-&gt;cell_num) { + serialize_row(value, destination); + } else if (i &gt; cursor-&gt;cell_num) { + memcpy(destination, leaf_node_cell(old_node, i - 1), LEAF_NODE_CELL_SIZE); + } else { + memcpy(destination, leaf_node_cell(old_node, i), LEAF_NODE_CELL_SIZE); + } + } </span></code></pre></div></div> <p>Update cell counts in each node’s header:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ /* Update cell count on both leaf nodes */ + *(leaf_node_num_cells(old_node)) = LEAF_NODE_LEFT_SPLIT_COUNT; + *(leaf_node_num_cells(new_node)) = LEAF_NODE_RIGHT_SPLIT_COUNT; </span></code></pre></div></div> <p>Then we need to update the nodes’ parent. If the original node was the root, it had no parent. In that case, create a new root node to act as the parent. I’ll stub out the other branch for now:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ if (is_node_root(old_node)) { + return create_new_root(cursor-&gt;table, new_page_num); + } else { + printf("Need to implement updating parent after split\n"); + exit(EXIT_FAILURE); + } +} </span></code></pre></div></div> <h2 id="allocating-new-pages">Allocating New Pages</h2> <p>Let’s go back and define a few new functions and constants. When we created a new leaf node, we put it in a page decided by <code class="language-plaintext highlighter-rouge">get_unused_page_num()</code>:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+/* +Until we start recycling free pages, new pages will always +go onto the end of the database file +*/ +uint32_t get_unused_page_num(Pager* pager) { return pager-&gt;num_pages; } </span></code></pre></div></div> <p>For now, we’re assuming that in a database with N pages, page numbers 0 through N-1 are allocated. Therefore we can always allocate page number N for new pages. Eventually after we implement deletion, some pages may become empty and their page numbers unused. To be more efficient, we could re-allocate those free pages.</p> <h2 id="leaf-node-sizes">Leaf Node Sizes</h2> <p>To keep the tree balanced, we evenly distribute cells between the two new nodes. If a leaf node can hold N cells, then during a split we need to distribute N+1 cells between two nodes (N original cells plus one new one). I’m arbitrarily choosing the left node to get one more cell if N+1 is odd.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+const uint32_t LEAF_NODE_RIGHT_SPLIT_COUNT = (LEAF_NODE_MAX_CELLS + 1) / 2; +const uint32_t LEAF_NODE_LEFT_SPLIT_COUNT = + (LEAF_NODE_MAX_CELLS + 1) - LEAF_NODE_RIGHT_SPLIT_COUNT; </span></code></pre></div></div> <h2 id="creating-a-new-root">Creating a New Root</h2> <p>Here’s how <a href="https://play.google.com/store/books/details/Sibsankar_Haldar_SQLite_Database_System_Design_and?id=9Z6IQQnX1JEC&amp;hl=en">SQLite Database System</a> explains the process of creating a new root node:</p> <blockquote> <p>Let N be the root node. First allocate two nodes, say L and R. Move lower half of N into L and the upper half into R. Now N is empty. Add ⟨L, K,R⟩ in N, where K is the max key in L. Page N remains the root. Note that the depth of the tree has increased by one, but the new tree remains height balanced without violating any B+-tree property.</p> </blockquote> <p>At this point, we’ve already allocated the right child and moved the upper half into it. Our function takes the right child as input and allocates a new page to store the left child.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void create_new_root(Table* table, uint32_t right_child_page_num) { + /* + Handle splitting the root. + Old root copied to new page, becomes left child. + Address of right child passed in. + Re-initialize root page to contain the new root node. + New root node points to two children. + */ + + void* root = get_page(table-&gt;pager, table-&gt;root_page_num); + void* right_child = get_page(table-&gt;pager, right_child_page_num); + uint32_t left_child_page_num = get_unused_page_num(table-&gt;pager); + void* left_child = get_page(table-&gt;pager, left_child_page_num); </span></code></pre></div></div> <p>The old root is copied to the left child so we can reuse the root page:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ /* Left child has data copied from old root */ + memcpy(left_child, root, PAGE_SIZE); + set_node_root(left_child, false); </span></code></pre></div></div> <p>Finally we initialize the root page as a new internal node with two children.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ /* Root node is a new internal node with one key and two children */ + initialize_internal_node(root); + set_node_root(root, true); + *internal_node_num_keys(root) = 1; + *internal_node_child(root, 0) = left_child_page_num; + uint32_t left_child_max_key = get_node_max_key(left_child); + *internal_node_key(root, 0) = left_child_max_key; + *internal_node_right_child(root) = right_child_page_num; +} </span></code></pre></div></div> <h2 id="internal-node-format">Internal Node Format</h2> <p>Now that we’re finally creating an internal node, we have to define its layout. It starts with the common header, then the number of keys it contains, then the page number of its rightmost child. Internal nodes always have one more child pointer than they have keys. That extra child pointer is stored in the header.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+/* + * Internal Node Header Layout + */ +const uint32_t INTERNAL_NODE_NUM_KEYS_SIZE = sizeof(uint32_t); +const uint32_t INTERNAL_NODE_NUM_KEYS_OFFSET = COMMON_NODE_HEADER_SIZE; +const uint32_t INTERNAL_NODE_RIGHT_CHILD_SIZE = sizeof(uint32_t); +const uint32_t INTERNAL_NODE_RIGHT_CHILD_OFFSET = + INTERNAL_NODE_NUM_KEYS_OFFSET + INTERNAL_NODE_NUM_KEYS_SIZE; +const uint32_t INTERNAL_NODE_HEADER_SIZE = COMMON_NODE_HEADER_SIZE + + INTERNAL_NODE_NUM_KEYS_SIZE + + INTERNAL_NODE_RIGHT_CHILD_SIZE; </span></code></pre></div></div> <p>The body is an array of cells where each cell contains a child pointer and a key. Every key should be the maximum key contained in the child to its left.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+/* + * Internal Node Body Layout + */ +const uint32_t INTERNAL_NODE_KEY_SIZE = sizeof(uint32_t); +const uint32_t INTERNAL_NODE_CHILD_SIZE = sizeof(uint32_t); +const uint32_t INTERNAL_NODE_CELL_SIZE = + INTERNAL_NODE_CHILD_SIZE + INTERNAL_NODE_KEY_SIZE; </span></code></pre></div></div> <p>Based on these constants, here’s how the layout of an internal node will look:</p> <table class="image"> <caption align="bottom">Our internal node format</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/internal-node-format.png"><img src="https://cstack.github.io/db_tutorial/assets/images/internal-node-format.png" alt="Our internal node format" /></a></td></tr> </table> <p>Notice our huge branching factor. Because each child pointer / key pair is so small, we can fit 510 keys and 511 child pointers in each internal node. That means we’ll never have to traverse many layers of the tree to find a given key!</p> <table> <thead> <tr> <th># internal node layers</th> <th>max # leaf nodes</th> <th>Size of all leaf nodes</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>511^0 = 1</td> <td>4 KB</td> </tr> <tr> <td>1</td> <td>511^1 = 512</td> <td>~2 MB</td> </tr> <tr> <td>2</td> <td>511^2 = 261,121</td> <td>~1 GB</td> </tr> <tr> <td>3</td> <td>511^3 = 133,432,831</td> <td>~550 GB</td> </tr> </tbody> </table> <p>In actuality, we can’t store a full 4 KB of data per leaf node due to the overhead of the header, keys, and wasted space. But we can search through something like 500 GB of data by loading only 4 pages from disk. This is why the B-Tree is a useful data structure for databases.</p> <p>Here are the methods for reading and writing to an internal node:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+uint32_t* internal_node_num_keys(void* node) { + return node + INTERNAL_NODE_NUM_KEYS_OFFSET; +} + +uint32_t* internal_node_right_child(void* node) { + return node + INTERNAL_NODE_RIGHT_CHILD_OFFSET; +} + +uint32_t* internal_node_cell(void* node, uint32_t cell_num) { + return node + INTERNAL_NODE_HEADER_SIZE + cell_num * INTERNAL_NODE_CELL_SIZE; +} + +uint32_t* internal_node_child(void* node, uint32_t child_num) { + uint32_t num_keys = *internal_node_num_keys(node); + if (child_num &gt; num_keys) { + printf("Tried to access child_num %d &gt; num_keys %d\n", child_num, num_keys); + exit(EXIT_FAILURE); + } else if (child_num == num_keys) { + return internal_node_right_child(node); + } else { + return internal_node_cell(node, child_num); + } +} + +uint32_t* internal_node_key(void* node, uint32_t key_num) { + return internal_node_cell(node, key_num) + INTERNAL_NODE_CHILD_SIZE; +} </span></code></pre></div></div> <p>For an internal node, the maximum key is always its right key. For a leaf node, it’s the key at the maximum index:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+uint32_t get_node_max_key(void* node) { + switch (get_node_type(node)) { + case NODE_INTERNAL: + return *internal_node_key(node, *internal_node_num_keys(node) - 1); + case NODE_LEAF: + return *leaf_node_key(node, *leaf_node_num_cells(node) - 1); + } +} </span></code></pre></div></div> <h2 id="keeping-track-of-the-root">Keeping Track of the Root</h2> <p>We’re finally using the <code class="language-plaintext highlighter-rouge">is_root</code> field in the common node header. Recall that we use it to decide how to split a leaf node:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">if</span> <span class="p">(</span><span class="n">is_node_root</span><span class="p">(</span><span class="n">old_node</span><span class="p">))</span> <span class="p">{</span> <span class="k">return</span> <span class="n">create_new_root</span><span class="p">(</span><span class="n">cursor</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">,</span> <span class="n">new_page_num</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="n">printf</span><span class="p">(</span><span class="s">"Need to implement updating parent after split</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span> <span class="n">exit</span><span class="p">(</span><span class="n">EXIT_FAILURE</span><span class="p">);</span> <span class="p">}</span> <span class="err">}</span> </code></pre></div></div> <p>Here are the getter and setter:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+bool is_node_root(void* node) { + uint8_t value = *((uint8_t*)(node + IS_ROOT_OFFSET)); + return (bool)value; +} + +void set_node_root(void* node, bool is_root) { + uint8_t value = is_root; + *((uint8_t*)(node + IS_ROOT_OFFSET)) = value; +} </span></code></pre></div></div> <p>Initializing both types of nodes should default to setting <code class="language-plaintext highlighter-rouge">is_root</code> to false:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> void initialize_leaf_node(void* node) { set_node_type(node, NODE_LEAF); <span class="gi">+ set_node_root(node, false); </span> *leaf_node_num_cells(node) = 0; } +void initialize_internal_node(void* node) { <span class="gi">+ set_node_type(node, NODE_INTERNAL); + set_node_root(node, false); + *internal_node_num_keys(node) = 0; +} </span></code></pre></div></div> <p>We should set <code class="language-plaintext highlighter-rouge">is_root</code> to true when creating the first node of the table:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> // New database file. Initialize page 0 as leaf node. void* root_node = get_page(pager, 0); initialize_leaf_node(root_node); <span class="gi">+ set_node_root(root_node, true); </span> } return table; </code></pre></div></div> <h2 id="printing-the-tree">Printing the Tree</h2> <p>To help us visualize the state of the database, we should update our <code class="language-plaintext highlighter-rouge">.btree</code> metacommand to print a multi-level tree.</p> <p>I’m going to replace the current <code class="language-plaintext highlighter-rouge">print_leaf_node()</code> function</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-void print_leaf_node(void* node) { - uint32_t num_cells = *leaf_node_num_cells(node); - printf("leaf (size %d)\n", num_cells); - for (uint32_t i = 0; i &lt; num_cells; i++) { - uint32_t key = *leaf_node_key(node, i); - printf(" - %d : %d\n", i, key); - } -} </span></code></pre></div></div> <p>with a new recursive function that takes any node, then prints it and its children. It takes an indentation level as a parameter, which increases with each recursive call. I’m also adding a tiny helper function to indent.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void indent(uint32_t level) { + for (uint32_t i = 0; i &lt; level; i++) { + printf(" "); + } +} + +void print_tree(Pager* pager, uint32_t page_num, uint32_t indentation_level) { + void* node = get_page(pager, page_num); + uint32_t num_keys, child; + + switch (get_node_type(node)) { + case (NODE_LEAF): + num_keys = *leaf_node_num_cells(node); + indent(indentation_level); + printf("- leaf (size %d)\n", num_keys); + for (uint32_t i = 0; i &lt; num_keys; i++) { + indent(indentation_level + 1); + printf("- %d\n", *leaf_node_key(node, i)); + } + break; + case (NODE_INTERNAL): + num_keys = *internal_node_num_keys(node); + indent(indentation_level); + printf("- internal (size %d)\n", num_keys); + for (uint32_t i = 0; i &lt; num_keys; i++) { + child = *internal_node_child(node, i); + print_tree(pager, child, indentation_level + 1); + + indent(indentation_level + 1); + printf("- key %d\n", *internal_node_key(node, i)); + } + child = *internal_node_right_child(node); + print_tree(pager, child, indentation_level + 1); + break; + } +} </span></code></pre></div></div> <p>And update the call to the print function, passing an indentation level of zero.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> } else if (strcmp(input_buffer-&gt;buffer, ".btree") == 0) { printf("Tree:\n"); <span class="gd">- print_leaf_node(get_page(table-&gt;pager, 0)); </span><span class="gi">+ print_tree(table-&gt;pager, 0, 0); </span> return META_COMMAND_SUCCESS; </code></pre></div></div> <p>Here’s a test case for the new printing functionality!</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ it 'allows printing out the structure of a 3-leaf-node btree' do + script = (1..14).map do |i| + "insert #{i} user#{i} person#{i}@example.com" + end + script &lt;&lt; ".btree" + script &lt;&lt; "insert 15 user15 [email protected]" + script &lt;&lt; ".exit" + result = run_script(script) + + expect(result[14...(result.length)]).to match_array([ + "db &gt; Tree:", + "- internal (size 1)", + " - leaf (size 7)", + " - 1", + " - 2", + " - 3", + " - 4", + " - 5", + " - 6", + " - 7", + " - key 7", + " - leaf (size 7)", + " - 8", + " - 9", + " - 10", + " - 11", + " - 12", + " - 13", + " - 14", + "db &gt; Need to implement searching an internal node", + ]) + end </span></code></pre></div></div> <p>The new format is a little simplified, so we need to update the existing <code class="language-plaintext highlighter-rouge">.btree</code> test:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "db &gt; Executed.", "db &gt; Executed.", "db &gt; Tree:", <span class="gd">- "leaf (size 3)", - " - 0 : 1", - " - 1 : 2", - " - 2 : 3", </span><span class="gi">+ "- leaf (size 3)", + " - 1", + " - 2", + " - 3", </span> "db &gt; " ]) end </code></pre></div></div> <p>Here’s the <code class="language-plaintext highlighter-rouge">.btree</code> output of the new test on its own:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tree: - internal (size 1) - leaf (size 7) - 1 - 2 - 3 - 4 - 5 - 6 - 7 - key 7 - leaf (size 7) - 8 - 9 - 10 - 11 - 12 - 13 - 14 </code></pre></div></div> <p>On the least indented level, we see the root node (an internal node). It says <code class="language-plaintext highlighter-rouge">size 1</code> because it has one key. Indented one level, we see a leaf node, a key, and another leaf node. The key in the root node (7) is is the maximum key in the first leaf node. Every key greater than 7 is in the second leaf node.</p> <h2 id="a-major-problem">A Major Problem</h2> <p>If you’ve been following along closely you may notice we’ve missed something big. Look what happens if we try to insert one additional row:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>db &gt; insert 15 user15 [email protected] Need to implement searching an internal node </code></pre></div></div> <p>Whoops! Who wrote that TODO message? :P</p> <p>Next time we’ll continue the epic B-tree saga by implementing search on a multi-level tree.</p> Mon, 09 Oct 2017 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part10.html https://cstack.github.io/db_tutorial/parts/part10.html Part 11 - Recursively Searching the B-Tree <p>Last time we ended with an error inserting our 15th row:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>db &gt; insert 15 user15 [email protected] Need to implement searching an internal node </code></pre></div></div> <p>First, replace the code stub with a new function call.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> if (get_node_type(root_node) == NODE_LEAF) { return leaf_node_find(table, root_page_num, key); } else { <span class="gd">- printf("Need to implement searching an internal node\n"); - exit(EXIT_FAILURE); </span><span class="gi">+ return internal_node_find(table, root_page_num, key); </span> } } </code></pre></div></div> <p>This function will perform binary search to find the child that should contain the given key. Remember that the key to the right of each child pointer is the maximum key contained by that child.</p> <table class="image"> <caption align="bottom">three-level btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree6.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree6.png" alt="three-level btree" /></a></td></tr> </table> <p>So our binary search compares the key to find and the key to the right of the child pointer:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+Cursor* internal_node_find(Table* table, uint32_t page_num, uint32_t key) { + void* node = get_page(table-&gt;pager, page_num); + uint32_t num_keys = *internal_node_num_keys(node); + + /* Binary search to find index of child to search */ + uint32_t min_index = 0; + uint32_t max_index = num_keys; /* there is one more child than key */ + + while (min_index != max_index) { + uint32_t index = (min_index + max_index) / 2; + uint32_t key_to_right = *internal_node_key(node, index); + if (key_to_right &gt;= key) { + max_index = index; + } else { + min_index = index + 1; + } + } </span></code></pre></div></div> <p>Also remember that the children of an internal node can be either leaf nodes or more internal nodes. After we find the correct child, call the appropriate search function on it:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ uint32_t child_num = *internal_node_child(node, min_index); + void* child = get_page(table-&gt;pager, child_num); + switch (get_node_type(child)) { + case NODE_LEAF: + return leaf_node_find(table, child_num, key); + case NODE_INTERNAL: + return internal_node_find(table, child_num, key); + } +} </span></code></pre></div></div> <h1 id="tests">Tests</h1> <p>Now inserting a key into a multi-node btree no longer results in an error. And we can update our test:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> " - 12", " - 13", " - 14", <span class="gd">- "db &gt; Need to implement searching an internal node", </span><span class="gi">+ "db &gt; Executed.", + "db &gt; ", </span> ]) end </code></pre></div></div> <p>I also think it’s time we revisit another test. The one that tries inserting 1400 rows. It still errors, but the error message is new. Right now, our tests don’t handle it very well when the program crashes. If that happens, let’s just use the output we’ve gotten so far:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> raw_output = nil IO.popen("./db test.db", "r+") do |pipe| commands.each do |command| <span class="gd">- pipe.puts command </span><span class="gi">+ begin + pipe.puts command + rescue Errno::EPIPE + break + end </span> end pipe.close_write </code></pre></div></div> <p>And that reveals that our 1400-row test outputs this error:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> end script &lt;&lt; ".exit" result = run_script(script) <span class="gd">- expect(result[-2]).to eq('db &gt; Error: Table full.') </span><span class="gi">+ expect(result.last(2)).to match_array([ + "db &gt; Executed.", + "db &gt; Need to implement updating parent after split", + ]) </span> end </code></pre></div></div> <p>Looks like that’s next on our to-do list!</p> Sun, 22 Oct 2017 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part11.html https://cstack.github.io/db_tutorial/parts/part11.html Part 12 - Scanning a Multi-Level B-Tree <p>We now support constructing a multi-level btree, but we’ve broken <code class="language-plaintext highlighter-rouge">select</code> statements in the process. Here’s a test case that inserts 15 rows and then tries to print them.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ it 'prints all rows in a multi-level tree' do + script = [] + (1..15).each do |i| + script &lt;&lt; "insert #{i} user#{i} person#{i}@example.com" + end + script &lt;&lt; "select" + script &lt;&lt; ".exit" + result = run_script(script) + + expect(result[15...result.length]).to match_array([ + "db &gt; (1, user1, [email protected])", + "(2, user2, [email protected])", + "(3, user3, [email protected])", + "(4, user4, [email protected])", + "(5, user5, [email protected])", + "(6, user6, [email protected])", + "(7, user7, [email protected])", + "(8, user8, [email protected])", + "(9, user9, [email protected])", + "(10, user10, [email protected])", + "(11, user11, [email protected])", + "(12, user12, [email protected])", + "(13, user13, [email protected])", + "(14, user14, [email protected])", + "(15, user15, [email protected])", + "Executed.", "db &gt; ", + ]) + end </span></code></pre></div></div> <p>But when we run that test case right now, what actually happens is:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>db &gt; select (2, user1, [email protected]) Executed. </code></pre></div></div> <p>That’s weird. It’s only printing one row, and that row looks corrupted (notice the id doesn’t match the username).</p> <p>The weirdness is because <code class="language-plaintext highlighter-rouge">execute_select()</code> begins at the start of the table, and our current implementation of <code class="language-plaintext highlighter-rouge">table_start()</code> returns cell 0 of the root node. But the root of our tree is now an internal node which doesn’t contain any rows. The data that was printed must have been left over from when the root node was a leaf. <code class="language-plaintext highlighter-rouge">execute_select()</code> should really return cell 0 of the leftmost leaf node.</p> <p>So get rid of the old implementation:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-Cursor* table_start(Table* table) { - Cursor* cursor = malloc(sizeof(Cursor)); - cursor-&gt;table = table; - cursor-&gt;page_num = table-&gt;root_page_num; - cursor-&gt;cell_num = 0; - - void* root_node = get_page(table-&gt;pager, table-&gt;root_page_num); - uint32_t num_cells = *leaf_node_num_cells(root_node); - cursor-&gt;end_of_table = (num_cells == 0); - - return cursor; -} </span></code></pre></div></div> <p>And add a new implementation that searches for key 0 (the minimum possible key). Even if key 0 does not exist in the table, this method will return the position of the lowest id (the start of the left-most leaf node).</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+Cursor* table_start(Table* table) { + Cursor* cursor = table_find(table, 0); + + void* node = get_page(table-&gt;pager, cursor-&gt;page_num); + uint32_t num_cells = *leaf_node_num_cells(node); + cursor-&gt;end_of_table = (num_cells == 0); + + return cursor; +} </span></code></pre></div></div> <p>With those changes, it still only prints out one node’s worth of rows:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>db &gt; select (1, user1, [email protected]) (2, user2, [email protected]) (3, user3, [email protected]) (4, user4, [email protected]) (5, user5, [email protected]) (6, user6, [email protected]) (7, user7, [email protected]) Executed. db &gt; </code></pre></div></div> <p>With 15 entries, our btree consists of one internal node and two leaf nodes, which looks something like this:</p> <table class="image"> <caption align="bottom">structure of our btree</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/btree3.png"><img src="https://cstack.github.io/db_tutorial/assets/images/btree3.png" alt="structure of our btree" /></a></td></tr> </table> <p>To scan the entire table, we need to jump to the second leaf node after we reach the end of the first. To do that, we’re going to save a new field in the leaf node header called “next_leaf”, which will hold the page number of the leaf’s sibling node on the right. The rightmost leaf node will have a <code class="language-plaintext highlighter-rouge">next_leaf</code> value of 0 to denote no sibling (page 0 is reserved for the root node of the table anyway).</p> <p>Update the leaf node header format to include the new field:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> const uint32_t LEAF_NODE_NUM_CELLS_SIZE = sizeof(uint32_t); const uint32_t LEAF_NODE_NUM_CELLS_OFFSET = COMMON_NODE_HEADER_SIZE; <span class="gd">-const uint32_t LEAF_NODE_HEADER_SIZE = - COMMON_NODE_HEADER_SIZE + LEAF_NODE_NUM_CELLS_SIZE; </span><span class="gi">+const uint32_t LEAF_NODE_NEXT_LEAF_SIZE = sizeof(uint32_t); +const uint32_t LEAF_NODE_NEXT_LEAF_OFFSET = + LEAF_NODE_NUM_CELLS_OFFSET + LEAF_NODE_NUM_CELLS_SIZE; +const uint32_t LEAF_NODE_HEADER_SIZE = COMMON_NODE_HEADER_SIZE + + LEAF_NODE_NUM_CELLS_SIZE + + LEAF_NODE_NEXT_LEAF_SIZE; </span> </code></pre></div></div> <p>Add a method to access the new field:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+uint32_t* leaf_node_next_leaf(void* node) { + return node + LEAF_NODE_NEXT_LEAF_OFFSET; +} </span></code></pre></div></div> <p>Set <code class="language-plaintext highlighter-rouge">next_leaf</code> to 0 by default when initializing a new leaf node:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -322,6 +330,7 @@</span> void initialize_leaf_node(void* node) { set_node_type(node, NODE_LEAF); set_node_root(node, false); *leaf_node_num_cells(node) = 0; <span class="gi">+ *leaf_node_next_leaf(node) = 0; // 0 represents no sibling </span> } </code></pre></div></div> <p>Whenever we split a leaf node, update the sibling pointers. The old leaf’s sibling becomes the new leaf, and the new leaf’s sibling becomes whatever used to be the old leaf’s sibling.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -659,6 +671,8 @@</span> void leaf_node_split_and_insert(Cursor* cursor, uint32_t key, Row* value) { uint32_t new_page_num = get_unused_page_num(cursor-&gt;table-&gt;pager); void* new_node = get_page(cursor-&gt;table-&gt;pager, new_page_num); initialize_leaf_node(new_node); <span class="gi">+ *leaf_node_next_leaf(new_node) = *leaf_node_next_leaf(old_node); + *leaf_node_next_leaf(old_node) = new_page_num; </span></code></pre></div></div> <p>Adding a new field changes a few constants:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> it 'prints constants' do script = [ ".constants", <span class="p">@@ -199,9 +228,9 @@</span> describe 'database' do "db &gt; Constants:", "ROW_SIZE: 293", "COMMON_NODE_HEADER_SIZE: 6", <span class="gd">- "LEAF_NODE_HEADER_SIZE: 10", </span><span class="gi">+ "LEAF_NODE_HEADER_SIZE: 14", </span> "LEAF_NODE_CELL_SIZE: 297", <span class="gd">- "LEAF_NODE_SPACE_FOR_CELLS: 4086", </span><span class="gi">+ "LEAF_NODE_SPACE_FOR_CELLS: 4082", </span> "LEAF_NODE_MAX_CELLS: 13", "db &gt; ", ]) </code></pre></div></div> <p>Now whenever we want to advance the cursor past the end of a leaf node, we can check if the leaf node has a sibling. If it does, jump to it. Otherwise, we’re at the end of the table.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -428,7 +432,15 @@</span> void cursor_advance(Cursor* cursor) { cursor-&gt;cell_num += 1; if (cursor-&gt;cell_num &gt;= (*leaf_node_num_cells(node))) { <span class="gd">- cursor-&gt;end_of_table = true; </span><span class="gi">+ /* Advance to next leaf node */ + uint32_t next_page_num = *leaf_node_next_leaf(node); + if (next_page_num == 0) { + /* This was rightmost leaf */ + cursor-&gt;end_of_table = true; + } else { + cursor-&gt;page_num = next_page_num; + cursor-&gt;cell_num = 0; + } </span> } } </code></pre></div></div> <p>After those changes, we actually print 15 rows…</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>db &gt; select (1, user1, [email protected]) (2, user2, [email protected]) (3, user3, [email protected]) (4, user4, [email protected]) (5, user5, [email protected]) (6, user6, [email protected]) (7, user7, [email protected]) (8, user8, [email protected]) (9, user9, [email protected]) (10, user10, [email protected]) (11, user11, [email protected]) (12, user12, [email protected]) (13, user13, [email protected]) (1919251317, 14, [email protected]) (15, user15, [email protected]) Executed. db &gt; </code></pre></div></div> <p>…but one of them looks corrupted</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(1919251317, 14, [email protected]) </code></pre></div></div> <p>After some debugging, I found out it’s because of a bug in how we split leaf nodes:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -676,7 +690,9 @@</span> void leaf_node_split_and_insert(Cursor* cursor, uint32_t key, Row* value) { void* destination = leaf_node_cell(destination_node, index_within_node); if (i == cursor-&gt;cell_num) { <span class="gd">- serialize_row(value, destination); </span><span class="gi">+ serialize_row(value, + leaf_node_value(destination_node, index_within_node)); + *leaf_node_key(destination_node, index_within_node) = key; </span> } else if (i &gt; cursor-&gt;cell_num) { memcpy(destination, leaf_node_cell(old_node, i - 1), LEAF_NODE_CELL_SIZE); } else { </code></pre></div></div> <p>Remember that each cell in a leaf node consists of first a key then a value:</p> <table class="image"> <caption align="bottom">Original leaf node format</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/leaf-node-format.png"><img src="https://cstack.github.io/db_tutorial/assets/images/leaf-node-format.png" alt="Original leaf node format" /></a></td></tr> </table> <p>We were writing the new row (value) into the start of the cell, where the key should go. That means part of the username was going into the section for id (hence the crazy large id).</p> <p>After fixing that bug, we finally print out the entire table as expected:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>db &gt; select (1, user1, [email protected]) (2, user2, [email protected]) (3, user3, [email protected]) (4, user4, [email protected]) (5, user5, [email protected]) (6, user6, [email protected]) (7, user7, [email protected]) (8, user8, [email protected]) (9, user9, [email protected]) (10, user10, [email protected]) (11, user11, [email protected]) (12, user12, [email protected]) (13, user13, [email protected]) (14, user14, [email protected]) (15, user15, [email protected]) Executed. db &gt; </code></pre></div></div> <p>Whew! One bug after another, but we’re making progress.</p> <p>Until next time.</p> Sat, 11 Nov 2017 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part12.html https://cstack.github.io/db_tutorial/parts/part12.html Part 13 - Updating Parent Node After a Split <p>For the next step on our epic b-tree implementation journey, we’re going to handle fixing up the parent node after splitting a leaf. I’m going to use the following example as a reference:</p> <table class="image"> <caption align="bottom">Example of updating internal node</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/updating-internal-node.png"><img src="https://cstack.github.io/db_tutorial/assets/images/updating-internal-node.png" alt="Example of updating internal node" /></a></td></tr> </table> <p>In this example, we add the key “3” to the tree. That causes the left leaf node to split. After the split we fix up the tree by doing the following:</p> <ol> <li>Update the first key in the parent to be the maximum key in the left child (“3”)</li> <li>Add a new child pointer / key pair after the updated key <ul> <li>The new pointer points to the new child node</li> <li>The new key is the maximum key in the new child node (“5”)</li> </ul> </li> </ol> <p>So first things first, replace our stub code with two new function calls: <code class="language-plaintext highlighter-rouge">update_internal_node_key()</code> for step 1 and <code class="language-plaintext highlighter-rouge">internal_node_insert()</code> for step 2</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -670,9 +725,11 @@</span> void leaf_node_split_and_insert(Cursor* cursor, uint32_t key, Row* value) { */ void* old_node = get_page(cursor-&gt;table-&gt;pager, cursor-&gt;page_num); <span class="gi">+ uint32_t old_max = get_node_max_key(old_node); </span> uint32_t new_page_num = get_unused_page_num(cursor-&gt;table-&gt;pager); void* new_node = get_page(cursor-&gt;table-&gt;pager, new_page_num); initialize_leaf_node(new_node); <span class="gi">+ *node_parent(new_node) = *node_parent(old_node); </span> *leaf_node_next_leaf(new_node) = *leaf_node_next_leaf(old_node); *leaf_node_next_leaf(old_node) = new_page_num; <span class="p">@@ -709,8 +766,12 @@</span> void leaf_node_split_and_insert(Cursor* cursor, uint32_t key, Row* value) { if (is_node_root(old_node)) { return create_new_root(cursor-&gt;table, new_page_num); } else { <span class="gd">- printf("Need to implement updating parent after split\n"); - exit(EXIT_FAILURE); </span><span class="gi">+ uint32_t parent_page_num = *node_parent(old_node); + uint32_t new_max = get_node_max_key(old_node); + void* parent = get_page(cursor-&gt;table-&gt;pager, parent_page_num); + + update_internal_node_key(parent, old_max, new_max); + internal_node_insert(cursor-&gt;table, parent_page_num, new_page_num); + return; </span> } } </code></pre></div></div> <p>In order to get a reference to the parent, we need to start recording in each node a pointer to its parent node.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+uint32_t* node_parent(void* node) { return node + PARENT_POINTER_OFFSET; } </span></code></pre></div></div> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -660,6 +675,48 @@</span> void create_new_root(Table* table, uint32_t right_child_page_num) { uint32_t left_child_max_key = get_node_max_key(left_child); *internal_node_key(root, 0) = left_child_max_key; *internal_node_right_child(root) = right_child_page_num; <span class="gi">+ *node_parent(left_child) = table-&gt;root_page_num; + *node_parent(right_child) = table-&gt;root_page_num; </span> } </code></pre></div></div> <p>Now we need to find the affected cell in the parent node. The child doesn’t know its own page number, so we can’t look for that. But it does know its own maximum key, so we can search the parent for that key.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void update_internal_node_key(void* node, uint32_t old_key, uint32_t new_key) { + uint32_t old_child_index = internal_node_find_child(node, old_key); + *internal_node_key(node, old_child_index) = new_key; </span> } </code></pre></div></div> <p>Inside <code class="language-plaintext highlighter-rouge">internal_node_find_child()</code> we’ll reuse some code we already have for finding a key in an internal node. Refactor <code class="language-plaintext highlighter-rouge">internal_node_find()</code> to use the new helper method.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-Cursor* internal_node_find(Table* table, uint32_t page_num, uint32_t key) { - void* node = get_page(table-&gt;pager, page_num); </span><span class="gi">+uint32_t internal_node_find_child(void* node, uint32_t key) { + /* + Return the index of the child which should contain + the given key. + */ + </span> uint32_t num_keys = *internal_node_num_keys(node); <span class="gd">- /* Binary search to find index of child to search */ </span><span class="gi">+ /* Binary search */ </span> uint32_t min_index = 0; uint32_t max_index = num_keys; /* there is one more child than key */ <span class="p">@@ -386,7 +394,14 @@</span> Cursor* internal_node_find(Table* table, uint32_t page_num, uint32_t key) { } } <span class="gd">- uint32_t child_num = *internal_node_child(node, min_index); </span><span class="gi">+ return min_index; +} + +Cursor* internal_node_find(Table* table, uint32_t page_num, uint32_t key) { + void* node = get_page(table-&gt;pager, page_num); + + uint32_t child_index = internal_node_find_child(node, key); + uint32_t child_num = *internal_node_child(node, child_index); </span> void* child = get_page(table-&gt;pager, child_num); switch (get_node_type(child)) { case NODE_LEAF: </code></pre></div></div> <p>Now we get to the heart of this article, implementing <code class="language-plaintext highlighter-rouge">internal_node_insert()</code>. I’ll explain it in pieces.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void internal_node_insert(Table* table, uint32_t parent_page_num, + uint32_t child_page_num) { + /* + Add a new child/key pair to parent that corresponds to child + */ + + void* parent = get_page(table-&gt;pager, parent_page_num); + void* child = get_page(table-&gt;pager, child_page_num); + uint32_t child_max_key = get_node_max_key(child); + uint32_t index = internal_node_find_child(parent, child_max_key); + + uint32_t original_num_keys = *internal_node_num_keys(parent); + *internal_node_num_keys(parent) = original_num_keys + 1; + + if (original_num_keys &gt;= INTERNAL_NODE_MAX_CELLS) { + printf("Need to implement splitting internal node\n"); + exit(EXIT_FAILURE); + } </span></code></pre></div></div> <p>The index where the new cell (child/key pair) should be inserted depends on the maximum key in the new child. In the example we looked at, <code class="language-plaintext highlighter-rouge">child_max_key</code> would be 5 and <code class="language-plaintext highlighter-rouge">index</code> would be 1.</p> <p>If there’s no room in the internal node for another cell, throw an error. We’ll implement that later.</p> <p>Now let’s look at the rest of the function:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ + uint32_t right_child_page_num = *internal_node_right_child(parent); + void* right_child = get_page(table-&gt;pager, right_child_page_num); + + if (child_max_key &gt; get_node_max_key(right_child)) { + /* Replace right child */ + *internal_node_child(parent, original_num_keys) = right_child_page_num; + *internal_node_key(parent, original_num_keys) = + get_node_max_key(right_child); + *internal_node_right_child(parent) = child_page_num; + } else { + /* Make room for the new cell */ + for (uint32_t i = original_num_keys; i &gt; index; i--) { + void* destination = internal_node_cell(parent, i); + void* source = internal_node_cell(parent, i - 1); + memcpy(destination, source, INTERNAL_NODE_CELL_SIZE); + } + *internal_node_child(parent, index) = child_page_num; + *internal_node_key(parent, index) = child_max_key; + } +} </span></code></pre></div></div> <p>Because we store the rightmost child pointer separately from the rest of the child/key pairs, we have to handle things differently if the new child is going to become the rightmost child.</p> <p>In our example, we would get into the <code class="language-plaintext highlighter-rouge">else</code> block. First we make room for the new cell by shifting other cells one space to the right. (Although in our example there are 0 cells to shift)</p> <p>Next, we write the new child pointer and key into the cell determined by <code class="language-plaintext highlighter-rouge">index</code>.</p> <p>To reduce the size of testcases needed, I’m hardcoding <code class="language-plaintext highlighter-rouge">INTERNAL_NODE_MAX_CELLS</code> for now</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -126,6 +126,8 @@</span> const uint32_t INTERNAL_NODE_KEY_SIZE = sizeof(uint32_t); const uint32_t INTERNAL_NODE_CHILD_SIZE = sizeof(uint32_t); const uint32_t INTERNAL_NODE_CELL_SIZE = INTERNAL_NODE_CHILD_SIZE + INTERNAL_NODE_KEY_SIZE; <span class="gi">+/* Keep this small for testing */ +const uint32_t INTERNAL_NODE_MAX_CELLS = 3; </span></code></pre></div></div> <p>Speaking of tests, our large-dataset test gets past our old stub and gets to our new one:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -65,7 +65,7 @@</span> describe 'database' do result = run_script(script) expect(result.last(2)).to match_array([ "db &gt; Executed.", <span class="gd">- "db &gt; Need to implement updating parent after split", </span><span class="gi">+ "db &gt; Need to implement splitting internal node", </span> ]) </code></pre></div></div> <p>Very satisfying, I know.</p> <p>I’ll add another test that prints a four-node tree. Just so we test more cases than sequential ids, this test will add records in a pseudorandom order.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ it 'allows printing out the structure of a 4-leaf-node btree' do + script = [ + "insert 18 user18 [email protected]", + "insert 7 user7 [email protected]", + "insert 10 user10 [email protected]", + "insert 29 user29 [email protected]", + "insert 23 user23 [email protected]", + "insert 4 user4 [email protected]", + "insert 14 user14 [email protected]", + "insert 30 user30 [email protected]", + "insert 15 user15 [email protected]", + "insert 26 user26 [email protected]", + "insert 22 user22 [email protected]", + "insert 19 user19 [email protected]", + "insert 2 user2 [email protected]", + "insert 1 user1 [email protected]", + "insert 21 user21 [email protected]", + "insert 11 user11 [email protected]", + "insert 6 user6 [email protected]", + "insert 20 user20 [email protected]", + "insert 5 user5 [email protected]", + "insert 8 user8 [email protected]", + "insert 9 user9 [email protected]", + "insert 3 user3 [email protected]", + "insert 12 user12 [email protected]", + "insert 27 user27 [email protected]", + "insert 17 user17 [email protected]", + "insert 16 user16 [email protected]", + "insert 13 user13 [email protected]", + "insert 24 user24 [email protected]", + "insert 25 user25 [email protected]", + "insert 28 user28 [email protected]", + ".btree", + ".exit", + ] + result = run_script(script) </span></code></pre></div></div> <p>As-is, it will output this:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- internal (size 3) - leaf (size 7) - 1 - 2 - 3 - 4 - 5 - 6 - 7 - key 1 - leaf (size 8) - 8 - 9 - 10 - 11 - 12 - 13 - 14 - 15 - key 15 - leaf (size 7) - 16 - 17 - 18 - 19 - 20 - 21 - 22 - key 22 - leaf (size 8) - 23 - 24 - 25 - 26 - 27 - 28 - 29 - 30 db &gt; </code></pre></div></div> <p>Look carefully and you’ll spot a bug:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> - 5 - 6 - 7 - key 1 </code></pre></div></div> <p>The key there should be 7, not 1!</p> <p>After a bunch of debugging, I discovered this was due to some bad pointer arithmetic.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> uint32_t* internal_node_key(void* node, uint32_t key_num) { <span class="gd">- return internal_node_cell(node, key_num) + INTERNAL_NODE_CHILD_SIZE; </span><span class="gi">+ return (void*)internal_node_cell(node, key_num) + INTERNAL_NODE_CHILD_SIZE; </span> } </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">INTERNAL_NODE_CHILD_SIZE</code> is 4. My intention here was to add 4 bytes to the result of <code class="language-plaintext highlighter-rouge">internal_node_cell()</code>, but since <code class="language-plaintext highlighter-rouge">internal_node_cell()</code> returns a <code class="language-plaintext highlighter-rouge">uint32_t*</code>, this it was actually adding <code class="language-plaintext highlighter-rouge">4 * sizeof(uint32_t)</code> bytes. I fixed it by casting to a <code class="language-plaintext highlighter-rouge">void*</code> before doing the arithmetic.</p> <p>NOTE! <a href="https://stackoverflow.com/questions/3523145/pointer-arithmetic-for-void-pointer-in-c/46238658#46238658">Pointer arithmetic on void pointers is not part of the C standard and may not work with your compiler</a>. I may do an article in the future on portability, but I’m leaving my void pointer arithmetic for now.</p> <p>Alright. One more step toward a fully-operational btree implementation. The next step should be splitting internal nodes. Until then!</p> Sun, 26 Nov 2017 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part13.html https://cstack.github.io/db_tutorial/parts/part13.html Part 14 - Splitting Internal Nodes <p>The next leg of our journey will be splitting internal nodes which are unable to accommodate new keys. Consider the example below:</p> <table class="image"> <caption align="bottom">Example of splitting an internal</caption> <tr><td><a href="https://cstack.github.io/db_tutorial/assets/images/splitting-internal-node.png"><img src="https://cstack.github.io/db_tutorial/assets/images/splitting-internal-node.png" alt="Example of splitting an internal" /></a></td></tr> </table> <p>In this example, we add the key “11” to the tree. This will cause our root to split. When splitting an internal node, we will have to do a few things in order to keep everything straight:</p> <ol> <li>Create a sibling node to store (n-1)/2 of the original node’s keys</li> <li>Move these keys from the original node to the sibling node</li> <li>Update the original node’s key in the parent to reflect its new max key after splitting</li> <li>Insert the sibling node into the parent (could result in the parent also being split)</li> </ol> <p>We will begin by replacing our stub code with the call to <code class="language-plaintext highlighter-rouge">internal_node_split_and_insert</code></p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void internal_node_split_and_insert(Table* table, uint32_t parent_page_num, + uint32_t child_page_num); + </span> void internal_node_insert(Table* table, uint32_t parent_page_num, uint32_t child_page_num) { /* <span class="p">@@ -685,25 +714,39 @@</span> void internal_node_insert(Table* table, uint32_t parent_page_num, void* parent = get_page(table-&gt;pager, parent_page_num); void* child = get_page(table-&gt;pager, child_page_num); <span class="gd">- uint32_t child_max_key = get_node_max_key(child); </span><span class="gi">+ uint32_t child_max_key = get_node_max_key(table-&gt;pager, child); </span> uint32_t index = internal_node_find_child(parent, child_max_key); uint32_t original_num_keys = *internal_node_num_keys(parent); <span class="gd">- *internal_node_num_keys(parent) = original_num_keys + 1; </span> if (original_num_keys &gt;= INTERNAL_NODE_MAX_CELLS) { <span class="gd">- printf("Need to implement splitting internal node\n"); - exit(EXIT_FAILURE); </span><span class="gi">+ internal_node_split_and_insert(table, parent_page_num, child_page_num); + return; </span> } uint32_t right_child_page_num = *internal_node_right_child(parent); <span class="gi">+ /* + An internal node with a right child of INVALID_PAGE_NUM is empty + */ + if (right_child_page_num == INVALID_PAGE_NUM) { + *internal_node_right_child(parent) = child_page_num; + return; + } + </span> void* right_child = get_page(table-&gt;pager, right_child_page_num); <span class="gi">+ /* + If we are already at the max number of cells for a node, we cannot increment + before splitting. Incrementing without inserting a new key/child pair + and immediately calling internal_node_split_and_insert has the effect + of creating a new key at (max_cells + 1) with an uninitialized value + */ + *internal_node_num_keys(parent) = original_num_keys + 1; </span> <span class="gd">- if (child_max_key &gt; get_node_max_key(right_child)) { </span><span class="gi">+ if (child_max_key &gt; get_node_max_key(table-&gt;pager, right_child)) { </span> /* Replace right child */ *internal_node_child(parent, original_num_keys) = right_child_page_num; *internal_node_key(parent, original_num_keys) = <span class="gd">- get_node_max_key(right_child); </span><span class="gi">+ get_node_max_key(table-&gt;pager, right_child); </span> *internal_node_right_child(parent) = child_page_num; </code></pre></div></div> <p>There are three important changes we are making here aside from replacing the stub:</p> <ul> <li>First, <code class="language-plaintext highlighter-rouge">internal_node_split_and_insert</code> is forward-declared because we will be calling <code class="language-plaintext highlighter-rouge">internal_node_insert</code> in its definition to avoid code duplication.</li> <li>In addition, we are moving the logic which increments the parent’s number of keys further down in the function definition to ensure that this does not happen before the split.</li> <li>Finally, we are ensuring that a child node inserted into an empty internal node will become that internal node’s right child without any other operations being performed, since an empty internal node has no keys to manipulate.</li> </ul> <p>The changes above require that we be able to identify an empty node - to this end, we will first define a constant which represents an invalid page number that is the child of every empty node.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+#define INVALID_PAGE_NUM UINT32_MAX </span></code></pre></div></div> <p>Now, when an internal node is initialized, we initialize its right child with this invalid page number.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -330,6 +335,12 @@</span> void initialize_internal_node(void* node) { set_node_type(node, NODE_INTERNAL); set_node_root(node, false); *internal_node_num_keys(node) = 0; <span class="gi">+ /* + Necessary because the root page number is 0; by not initializing an internal + node's right child to an invalid page number when initializing the node, we may + end up with 0 as the node's right child, which makes the node a parent of the root + */ + *internal_node_right_child(node) = INVALID_PAGE_NUM; </span> } </code></pre></div></div> <p>This step was made necessary by a problem that the comment above attempts to summarize - when initializing an internal node without explicitly initializing the right child field, the value of that field at runtime could be 0 depending on the compiler or the architecture of the machine on which the program is being executed. Since we are using 0 as our root page number, this means that a newly allocated internal node will be a parent of the root.</p> <p>We have introduced some guards in our <code class="language-plaintext highlighter-rouge">internal_node_child</code> function to throw an error in the case of an attempt to access an invalid page.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -186,9 +188,19 @@</span> uint32_t* internal_node_child(void* node, uint32_t child_num) { printf("Tried to access child_num %d &gt; num_keys %d\n", child_num, num_keys); exit(EXIT_FAILURE); } else if (child_num == num_keys) { <span class="gd">- return internal_node_right_child(node); </span><span class="gi">+ uint32_t* right_child = internal_node_right_child(node); + if (*right_child == INVALID_PAGE_NUM) { + printf("Tried to access right child of node, but was invalid page\n"); + exit(EXIT_FAILURE); + } + return right_child; </span> } else { <span class="gd">- return internal_node_cell(node, child_num); </span><span class="gi">+ uint32_t* child = internal_node_cell(node, child_num); + if (*child == INVALID_PAGE_NUM) { + printf("Tried to access child %d of node, but was invalid page\n", child_num); + exit(EXIT_FAILURE); + } + return child; </span> } } </code></pre></div></div> <p>One additional guard is needed in our <code class="language-plaintext highlighter-rouge">print_tree</code> function to ensure that we do not attempt to print an empty node, as that would involve trying to access an invalid page.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -294,15 +305,17 @@</span> void print_tree(Pager* pager, uint32_t page_num, uint32_t indentation_level) { num_keys = *internal_node_num_keys(node); indent(indentation_level); printf("- internal (size %d)\n", num_keys); <span class="gd">- for (uint32_t i = 0; i &lt; num_keys; i++) { - child = *internal_node_child(node, i); </span><span class="gi">+ if (num_keys &gt; 0) { + for (uint32_t i = 0; i &lt; num_keys; i++) { + child = *internal_node_child(node, i); + print_tree(pager, child, indentation_level + 1); + + indent(indentation_level + 1); + printf("- key %d\n", *internal_node_key(node, i)); + } + child = *internal_node_right_child(node); </span> print_tree(pager, child, indentation_level + 1); <span class="gd">- - indent(indentation_level + 1); - printf("- key %d\n", *internal_node_key(node, i)); </span> } <span class="gd">- child = *internal_node_right_child(node); - print_tree(pager, child, indentation_level + 1); </span> break; } } </code></pre></div></div> <p>Now for the headliner, <code class="language-plaintext highlighter-rouge">internal_node_split_and_insert</code>. We will first provide it in its entirety, and then break it down by steps.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+void internal_node_split_and_insert(Table* table, uint32_t parent_page_num, + uint32_t child_page_num) { + uint32_t old_page_num = parent_page_num; + void* old_node = get_page(table-&gt;pager,parent_page_num); + uint32_t old_max = get_node_max_key(table-&gt;pager, old_node); + + void* child = get_page(table-&gt;pager, child_page_num); + uint32_t child_max = get_node_max_key(table-&gt;pager, child); + + uint32_t new_page_num = get_unused_page_num(table-&gt;pager); + + /* + Declaring a flag before updating pointers which + records whether this operation involves splitting the root - + if it does, we will insert our newly created node during + the step where the table's new root is created. If it does + not, we have to insert the newly created node into its parent + after the old node's keys have been transferred over. We are not + able to do this if the newly created node's parent is not a newly + initialized root node, because in that case its parent may have existing + keys aside from our old node which we are splitting. If that is true, we + need to find a place for our newly created node in its parent, and we + cannot insert it at the correct index if it does not yet have any keys + */ + uint32_t splitting_root = is_node_root(old_node); + + void* parent; + void* new_node; + if (splitting_root) { + create_new_root(table, new_page_num); + parent = get_page(table-&gt;pager,table-&gt;root_page_num); + /* + If we are splitting the root, we need to update old_node to point + to the new root's left child, new_page_num will already point to + the new root's right child + */ + old_page_num = *internal_node_child(parent,0); + old_node = get_page(table-&gt;pager, old_page_num); + } else { + parent = get_page(table-&gt;pager,*node_parent(old_node)); + new_node = get_page(table-&gt;pager, new_page_num); + initialize_internal_node(new_node); + } + + uint32_t* old_num_keys = internal_node_num_keys(old_node); + + uint32_t cur_page_num = *internal_node_right_child(old_node); + void* cur = get_page(table-&gt;pager, cur_page_num); + + /* + First put right child into new node and set right child of old node to invalid page number + */ + internal_node_insert(table, new_page_num, cur_page_num); + *node_parent(cur) = new_page_num; + *internal_node_right_child(old_node) = INVALID_PAGE_NUM; + /* + For each key until you get to the middle key, move the key and the child to the new node + */ + for (int i = INTERNAL_NODE_MAX_CELLS - 1; i &gt; INTERNAL_NODE_MAX_CELLS / 2; i--) { + cur_page_num = *internal_node_child(old_node, i); + cur = get_page(table-&gt;pager, cur_page_num); + + internal_node_insert(table, new_page_num, cur_page_num); + *node_parent(cur) = new_page_num; + + (*old_num_keys)--; + } + + /* + Set child before middle key, which is now the highest key, to be node's right child, + and decrement number of keys + */ + *internal_node_right_child(old_node) = *internal_node_child(old_node,*old_num_keys - 1); + (*old_num_keys)--; + + /* + Determine which of the two nodes after the split should contain the child to be inserted, + and insert the child + */ + uint32_t max_after_split = get_node_max_key(table-&gt;pager, old_node); + + uint32_t destination_page_num = child_max &lt; max_after_split ? old_page_num : new_page_num; + + internal_node_insert(table, destination_page_num, child_page_num); + *node_parent(child) = destination_page_num; + + update_internal_node_key(parent, old_max, get_node_max_key(table-&gt;pager, old_node)); + + if (!splitting_root) { + internal_node_insert(table,*node_parent(old_node),new_page_num); + *node_parent(new_node) = *node_parent(old_node); + } +} + </span></code></pre></div></div> <p>The first thing we need to do is create a variable to store the page number of the node we are splitting (the old node from here out). This is necessary because the page number of the old node will change if it happens to be the table’s root node. We also need to remember what the node’s current max is, because that value represents its key in the parent, and that key will need to be updated with the old node’s new maximum after the split occurs.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ uint32_t old_page_num = parent_page_num; + void* old_node = get_page(table-&gt;pager,parent_page_num); + uint32_t old_max = get_node_max_key(table-&gt;pager, old_node); </span></code></pre></div></div> <p>The next important step is the branching logic which depends on whether the old node is the table’s root node. We will need to keep track of this value for later use; as the comment attempts to convey, we run into a problem if we do not store this information at the beginning of our function definition - if we are not splitting the root, we cannot insert our newly created sibling node into the old node’s parent right away, because it does not yet contain any keys and therefore will not be placed at the right index among the other key/child pairs which may or may not already be present in the parent node.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ uint32_t splitting_root = is_node_root(old_node); + + void* parent; + void* new_node; + if (splitting_root) { + create_new_root(table, new_page_num); + parent = get_page(table-&gt;pager,table-&gt;root_page_num); + /* + If we are splitting the root, we need to update old_node to point + to the new root's left child, new_page_num will already point to + the new root's right child + */ + old_page_num = *internal_node_child(parent,0); + old_node = get_page(table-&gt;pager, old_page_num); + } else { + parent = get_page(table-&gt;pager,*node_parent(old_node)); + new_node = get_page(table-&gt;pager, new_page_num); + initialize_internal_node(new_node); + } </span></code></pre></div></div> <p>Once we have settled the question of splitting or not splitting the root, we begin moving keys from the old node to its sibling. We must first move the old node’s right child and set its right child field to an invalid page to indicate that it is empty. Now, we loop over the old node’s remaining keys, performing the following steps on each iteration:</p> <ol> <li>Obtain a reference to the old node’s key and child at the current index</li> <li>Insert the child into the sibling node</li> <li>Update the child’s parent value to point to the sibling node</li> <li>Decrement the old node’s number of keys</li> </ol> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ uint32_t* old_num_keys = internal_node_num_keys(old_node); + + uint32_t cur_page_num = *internal_node_right_child(old_node); + void* cur = get_page(table-&gt;pager, cur_page_num); + + /* + First put right child into new node and set right child of old node to invalid page number + */ + internal_node_insert(table, new_page_num, cur_page_num); + *node_parent(cur) = new_page_num; + *internal_node_right_child(old_node) = INVALID_PAGE_NUM; + /* + For each key until you get to the middle key, move the key and the child to the new node + */ + for (int i = INTERNAL_NODE_MAX_CELLS - 1; i &gt; INTERNAL_NODE_MAX_CELLS / 2; i--) { + cur_page_num = *internal_node_child(old_node, i); + cur = get_page(table-&gt;pager, cur_page_num); + + internal_node_insert(table, new_page_num, cur_page_num); + *node_parent(cur) = new_page_num; + + (*old_num_keys)--; + } </span></code></pre></div></div> <p>Step 4 is important, because it serves the purpose of “erasing” the key/child pair from the old node. Although we are not actually freeing the memory at that byte offset in the old node’s page, by decrementing the old node’s number of keys we are making that memory location inaccessible, and the bytes will be overwritten the next time a child is inserted into the old node.</p> <p>Also note the behavior of our loop invariant - if our maximum number of internal node keys changes in the future, our logic ensures that both our old node and our sibling node will end up with (n-1)/2 keys after the split, with the 1 remaining node going to the parent. If an even number is chosen as the maximum number of nodes, n/2 nodes will remain with the old node while (n-1)/2 will be moved to the sibling node. This logic would be straightforward to revise as needed.</p> <p>Once the keys to be moved have been, we set the old node’s i’th child as its right child and decrement its number of keys.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ /* + Set child before middle key, which is now the highest key, to be node's right child, + and decrement number of keys + */ + *internal_node_right_child(old_node) = *internal_node_child(old_node,*old_num_keys - 1); + (*old_num_keys)--; </span></code></pre></div></div> <p>We then insert the child node into either the old node or the sibling node depending on the value of its max key.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ uint32_t max_after_split = get_node_max_key(table-&gt;pager, old_node); + + uint32_t destination_page_num = child_max &lt; max_after_split ? old_page_num : new_page_num; + + internal_node_insert(table, destination_page_num, child_page_num); + *node_parent(child) = destination_page_num; </span></code></pre></div></div> <p>Finally, we update the old node’s key in its parent, and insert the sibling node and update the sibling node’s parent pointer if necessary.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ update_internal_node_key(parent, old_max, get_node_max_key(table-&gt;pager, old_node)); + + if (!splitting_root) { + internal_node_insert(table,*node_parent(old_node),new_page_num); + *node_parent(new_node) = *node_parent(old_node); + } </span></code></pre></div></div> <p>One important change required to support this new logic is in our <code class="language-plaintext highlighter-rouge">create_new_root</code> function. Before, we were only taking into account situations where the new root’s children would be leaf nodes. If the new root’s children are instead internal nodes, we need to do two things:</p> <ol> <li>Correctly initialize the root’s new children to be internal nodes</li> <li>In addition to the call to memcpy, we need to insert each of the root’s keys into its new left child and update the parent pointer of each of those children</li> </ol> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -661,22 +680,40 @@</span> void create_new_root(Table* table, uint32_t right_child_page_num) { uint32_t left_child_page_num = get_unused_page_num(table-&gt;pager); void* left_child = get_page(table-&gt;pager, left_child_page_num); <span class="gi">+ if (get_node_type(root) == NODE_INTERNAL) { + initialize_internal_node(right_child); + initialize_internal_node(left_child); + } + </span> /* Left child has data copied from old root */ memcpy(left_child, root, PAGE_SIZE); set_node_root(left_child, false); <span class="gi">+ if (get_node_type(left_child) == NODE_INTERNAL) { + void* child; + for (int i = 0; i &lt; *internal_node_num_keys(left_child); i++) { + child = get_page(table-&gt;pager, *internal_node_child(left_child,i)); + *node_parent(child) = left_child_page_num; + } + child = get_page(table-&gt;pager, *internal_node_right_child(left_child)); + *node_parent(child) = left_child_page_num; + } + </span> /* Root node is a new internal node with one key and two children */ initialize_internal_node(root); set_node_root(root, true); *internal_node_num_keys(root) = 1; *internal_node_child(root, 0) = left_child_page_num; <span class="gd">- uint32_t left_child_max_key = get_node_max_key(left_child); </span><span class="gi">+ uint32_t left_child_max_key = get_node_max_key(table-&gt;pager, left_child); </span> *internal_node_key(root, 0) = left_child_max_key; *internal_node_right_child(root) = right_child_page_num; *node_parent(left_child) = table-&gt;root_page_num; *node_parent(right_child) = table-&gt;root_page_num; } </code></pre></div></div> <p>Another important change has been made to <code class="language-plaintext highlighter-rouge">get_node_max_key</code>, as mentioned at the beginning of this article. Since an internal node’s key represents the maximum of the tree pointed to by the child to its left, and that child can be a tree of arbitrary depth, we need to walk down the right children of that tree until we get to a leaf node, and then take the maximum key of that leaf node.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+uint32_t get_node_max_key(Pager* pager, void* node) { + if (get_node_type(node) == NODE_LEAF) { + return *leaf_node_key(node, *leaf_node_num_cells(node) - 1); + } + void* right_child = get_page(pager,*internal_node_right_child(node)); + return get_node_max_key(pager, right_child); +} </span></code></pre></div></div> <p>We have written a single test to demonstrate that our <code class="language-plaintext highlighter-rouge">print_tree</code> function still works after the introduction of internal node splitting.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gi">+ it 'allows printing out the structure of a 7-leaf-node btree' do + script = [ + "insert 58 user58 [email protected]", + "insert 56 user56 [email protected]", + "insert 8 user8 [email protected]", + "insert 54 user54 [email protected]", + "insert 77 user77 [email protected]", + "insert 7 user7 [email protected]", + "insert 25 user25 [email protected]", + "insert 71 user71 [email protected]", + "insert 13 user13 [email protected]", + "insert 22 user22 [email protected]", + "insert 53 user53 [email protected]", + "insert 51 user51 [email protected]", + "insert 59 user59 [email protected]", + "insert 32 user32 [email protected]", + "insert 36 user36 [email protected]", + "insert 79 user79 [email protected]", + "insert 10 user10 [email protected]", + "insert 33 user33 [email protected]", + "insert 20 user20 [email protected]", + "insert 4 user4 [email protected]", + "insert 35 user35 [email protected]", + "insert 76 user76 [email protected]", + "insert 49 user49 [email protected]", + "insert 24 user24 [email protected]", + "insert 70 user70 [email protected]", + "insert 48 user48 [email protected]", + "insert 39 user39 [email protected]", + "insert 15 user15 [email protected]", + "insert 47 user47 [email protected]", + "insert 30 user30 [email protected]", + "insert 86 user86 [email protected]", + "insert 31 user31 [email protected]", + "insert 68 user68 [email protected]", + "insert 37 user37 [email protected]", + "insert 66 user66 [email protected]", + "insert 63 user63 [email protected]", + "insert 40 user40 [email protected]", + "insert 78 user78 [email protected]", + "insert 19 user19 [email protected]", + "insert 46 user46 [email protected]", + "insert 14 user14 [email protected]", + "insert 81 user81 [email protected]", + "insert 72 user72 [email protected]", + "insert 6 user6 [email protected]", + "insert 50 user50 [email protected]", + "insert 85 user85 [email protected]", + "insert 67 user67 [email protected]", + "insert 2 user2 [email protected]", + "insert 55 user55 [email protected]", + "insert 69 user69 [email protected]", + "insert 5 user5 [email protected]", + "insert 65 user65 [email protected]", + "insert 52 user52 [email protected]", + "insert 1 user1 [email protected]", + "insert 29 user29 [email protected]", + "insert 9 user9 [email protected]", + "insert 43 user43 [email protected]", + "insert 75 user75 [email protected]", + "insert 21 user21 [email protected]", + "insert 82 user82 [email protected]", + "insert 12 user12 [email protected]", + "insert 18 user18 [email protected]", + "insert 60 user60 [email protected]", + "insert 44 user44 [email protected]", + ".btree", + ".exit", + ] + result = run_script(script) + + expect(result[64...(result.length)]).to match_array([ + "db &gt; Tree:", + "- internal (size 1)", + " - internal (size 2)", + " - leaf (size 7)", + " - 1", + " - 2", + " - 4", + " - 5", + " - 6", + " - 7", + " - 8", + " - key 8", + " - leaf (size 11)", + " - 9", + " - 10", + " - 12", + " - 13", + " - 14", + " - 15", + " - 18", + " - 19", + " - 20", + " - 21", + " - 22", + " - key 22", + " - leaf (size 8)", + " - 24", + " - 25", + " - 29", + " - 30", + " - 31", + " - 32", + " - 33", + " - 35", + " - key 35", + " - internal (size 3)", + " - leaf (size 12)", + " - 36", + " - 37", + " - 39", + " - 40", + " - 43", + " - 44", + " - 46", + " - 47", + " - 48", + " - 49", + " - 50", + " - 51", + " - key 51", + " - leaf (size 11)", + " - 52", + " - 53", + " - 54", + " - 55", + " - 56", + " - 58", + " - 59", + " - 60", + " - 63", + " - 65", + " - 66", + " - key 66", + " - leaf (size 7)", + " - 67", + " - 68", + " - 69", + " - 70", + " - 71", + " - 72", + " - 75", + " - key 75", + " - leaf (size 8)", + " - 76", + " - 77", + " - 78", + " - 79", + " - 81", + " - 82", + " - 85", + " - 86", + "db &gt; ", + ]) + end </span></code></pre></div></div> Tue, 23 May 2023 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part14.html https://cstack.github.io/db_tutorial/parts/part14.html Part 15 - Where to go next <p>This project is no longer under active development.</p> <p>But if you’d like to keep learning how to make your own SQLite clone from scratch, or one of many other projects like Docker, Redis, Git or BitTorrent, try <a href="https://app.codecrafters.io/join?via=cstack"><b>CodeCrafters</b></a>.</p> <p>CodeCrafters maintains a <a href="https://github.com/codecrafters-io/build-your-own-x?tab=readme-ov-file#build-your-own-docker">pretty comprehensive list of “Build your own X” tutorials</a> including “Build your own Database”.</p> <p>Plus, if your company has a learning and development budget, you can use it to pay for CodeCrafter’s paid service:</p> <p><a href="https://app.codecrafters.io/join?via=cstack"><img src="https://cstack.github.io/db_tutorial/assets/images/code-crafters.jpeg" alt="" /></a></p> <p>If you use my referral link, I get a commision.</p> Mon, 04 Mar 2024 00:00:00 +0000 https://cstack.github.io/db_tutorial/parts/part15.html https://cstack.github.io/db_tutorial/parts/part15.html