Building a Database

June 1, 2025
last_modified: Sep 6, 2025

database as a semester project

I(barely) built a database from scratch; well… small part of it.
I wouldn’t recommend using this as a database for your next startup, because this is more of an advanced file writer/fetcher.
I got the idea when I was looking through build-your-own[.]org, so we just went with it.
Before I started this project, I had some basic idea about what makes a database, so kinda had a roadmap on how to proceed.
Moreover I had built a minimal derivative_calculator as my MCSC202 coursework, that made things a bit quicker; well for the initial phase at least.

components of a typical database

I’ll divide a typical database in two parts:

frontend : tokenizer(lexical analyzer), parser(syntatic analyzer)
backend : btree, pager, statement executor

frontend => lexical + syntatic

Lexical analyzer(tokenization)

let’s say I write an SQL-like command:\

  SELECT * FROM GirlsMyFriendCanPull;

We need our database engine to recognize each of these say keywords to be recognized.
So, we should be able to detect each keywords, right? Well, we can loop through the list of keywords and assign each it’s type.
After the keywords are assigned to their types they’re called 💫tokens💫.

  const (
    TokenSelect        TokenType = "SELECT"
    TokenAsterisk      TokenType = "ASTERISK"
    TokenIdentifier    TokenType = "IDENTIFIER"
    TokenFrom          TokenType = "FROM"
    ...
  )

  for ___,token := range rawTokens{
		switch token {
		case "SELECT":
			tokens = append(tokens, Token{Type: TokenSelect, CurrentToken: token})
		case "FROM":
			tokens = append(tokens, Token{Type: TokenFrom, CurrentToken: token})
      ...
		default:
			tokens = append(tokens, Token{Type: TokenIdentifier, CurrentToken: token})
}

This will assign each to a predefined token and the keywords which are not recognized are assigned as(guess?)…
yes an Identifier.

Now that we have a series of tokens, we can move on to the next step, the Parser.

Syntatic analyzer(parsing)

Syntatic analyzer or a parser is just a program that turns your series of tokens into an Abstract Syntax Tree(AST),
if you’re building a calculator/interpreter, you need to decide precedence of operations eg: the BDMAS rule, the precedency of * is always above - in an equation.
So, an AST will help you assign precendence to each term(lets say a node) as well as validity of some arbitrary equation. The expression
3 * 5 */ 6
is obviously not a valid equation. Parsing allows us to check validity of a statement/equation by using a Grammar as reference.
The grammar of any parser defines validity of any derived expression. For our database,a sample grammar would look like:

Statement ::= CreateDatabaseStmt
| UseDatabaseStmt
| CreateTableStmt
CreateDatabaseStmt ::= “CREATE” “DATABASE” DatabaseName “;”
UseDatabaseStmt ::= “USE” DatabaseName “;”
ColumnList ::= ColumnName { “,” ColumnName }
ValueList ::= Value { “,” Value }

A given statement can be either a CreateTableStmt, a CreateDatabaseStmt or UseDatabaseStmt and each of those statement can be built using multiple valid combination of Tokens.
So, grammars are just way of defining validity of any statement to make the parser work around it.

Now you might be wondering, well then, how are ASTs built??

This part is not as easy as tokenization. Remember how I said the keywords are transformed into tokens, now these array of tokens are passed through a parser

Writing in progress

Tags: