There’s no better way to understand what is going on under the hood than getting in there and trying to build it yourself.

So that’s what I’ve done with a Search Engine.
This is intentionally simplified and omits many production concerns such as advanced tokenisation, query parsing, compression, and indexing optimisations.

Just the core Information Retrieval logic. To get in there and understand what is really happening.
Let me walk you through what I have found.

The Inverted Index

The heart of Lexical search, meet the inverted index.

type InvertedIndex map[string][]int

Essentially it is a map from a “term” to the documents that match them!

For a really basic search, without worrying about relevance and ranking, this is really all we need!

For these given search terms, which documents also contain those terms!
Conceptually simple, but incredibly powerful.

If you are familiar with PostgreSQL, GIN is an implementation of an inverted index. That’s right, PostgreSQL could be a search engine if you really wanted.

Indexing into our basic inverted index

We will skip over analysers and tokenisers. There is a lot of complexity here, and it would need its own post. Just understand that before we can search for documents, they need to be normalised in some standard way so that the system knows what is a word, which words are equivalent, etc. etc.

So we have a “document”, a collection of terms, and simply record it in our inverted index as so

func (ii InvertedIndex) IndexDocument(docId int, terms []string) {
  seen := make(map[string]bool)

  for _, term := range terms {
    if seen[term] {
      continue
    }

    ii[term] = append(ii[term], docId)
    seen[term] = true
  }
}

And we can index some documents into it like

index := make(InvertedIndex)

index.IndexDocument(1, []string{"go", "search"})

In the real world, the steps of getting the data ready for this point are not to be underestimated and are key to quality search results.

Searching over our basic inverted index

Now that we have data in our index, the next thing we want to do is Search.

// Returns a slice of document IDs matching the Searched term
func (ii InvertedIndex) SearchTerm(term string) []int {
  return ii[term]
}

Indexing for Relevance

You may have noticed that while we can find all of the documents that contain the term we

cannot search for more than one term
have no idea which documents in our search result mention our term only once, and ones which are all about it

For Lexical search, the industry standard ranking algorithm is BM25.
Rather than trying to explain it, I will refer you to Elastic Search’s article on it - https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables.

You may notice there are a few extra pieces that we need to have organised in our index in order to Search using BM25.

We need to record

Term Frequency: The Frequency a Term occurred for the given Document, rather than a binary occurred or not occurred
Document Length: The number of Terms for the given Document
Document Count: The number of Documents in our Index.

Term Frequency and Postings

Rather than just recording whether the term was found in the document or not, BM25 needs the Term Frequency (TF).
This is because we want to rank a document that mentions a term a lot higher than one that doesn’t mention it very much.

In Information Retrieval, we refer to the Document + Term Frequency tuple as the Posting.

type PostingsList map[string]map[int]int

Our Inverted Index now is a Postings List

In terms of data-structure it is still an Inverted Index, but there is a specific Domain meaning to the data we are storing in it that makes it a Postings List.

Indexing Postings

The only difference to what we had before is that we now need to record the frequency of the Term in the Document.

func (pl *PostingsList) IndexDocument(docId int, terms []string) {
  for _, term := range terms {
		postings := pl[term]

    if postings == nil {
      postings = make(map[int]int)
      pl[term] = postings
    }

    postings[docId]++
	}
}

The indexing works the same as before, but we have stored a term frequency against each document for a given term.

Keeping count of Documents

Since our inverted index is all based around terms, there’s no efficient way to look at it and tell how many documents we have stored.
BM25 needs this information in order to understand how rare a Term is within the entire index.

So we can simply keep count

type PostingsList struct {
	docCount   int
	index      map[string]map[int]int
}

func NewPostingsList() *PostingsList {
	return &PostingsList{
		index:      make(map[string]map[int]int),
	}
}

func (pl *PostingsList) IndexDocument(docId int, terms []string) {
  // Our function from before...

  pl.docCount++
}

Knowing the length of a given Document

BM25 uses this to boost the relevance score of a shorter document compared to a longer document with a similar TF.
A 50 word document where all 50 words are the search term is assumed more relevant than a 500 word document with 51 instances of the term.

Again our current structure doesn’t offer a quick lookup for this, so let’s add it in

type PostingsList struct {
	docCount   int
	docLengths map[int]int
	index      map[string]map[int]int
}

func NewPostingsList() *PostingsList {
	return &PostingsList{
		index:      make(map[string]map[int]int),
		docLengths: make(map[int]int),
	}
}

func (pl *PostingsList) IndexDocument(docId int, terms []string) {
  // ...

  pl.docLengths[docId] = len(terms)
}

Putting it together

So now our Index function looks like

func (pl *PostingsList) IndexDocument(docId int, terms []string) {
  for _, term := range terms {
		postings := pl.index[term]

    if postings == nil {
      postings = make(map[int]int)
      pl.index[term] = postings
    }

    postings[docId]++
	}

  pl.docCount++

  pl.docLengths[docId] = len(terms)
}

Searching with BM25

Now we have all of the pieces in place, we can finally implement our Search algorithm.

Again rather than typing out and trying to explain the entire algorithm, let’s just refer to wikipedia - https://en.wikipedia.org/wiki/Okapi_BM25.

BM25 is natively multi-term, so we can define a signature that looks like

type ScoredDocument struct {
	docId int
	score float64
}

func (pl *PostingsList) Search(terms []string) []ScoredDocument {
  // TODO
}

Inverse Document Frequency

The first part is to calculate the IDF. For this all we need is the total number of documents in our index, and the number of documents that contain the term being Searched for.

func (pl *PostingsList) inverseDocumentFrequency(term string) float64 {
	df := float64(len(pl.index[term]))
	N := float64(pl.docCount)

	return math.Log(1 + (N-df+0.5)/(df+0.5))
}

BM25 uses this to weigh how rare a particular term is. This is important because BM25 will sum together the results of multiple terms, and let it weigh a rare term appropriately high.

Term Frequency

The next core piece of the puzzle that we need is the Frequency of the Term being queried for.

tf := float64(pl.index[term][docId])

Normalised Document Length

BM25 compares the length of this document with the average of the index. Accounting for the fact that the longer a document is the more likely it is for a term to appear multiple times without the relevance of the document increasing.

const B = 0.75

// ...

norm := 1 - B + B*(float64(pl.docLengths[docId])/avgLen)

b is a parameter which is usually set to 0.75. This is the standard value used by Lucene. A value of 0 would mean that no normalisation is applied, so document length is completely ignored by the algorithm. A value of 1 would mean that it is fully applied, meaning long documents are heavily penalised.

`k1` parameter

BM25 adjusts the Term Frequency with this parameter. It dictates how quickly the relevance score maxes out as a query term is repeated in a document.

Typical values for k1 are between 1.2 and 2.0, depending on corpus.
Lower values (e.g., (1.2)) penalize keyword stuffing, while higher values (e.g., (2.0)) reward document term repetition.

The standard value is 1.2 so let’s use that in our implementation.

const K1 = 1.2

Calculating a score

At this point, it’s easiest to just show you the code than try to describe it.

func (pl *PostingsList) scoreDocument(docId int, term string, idf float64, avgLen float64) float64 {
  tf := float64(pl.index[term][docId])

  norm := 1 - B + B*(float64(pl.docLengths[docId])/avgLen)

  score := idf * ((tf * (K1 + 1)) / (tf + K1*norm))

  return score
}

BM25 Search

Putting it all together.

func (pl *PostingsList) Search(terms []string) []ScoredDocument {
  scores := make(map[int]float64)

	avgLen := pl.avgDocLen()

  // BM25 is multi-term. Where the score for each term is simply summed together.
	for _, term := range terms {
		postings, ok := pl.index[term]
		if !ok {
			continue
		}

		idf := pl.inverseDocumentFrequency(term)

    // for each document which was in the posting, meaning it contained the term, 
    // calculate the BM25 score. This was `scoreDocument` earlier. 
    // Expanded out here so you can see it in all its glory. 
		for docId := range postings {
			tf := float64(postings[docId])

			norm := 1 - B + B*(float64(pl.docLengths[docId])/avgLen)

			score := idf * ((tf * (K1 + 1)) / (tf + K1*norm))

			scores[docId] += score
		}
	}

	results := make([]ScoredDocument, 0, len(scores))
	for docId, score := range scores {
		results = append(results, ScoredDocument{docId, score})
	}

  // return them in the order that we've ranked them to. 0 is the most relevant document.
	sort.Slice(results, func(i, j int) bool {
		if results[i].score == results[j].score {
			return results[i].docId < results[j].docId // tie-break for determinism
		}
		return results[i].score > results[j].score
	})

	return results
}

The Inverted Index#

Indexing into our basic inverted index#

Searching over our basic inverted index#

Indexing for Relevance#

Term Frequency and Postings#

Indexing Postings#

Keeping count of Documents#

Knowing the length of a given Document#

Putting it together#

Searching with BM25#

Inverse Document Frequency#

Term Frequency#

Normalised Document Length#

k1 parameter#

Calculating a score#

BM25 Search#