There’s no better way to understand what is going on under the hood than getting in there and trying to build it yourself.
So that’s what I’ve done with a Search Engine.
Of course I only spent an hour or so on this, so didn’t implement all of the edge cases, performance, and more complex queries.
Just the core Information Retrival logic. To get in there and understand what is really happening.
Let me walk you through what I have found.
The Inverted Index
The heart of Lexical search, meet the inverted index.
type InvertedIndex map[string][]int
Essentially it is a map from a “term” to the documents that match them!
For a really basic search, without worrying about relevance and ranking, this is really all we need!
For these given search terms, which documents also contain those terms!
Conceptually simple, but incredibly powerful.
If you are familiar with PostgreSQL, GIN is an implementation of an inverted index. That’s right, PostgreSQL could be a search engine if you really wanted.
Indexing into our basic inverted index
We will skip over analysers and tokenisers. There is a lot of complexity here, and it would need its own post. Just understand that before we can search for documents, they need to be normalised in some standard way so that the system knows what is a word, which words are equivilent, etc. etc.
So we have a “document”, a collection of terms, and simply record it in our inverted index as so
func (ii InvertedIndex) IndexDocument(docId int, terms []string) {
seen := make(map[string]bool)
for _, term := range terms {
if seen[term] {
continue
}
ii[term] = append(ii[term], docId)
seen[term] = true
}
}
And we can index some documents into it like
index := make(InvertedIndex)
index.IndexDocument(1, []string{"go", "search"})
In the real world, the steps of getting the data ready for this point are not to be underestimated and are key to quality search results.
Searching over our basic inverted index
Now that we have data in our index, the next thing we want to do is Search.
// Returns a slice of document IDs matching the Searched term
func (ii InvertedIndex) SearchTerm(term string) []int {
return ii[term]
}
Indexing for Relevance
You may have noticed that while we can find all of the documents that contain the term we
- cannot search for more than one term
- have no idea which documents in our search result mention our term only once, and ones which are all about it
For Lexical search, the industry standard ranking algorithm is BM25.
Rather than trying to explain it, I will refer you to Elastic Search’s article on it - https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables.
You may notice there are a few extra pieces that we need to have organised in our index in order to Search using BM25.
We need to record
- Term Frequency: The Frequency a Term occured for the given Document, rather than a binary occured or not occured
- Document Length: The number of Terms for the given Document
- Document Count: The number of Documents in our Index.
Term Frequency and Postings
Rather than just recording whether the term was found in the document or not, BM25 needs the Term Frequency (TF).
This is because we want to rank a document that mentions a term a lot higher than one that doesn’t mention it very much.
In Information Retrieval, we refer to the Document + Term Frequency tuple as the Posting.
type PostingsList map[string]map[int]int
Our Inverted Index now is a Postings List
In terms of data-structure it is still an Inverted Index, but there is a specific Domain meaning to the data we are storing in it that makes it a Postings List.
Indexing Postings
The only difference to what we had before is that we now need to record the frequency of the Term in the Document.
func (pl *PostingsList) IndexDocument(docId int, terms []string) {
for _, term := range terms {
postings := pl[term]
if postings == nil {
postings = make(map[int]int)
pl[term] = postings
}
postings[docId]++
}
}
The indexing works the same as before, but we have stored a term frequency against each document for a given term.
Keeping count of Documents
Since our inverted index is all based around terms, there’s no efficient way to look at it and tell how many documents we have stored.
BM25 needs this information in order to understand how rare a Term is within the entire index.
So we can simply keep count
type PostingsList struct {
docCount int
index map[string]map[int]int
}
func NewPostingsList() *PostingsList {
return &PostingsList{
index: make(map[string]map[int]int),
}
}
func (pl *PostingsList) IndexDocument(docId int, terms []string) {
// Our function from before...
pl.docCount++
}
Knowing the length of a given Document
BM25 uses this to boost the relevance score of a shorter document compared to a longer document with a similar TF.
A 50 word document where all 50 words are the search term is assumed more relevant than a 500 word document with 51 instances of the term.
Again our current structure doesn’t offer a quick lookup for this, so let’s add it in
type PostingsList struct {
docCount int
docLengths map[int]int
index map[string]map[int]int
}
func NewPostingsList() *PostingsList {
return &PostingsList{
index: make(map[string]map[int]int),
docLengths: make(map[int]int),
}
}
func (pl *PostingsList) IndexDocument(docId int, terms []string) {
// ...
pl.docLengths[docId] = len(terms)
}
Putting it together
So now our Index function looks like
func (pl *PostingsList) IndexDocument(docId int, terms []string) {
for _, term := range terms {
postings := pl.index[term]
if postings == nil {
postings = make(map[int]int)
pl.index[term] = postings
}
postings[docId]++
}
pl.docCount++
pl.docLengths[docId] = len(terms)
}
Searching with BM25
Now we have all of the pieces in place, we can finally implement our Search algorithm.
Again rather than typing out and trying to explain the entire algorith, let’s just refer to wikipedia - https://en.wikipedia.org/wiki/Okapi_BM25.
BM25 is natively multi-term, so we can define a signature that looks like
type ScoredDocument struct {
docId int
score float64
}
func (pl *PostingsList) Search(terms []string) []ScoredDocument {
// TODO
}
Inverse Document Frequency
The first part is to calculate the IDF. For this all we need is the total number of documents in our index, and the number of documents that contain the term being Searched for.
func (pl *PostingsList) inverseDocumentFrequency(term string) float64 {
df := float64(len(pl.index[term]))
N := float64(pl.docCount)
return math.Log(1 + (N-df+0.5)/(df+0.5))
}
BM25 uses this to weigh how rare a particular term is. This is important because BM25 will sum together the results of multiple terms, and let it weigh a rare term appropriately high.
Term Frequency
The next core piece of the puzzle that we need is the Frequency of the Term being queried for.
tf := float64(pl.index[term][docId])
Normalised Document Length
BM25 compares the length of this document with the average of the index. Accounting for the fact that the longer a document is the more likely it is for a term to appear multiple times without the relavancy of the document increasing.
const B = 0.75
// ...
norm := 1 - B + B*(float64(pl.docLengths[docId])/avgLen)
b is a parameter which is usually set to 0.75. This is the standard value used by Lucene.
A value of 0 would mean that no normalisation is applied, so document length is completely ignored by the algorithm.
A value of 1 would mean that it is fully applied, meaning long documents are heavily penalised.
k1 parameter
BM25 adjusts the Term Frequency with this parameter. It dictates how quickly the relevance score maxes out as a query term is repeated in a document.
Typical values for k1 are between 1.2 and 2.0, depending on corpus.
Lower values (e.g., (1.2)) penalize keyword stuffing, while higher values (e.g., (2.0)) reward document term repetition.
The standard value is 1.2 so let’s use that in our implementation.
const K1 = 1.2
Calculating a score
At this point, it’s easiest to just show you the code than try to describe it.
func (pl *PostingsList) scoreDocument(docId int, term string, idf float64, avgLen float64) float64 {
tf := float64(pl.index[term][docId])
norm := 1 - B + B*(float64(pl.docLengths[docId])/avgLen)
score := idf * ((tf * (K1 + 1)) / (tf + K1*norm))
return score
}
BM25 Search
Putting it all together.
func (pl *PostingsList) Search(terms []string) []ScoredDocument {
scores := make(map[int]float64)
avgLen := pl.avgDocLen()
// BM25 is multi-term. Where the score for each term is simply summed together.
for _, term := range terms {
postings, ok := pl.index[term]
if !ok {
continue
}
idf := pl.inverseDocumentFrequency(term)
// for each document which was in the posting, meaning it contained the term,
// calculate the BM25 score. This was `scoreDocument` earlier.
// Expanded out here so you can see it in all it's glory.
for docId := range postings {
tf := float64(postings[docId])
norm := 1 - B + B*(float64(pl.docLengths[docId])/avgLen)
score := idf * ((tf * (K1 + 1)) / (tf + K1*norm))
scores[docId] += score
}
}
results := make([]ScoredDocument, 0, len(scores))
for docId, score := range scores {
results = append(results, ScoredDocument{docId, score})
}
// return them in the order that we've ranked them to. 0 is the most relevant document.
sort.Slice(results, func(i, j int) bool {
if results[i].score == results[j].score {
return results[i].docId < results[j].docId // tie-break for determinism
}
return results[i].score > results[j].score
})
return results
}