Personal Logo

Inside Wade

July 16, 2017

Wade is a 1kb Javascript search library. It allows you to search for a query through a set of documents. This query is processed, split into keywords, and then searched within an index structure.

API

The API is extremely minimal and self-explanatory. It looks like:

const search = Wade(
  ["Moon is fast!", "Also Slash!", "Spark too!", "Is Wade fast?"]
);

search("Moon");

Processor

Wade processes all documents and queries. This works by moving each item through a separate processing function. These do the following operations:

For example, when given the data:

["Moon is fast!", "Slash is fast also!", "Spark is fast too!", "Is Wade fast?"]

It is processed into:

["moon fast", "slash fast", "spark fast", "wade fast"]

Notice how everything is lowercase, does not contain punctuation, and does not contain stop words.

Index

After processing the data, an index can be generated to allow for optimized searches within a text. This index can be generated by:

  1. Processing each document.

  2. Splitting each document into a multiset of terms.

  3. Generating a trie of the terms and storing the indexes of the corresponding documents containing the term.

  4. Storing the weighted significance of the term in the documents.

A set is used to store multiple unique items. The cardinality of a set represents the amount of elements in the set.

On the other hand, a multiset is used to store multiple items (including duplicates). The multiplicity of an item in a multiset is the number of instances of that item in the multiset.

Wade uses a special function to find how significant a term is to the data. This function takes various factors into account, including the length of the documents, the length of a specific document with the term, and the number of occurrences of the term.

The significance of the term tt in the set of documents dd can be represented by the function:

wm(t,d)=1.5i=0d[μ(t)pdi(μ(p))]dwm(t, d) = 1.5 - \frac{\sum_{i=0}^{|d|} [\frac{\mu(t)}{\sum_{p \in d_i}(\mu(p))}]}{|d|}
  1. dd is a set of multisets did_i

  2. di(tdi)di(tdi)\exists d_i(t \in d_i) \lor \nexists d_i(t \in d_i)

  3. μ(t)\mu(t) is the multiplicity of tt in the multiset did_i

  4. pdi(μ(p))\sum_{p \in d_i}(\mu(p)) is the cardinality of the multiset did_i

  5. μ(t)pdi(μ(p))\frac{\mu(t)}{\sum_{p \in d_i}(\mu(p))} is the ratio of occurrences of the term tt in the document did_i to the total amount of terms in the document did_i

This works by finding the average of how often the term appears within a document. After this, the significance is normalized between 0.5 and 1.5, allowing it to become higher when the average occurrence is lower. This allows for rarer terms to be amplified in significance.

For example, when given the processed data:

["moon fast", "slash fast", "spark fast", "wade fast"]

An index is generated:

{
  "m": {
    "o": {
      "o": {
        "n": {
          "data": [
            1.375,
            0
          ]
        }
      }
    }
  },
  "f": {
    "a": {
      "s": {
        "t": {
          "data": [
            1,
            0,
            1,
            2,
            3
          ]
        }
      }
    }
  },
  "s": {
    "l": {
      "a": {
        "s": {
          "h": {
            "data": [
              1.375,
              1
            ]
          }
        }
      }
    },
    "p": {
      "a": {
        "r": {
          "k": {
            "data": [
              1.375,
              2
            ]
          }
        }
      }
    }
  },
  "w": {
    "a": {
      "d": {
        "e": {
          "data": [
            1.375,
            3
          ]
        }
      }
    }
  }
}

This is a map of every single character so that they can be searched through individually. At the end of a specific term, there is a data property. The first item in this array is the significance of the term. The other items are the indexes of the documents in which the term appears.

Notice how the unique terms "moon", "slash", "spark", and "wade" all have a higher weight of 1.375. In contrast, the term "fast" appears in all of the documents half of the time, and has a lower weight of 1.

Searching works by:

  1. Processing the query.

  2. Splitting the query into a multiset of terms.

  3. Searching the index for each term.

To correctly increment the score, the relevance must be taken into account. The relevance of the term tt in the query qq to the set of documents dd can be represented by the function:

wr(t,q,d)=wm(t,d)[1p q(μ(p))]wr(t, q, d) = wm(t, d)[\frac{1}{\sum_{p \in\ q}(\mu(p))}]
  1. qq is a multiset

  2. μ(p)\mu(p) is the multiplicity of pp in the multiset qq

  3. p q(μ(p))\sum_{p \in\ q}(\mu(p)) is the cardinality of the multiset qq

This can be used to represent how much each term should affect the score of the query. It works by taking significance of the term and the length of the query into account.

For example, let's say you are searching in real-time and have a partial query:

Fast S

This is processed and split into terms:

["fast", "s"]

All terms except for the last term are searched for exact matches. This means that Wade will only increment the score for documents that have the term in them.

First, the term "fast" is searched for in the index. The process looks like:

  1. Check if the letter "f" is in the index. Set it to the current node and continue.

  2. Set the current node to the "a" node.

  3. Set the current node to the "s" node.

  4. Set the current node to the "t" node.

At this point, the current node is:

"t": {
  "data": [
    1,
    0,
    1,
    2,
    3
  ]
}

It has a data property, meaning that this term was present in at least one document. We store the significance (1), and increment the score for the indexes.

In this case, the indexes are [0, 1, 2, 3]. The current relevance can be evaluated using the wrwr function by setting tt to "fast", qq to the query, and dd to the documents.

wr(t,q,d)=1(12)=0.5wr(t, q, d) = 1(\frac{1}{2}) = 0.5

As a result, we update the score for all documents containing the term "fast" by 0.5. So far, the results are:

[
  {
    index: 0,
    score: 0.5
  },
  {
    index: 1,
    score: 0.5
  },
  {
    index: 2,
    score: 0.5
  },
  {
    index: 3,
    score: 0.5
  }
]

Next, we have the term "s". This term is treated as a prefix, and every single document containing a term with the prefix "s" will have their score updated.

After checking if "s" is in the index, we find the node:

"s": {
  "l": {
    "a": {
      "s": {
        "h": {
          "data": [
            1.375,
            1
          ]
        }
      }
    }
  },
  "p": {
    "a": {
      "r": {
        "k": {
          "data": [
            1.375,
            2
          ]
        }
      }
    }
  }
}

After looking through each individual node, we find data properties after "slash" and "spark":

// "slash"
"h": {
  "data": [
    1.375,
    1
  ]
}

// "spark"
"k": {
  "data": [
    1.375,
    2
  ]
}

The relevance for the terms "slash" and "spark" can be calculated:

wr(t,q,d)=1.375(12)=0.6875wr(t, q, d) = 1.375(\frac{1}{2}) = 0.6875

With this value, we can update the score for the documents containing the terms "slash" and "spark" by 0.6875.

After updating the results using the indexes, the final results are:

[
  // Moon is fast!
  {
    index: 0,
    score: 0.5
  },

  // Slash is fast also!
  {
    index: 1,
    score: 1.1875
  },

  // Spark is fast too!
  {
    index: 2,
    score: 1.1875
  },

  // Is Wade fast?
  {
    index: 3,
    score: 0.5
  }
]

The most relevant results were:

"Slash is fast also!"
"Spark is fast too!"

They both have a term with the prefix "s", and the term "fast". The rest of the results were given a lower score of 0.5 because they only had the term "fast" in them.

Conclusion

In a nutshell, Wade processes data, splits it into terms, and creates an index containing the indexes of the items along with how relevant each term is to the data. A query is also processed and split into terms. These terms are individually searched within the index and the scores are updated as needed.

Wade's source is available on GitHub.

Portfolio

Blog

Twitter

GitHub