Jan 26, 2018 5 min read

Client-side searching a large data set using tries

Understanding data structures is an important part of being a web developer when non-trivial problems crop up. What works with 50 entries may not work with 50,000 entries.

Recently, I was developing an application that required the ability to search through a set of data for a name match on two fields, name and company. We can imagine the object we are searching as a simple structure:

{
  "name": "Phil Barresi",
  "company": "Barresi Inc.",
  "id": 1,
  "bio": "I like coffee and being a 1-dimensional oversimplification for a code demonstration"
}

This search had to be performed on the frontend of the application without requerying the server. Originally, I had to make the search work through approximately 50 entries; no big deal, really. A simple [].filter(x => x.name.indexOf(query) >= 0 || x.company.indexOf(query) >= 0) was fine, and worked great.

The implementation I released to QA was simple; I gave a text box for the query, and a normal select box with options on matching results.

A few days later, a ticket came through, and the same data source that returned 50 results for my searching was now supposed to return 50,000~ results. The searching was only a small facet of the app, so I converted the data source to return all the new data, and re-released the app to QA, where it was discovered that the search I created was now totally unusable for several reasons:

Browsers do not like painting more than a handful of option elements in a select; searches would now match hundreds or thousands of results, and then the browser would crash painting that many options in the select
Searching 50,000 entries with two indexOf operations at a time is very slow.
Recomputing each time the user typed triggered way too many operations.

I had to rethink my strategy and make some constraints on the application, and build out a new way to approach this problem.

I tackled each problem seperately, because I really wanted to avoid solving problem #2 if possible. Solving problem #1 was the easiest one to immediately tackle: it was time to virtualize my elements.

Virtualizing elements is a practice where you only draw a subset of elements that are visible, and replace which elements are visible. This dramatically reduces the number of items painted and repainted on a screen. I swapped out the select element for an JavaScript solution that would only render out elements that I was currently viewing, and load more in as the user hit the up or down arrows while the dropdown was visible. This helped with the part where the browser crashed, but the operation was still way too slow.

I added a debounce into the operation that queried the result in an attempt to solve problem #3. Debouncers are structures that block operations from occuring until a certain amount of time has elapsed. For example, take two people carrying groceries from the car to the second floor of a home. The first person would take groceries back and forth from the car, and to the foyer of the home. The second would take groceries from the foyer to the second floor. The second person would "debounce" by waiting until the groceries had been brought in for a period of time so they can bring up many grocery bags at once.

In our case, we used the debouncer to wait for the user to stop typing. Each time the user input a key, we fired a timeout to run 2 seconds in the future. If any previous timeouts were currently waiting to be executed, we deleted them. In that way, we would only ever queue up one search at a time, which would only occur after the user had stopped typing for 2 seconds or longer. This was an easy fix, and didn't hurt the usability of the application too much. With how much data there was, we hoped that users wouldn't mind too much.

While the debounce helped, it still wasn't enough. Searching through the dataset was just too damn slow. The page would freeze and the browser would throw up a message that JavaScript was running that froze the page.

Damn.

Solving problem #2 took some thought. After reviewing information about how our users used our application, it became clear that they would commonly search based on the start of strings, and searching on anything except the start of either the name / company field was rare. They would often only type in a few letters as well, and then just hit search to get the results.

The best suited data structure for this use case that I know of is a trie (also known as a radix tree) that would hold references to my entries. A trie is a search tree data structure that stores data based on a string key, where each character in order represents a node. Tries are optimized for fast retrieval time, although insertion time is relatively quick as well.

For example, given two key value pairs of ("hello", "world") with and ("hi", "cat"), our tree would look like:


""
-"h": []
--"e": []
---"l": []
----"l": []
-----"o": ["world"]
--"i": ["cat"]

I built out a generic JavaScript trie package called triever and with the help of my friend Ryan, stress tested the absolute and total hell out of it. We converted the internal library code to use optimized and higher performing code in places where we profiled slowdowns, which was usually caused by using cleaner and more readable code that suffered from performance drawbacks. The purpose of the library was to make adding entries and searching based on these prefix keys extremely fast, so we took out some of the syntactical sugar that comes from using certain native methods like Array.filter and Array.each.

We dropped triever into our search operation and stored each of the 50,000 entires into two tries; one for name, and one for company name. We also only searched for entries greater than 3 characters, and trimmed all non alphanumeric characters from entry names and search queries.

We were able to search for matches in milliseconds each time after that. Rather than searching through all 50,000 entries each time, we were now only searching a small subset of entries based on the maximum key length in a particular prefix path. For example, the h prefix path would cause a user to search for all paths on h, which would only result in 7 traversals and a summation of all results. Even if we had hi, hello, helen, and hugo, for our h paths, we would no longer be searching for any other names in the trie that started with any other letter. In addition, by waiting for at least 3 characters before searching, we were able to ensure that we were going down far fewer paths; by that point, we would be able to filter out the bulk of results and start getting to results that actually mattered.

With all three fixes in place, the search operation took dramatically less time and was giving equally relevant results as the original implementation.