How to increase search engine relevance in ElasticSearch

Written by Alexander Junger, Software Engineer Backend

A couple of sprints ago my colleague and I were tasked with building a search engine. Pulling out your smartphone and opening an app every time you consume your delicious gluten-free green smoothie requires some discipline. Therefore we wanted to delight our users with a fast and smart search that – almost – knows what you ate from the food stains on your fingers.

Our technology of choice was ElasticSearch because of its rich full-text capabilities and horizontal scalability. The fact that it’s well-established in large production systems and has pretty nice documentation also helped.

First steps

The foods in our nutrient database are identified by a name – this can be something like “tomato mozzarella salad,” but also a product name – and by an optional brand name. As you can guess, we created an index and right away loaded it with our database.

{
   "_index":"foods",
   "_type":"food",
   "_id":"6afd355f-12aa-4e61-928b-516c5f3b2a41",
   "_source":{
      "name":"Apple (Gala)",
      "brand":null
   }
}

Now, in ElasticSearch there is a match_all query. It sounds great, right? You pass it your query parameter and it does the magic. We ran our first search, “Chicken Curry,” and to our surprise, it worked perfectly. First result in the list, we’re done! (Of course, we weren’t).

Next, we searched for “Kelloggs Müsli” and it just didn’t work, even though we had a “Kellogg’s Müsli” in the index (note the missing apostrophe for later). And that was by far not the only issue.

Most people think that building a search engine is just about retrieving some documents, and that the choice which documents you have to retrieve is pretty obvious. We gradually found out that this choice is only obvious to humans. A search string is not an ID that either matches or not and your search engine at first has a very hard time understanding what you deem relevant for a given search query. Rome Google wasn’t built in a day.

1. Be prepared to re-index a lot

As you probably know, relevance isn’t a yes/no question to ElasticSearch. The query is often your first point for tweaking, but the analysis and structure of your index is at least equally important. Our search has to support 14 different languages, and we had to customize the default language analyzers a lot to fit our search problem. It might seem “good enough” to use just the default analyzers, but they rarely work well for edge cases, because your index has to be adapted to the domain of your search. This allows ElasticSearch to more accurately rank documents, so that the most relevant ones appear on top.

Coming back to the cereal killer bug, we realized that the German analyzer we used for this German food didn’t trim away the possessive «‘s» on index analysis, as this is not a feature of German words. Of course the “kelloggs” token from our search could never match. These sorts of quirks are the domain-specific problems that you have to solve as you improve your search. They require a deep understanding of your content and how it will be used as well as a creative mindset when setting up your filters and analyzers.

Most importantly, don’t shy away from iterating your query and index structures (they influence each other!) over and over again. If you have everything checked into your VCS, you can always go back when something doesn’t work. As a consequence of iterating, you’ll also have to re-index a lot, so setting up a development environment that supports this will save you a lot of time. In our case, it looked something like this:

last_updated_at = nil
limit = 1_000

loop do
  foods = Repository::Food.all(updated_after: last_updated_at,
                               limit: limit)
  
  Repository::Search.bulk_index_foods(foods)
  break if foods.size < limit
  raise "endless loop" if foods.last.updated_at == last_updated_at
  last_updated_at = foods.last.updated_at
end

We retrieve our primary food database in bulks of 1000 foods and insert that into the cluster. Note the bulk_index method, which wraps ElasticSearch’s POST /_bulk endpoint, taking up significantly less time than individual write operations would.

2. Add context to the search query

When most people search for something in Google, they probably don’t realize that their search doesn’t just consist of the string they entered in the search box. To draw an analogy to the real world, a colleague who’s new in town might ask you where to buy groceries. Even though you know many supermarkets around, there is no universal answer to that question. Opening hours, their work schedule, distance, a preference for convenient delis or upscale supermarket chains, maybe Google reviews – all of that forms the context for what you consider the best answer to your friend’s question. You will subconsciously assign a weighting to those factors, just as you can with different clauses in a search query. In most use-cases, there should be some context that you could incorporate in your search engine. By synthesizing the raw text query with these factors, be it analytics data, geolocation or parameters from the user’s profile, your search engine will deliver much smarter results.

{
   "query":{
      "function_score":{
         "query":{
            "bool":{
               "must":{
                  "match":{
                     "food_name":"Apple"
                  }
               },
               "filter":{
                  "terms":{
                     "region":[
                        "AT",
                        "IT"
                     ]
                  }
               }
            }
         },
         "functions":[
            {
               "filter":{
                  "terms":{
                     "_id":[
                        "gala_apple_98128",
                        "toast_5812931"
                     ]
                  }
               },
               "weight":20
            }
         ]
      }
   }
}

As you can see in this (simplified) example, in addition to the obvious query string parameter, you could take the user’s location and their previously tracked foods as input signals. In this case, the user is searching for apples and we know that she previously tracked an apple of the “Gala” cultivar. We boost that one with a certain weight and it will likely appear as the very first result among the dozens of available apple varieties.

3. Try to ignore the response times (at first)

When you deploy code at Runtastic, the quality of your work will immediately be tested by thousands of users simultaneously hitting this feature with requests. For me, this is one of the coolest aspects of working here, and with great RPM comes great responsibility. Your search engine might feel sluggish during development, and you will be tempted to throw more hardware at the problem, right away. I would urge you to start with a rather small cluster and instead focus on kickass, relevant search results first. This will also reduce load, because your users are more likely to find what they need without rephrasing (and producing more search queries that way).

As soon as the quality is there, you can start scaling up your cluster to meet performance requirements. There are many caveats here like correctly distributing memory between JVM heap and kernel cache or the segmentation of your index. The effects of these depend largely on your use-case and there isn’t really a secret formula, but ElasticSearch is…truly elastic, and if you correctly optimize for your use case, you can make it very fast in the end. Together with our operations team, the official documentation and some trial and error, we were able to greatly increase both the base response time for one search request as well as maximum throughput in our load tests.

How do you know when it is working?

You will soon recognize different patterns or use-cases within your search domain. The patterns coming to mind for our application are product names, generic searches such as “Juice” or “Bread” and autocomplete searches with partial names such as “Wie” (which should obviously give you “Wiener Schnitzel”). As mentioned before, when you optimize your query or index structure for one of those patterns, the other use-cases often regress to producing worse results. Even though this might be frustrating, you’re not “doing it wrong,” you just have to iterate more. In order to know when your search engine is actually delivering the most relevant results overall, it really helps to have a methodical approach.

A structured evaluation of the relevance of your search engine results is,  in my opinion, a holy grail. It deviates from the more practical engineering topics in ElasticSearch documentation and blog posts towards academia (information retrieval fills papers and entire conferences, as you can imagine). Agile development and delivering value to our users is something we strive for at Runtastic, but regardless, we are also given enough leeway to dive deep into a topic and do some research. Stay tuned for another blog post about what we have learned in implementing an actual evaluation framework for our search engine.

***

Tech Team We are made up of all the tech departments at Runtastic like iOS, Android, Backend, Infrastructure, DataEngineering, etc. We’re eager to tell you how we work and what we have learned along the way. View all posts by Tech Team