How Small Changes Led to Significant Scalability Improvements

Written by Martin Führlinger, Backend Engineer
At Runtastic we face an increase in traffic on our servers every year. We usually have a higher number of requests in the spring and summer than in the autumn and winter. Throughout the year, especially in spring and early summer, there are many sports events, like marathons, the Wings for Life World Run, Run For The Oceans, and other campaigns with a lot of participants locally or around the world.
As backend engineers, we have to keep an eye on the health of our services and we need to scale and improve our running system every year. Scaling can generally be done by either increasing the amount of servers handling the requests or making the requests faster. Both actions increase the number of requests which can be handled per minute.
Adding more servers and workers sounds easy, and it is for a backend developer in our setup, as it has to be done by the OPS team. But more importantly, hardware resources are also necessary. And as we all know, hardware resources cost money.
So, the more cost effective approach is to scale the requests by making them faster. Depending on the implementation, this can either be easy if we find low hanging fruit or complex if we rewrite parts of the code.
This spring my colleague Martin Landl and I invested some time in improving our core services, which I want to share. Since we use NewRelic to analyze and monitor our services, we were able to see the improvements a few minutes after deploying the changes.
Caching
Whenever you read about performance improvements, caching is a big topic. Caching means that you store a value or an object you read or calculated before in an easily accessible storage system (like memcached) so you don’t need to fetch or calculate it again. This works for many situations. So, we added some more caching (we were, of course, already using caching).
Our clients make multiple requests within a few minutes, to fetch running sessions, user statistics, user data, News Feed and other things during startup. Since all of these requests pass our central gateway, we were able to save about 75% of the database calls by just adding a simple caching of the current user, which is used for authentication. At this time it decreased the database queries by about 60000 per minute.

The code around this is simply:
identities_cache.fetch(cache_key) do identity = Identity.find(id) identities_cache.store(cache_key, identity) if identity end
In another service (this time java-based) we were able to cache some training plan meta information, which is basically static data, to reduce the response time by nearly 50%. This was done by just adding a few lines of code in a few dedicated classes:
import org.hibernate.annotations.Cache; import org.hibernate.annotations.CacheConcurrencyStrategy; @Cache(usage = CacheConcurrencyStrategy.READ_ONLY) DatabaseUtil.setQueryCacheableHint(q);

Caching promotion data, which only changes every few months, leads to a few hundred requests to the database instead of nearly 5000.

Although many database calls or calculations can be avoided by caching, very often caching is not the way to go. Another good improvement is to avoid unnecessary work.
Avoid unnecessary work
In our services we often use hooks, which automatically run on every save or update of an entity. Some examples are: calculating the plausibility of a running session or calculating the fastest paths within a session to be able to calculate the records (fastest 5K, fastest half marathon,…), geocoding and many other things. Using these hooks is pretty nice, as you don’t have to take care about this code in every use case then. But in some cases this additional work is avoidable, because it just does not make sense.
By removing the plausibility calculation during a session that is still live (live-tracking), we saved about 15 ms per live session update request, which is actually 20% of the request (~75 down to about 60 ms). This calculation is unnecessary, because the average speed and other values of that session are not final yet, as it is still live.

As the live update request is the topmost request we have in this service (with up to 60000 requests per minute) this even leads to an overall reduction of response time in that service:

And the best part of this improvement is, that it is only one line of code at the right place:
return true if session.type == "run_session" && session.live_tracking_active
Another improvement was reducing the amount of code executed. This includes database calls by prefetching the entities and using it from the local variable, instead of fetching it multiple times. In this case we have a request returning training plan information which is part of the response together with the sessions. If the response contains a lot of sessions with a training plan assigned, this led to many requests to the database, as we fetched the training plan information per session. On average we had about 36 database queries per request, but there were some requests with much more .. like this one with 1089 database queries.

Having a look at the histogram of that request also proves that there are a lot of requests which are pretty fast, but still a lot, which are pretty slow because of that.

The improved code fetches the training plan information for all sessions at once, and stores that information inside a local variable. Assigning the information to each session then uses this local data. After deploying the improvement we see a much less database calls on average (5.4)

and a much better histogram.

Avoid unnecessary code
Although avoiding unnecessary work is the best improvement, as not executed code is definitely the fastest code, sometimes necessary code can be improved also rather easily. When analysing our services, I was wondering about the time needed to render a huge entity. So an entity with a lot of attributes took quite some time to render correctly to json. Digging into the serialization of a specific entity with a lot of attributes that was especially slow, I found these lines, which are used for 23 attributes in this entity:
def format_timestamp(ts) format_timestamp_value(@object.public_send(ts)) if @object.public_send(ts) end
I thought it might be beneficial to not read the attribute twice, so i changed these to this:
def format_timestamp(ts) val = @object.public_send(ts) format_timestamp_value(val) if val end
As you can see, this is a rather simple change, just do not use the public_send method to get the value twice, instead store the value in a local variable first. The impact was pretty nice:

It reduced the rendering time of that huge entity (> 100 attributes) from around 40 to about 32 ms. This is 25 % less time. As this was another pretty central piece of code in this service, it reduced the overall time also from around 105 ms to about 93 ms.

The last improvement I want to mention is moving code to be executed when it is necessary and not before. We calculate the fastests paths of a running session (as mentioned above already, the fastest 5 kilometers, the fastest mile, fastest half marathon and similar). This is done on the mobiles already, but as there are imported sessions (e.g. from garmin) and manual sessions too, we need to have the same logic in the backend. We also check if the uploaded session already has all necessary values calculated, to be able to calculate the missing ones. Before the improvement we always fetched the trace outside the fastest paths calculation (we abstracted that into a gem) and passed it into the method. After changing to just pass a trace-reader, which is able to fetch the trace when it is necessary into the method we saved up to 100 ms on average. So instead of around 150 ms per job, it only takes around 50 ms on average now.

Summary
To summarize our findings and thoughts, I collected some guidelines we followed when searching for bottlenecks and possible improvements.
- Improve code that is used often. Even if it only takes a few milliseconds, if it is done 2 billion times a day, it is still a lot (e.g. plausibility code calculation on every update)
- Improve code that takes a long time (improve and avoid queries)
- Search for entities which are static and cache them (e.g. static data like training plan information, or data which is used often within a short time like the user in our central gateway)
- It is not efficient to spend hours improving something that is done only once a day and speed it up from e.g. 60 seconds to 30 seconds.
- If a request has dozens or hundreds of database queries, there might be something strange happening.

***