Should I Convert My Data into a More Efficient Representation?

By Diego Camargo, Software Engineer
Most likely the answer to this question is yes; this is a bit of a reminder. What happens whenever you are searching for an element inside of a collection? Do you think about the internal implementation of the collection? Do you think about the advantages and/or disadvantages that it comes with? Should you care?
Choosing a Performant Data Structure and Solution
I’ve been using Ruby for some time now, and I’ve found myself writing inefficient solutions just because I wasn’t paying attention to the data structure. I’m not saying that I was careless, but rather more concerned about the methods available for the object that I was working with than its actual composition, plus, I’ve also seen it while reading code from others. The reason why I’m writing this is that I’ve worked with other languages in the past, and this doesn’t happen that often. I’m usually concerned about what’s going on under the hood compared to which language I’m using, which some people might find not standard practice. That’s also the reason why I don’t call myself a “Ruby developer,” but the way I ended up using Ruby is a funny story. Don’t get me wrong, I do like Ruby, but I don’t like being restricted to a single language/technology.
Let’s think about the implementation of Enumerable#include?
In the case of an Array, the implementation probably looks like this:
def include?(element) self.each { |elm| return true if elm == element } false end
And if we think about a Set, it could be done in the following way:
# Ruby doesn’t actually do this, it uses a Hash internally to represent a Set and marks # true the value part of the key-value tuple, however you get the idea of my representation def include?(element) # this would have issues if someone wanted to store nil, however it is just an example @internal_values[hashing_method(element)] != nil end
If we compare both solutions, it might seem that the Set is way faster since the evaluation runs in O(1) time compared to the Array, which runs in O(n). Nevertheless, generating a Set might be expensive because the calculation of a Hash is. One more thing that you want to take into consideration is that Sets don’t store the same value twice.
Sometimes there’s the idea that using elaborated or complicated data structures pays off in performance, however, it might also be less performant to generate them. So you have to find the balance of where and when you should use them and so on. I’ll try to compare a few implementations of different methods while explaining on a high level what’s going on.

A side note on data structures: you don’t have to know how to implement all of them by heart, although if you know how they work internally, it is likely that you know how to implement them and understand the difference between using an Array vs using a Set or an Array vs a Dictionary.
Killing assumptions
I’ll be using the benchmark-ips gem to compare the performance of different implementations. I generated an Array of a 1000 elements and compared searching for the element that is in the last position of the Array. I’m using Ruby 2.5.1(MRI); maybe with some other implementations you can get different results.
# Here’s the output of my terminal # I’m running this code on a Thinkpad T470p (i7-7700HQ and 24 GB of RAM) if that matters ➜ ruby -v ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
i/s means iterations per second
# Searching for the last element out of a 1000 Set#include?(last_element): 4678718.5 i/s Array#include?(last_element): 271223.7 i/s - 17.25x slower Array#to_set#include?(last_element): 9964.6 i/s - 469.53x slower
From the results you can see that the Set is the fastest. However, converting the Array to a Set and then searching is slow, because it has to iterate over all of the elements and hash them. If you’re going to search several times, it might pay off depending on the size of the collection and the amount of times that you’re searching, otherwise you should just use an Array.

Graphic by Jorge Stolfi
# Searching for 500 elements out of a 1000 Comparison: Set#include?(half_of_the_elements): 20635.0 i/s Array#to_set#include?(half_of_the_elements): 4970.2 i/s - 4.15x slower Array#include?(half_of_the_elements): 581.6 i/s - 35.48x slower
Here we can see that converting in place to a Set starts to pay off. Sadly and surprisingly trying with only a 100 elements doesn’t seem to do so. Although the gap gets closer:
# Searching for 50 elements out of a 100 Comparison: Set#include?(half_of_the_elements): 255422.0 i/s Array#include?(half_of_the_elements): 53188.7 i/s - 4.80x slower Array#to_set#include?(half_of_the_elements): 47724.4 i/s - 5.35x slower
Here’s the code with the results, which are not what I expected. Still, it was really interesting to try it out.
After the result, I dug into the Ruby code, and the Set class, as mentioned before, uses a Hash instance internally where it only saves true on the value side of the tuple. Then I went to the Hash implementation, which is written in C and couldn’t find much room for improvement. It seems that the actual bottleneck is the hashing operation itself.
An example could be, you having a service that has some whitelisting in memory. If you have that whitelist as a Set and receive 20,000 requests a day, you’ll process those requests 75x~ faster than if you had that as an Array.
To conclude, it seems like Sets are quite useful if you’re going to search several times within the collection, converting an Array to a Set seems quite expensive. With this information in hand you should try to use them if the value is going to stay in memory as a constant or, as I said before, if you’re going to use it several times to search for a value or several ones. I also find it important to state that you should always be thinking about what is happening under the hood. You will end up writing much better code in the end.

***