Bloomfilters and their role in cloud storage
As software craftpersons, it is important to have a variety of tools for fixing poor performance. Bloomfilter are generally used to solve a cardinality problem: is an element probably in the list or is it definitely not there? This allows you to skip the expensive work of searching for elements that you can cheaply assert are not in the list without looking for them.
Bloomfilters are a data structure that afford several operations:
- Adding an element
- Testing for an element
- Linear counting (an algorithm to estimate the number of elements in the set)
Note that a bloomfilter doesn't afford to remove an element[1].
Bloomfilters are incredibly useful when writing an encrypted cloud storage application: They are very fast and can easily save on performance by cheaply avoiding very expensive operations: misses in exploring trees or hashmaps, which are often their worst case, and are most expensive when they happen on a slow storage media.
How does it work?
A bloomfilter uses a number of hash functions. These functions output a deterministic but random looking number for every element in the set.
The bloomfilter itself is an array of bits. For example, it may look like the following in C++:
template<typename T>
class bloomfilter {
using hash_list = std::vector<std::function<size_t(const T&)>>;
const hash_list hashes;
std::vector<bool> data;
public:
bloomfilter(size_t size, hash_list&& list)
: hashes(std::forward(list))
, data(size)
{}
void set(const T& new_element);
bool test(const T& potential_element);
double linear_count();
};
Let's break this down, our class is mainly:
- A list of hash functions
- A list of bits (we use
vector<bool>
because it may be more efficient in certain implementations)
We also have 3 methods: one to set the correct bits according to received data, one to test the list of bits, and one to execute the linear counting algorithm.
Note that this implementation is not concurrency friendly, and hence not the exact one used by our encrypted cloud storage, but it is close enough to be used as an example.
Set
template<typename T>
void bloomfilter<T>::set(const T& new_element) {
for(const auto& hash : hashes) {
data[hash(new_element)%data.size()] = true;
}
}
This will set the hashes that fingerprint the provided data. The way this works is extremely simple. In an actual production environment, this would use atomic operations to be able to use the filter in a massively concurrent way.
Test
template<typename T>
bool bloomfilter<T>::test(const T& new_element) {
for(const auto& hash : hashes) {
if(not data[hash(new_element)]%data.size()) return false;
}
return true;
}
Similarly, the test function is simple: if a single of the hashes is not set, then the tested value is not in the set, else it probably is in the set. This is often used in cloud storage software to check if a file or a user actually exist before fetching their actual data.
Bonus: linear counting
template<typename T>
bool bloomfilter<T>::linear_count() {
double n = std::count(data.begin(), data.end(), false);
double m = data.size();
return -m*std::log(n/m)/hashes.size();
}
Linear counting is a peculiar probabilistic algorithm. Note that it works best with a single hashing function in the filter.
\[n=\sum^{i \in data} \begin{cases} \text{$i$~is~true} \Longrightarrow 0 \\ \text{$i$~is~false} \Longrightarrow 1 \\ \end{cases}\]
\[m=|data|\]
\[c = -m\times ln \frac{n}{m}\]
Those equations[2] explain the inner workings of linear counting. You can find in the code an additional factor: we divide by the amount of hashes by element as the formula gives you a value as a number of hashes inserted.
The use of bloomfilters
Bloomfilters are used in cloud storage as well as in applications that would have to handle tremendous amout of data that is hard to access.
For example, a bloomfilter of a relatively small size can fingerprint a large amount of unsafe passwords so that they can be prohibited on client side without ever sending the unsafe or safe passwords through the internet. This is one of the privacy-enhancing features that you want from an encrypted cloud storage. They also allow to limit the actual slow operations count to do: if an element tests true, then it may be in the set, but if it tests false, it never will be in the set.
It is all of those features that make bloomfilters a very common and useful component of computer systems.
Conclusion
We use a lot of different types of data structures, and probabilistic ones are not the least at the party that privacy-oriented cloud storage is.
I invite you to check the information at https://nekoit.xyz/, join us on Discord or Telegram, or follow me on Mastodon Archivist@social.linux.pizza
Note that there exist a version of the bloomfilter known as a counting bloomfilter, this adds support for deletion of elements but is less efficient memory-wise. ↩︎
Probabilistic data structures and algorithms for big data applications by Andrii Gakhov ↩︎