Designing an Efficient Algorithmic Caching System: A Walkthrough (1/2)
In our fast-paced world, speed is imperative.
For every 100ms of Latency, Amazon lost 1% in sales.
The logical conclusion is to know what likely does not change, and save that at a place where you can retrieve it quickly. That’s caching.
The danger is that users/customers receive outdated data, which, in many cases, is worse than not delivering at all, because it’s a flawed foundation for decisions and impressions.
The question is: How can we maximize speed while not compensating in data integrity?
I asked myself this while developing Embyte. To illustrate why caching is so beneficial in a web-environment, compare the typical request time to a website of 200–2000ms to a simple persistent database-retrieval of less than 10ms.
The difference is 50–500-fold, and that’s not even using an in-memory DB.
The caching algorithm I’m going to walk you through was used in 5000 requests and causes virtually no overhead (<1/2ms).
Starting with the boundaries
Boundaries are simple conditions, that we’ll use to speed the decision process up and define limits to our algorithm for baseline safety.
- Is the data not in the cache? -> Make a request
- Is the cached entry too old? -> Make a request
- Is the cached entry too young? -> Don’t make a request
What is too „old“ or „young“ depends on your specific use-case. A good starting point is knowing at what time difference a “considerable” part of the entry changes, on average. By time difference, I mean the difference to the most recent cache entry. For example, I chose 6 minutes as lower limit and 2 months as upper limit.
The idea
To reach the necessary equations, we need to be aware of what we want to do:
We want to minimize the cost incurred in each query our algorithm handles.
By cost, I mean the total loss of all lost resources, which are either data integrity or time.
If the algorithm picks the cache, it expends virtually no time. If the algorithm picks sending a request, it has full data integrity. This simplifies the problem: The question now is, which factor to eliminate when. Obviously the larger one.
Picture a coordinate system with time as x-axis and cost as y-axis, where t=0 is the time of the most recent entry.
At t=0, the cost of using the cache would be 0, in the sense that no data integrity could have been lost yet. This changes, the cost of using the cache increases, and hence the cost of a request decreases.
The tipping point is the time when these two cost curves meet, assuming they do.
If more time elapses, we would want to eliminate the potential cost in data integrity, since it's the larger factor. If less time elapses, the opposite would be true, and we should focus on time. It is clear how we can make decisions based on this metric. If „now“ and the last entry’s time are closer to each other than the calculated delta (t_now < t_last + t_delta), the cache will be used.
Approximating the cost functions
The cost of using the cache can be based on the probability of the entry being deprecated, since this is the only factor that is a relative drawback to making a request. A logarithmic function (e^-x) is an intuitive choice to approximate that probability.
There are 3 more things to consider:
1) Since the entry could have become deprecated at any time between t=0 and now, we have to use the PDF (probability density function), which means integrating P_Deprecated(t).
2) Since the two costs are not equally large (returning outdated data is far most costly), we introduce a factor $k$ to the equation. Its value will depend on the specific use-case.
3) Since the steepness of these curve can vary too, we take this into account with another factor $a$, placed in the exponent.
Our current equation looks like this:
Considering $P_deprecated(t) = e^(-a \ t), we get this equation:
Solving for t:
W(x) is called Lambert function. We used it to solve an equation in the form of x * W(x) = a.
And that’s it. If you want to know how to implement it, read article part two, a walkthrough of the algorithm’s implementation in C#, where I’ll also show some example output.