Implementing a fast and close to optimal caching algorithm in C#: A Walkthrough (2/2)
Caching can increase an application’s speed by several dozen times. The algorithm we’ll implement is fast (<1ms), and makes decisions based on previous data. It is close to optimal, and can be customized further.
Caching is especially beneficial in a web-based environment.
The application from which the code was taken, Embyte, made over 5000 requests to websites, which took 750ms on average. Compare this to retrieving data from the cache within sub-10ms.
Because the user waits, not a background program, the speed increase is especially beneficial.
To better make sense of the implementation, it’s best to read part 1/2, Designing an algorithmic caching system, first.
Broad overview of the algorithm
The algorithm calculates the optimal point in time to renew the cache. If this point lies in the future, the cache will be used. Otherwise, a request is made. This logic is expressed as the following function:
public Tuple<WebsiteInfo, WebsiteInfoStatus> Get(string url)
{
var ttRenew = CacheAlg.TimeToRenew(url, DbCtx.ExtractorEntries);
var (cachedInfo, status) = GetFromCache(url, prevEntry, ttRenew);
if (ttRenew > DateTime.Now && cachedInfo != null)
{
return Tuple.Create(cachedInfo, status);
}
return GetFromExtractor(url, prevEntry, cachedInfo);
}
To make these predictions, every time the program decides to make a request, we log data about if a cache was deprecated or not. This can be used in the algorithm to make a better prediction: it fits a cost curve to our situation. This curve can be used to calculate the optimal point to renew the cache.
Architecture
- The module that calls the algorithm is called `WebsiteInfoGetter`, in this application.
- - The algorithm’s logic is stored in a module (`CacheAlg`)
- - Learning data is stored in a database (Data class: `RequestEntry`)
- - A sub-part of the algorithm is the fitter, sorted in the `ExponentialFitter`-Class.
Side note: To store the data for the algorithm, I used EF-Core with a PostgreSQL-Table. However, I minimized the data-logic to focus on the algorithm. If you want to implement it that way, you further look into the project’s code.
Learning data – RequestEntry
The algorithm uses previous entries that contain data about the previous validity of the cache. (e.g. request on date x: cached data didn’t match requested data – cache was therefore deprecated, or did match and was not deprecated). Here’s the data class. More on this later.
public class RequestEntry
{
public int Id { get; set; }
public DateTime Time { get; set; }
public TimeSpan DeltaToPrevious { get; set; }
public string Url { get; set; } = string.Empty;
public bool DataChanged { get; set; } = false;
public RequestEntry(TimeSpan deltaToPrevious, string url, bool dataChanged)
{
...
}
}
Calculating the optimal time to refresh the cache
We’ll now implement the `TimeToRenew`-method:
public static DateTime TimeToRenew(string url, IQueryable<RequestEntry> entriesGeneral)
{
...
}
Firstly, the method checks boundary conditions. Their implementation is very simple: Compare the entry’s value to a predefined one. These three are the most common ones.
var entriesSpecific = entriesGeneral.Where(e => e.Url == url);
if (TooFewDataPoints(entriesSpecific) || TooOldDataPoints(entriesSpecific))
return DateTime.MinValue;
if (TooRecent(entriesSpecific))
return DateTime.MaxValue;
The goal here is to eliminate certain scenarios where the algorithm might not return the result we want. For example, if too few data points exist to make a reliable prediction.
double ratioImportanceHitMiss = 0.1;
double lowerLimitHours = 0.1;
We will then define `ratioImportanceHitMiss`, a central parameter of the algorithm. It represents the factor of the abstract metric **loss** between *making the request*, which slows down the application, and *using the cache*, which risks showing wrong data. In this case, making a request is exceptionally time-consuming, which is why the value is `0.1`. It should be lower for applications where data integrity is important, and higher if requests are even more costly. A factor of `0.02` is a good starting point.
`lowerLimitHours` filters out all entries with a very low time delta to the previous request. Since almost all of them contain `DataChanged=false`, they aren’t good learning input for the algorithm. (Including them would bias the learning approach.)
var (timeStart, entriesForFitting) = ChooseEntries(entriesGeneral, entriesSpecific, lowerLimitHours);
The algorithm chooses between using url-specific or all entries as learning data. This method also returns the most recent entry’s time, `timeStart`. `timeStart` represents the start of the time section in which the probability of the cache being outdated increases from 0% continually.
double coeff = GetPolyfittedSecondCoefficient(entriesForFitting, timeStart, lowerLimitHours);
The coefficient of a curve that fits to the training data is then added to `timeStart` and returned. This method will be explained later.
var t = timeStart.Add(TimeSpan.FromHours(CalculateTimeToRenewFromPolyfittedArguments(ratioImportanceHitMiss, coeff)));
return t;
}
Fitting the probability curve of cache being outdated
Our goal is to find the time to renew the cache. We will use the probability of the cache being outdated over time to derive it, which can either be 0 or 1 in the measured `RequestEntries`. The curve should approximate a logarithmic one that falls over time.
Diagram showing P(Outdated) and the fitted curve over time
To achieve this, we use a minimal gradient descend approach, contained in a separate class.
public class ExponentialFitter
{
public double lr;
public int iterations;
public ExponentialFitter(int iterations, double lr)
{
this.lr = lr;
this.iterations = iterations;
}
...
From testing different values, it’s clear that the difference to the optimal loss is directly proportional to both learning rate and number of iterations. In other words: If you double the number of iterations, and half the learning rate, the loss should be the same. Considering this, we can maximize learning rate and minimize the iterations to make the algorithm faster.
- We can use a higher learning rate (e.g. 0.25). I chose 0.5.
- If the learning rate is this high, 100 iterations are sufficient.
- As loss function, I chose MSE. (sum of (delta x)^2)
This is the fitting function.
public double GetCoefficientForCacheData(bool\[\] data, double\[\] timeDeltas, out double mse, double epsilon)
{
double\[\] actual = data.Select(b => b ? 1.0 : 0.0).ToArray();
double a = -1;
mse = 0.0;
double msePrev = 0.0;
for (int i = 0; i<iterations; i++)
{
double\[\] pred = timeDeltas.Select(t => Math.Exp(a \* t)).ToArray();
msePrev = mse;
mse = GetLoss(pred, actual);
double gradient = pred.Select((p, i) => p - actual\[i\]).Sum() \* 2 / pred.Length;
a -= lr \* mse \* gradient;
if (msePrev - mse < epsilon && i > 1)
break;
}
return a;
}
}
For each iteration, we calculate the predicted value, the loss, the gradient. The gradient is scaled by the loss and learning rate, and has the direction of the slope of the loss function. Since we want to minimize the loss, we subtract the gradient.
For more information, I recommend reading A gentle introduction to gradient descent through linear regression by José Tapadas.
Calculating the optimal time to renew the cache
Finally, we’ll use the previously calculated coefficient, along with `ratioImportanceHitMiss`, and pass it into a formula to get the wanted time. It contains `LambertW()`, which we import from `Meta.Numerics.Functions`
using Accord.Math;
public static double CalculateTimeToRenewFromPolyfittedArguments(double ratioImportanceHitMiss, double fittedSecondCoefficient)
{
var a = fittedSecondCoefficient;
var c = ratioImportanceHitMiss;
return -(1/a) \* AdvancedMath.LambertW(-(c+1)/c\*Math.Exp(-(c+1)/c)) -(c+1)/(a\*c);
}
If you want to know where this formula comes from, read the first part of the article series, [Designing an Optimal Caching System]().
In short, the optimal time to renew the cache when the loss-values for both options (using the cache or not) are equal to each other. After this point in time, the cache should not be used anymore. We equate the cumulative probabilities considering. `ratioImportanceHitMiss` and solve for `t`.
Conclusion
In this article, we have provided an overview of a mathematically optimal caching algorithm. Caching is useful in many applications, especially where performance is key, or requests take long. By understanding the underlying concepts, you will be better equipped to recude latency and increase efficiency in the applications you write.
For further understanding, view the full implementation in the public repository. The code is fully open-source-licensed.
Appendix: Original versions of all files
CacheAlg.cs
namespace Embyte.Modules.Product;
using Accord.Math;
using Embyte.Data.Storage;
using Embyte.Modules.Logging;
using Meta.Numerics.Functions;
public static class CacheAlg
{
public static DateTime TimeToRenew(string url, IQueryable<RequestEntry> entriesGeneral)
{
var entriesSpecific = entriesGeneral.Where(e => e.Url == url);
if (TooFewDataPoints(entriesSpecific) || TooOldDataPoints(entriesSpecific))
return DateTime.MinValue;
if (TooRecent(entriesSpecific))
return DateTime.MaxValue;
double ratioImportanceHitMiss = 0.1;
double lowerLimitHours = 0.1;
var (timeStart, entriesForFitting) = ChooseEntries(entriesGeneral, entriesSpecific, lowerLimitHours);
double coeff = GetPolyfittedSecondCoefficient(entriesForFitting, timeStart, lowerLimitHours);
var t = timeStart.Add(TimeSpan.FromHours(CalculateTimeToRenewFromPolyfittedArguments(ratioImportanceHitMiss, coeff)));
Log.Debug("TimeToRenew Coeff a={coeff}", coeff);
Log.Debug("TimeToRenew t={t}", t.ToString());
return t;
}
public static bool TooFewDataPoints(IQueryable<RequestEntry> entriesSpecificToUrl)
{
var cond = entriesSpecificToUrl.Count() < 5;
Log.Debug("TooFewDataPoints: {cond}", cond);
return cond;
}
public static bool TooOldDataPoints(IQueryable<RequestEntry> entriesSpecificToUrl)
{
var mostRecentEntry = entriesSpecificToUrl
.OrderByInDb("Time", true)
.FirstOrDefault();
var cond = mostRecentEntry != null && mostRecentEntry.Time < DateTime.Now.AddMonths(-EmbyteStorage.upperLimitCacheAgeMonths);
Log.Debug("TooOldDataPoints: {cond}", cond);
return cond;
}
public static bool TooRecent(IQueryable<RequestEntry> entriesSpecificToUrl)
{
var cond = entriesSpecificToUrl
.OrderByInDb("Time", true)
.First().Time > DateTime.Now.AddHours(-12);
Log.Debug("TooRecent: {cond}", cond);
return cond;
}
public static double GetPolyfittedSecondCoefficient(IQueryable<RequestEntry> entries, DateTime timeStart, double lowerLimitHours)
{
double\[\] time = entries.Select(e => e.Time.Subtract(timeStart).TotalHours).ToArray();
// The exponential function will be fitted to the probability of data not having changed over time, because
// it is the one declining over time. Hence the negation
bool\[\] booleanData = entries.Select(e => !e.DataChanged).ToArray();
double\[\] timeDeltas = timeDeltasOfEntries(entries);
booleanData = booleanData.Skip(1).ToArray();
booleanData = booleanData.Where((b, i) => timeDeltas\[i\] > lowerLimitHours).ToArray();
timeDeltas = timeDeltas.Where(t => t > lowerLimitHours).ToArray();
double a = new ExponentialFitter(100, 0.5).GetCoefficientForCacheData(booleanData, timeDeltas, out double mse, 0.00001);
return a;
}
public static Tuple<DateTime, IQueryable<RequestEntry>> ChooseEntries(IQueryable<RequestEntry> entriesSpecific, IQueryable<RequestEntry> entriesGeneral, double lowerLimitHours)
{
double\[\] timeDeltasSpecific = timeDeltasOfEntries(entriesSpecific);
double\[\] timeDeltasGeneral = timeDeltasOfEntries(entriesGeneral);
IQueryable<RequestEntry> entries = timeDeltasSpecific.Where(t => t > lowerLimitHours).Count() >= 8 ? entriesSpecific : entriesGeneral;
DateTime timeStart = entries
.OrderByInDb("Time", false)
.First().Time;
return Tuple.Create(timeStart, entries);
}
public static double\[\] timeDeltasOfEntries(IQueryable<RequestEntry> entries)
{
DateTime\[\] time = entries.Select(e => e.Time).ToArray();
return time.Skip(1).Select((t, i) => t.Subtract(time\[i\]).TotalHours).ToArray();
}
public static double CalculateTimeToRenewFromPolyfittedArguments(double ratioImportanceHitMiss, double fittedSecondCoefficient)
{
if (ratioImportanceHitMiss < 0 || ratioImportanceHitMiss > 1)
throw new ArgumentException("ratioImportanceHitMiss must be in the range of 0 and 1 (inclusive)");
var a = fittedSecondCoefficient;
var c = ratioImportanceHitMiss;
return -(1/a) \* AdvancedMath.LambertW(-(c+1)/c\*Math.Exp(-(c+1)/c)) -(c+1)/(a\*c);
}
}
ExponentialFitter.cs
using Accord.Statistics;
namespace Embyte.Modules.Product;
public class ExponentialFitter
{
public double lr;
public int iterations;
public ExponentialFitter(int iterations, double lr)
{
this.lr = lr;
this.iterations = iterations;
}
public double GetCoefficientForCacheData(bool\[\] data, double\[\] timeDeltas, out double mse, double epsilon)
{
if (data.Length != timeDeltas.Length)
throw new ArgumentException($"Input arrays must be of the same length: data.Length={data.Length}, timeDeltas.Length={timeDeltas.Length}");
double\[\] actual = data.Select(b => b ? 1.0 : 0.0).ToArray();
double a = -1;
mse = 0.0;
double msePrev = 0.0;
for (int i = 0; i<iterations; i++)
{
double\[\] pred = timeDeltas.Select(t => Math.Exp(a \* t)).ToArray();
msePrev = mse;
mse = GetLoss(pred, actual);
double gradient = pred.Select((p, i) => p - actual\[i\]).Sum() \* 2 / pred.Length;
a -= lr \* mse \* gradient;
if (msePrev - mse < epsilon && i > 1)
break;
}
return a;
}
public double GetLoss(double\[\] pred, double\[\] actual)
{
return pred.Select((x, i) => Math.Pow(x - actual\[i\], 2.0)).Sum() / pred.Length;
}
}
WebsiteInfoGetter.cs
using Embyte.Data.Product;
using Embyte.Modules.Db;
using Embyte.Data.Storage;
namespace Embyte.Modules.Product;
public class WebsiteInfoGetter
{
EmbyteDbContext DbCtx;
public WebsiteInfoGetter(EmbyteDbContext dbctx)
{
DbCtx = dbctx;
}
/// <summary>
/// Runs the cache algorithm and then determines whether or not to use the cache
/// </summary>
/// <returns></returns>
public Tuple<WebsiteInfo, WebsiteInfoStatus> Get(string url, bool allowCache=true)
{
RequestEntry? prevEntry = null;
prevEntry = DbCtx.ExtractorEntries
.Where(x => x.Url == url)
.OrderByDescending(entry => entry.Time)
.FirstOrDefault();
var ttRenew = CacheAlg.TimeToRenew(url, DbCtx.ExtractorEntries);
var (cachedInfo, status) = GetFromCache(url, prevEntry, ttRenew);
if (ttRenew > DateTime.Now && allowCache)
{
if (cachedInfo != null)
return Tuple.Create(cachedInfo, status);
}
return GetFromExtractor(url, prevEntry, cachedInfo);
}
protected Tuple<WebsiteInfo?, WebsiteInfoStatus> GetFromCache(string url, RequestEntry? prevEntry, DateTime ttRenew)
{
WebsiteInfoStatus status = new();
status.parsingDurationMS = 0;
status.statusType = WebsiteInfoStatusType.cacheSuccess;
DateTime startRequestTime = DateTime.Now;
var info = DbCtx.WebsiteInfos
.Where(x => x != null && x.Url == url)
.FirstOrDefault();
DateTime? cacheTime = null;
if (prevEntry != null)
cacheTime = prevEntry.Time;
status.message = prevEntry != null ? $"From Cache ({cacheTime})" : "From Cache (Time unknown)";
if (cacheTime != null)
{
DateTime upperRenewLimit = DateTime.Now.AddMonths(EmbyteStorage.upperLimitCacheAgeMonths);
status.CacheDateInfo = Tuple.Create(cacheTime ?? DateTime.Now, ttRenew < upperRenewLimit ? ttRenew : upperRenewLimit);
}
TimeSpan requestDuration = DateTime.Now - startRequestTime;
status.requestDurationMS = (int)Math.Round(requestDuration.TotalMilliseconds, MidpointRounding.AwayFromZero);
return Tuple.Create(info, status);
}
protected Tuple<WebsiteInfo, WebsiteInfoStatus> GetFromExtractor(string url, RequestEntry? prevEntry, WebsiteInfo? cachedInfo)
{
TimeSpan deltaToPrev;
bool dataChanged;
var (info, status) = WebsiteInfoExtractor.GetMetaDataFromUrl(url);
if (prevEntry == null)
{
deltaToPrev = new TimeSpan(0);
dataChanged = false;
} else
{
deltaToPrev = DateTime.Now.Subtract(prevEntry.Time);
dataChanged = !info.DataEqual(cachedInfo);
}
RequestEntry entry = new(deltaToPrev, url, dataChanged);
DbCtx.ExtractorEntries.Add(entry);
if (cachedInfo == null)
{
DbCtx.WebsiteInfos.Add(info);
}
if (dataChanged && cachedInfo != null)
{
var cachedInfoEntry = DbCtx.WebsiteInfos.Where(i => i.Url == cachedInfo!.Url).First();
cachedInfoEntry.SetPropertiesTo(info);
}
DbCtx.SaveChanges();
return Tuple.Create(info, status);
}
}