Implement a Context-Aware Key-Value Store

old-f-key-369079-m

Previously, we discussed the importance of centralized configuration management and proposed the context-aware key-value store as a means of implementing it. We have outlined the basic API for such a system, but never discussed an implementation. We propose to do so in this article.

Before diving in, let’s quickly review the API and requirements. A traditional key-value store has the following interface:

string GetValue(string key)
void SetValue(string key, string value)

There is also a use case for deleting values, but we can ignore that for the purposes of this discussion. In a standard key-value store, there is a single value for a given key, and looking up the value is straight forward. In the context-aware variant, the value can be different depending on the context. This implies a slightly more complex API:

string GetValue(string key, Context context)
void SetValue(string key, Context context, string value)

The context can be implemented as a string-to-string dictionary where the key is the “dimension” (environment, user, etc) and the value is the context’s “location” on that dimension (prod or dev for “environment”, for example). The SetValue function records an explicit value for a key in a specific context, but the GetValue function is free to return values that are not explicitly set in the requested context – it may return a value that is explicitly set in a context that the provided context “inherits” from. Herein lies the power of this approach for centralized configuration – it allows us to set common properties once, while retaining the flexibility to override values in very specific situations if necessary.

But how does the GetValue function figure out what the value should be if it is no longer a simple lookup by key?

We know that the GetValue function will either return null (in the event that no appropriate value can be found) or one of the values that has been explicitly set for the requested key (it would make no sense to return a value that was set for a different key). So, we start the lookup algorithm by fetching all the values set for the requested key, regardless of the context in which those values are explicitly set. Call this list the “candidate values”.

The problem then reduces to determining which one of the candidate values is set in a context that most closely matches the requested context. It is tempting to say that we can find the best match by simply choosing the candidate that matches the requested context along the most dimensions. However, this won’t quite work, as we need a mechanism for breaking ties, and we likely want certain dimensions to carry more weight than others. For example, we probably want to require the environment to match before we match on other dimensions.

Therefore, we need one additional piece of information in order to implement a matching algorithm. This additional information is a list of “search paths”, each of which is in turn a list of dimensions. For example, if your dimensions are “environment”, “application”, and “machine”, you may want to configure two search paths, “environment/application/machine” and “environment/machine/application”. The search paths allow us to break ties and they guide the structure of the search algorithm.

Next, we need to “expand” the search paths, by adding all “parent” paths — up to and including the root — of the specified paths. Once we have performed the expansion, the order of the dimensions in the paths becomes irrelevant (the order only matters for the purposes of performing the expansion), so we can remove duplicates. Using the above paths as an example, we end up with the following expanded set of search paths : “”, “environment”, “environment/application”, “environment/application/machine”, and “environment/machine” (note that we removed a duplicate “environment”, as well as “environment/machine/application”).

Now, associate each candidate value with a “candidate path”. This is the search path where every dimension specified in the candidate context is present in the path. If there is not a candidate path, the candidate value cannot be returned in any requested context (it is “unsearchable”).

Next, select the candidate values with a valid candidate path where every dimension location in the candidate context matches the corresponding dimension locations in the target context. These candidate values become “finalist values”.

If there are no finalists, there is no match, and GetValue should return null. Otherwise, GetValue should return the finalist value that has the most dimensions specified in its context. Ties are broken in favor of the value whose candidate path that was listed first.

Let’s work through an example. Utilizing the example search paths and dimensions used above, say we have specified the following values for the key “number”:

  • “one” in the default (empty) context. The candidate path is “”.
  • “two” in environment=prod. The candidate path is “environment”.
  • “three” in environment=dev. The candidate path is “environment”.
  • “four” in environment=dev/application=dow. The candidate path is “environment/application”.
  • “five” in environment=dev/machine=box2. The candidate path is “environment/machine”.
  • “six” in application=dow/machine=box2. There is no candidate path, and the value is unsearchable.

Now, let’s perform some searches.

  • A search in the default context returns “one”. It is the only finalist.
  • A search in environment=prod/application=dow returns “two”. “one” and “two” are both finalists, and “two” has more dimensions in its context.
  • A search in environment=dev/application=dow/machine=box2 returns “four”. “one”, “three”, “four”, and “five” are all finalists. “four” and “five” tie for having the most dimensions in their contexts, but since the search path “environment/application/machine” was listed before the search path “environment/machine/application”, “four” wins.
  • A search in environment=dev/application=app/machine=box2 returns “five”. The logic is similar to the previous search, but “four” is not a finalist because the requested context does not match “application=dow”. Hence, “five” wins.

This algorithm is a bit complicated and we still consider it a work in progress. But it provides a flexible approach to implementing the context-aware key-value store, which we believe to be the key to a quality firm-wide configuration solution.

  1. Fascinating read Stuart. In Octopus Deploy we have a ‘variables’ system which exists to solve the same problem, and this post almost felt like reading documentation for how our own system works 🙂

    We allow configuration values to be scoped to environment, machine or step (similar to your application dimension). We gather all possible values for each key, and filter out any non-matches (e.g., environment = DEV when you are deploying to PROD).

    Rather than building paths, however, we rank them according to specificity, with step being most specific, then machine, then environment. Specificity is calculated using a points system. If the machine matches, it gets a +10. If the environment matches, it gets a +1 and so on, with more specific dimensions giving more points.

    So our process is – find all values for the key, filter the ones that don’t apply, and then rank them by specificity and take the top value for the key.

    I haven’t put much thought into whether creating paths would be better or worse than our ranking system but they both seem to achieve the same goal (would love to hear your thoughts).

    In 2.0 we’re also tacking the ability to have one value associated with multiple dimension values. For example, instead of ‘environment=PROD’, it becomes ‘environment in (DEV, PROD)’. For us the additional values only matter for the filtering phase, not the ranking phase, and I guess it would be the same in your approach?

    Thanks for sharing!

    Paul

    1. Paul,

      Thanks for your thoughts — it is really encouraging to hear that other people are thinking about these problems in a similar way.

      Regarding the “search paths” — that’s my major sore spot about the design I outlined here. Something about it feels a bit…off. The “search paths” accomplish two things — they “break ties” in the event that you have values that are specified in the same number of dimensions (you’re solving this with the point-based ranking mechanism), and they also set up “requirements” regarding what dimensions *must* be specified before you can specify a “more specific” one. For example, if my dimensions are “environment” and “application”, I want the system to consider a value set with an “application” and not an “environment” to be nonsensical. This mechanism prevents possibly “dangerous” configurations from leaking between environments. I think that there is probably a way to achieve both these goals that is more elegant than the search paths, but I haven’t come up with one yet.

      Regarding the proposal to associate one value with multiple “contexts” — I understand where you’re coming from there. I’ve considered something similar, but I worry about managing the complexity of having a single value “record” that applies to multiple contexts simultaneously. It seems do-able, but it scares me a bit. Your specific example, setting a property in both “dev” and “prod”, is one that I’d solve by setting the value at the “default” (no context) level. That way, the value applies to everyone, regardless of environment (or other context dimensions).

      Thanks again for your comment, and I hope we’ll continue to see you around here.

      Stuart

Leave a Reply

Your email address will not be published. Required fields are marked *