Extending Django's Cache - a performance story

Dec. 9, 2020

Kyle Kaniecki

Backstory

Early into my refactor of Skirmish's tournament backend, I realized that I needed to make fetching and updating the bracket as quick and efficient as I could on the frontend. I ended up doing a lot of research into how I could speed up my queries to the database, but I also was reading a lot about caching and using in-memory storage to store and serve data instead of querying the database every time. However, there was one problem that I never really liked about Django's built-in caching system -- it lacked support for clearing a view's cache and forcing a recalculation of the data.

So, to give a clear example in tournament terminology, let's say that a player had reported the results for their match, and the match advanced the winner and the loser to their next respective matches. After this, if the bracket were to be requested again, django's built in @cache_page decorator would create a cache key for the request, look for it in our cache before performing calculations, grab the old data and serve it to the user. To show specifically why this is the case, let's look at @cache_page's key creation implementation:

# https://github.com/django/django/blob/302caa40e4caab7d95ef7d0a88a90f935039ab09/django/utils/cache.py#L326
def _generate_cache_key(request, method, headerlist, key_prefix):
    """Return a cache key from the headers given in the header list."""
    ctx = hashlib.md5()
    for header in headerlist:
        value = request.META.get(header)
        if value is not None:
            ctx.update(value.encode())
    url = hashlib.md5(request.build_absolute_uri().encode('ascii'))
    cache_key = 'views.decorators.cache.cache_page.%s.%s.%s.%s' % (
        key_prefix, method, url.hexdigest(), ctx.hexdigest()
    )
    return _i18n_cache_key_suffix(request, cache_key)

Ok... there is a lot going on here, so let's boil it down into some bullet points

  • First, we create a hash context that we can progressively add content to
  • Loop over the headers included in the "headerlist" parameter, get their value, and add them to a hash context
  • Create a hash of the complete request URI (which includes path and any optional parameters in the request)
  • Append that information to a static string and return after some internationalization is applied

Ok, so what are the benefits of this approach?

  • We are hashing values that are associated with a specific requests, so if anything changes about the request, the view is recalculated and stored in the cache as a new value
    • This includes things like if any query parameters have changed or if any request headers have changed
  • We internationalize our cache string

However, what are some drawbacks?

  • We don't have an easy way to clear a whole category of requests by view path
    • Since the request URI is hashed, we lose the ability to look for prefixes
  • Since we are hashing specific request headers, if any of those headers change, we recompute the value, even if no data content has changed
  • Their cache key method is tied to request objects themselves, which wouldn't be good if we wanted to clear and create cache from something like a Celery task which requires pickling function parameters

Great, so now that we have evaluated Django's version of generating the cache key for a view response, I can determine that this will not work for my use case. I want to cache data about a tournament view (which won't change with request headers), and be able to clear a whole category of response caches if any of the data changes. Using our example from before, if a tournament game is reported and advanced, I want to make sure all of the bracket view responses that contain that tournament game are cleared from the cache and values are recalculated when the next person requests information about the bracket. Since tournament data changes tended to happen in waves as rounds were completed, say 10-15 minutes at a time, the data would largely remain static until it changes for everyone at the same time. With django's built-in cache, I wouldn't have a great way to clear all of that data all at once (or even really know which keys are which) due to key obfuscation from hashing.

Basis

All of the proposals I thought still used django's view decorator paradigm to cache a responses content. So when we are talking about the code in the proposals, all of them would be happening inside the get_cache_key funciton that our cache decorator uses to store the values in redis, to be clear. In the decorator, we would only cache the response if the view was successfuly. I used this blog article to create a basic view decorator for caching response content, I suggest you give it a look over as well.

Ok, on to the actual proposals. I thought of a few cache key formats that would let me group content together by prefix.

Proposal 1

The first proposal centered around storing the entire URI in plain text in the cache key. This would allow us to use redis' key prefixing to clear a group of cache entries all at once. It would look something like this:

def get_cache_key(request, key_prefix='response_cache', method='get'):
    return '%s.%s.%s' % (key_prefix, request.get_absolute_uri(), method)

Pros

  • Readability for manual cache manipulation
  • Any variation in request endpoint would result in a new cache entry
  • Support deleting many entries by prefix

Cons

  • Requests with many query parameters would have very long keys
  • building a cache key still relies on a request object

Proposal 2

During the second proposal, I really wanted to see if I could get the length of the cache keys down a little bit. So I decided that I could do that by only applying the request path in plain text in the key, and hashing the complete URI on the end. This allows us to have a fixed length hash on the end that is still unique to the entire request URI.

def get_cache_key(request, key_prefix='response_cache', method='get'):
    url_hash = hashlib.md5(request.build_absolute_uri().encode('ascii')).hexdigest()
    return '%s.%s.%s.%s' % (key_prefix, request.path, method, url_hash)

Pros

  • Readability for manual cache manipulation
  • Any variation in request endpoint would result in a new cache entry
  • Support deleting many entries by prefix

Cons

  • Keys do not change with common paths that require authentication headers, such as /v1/account/permissions/, so we cannot use on user endpoints
  • Still tied to Django's request object

Proposal 3

Proposal 3 was a little bit of the previous proposal, with the inclusion of django's header system in order to incorporate a hash for varying request headers This allowed me to save key space on views that didn't change based on authentication, but get unique hash values for views that do change based on request headers, such as the Authentication header. Also, proposal 3 moves away from having our cache key building function being tied to request objects. Instead, I created a more generic cache key function that builds our keys based on a string input instead. This way, functions that are not views can perform crud operations in the cache from anywhere the cache is available, like inside a Celery worker.

CACHE_SEPERATOR = "."

def get_cache_key(
    key: str, prefix: str = "", version: int = 1, seperator: str = CACHE_SEPERATOR
) -> str:
    """
    Builds a Skirmish cache key from the given values

    Args:
        key (str): The raw key to be formatted and Skirmish-ified
        prefix (:obj:`str`, optional): A key prefix that will be prepended to the cache key
        version (:obj:`int`, optional): The version of cache building that this is for
        serperator (:obj:`str`, optional): A seperator for the key, defaults to ':'

    Returns:
        str: A new formatted key that will be inserted into the cache
    """
    return seperator.join(filter(None, [prefix, str(version), key]))


def get_cache_key_for_path(
    path: str,
    query_string: Optional[str] = None,
    additional_context: Optional[List[str]] = None,
    **kwargs,
) -> str:
    """
    Builds a cache key for a url path. This is used in Skirmish's endpoint caching so that
    we can also _get_ keys to clear the cache when things change in the endpoint
    Note:
        The url path should not include query strings. Use the `query_string` kwarg to include query_params
    Args:
        path (str): The url path to get a key for
        query_string (str): a urlencoded querystring that will be hashed
            and appended to the cache key
        additional_context (:obj:`list`, optional): Additional context that will be
            hashed and appended to the cache key
    """
    # if we actually have a query string, hash it
    if query_string:
        query_string = hashlib.md5(query_string.encode("ascii")).hexdigest()

    # If we have additional context, let's hash it
    additional_hash = None
    if additional_context:
        ctx = hashlib.md5()
        for value in additional_context:
            ctx.update(value.encode("ascii"))
        additional_hash = ctx.hexdigest()

    # https://stackoverflow.com/questions/26566663/does-filter-preserve-list-ordering
    # Just to make sure
    return build_cache_key(
        CACHE_SEPERATOR.join(filter(None, [path, query_string, additional_hash])),
        **kwargs,
    )",

Pros

  • Readability for manual cache manipulation
  • Any variation in request endpoint would result in a new cache entry
  • Support deleting many entries by prefix
  • Supports having seperate view cache responses based on authentication
  • Allows building cache keys based on arbitrary strings

Cons

  • Cache values are not all the same number of separators, sometimes it is 6, others it is 5. Inconsistency
  • If we hash our authentication headers, our key would change once a user's token expired. Could be problematic with short lived tokens

Full Decorator Code example

def cache_response(seconds: int, vary_headers: Optional[List[str]] = None):
    def decorator(func: partial):
        """
        This is a partial since we *should* be using django's method_decorator in CBV
        """
        @wraps(func)
        def inner(
            request: HttpRequest, *args, **kwargs,
        ):
            # Nonlocal scope access
            # https://www.python.org/dev/peps/pep-3104/
            nonlocal vary_headers
            if vary_headers is not None:
                vary_headers = list(
                    filter(
                        None,
                        [request.headers.get(header, None) for header in vary_headers],
                    )
                )

            # Get the cache key
            cache_key = get_cache_key_for_path(
                request.path,
                query_string=request.GET.urlencode(),
                additional_context=vary_headers,
            )

            # Get the cached value, if there is one
            content = cache.get(cache_key)
            # If we have a cached value, return it
            if content is not None:
                return Response(content)

            # Get the normal return value from the view by running it
            # This will run the view completely normally with no caching
            response = func(request, *args, **kwargs)

            # If the response is a valid one, put it in the cache
            if isinstance(response, Response) and response.status_code in (200, 304):
                cache.set(cache_key, response.data, seconds)

            # Return the response
            return response

        return inner

    return decorator

And then we can use the cache key functions in functions like this in celery tasks:

for phase_id in tournament.phases.values_list("id", flat=True):
        clear_cache_by_key(
            prefix=get_cache_key_for_path(
                reverse("TournamentPhaseDetail", kwargs={"pk": phase_id})
            )
        )

Conclusion

At the end of the day, I ended up choosing proposal #3 for my needs with Skirmish. While varying headers does add quite a bit of complexity to the cache decorator and more keyword arguments to our cache key function, I think that it really adds a lot of extensionality to the view cache decorator. It will allow us to cache endpoints like "v1/account/wallet/", which would need to be unique for every user. Also, this proposal uncouples our cache keys from request objects and instead couples them to specific endpoints on our site. This allowed me to clear view cache for endpoints in celery tasks that would generate tournament brackets or tasks that would advance tournament games.

However, developers do need to be careful with this. With great power comes great responsibility. If we are blindly using caching to improve performance on all endpoints, we are probably using a band-aid on a bullet wound. Instead, small endpoints should probably use efficient database querying to serve content so we are not storing large amounts of data redundantly in the redis store. Just because our cache keys change based on very specific request parameters doesn't mean we should use the cache on all endpoints.

If you liked this article, or would like to suggest ways to improve it, drop a comment below! I'd love to hear other options and solutions to grow and learn.


comments

Please enter a valid display name
Please enter a valid email
Your email will not be displayed publicly
Please enter a comment