How to implement a disk cache plugin for Elixir's Req HTTP client?

September 09, 2022

A starter: Mix.install/2 💜💚❤️

For the past few months, I’ve used Elixir’s newly-added ability to write self-containing programs via Mix.install/2 (https://hexdocs.pm/mix/1.14.0/Mix.html#install/2) more and more, and this includes scripting needs.

Here is a quick example of Mix.install/2 use, and you can find many more examples at https://github.com/wojtekmach/mix_install_examples:

Mix.install([
  {:req, "~> 0.3.0"}
])

Req.get!("https://hex.pm/api/packages/req").body["meta"]["description"]
|> IO.inspect(IEx.inspect_opts)

What I love about this is that this script is self-contained: it can be executed trivially with just elixir my_script.exs, and all the dependencies will then be installed automatically!

I find it very convenient and use it for scripting on a regular basis now.

On HTTP caching with Req

req has become my preferred Elixir HTTP client for these scripting needs:

  • It relies on solid underlying basis (finch, which itself relies on mint, all solidly maintained libraries).
  • It has useful defaults (JSON decoding) that you can bypass easily
  • Its API allows extensibility

See more in the docs.

The problem

When munging data locally, something that comes up very often is the need to replay a data-processing code sequence repeatedly. When it involves HTTP queries, this can be cumbersomely slow.

Req includes a simple cache, based on the if-modified-since HTTP request header.

Sadly, a lot of servers I’m querying do not handle that header well, if at all.

A Req-based solution

To solve that, no matter the tech stack, I very often use a form of permanent HTTP disk-caching. This makes my script work no matter if the remote server supports caching well or not.

Luckily Req is extensible, and allows you to register “steps” in the request processing, the response processing and also the error processing (see documentation).

It is actually quite flexible, allowing you to register options that the plugin can use etc.

Plugging my code into Req was something new to me, so I’m sharing it here. Some parts are actually extracted from the req cache step directly (I’ve provided links to the original in those cases):

defmodule CustomCache do
  @moduledoc """
  A simple HTTP cache for `req` that do not use headers. 
  If the file is not found on disk, the download will occur,
  otherwise response will be read from disk.
  """
  require Logger

  def attach(%Req.Request{} = request, options \\ []) do
    request
    |> Req.Request.register_options([:custom_cache_dir])
    |> Req.Request.merge_options(options)
    |> Req.Request.append_request_steps(custom_cache: &request_local_cache_step/1)
    |> Req.Request.prepend_response_steps(custom_cache: &response_local_cache_step/1)
  end

  def request_local_cache_step(request) do
    # TODO: handle a form of expiration - for now it is
    # acceptable to wipe out the whole folder manually for me
    # NOTE: race condition here, for parallel queries
    if File.exists?(path = cache_path(request)) do
      Logger.info("File found in cache (#{path})")
      # a request step can return a {req,resp} tuple,
      # and this will bypass the remaining request steps
      {request, load_cache(path)}
    else
      request
    end
  end

  def response_local_cache_step({request, response}) do
    unless File.exists?(path = cache_path(request)) do
      if response.status == 200 do
        Logger.info("Saving file to cache (#{path})")
        write_cache(path, response)
      else
        Logger.info("Status is #{response.status}, not saving file to disk")
      end
    end

    {request, response}
  end

  # https://github.com/wojtekmach/req/blob/102b9aa6c6ff66f00403054a0093c4f06f6abc2f/lib/req/steps.ex#L1268
  def cache_path(cache_dir, request = %{method: :get}) do
    cache_key =
      Enum.join(
        [
          request.url.host,
          Atom.to_string(request.method),
          :crypto.hash(:sha256, :erlang.term_to_binary(request.url))
          |> Base.encode16(case: :lower)
        ],
        "-"
      )

    Path.join(cache_dir, cache_key)
  end

  def cache_path(request) do
    cache_path(request.options[:custom_cache_dir], request)
  end

  # https://github.com/wojtekmach/req/blob/102b9aa6c6ff66f00403054a0093c4f06f6abc2f/lib/req/steps.ex#L1288-L1290
  def load_cache(path) do
    path |> File.read!() |> :erlang.binary_to_term()
  end

  # https://github.com/wojtekmach/req/blob/102b9aa6c6ff66f00403054a0093c4f06f6abc2f/lib/req/steps.ex#L1283-L1286
  def write_cache(path, response) do
    File.mkdir_p!(Path.dirname(path))
    File.write!(path, :erlang.term_to_binary(response))
  end
end

Now that it is available, you can use it this way:

# potentially here: Code.require_file("my_req_cache.exs")

cache_dir = Path.join(__ENV__.file, "../cache-dir") |> Path.expand()

req = Req.new() |> CustomCache.attach()

%{body: body, status: 200} = Req.get!(req, 
  url: url, 
  custom_cache_dir: cache_dir()
)

What happens is:

  • I create a high-level API Req structure
  • I “attach” my plugin to it (the name of the method does not matter)
  • This registers the option key (via register_options), a feature useful to detect mistyped option keys (pretty cool)
  • The request and response “steps” are attached to the processing pipeline
  • A cache filename is constructed based on the URL
  • The whole Response is serialized on disk with :erlang.term_to_binary(response) (which means everything is stored, body, headers & HTTP status code), and deserialized at the right moment in the pipeline (before JSON-decoding occurs, typically)

Although diving into the pipeline was a bit complicated initially, I’m fairly happy with the result, and it allows me to automatically keep leveraging all the interesting features (retries, decoding) that come with the default Req pipeline.

And keep my local data experiments fast :-)


Thank you for sharing this article!