A starter: Mix.install/2
💜💚❤️
For the past few months, I’ve used Elixir’s newly-added ability to write self-containing programs via Mix.install/2
(https://hexdocs.pm/mix/1.14.0/Mix.html#install/2) more and more, and this includes scripting needs.
Here is a quick example of Mix.install/2
use, and you can find many more examples at https://github.com/wojtekmach/mix_install_examples:
Mix.install([
{:req, "~> 0.3.0"}
])
Req.get!("https://hex.pm/api/packages/req").body["meta"]["description"]
|> IO.inspect(IEx.inspect_opts)
What I love about this is that this script is self-contained: it can be executed trivially with just elixir my_script.exs
, and all the dependencies will then be installed automatically!
I find it very convenient and use it for scripting on a regular basis now.
On HTTP caching with Req
req has become my preferred Elixir HTTP client for these scripting needs:
- It relies on solid underlying basis (finch, which itself relies on mint, all solidly maintained libraries).
- It has useful defaults (JSON decoding) that you can bypass easily
- Its API allows extensibility
See more in the docs.
The problem
When munging data locally, something that comes up very often is the need to replay a data-processing code sequence repeatedly. When it involves HTTP queries, this can be cumbersomely slow.
Req includes a simple cache, based on the if-modified-since
HTTP request header.
Sadly, a lot of servers I’m querying do not handle that header well, if at all.
A Req-based solution
To solve that, no matter the tech stack, I very often use a form of permanent HTTP disk-caching. This makes my script work no matter if the remote server supports caching well or not.
Luckily Req is extensible, and allows you to register “steps” in the request processing, the response processing and also the error processing (see documentation).
It is actually quite flexible, allowing you to register options that the plugin can use etc.
Plugging my code into Req was something new to me, so I’m sharing it here. Some parts are actually extracted from the req
cache step directly (I’ve provided links to the original in those cases):
defmodule CustomCache do
@moduledoc """
A simple HTTP cache for `req` that do not use headers.
If the file is not found on disk, the download will occur,
otherwise response will be read from disk.
"""
require Logger
def attach(%Req.Request{} = request, options \\ []) do
request
|> Req.Request.register_options([:custom_cache_dir])
|> Req.Request.merge_options(options)
|> Req.Request.append_request_steps(custom_cache: &request_local_cache_step/1)
|> Req.Request.prepend_response_steps(custom_cache: &response_local_cache_step/1)
end
def request_local_cache_step(request) do
# TODO: handle a form of expiration - for now it is
# acceptable to wipe out the whole folder manually for me
# NOTE: race condition here, for parallel queries
if File.exists?(path = cache_path(request)) do
Logger.info("File found in cache (#{path})")
# a request step can return a {req,resp} tuple,
# and this will bypass the remaining request steps
{request, load_cache(path)}
else
request
end
end
def response_local_cache_step({request, response}) do
unless File.exists?(path = cache_path(request)) do
if response.status == 200 do
Logger.info("Saving file to cache (#{path})")
write_cache(path, response)
else
Logger.info("Status is #{response.status}, not saving file to disk")
end
end
{request, response}
end
# https://github.com/wojtekmach/req/blob/102b9aa6c6ff66f00403054a0093c4f06f6abc2f/lib/req/steps.ex#L1268
def cache_path(cache_dir, request = %{method: :get}) do
cache_key =
Enum.join(
[
request.url.host,
Atom.to_string(request.method),
:crypto.hash(:sha256, :erlang.term_to_binary(request.url))
|> Base.encode16(case: :lower)
],
"-"
)
Path.join(cache_dir, cache_key)
end
def cache_path(request) do
cache_path(request.options[:custom_cache_dir], request)
end
# https://github.com/wojtekmach/req/blob/102b9aa6c6ff66f00403054a0093c4f06f6abc2f/lib/req/steps.ex#L1288-L1290
def load_cache(path) do
path |> File.read!() |> :erlang.binary_to_term()
end
# https://github.com/wojtekmach/req/blob/102b9aa6c6ff66f00403054a0093c4f06f6abc2f/lib/req/steps.ex#L1283-L1286
def write_cache(path, response) do
File.mkdir_p!(Path.dirname(path))
File.write!(path, :erlang.term_to_binary(response))
end
end
Now that it is available, you can use it this way:
# potentially here: Code.require_file("my_req_cache.exs")
cache_dir = Path.join(__ENV__.file, "../cache-dir") |> Path.expand()
req = Req.new() |> CustomCache.attach()
%{body: body, status: 200} = Req.get!(req,
url: url,
custom_cache_dir: cache_dir()
)
What happens is:
- I create a high-level API
Req
structure - I “attach” my plugin to it (the name of the method does not matter)
- This registers the option key (via register_options), a feature useful to detect mistyped option keys (pretty cool)
- The request and response “steps” are attached to the processing pipeline
- A cache filename is constructed based on the URL
- The whole
Response
is serialized on disk with:erlang.term_to_binary(response)
(which means everything is stored, body, headers & HTTP status code), and deserialized at the right moment in the pipeline (before JSON-decoding occurs, typically)
Although diving into the pipeline was a bit complicated initially, I’m fairly happy with the result, and it allows me to automatically keep leveraging all the interesting features (retries, decoding) that come with the default Req
pipeline.
And keep my local data experiments fast :-)
Thank you for sharing this article!