cio: cached HTTP requests for a smooth Jupyter experience!

August 21, 2018

This library provides a thin wrapper around the wreq library (a simple HTTP client library). It is meant to be used with Jupyter: all requests will be stored on disk and served from the cache subsequently, even if your kernel gets restarted. The cache lookups are near-instantaneous thanks to the amazing LevelDB library. You can use cio just like you would wreq — instead of importing Network.Wreq, import CIO (which stands for Cached IO):

{-# LANGUAGE OverloadedStrings #-}

import CIO
import Data.Aeson.Lens
import Control.Lens

Then use the functions you are used to, like get:

get "https://api.github.com/users/nmattia" <&>
    (^.responseBody.key "name"._String)

"Nicolas Mattia"

Building cio

The simplest way to build this library is to use Nix. To get started clone the cio repository (nmattia/cio), then run the following:

$ nix-shell
helpers:
> cio_build
> cio_ghci
> cio_notebook
> cio_readme_gen

The helper functions will respectively build cio, start a ghci session for cio, start a Jupyter notebook with cio loaded and regenerate the README (this file is a Jupyter notebook!).

Using cio

Three functions are provided on top of wreq:

get :: String -> CIO Response performs a (cached) request to the given URL.
getWith :: Options -> String -> CIO Response performs a (cached) request to the given URL using the provided wreq Options.
getAllWith :: Options -> String -> Producer CIO Response performs several (cached) requests by lazily following the Link headers (see for instance GitHub’s pagination mechanism).

Let’s see what happens when a request is performed twice. First let’s write a function for timing the requests:

import Control.Monad.IO.Class
import Data.Time

timeIt :: CIO a -> CIO (NominalDiffTime, a)
timeIt act = do
    start <- liftIO $ getCurrentTime
    res <- act
    stop <- liftIO $ getCurrentTime
    pure (diffUTCTime stop start, res)

Then we’ll generate a unique string which we’ll use as a dummy parameter in order to force cio to perform the request the first time, so that we can time it:

import Data.UUID (toText)
import System.Random (randomIO)

uuid <- toText <$> randomIO

Finally we use getWith and set the dummy query parameter to the UUID we just generated and time the request:

timeIt $ getWith (param "dummy" .~ [uuid] $ defaults) "https://api.github.com/users/nmattia" <&>
    (^.responseBody.key "name"._String)

(1.214306799s,"Nicolas Mattia")

That’s a pretty long time! When playing around with data in a Jupyter notebook waiting around for requests to complete is a real productivity and creativity killer. Let’s see what cio can do for us:

timeIt $ getWith (param "dummy" .~ [uuid] $ defaults) "https://api.github.com/users/nmattia" <&>
    (^.responseBody.key "name"._String)

(0.000248564s,"Nicolas Mattia")

Pretty nice! You might have noticed that the CIO results were printed out, as Show a => IO a would be in GHCi. As mentioned before, cio is optimized for Jupyter workflows, and as such all Show-able results will be printed directly to the notebook’s output. Lists of Show-ables will be pretty printed, which we’ll demonstrate by playing with cio’s other cool feature: lazily following page links.

import Data.Conduit
import Data.Conduit.Combinators as C

In order to lazily fetch data cio uses the conduit library. The getAllWith function is a Producer of Responses (sorry, a ConduitT i Response CIO ()) which are served from the cache when possible. Here we ask GitHub to give us only two results per page, and cio will iterate the pages until the five expected items have been fetched (if you do the math that’s about 3 pages):

sourceToList $
    getAllWith
        (defaults
        & param "q" .~ ["language:haskell"]
        & param "sort" .~ ["stars"]
        & param "per_page" .~ ["2"])
        "https://api.github.com/search/repositories"
    .| awaitForever (C.yieldMany . (
        ^..responseBody
        .key "items"
        .values
        .key "full_name"
        ._String))
    .| C.take 5

"jgm/pandoc"
"koalaman/shellcheck"
"PostgREST/postgrest"
"purescript/purescript"
"elm/compiler"

What if something goes wrong?

What’s the second hardest thing in computer science, besides naming and off-by-one errors? Cache invalidation, of course. For the cache’s sake, all your requests should be idempotent, but unfortunately that’s not always possible. Here cio doesn’t assume anything but lets you deal with dirtying yourself (as in you yourself deal with dirtying) by using either of these two functions:

dirtyReq :: String -> CIO (), like get but instead of fetching the response dirties the entry in the cache.
dirtyReqWith :: Options -> String -> CIO (), like getWith but instead of fetching the response dirties the entry in the cache.

If things went really wrong, you can always wipe the cache entirely…

… but where’s the cache?

The cache is set globally (reminder: this is a Jupyter-optimized workflow):

getCacheFile
"requests.cache"

If you need a different cache file you can either change the global cache file:

:t setCacheFile
setCacheFile :: FilePath -> IO ()</span>

or run your CIO code manually:

:t runCIOWith
runCIOWith :: forall a. FilePath -> CIO a -> IO a

one more thing…

.. nope, that’s all! Enjoy!

Let me know if you enjoyed this article! You can also subscribe to receive updates.

Here's more on the topic of Haskell: