cio: cached HTTP requests for a smooth Jupyter experience!
August 21, 2018
 
 This library provides a thin wrapper around the
wreq library (a simple HTTP client library). It
is meant to be used with Jupyter: all requests will be
stored on disk and served from the cache subsequently, even if your kernel
gets restarted. The cache lookups are near-instantaneous thanks to the amazing
LevelDB library. You can use cio just like you would
wreq — instead of importing Network.Wreq, import CIO (which stands for
Cached IO):
{-# LANGUAGE OverloadedStrings #-}
import CIO
import Data.Aeson.Lens
import Control.LensThen use the functions you are used to, like get:
get "https://api.github.com/users/nmattia" <&>
    (^.responseBody.key "name"._String)
"Nicolas Mattia"Building cio
The simplest way to build this library is to use Nix. To get started clone the cio repository (nmattia/cio), then run the following:
$ nix-shell
helpers:
> cio_build
> cio_ghci
> cio_notebook
> cio_readme_genThe helper functions will respectively build cio, start a ghci session for cio, start a Jupyter notebook with cio loaded and regenerate the README (this file is a Jupyter notebook!).
Using cio
Three functions are provided on top of wreq:
- get :: String -> CIO Responseperforms a (cached) request to the given URL.
- getWith :: Options -> String -> CIO Responseperforms a (cached) request to the given URL using the provided- wreq- Options.
- getAllWith :: Options -> String -> Producer CIO Responseperforms several (cached) requests by lazily following the- Linkheaders (see for instance GitHub’s pagination mechanism).
Let’s see what happens when a request is performed twice. First let’s write a function for timing the requests:
import Control.Monad.IO.Class
import Data.Time
timeIt :: CIO a -> CIO (NominalDiffTime, a)
timeIt act = do
    start <- liftIO $ getCurrentTime
    res <- act
    stop <- liftIO $ getCurrentTime
    pure (diffUTCTime stop start, res)Then we’ll generate a unique string which we’ll use as a dummy parameter in order to force cio to perform the request the first time, so that we can time it:
import Data.UUID (toText)
import System.Random (randomIO)
uuid <- toText <$> randomIOFinally we use getWith and set the dummy query parameter to the UUID we just generated and time the request:
timeIt $ getWith (param "dummy" .~ [uuid] $ defaults) "https://api.github.com/users/nmattia" <&>
    (^.responseBody.key "name"._String)
(1.214306799s,"Nicolas Mattia")That’s a pretty long time! When playing around with data in a Jupyter notebook waiting around for requests to complete is a real productivity and creativity killer. Let’s see what cio can do for us:
timeIt $ getWith (param "dummy" .~ [uuid] $ defaults) "https://api.github.com/users/nmattia" <&>
    (^.responseBody.key "name"._String)
(0.000248564s,"Nicolas Mattia")Pretty nice! You might have noticed that the CIO results were printed out, as Show a => IO a would be in GHCi. As mentioned before, cio is optimized for Jupyter workflows, and as such all Show-able results will be printed directly to the notebook’s output. Lists of Show-ables will be pretty printed, which we’ll demonstrate by playing with cio’s other cool feature: lazily following page links.
import Data.Conduit
import Data.Conduit.Combinators as CIn order to lazily fetch data cio uses the conduit library. The getAllWith function is a Producer of Responses (sorry, a ConduitT i Response CIO ()) which are served from the cache when possible. Here we ask GitHub to give us only two results per page, and cio will iterate the pages until the five expected items have been fetched (if you do the math that’s about 3 pages):
sourceToList $
    getAllWith
        (defaults
        & param "q" .~ ["language:haskell"]
        & param "sort" .~ ["stars"]
        & param "per_page" .~ ["2"])
        "https://api.github.com/search/repositories"
    .| awaitForever (C.yieldMany . (
        ^..responseBody
        .key "items"
        .values
        .key "full_name"
        ._String))
    .| C.take 5
"jgm/pandoc"
"koalaman/shellcheck"
"PostgREST/postgrest"
"purescript/purescript"
"elm/compiler"What if something goes wrong?
What’s the second hardest thing in computer science, besides naming and off-by-one errors? Cache invalidation, of course. For the cache’s sake, all your requests should be idempotent, but unfortunately that’s not always possible. Here cio doesn’t assume anything but lets you deal with dirtying yourself (as in you yourself deal with dirtying) by using either of these two functions:
- dirtyReq :: String -> CIO (), like- getbut instead of fetching the response dirties the entry in the cache.
- dirtyReqWith :: Options -> String -> CIO (), like- getWithbut instead of fetching the response dirties the entry in the cache.
If things went really wrong, you can always wipe the cache entirely…
… but where’s the cache?
The cache is set globally (reminder: this is a Jupyter-optimized workflow):
getCacheFile
"requests.cache"If you need a different cache file you can either change the global cache file:
:t setCacheFile
setCacheFile :: FilePath -> IO ()</span>or run your CIO code manually:
:t runCIOWith
runCIOWith :: forall a. FilePath -> CIO a -> IO aone more thing…
.. nope, that’s all! Enjoy!
Let me know if you enjoyed this article! You can also subscribe to receive updates.
Here's more on the topic of Haskell: