cio: cached HTTP requests for a smooth Jupyter experience!
August 21, 2018
This library provides a thin wrapper around the
wreq library (a simple HTTP client library). It
is meant to be used with Jupyter: all requests will be
stored on disk and served from the cache subsequently, even if your kernel
gets restarted. The cache lookups are near-instantaneous thanks to the amazing
LevelDB library. You can use cio
just like you would
wreq
— instead of importing Network.Wreq
, import CIO
(which stands for
Cached IO):
{-# LANGUAGE OverloadedStrings #-} import CIO import Data.Aeson.Lens import Control.Lens
Then use the functions you are used to, like get
:
get "https://api.github.com/users/nmattia" <&> (^.responseBody.key "name"._String) "Nicolas Mattia"
Building cio
The simplest way to build this library is to use Nix. To get started clone the cio repository (nmattia/cio), then run the following:
$ nix-shell helpers: > cio_build > cio_ghci > cio_notebook > cio_readme_gen
The helper functions will respectively build cio
, start a ghci
session for cio
, start a Jupyter notebook with cio
loaded and regenerate the README (this file is a Jupyter notebook!).
Using cio
Three functions are provided on top of wreq
:
get :: String -> CIO Response
performs a (cached) request to the given URL.getWith :: Options -> String -> CIO Response
performs a (cached) request to the given URL using the providedwreq
Options
.getAllWith :: Options -> String -> Producer CIO Response
performs several (cached) requests by lazily following theLink
headers (see for instance GitHub’s pagination mechanism).
Let’s see what happens when a request is performed twice. First let’s write a function for timing the requests:
import Control.Monad.IO.Class import Data.Time timeIt :: CIO a -> CIO (NominalDiffTime, a) timeIt act = do start <- liftIO $ getCurrentTime res <- act stop <- liftIO $ getCurrentTime pure (diffUTCTime stop start, res)
Then we’ll generate a unique string which we’ll use as a dummy parameter in order to force cio
to perform the request the first time, so that we can time it:
import Data.UUID (toText) import System.Random (randomIO) uuid <- toText <$> randomIO
Finally we use getWith
and set the dummy
query parameter to the UUID
we just generated and time the request:
timeIt $ getWith (param "dummy" .~ [uuid] $ defaults) "https://api.github.com/users/nmattia" <&> (^.responseBody.key "name"._String) (1.214306799s,"Nicolas Mattia")
That’s a pretty long time! When playing around with data in a Jupyter notebook waiting around for requests to complete is a real productivity and creativity killer. Let’s see what cio
can do for us:
timeIt $ getWith (param "dummy" .~ [uuid] $ defaults) "https://api.github.com/users/nmattia" <&> (^.responseBody.key "name"._String) (0.000248564s,"Nicolas Mattia")
Pretty nice! You might have noticed that the CIO
results were printed out, as Show a => IO a
would be in GHCi. As mentioned before, cio
is optimized for Jupyter workflows, and as such all Show
-able results will be printed directly to the notebook’s output. Lists of Show
-ables will be pretty printed, which we’ll demonstrate by playing with cio
’s other cool feature: lazily following page links.
import Data.Conduit import Data.Conduit.Combinators as C
In order to lazily fetch data cio
uses the conduit
library. The getAllWith
function is a Producer
of Response
s (sorry, a ConduitT i Response CIO ()
) which are served from the cache when possible. Here we ask GitHub to give us only two results per page, and cio
will iterate the pages until the five expected items have been fetched (if you do the math that’s about 3 pages):
sourceToList $ getAllWith (defaults & param "q" .~ ["language:haskell"] & param "sort" .~ ["stars"] & param "per_page" .~ ["2"]) "https://api.github.com/search/repositories" .| awaitForever (C.yieldMany . ( ^..responseBody .key "items" .values .key "full_name" ._String)) .| C.take 5 "jgm/pandoc" "koalaman/shellcheck" "PostgREST/postgrest" "purescript/purescript" "elm/compiler"
What if something goes wrong?
What’s the second hardest thing in computer science, besides naming and off-by-one errors? Cache invalidation, of course. For the cache’s sake, all your requests should be idempotent, but unfortunately that’s not always possible. Here cio
doesn’t assume anything but lets you deal with dirtying yourself (as in you yourself deal with dirtying) by using either of these two functions:
dirtyReq :: String -> CIO ()
, likeget
but instead of fetching the response dirties the entry in the cache.dirtyReqWith :: Options -> String -> CIO ()
, likegetWith
but instead of fetching the response dirties the entry in the cache.
If things went really wrong, you can always wipe the cache entirely…
… but where’s the cache?
The cache is set globally (reminder: this is a Jupyter-optimized workflow):
getCacheFile "requests.cache"
If you need a different cache file you can either change the global cache file:
:t setCacheFile setCacheFile :: FilePath -> IO ()</span>
or run your CIO
code manually:
:t runCIOWith runCIOWith :: forall a. FilePath -> CIO a -> IO a
one more thing…
.. nope, that’s all! Enjoy!
Like Haskell? Here's more on the topic:
- Automatically generated directories for individual tasty tests
- A Fully Functional, Fully Functional Webapp
Start a conversation or share your thoughts below!