Practical use of Ruby PStore

Arkency blog has undergone several improvements over recent weeks. One of such changes was opening the source of blog articles. We’ve have concluded that having posts in the open would shorten the feedback loop and allow our readers to collaborate and make the articles better for all.

Nanoc + Github

For years the blog has been driven by nanoc, which is a static-site generator. You put a bunch of markdown files in, drop a layout and on the other side out of it comes the HTML. Let’s call this magic “compilation”. One of nanoc prominent features is data sources. With it one could render content not only from a local filesystem. Given appropriate adapter posts, pages or other data items can be fetched from 3rd party API. Like SQL database. Or Github!

Choosing Github as a backend for our posts was no-brainer. Developers are familiar with it. It has quite a nice integrated web editor with Markdown preview — which gives in-place editing. Pull requests create the space for discussion. Last but not least there is octokit gem for API interaction, taking much of the implementation burden out of our shoulders.

An initial data adapter looked like this to fetch articles looked like this:

class Source < Nanoc :: DataSource identifier :github def items client = Octokit :: Client . new ( access_token: ENV [ 'GITHUB_TOKEN' ]) client . contents ( ENV [ 'GITHUB_REPO' ]) . select { | item | item . end_with? ( ".md" ) } . map { | item | client . contents ( ENV [ 'GITHUB_REPO' ], path: item [ :path ]) } . map { | item | new_item ( item [ :content ], item , Nanoc :: Identifier . new ( item [ :path ])) } end end

This code:

gets a list of files in repository

filters it by extension to only let markdowns stay

gets content of each markdown file

transforms it into a nanoc item object

Good enough for a quick spike and exploration of the problem. Becomes problematic as soon as you start using it for real. Can you spot the problems?

Source data improved

For a repository with 100 markdown files we will have to make 100 + 1 HTTP requests in order to retrieve the content

it takes time and becomes annoying when you’re in the change-layout-recompile-content cycle of the work on the site

there is an API request limit per hour (slightly bigger when using token but still present)

Making those requests parallel will only make the process of hitting request quota faster. Something has to be done to limit number of requests that are needed.

Luckily enough octokit gem used faraday library for HTTP interaction and some kind souls documented how one could leverage faraday-http-cache middleware.

class Source < Nanoc :: DataSource identifier :github def up stack = Faraday :: RackBuilder . new do | builder | builder . use Faraday :: HttpCache , serializer: Marshal , shared_cache: false builder . use Faraday :: Request :: Retry , exceptions: [ Octokit :: ServerError ] builder . use Octokit :: Middleware :: FollowRedirects builder . use Octokit :: Response :: RaiseError builder . use Octokit :: Response :: FeedParser builder . adapter Faraday . default_adapter end Octokit . middleware = stack end def items repository_items . map do | item | identifier = Nanoc :: Identifier . new ( "/ #{ item [ :name ] } " ) metadata , data = decode ( item [ :content ]) new_item ( data , metadata , identifier , checksum_data: item [ :sha ]) end end private def repository_items pool = Concurrent :: FixedThreadPool . new ( 10 ) items = Concurrent :: Array . new client . contents ( repository , path: path ) . select { | item | item [ :type ] == "file" } . each { | item | pool . post { items << client . contents ( repository , path: item [ :path ]) } } pool . shutdown pool . wait_for_termination items rescue Octokit :: NotFound => exc [] end def client Octokit :: Client . new ( access_token: access_token ) end def repository # ... end def path # ... end def access_token # ... end def decode ( content ) # ... end end

Notice two main additions here:

the up method, used by nanoc when spinning the data source, which introduces cache middleware

method, used by nanoc when spinning the data source, which introduces cache middleware Concurrent::FixedThreadPool from concurrent-ruby gem for concurrent requests in multiple threads

If only that cache worked… Faraday ships with in-memory cache, which is useless for the flow of work one has with nanoc. We’d very much like to persist the cache across runs of the compile process. Documentation indeed shows how one could switch cache backend to one from Rails but that is not helpful advice in nanoc context either. You probably wouldn’t like to start Redis or Memcache instance just to compile a bunch of HTML!

Time to roll-up sleeves again. Knowing what API is expected, we can build file-based cache backend. And there little-known standard library gem we could use to free ourselves of reimplementing the basics again. So much for standing on the shoulders of giants again.

Enter PStore

PStore is a file based persistence mechanism based on a Hash. We can store Ruby objects — they’re serialized with Marshal before being dumped on disk. It supports transactional behaviour and can be made thread safe. Sounds perfect for the job!

class Cache def initialize ( cache_dir ) @store = PStore . new ( File . join ( cache_dir , "nanoc-github.store" ), true ) end def write ( name , value , options = nil ) store . transaction { store [ name ] = value } end def read ( name , options = nil ) store . transaction ( true ) { store [ name ] } end def delete ( name , options = nil ) store . transaction { store . delete ( name ) } end private attr_reader :store end

In the end that cache store turned out to be merely a wrapper on pstore. How convenient! Thread safety is achieved here by using Mutex internaly around transaction block.

class Source < Nanoc :: DataSource identifier :github def up stack = Faraday :: RackBuilder . new do | builder | builder . use Faraday :: HttpCache , serializer: Marshal , shared_cache: false , store: Cache . new ( tmp_dir ) # ... end Octokit . middleware = stack end # ... end

With persistent cache store plugged into Faraday we can now reap benefits of cached responses. Subsequent requests to Github API are skipped. Responses are being served directly from local files. That is, as long as the cache stays fresh..

Cache validity can be controlled by several HTTP headers. In case of Github API it is the Cache-Control: private, max-age=60, s-maxage=60 that matters. Together with Date header this roughly means that the content will be valid for 60 seconds since the response was received. Is it much? For frequently changed content — probably. For blog articles I’d prefer something more long-lasting…

And that is how we arrive to the last piece of nanoc-github. A faraday middleware to allow extending cache time. It is a quite primitive piece of code that substitutes max-age value to the desired one. For my particular needs I set this value 3600 seconds. The general idea is that we modify HTTP responses from API before they hit the cache. Then the cache middleware examines cache validity based on modified age, rather than original one. Simple and good enough. Just be careful to add this to middleware stack in correct order 😅

class ModifyMaxAge < Faraday :: Middleware def initialize ( app , time :) @app = app @time = Integer ( time ) end def call ( request_env ) @app . call ( request_env ). on_complete do | response_env | response_env [ :response_headers ][ :cache_control ] = "public, max-age= #{ @time } , s-maxage= #{ @time } " end end end

And that’s it! I hope you found this article useful and learned a bit or two. Drop me a line on my twitter or leave a star on this project:

Happy hacking!