I have an obsession with data.

In this post I’ll show you what I did with data and Elixir, Poolboy, Mogrify, AndreaMosaic.

I normally attempt to use new technologies so I can learn from real problems.

Some years ago I did a Web crawler that scrapped the news from the most popular media websites in my city. I did a Facebook bot so people could read all the news in one place.

The first time I did the crawler, it was “OK.” I wrote it in Java and I learned that it was very difficult to deal with concurrency. For example, it is easy to introduce race condition issues in the code. In the end, I got it all to work, but it took a lot of time and effort.

Elixir

I’ve been trying to learn Elixir for the past two years. I learned about basic things like Pattern Matching, OTP, and macros but I hadn’t had a chance to do something from scratch. So, I decided to re-do the Java crawler, but this time using Elixir.

I won’t explain how I did it but I will tell the how it works and the tools I used.

Challenges

Read all links from the front page of the media website

Identify which links match with the pattern related to a single news item

Generate an object of type Article that had things like title, content, URL, etc

Save it to the database

Save the thumbnail in my computer

Resize the thumbnail

Do all this recursively for child nodes

Use Elixir concurrent workers for doing these tasks without exhausting my system resources

Libraries

These are some libraries I used to do this,

HTTPotion: This is an HTTP client

Floki: HTML parser

Ecto: A database wrapper and language integrated query for Elixir

Mogrify: Wrapper to use an awesome library called ImageMagick

Poolboy: Worker pool factory

How It Works

I used HTTPotion to serve the HTML for every single page. The first thing I did was to crawl the home page of the media site. Then, with the help of Floki, I got all the href attributes of every <a>.

The code looked something like this,

def extract_links(html) do html |> Floki.find("a") # get all <a> tags |> get_only_links() # get href attributes |> filter_links() # get only single new links |> Enum.uniq() # remove duplicate links end </a> 1 2 3 4 5 6 7 8 def extract_links ( html ) do html | > Floki . find ( "a" ) # get all <a> tags | > get_only_links ( ) # get href attributes | > filter_links ( ) # get only single new links | > Enum . uniq ( ) # remove duplicate links end < / a >

Once I had extracted the URLs, I looped through each of them and crawled for getting an object,

%{title: title, content: content, thumbnail: thumbnail … etc} def get_article(html, url) do %ArticleStruct{ title: title(html), slug: Slugger.slugify_downcase(article.title, ?_), # slug content: content(html), url: url, thumbnail: thumbnail(html), } end 1 2 3 4 5 6 7 8 9 10 % { title : title , content : content , thumbnail : thumbnail … etc } def get_article ( html , url ) do % ArticleStruct { title : title ( html ) , slug : Slugger . slugify_downcase ( article . title , ? _ ) , # slug content : content ( html ) , url : url , thumbnail : thumbnail ( html ) , } end

Noticed that I also save the slug of the title. This could help me later to identify each thumbnail’s name.

Once I got this object filled I could go ahead and save it in my database using Ecto.

In order to make our beautiful mosaic we need tons of images somewhere locally. I used HTTPotion again for getting the image from the thumbnail URL; I used Mogrify to resize the image.

def save_image(article) do case HTTPotion.get(article.thumbnail) do %HTTPotion.Response{body: body} -> basepath = "/path/images/" filename = Path.join(basepath, "#{article.slug}.png") File.write!(filename, body) resize_image(filename, 200, 200) article _ -> nil end end 1 2 3 4 5 6 7 8 9 10 11 def save_image ( article ) do case HTTPotion . get ( article . thumbnail ) do % HTTPotion . Response { body : body } -> basepath = "/path/images/" filename = Path . join ( basepath , "#{article.slug}.png" ) File . write ! ( filename , body ) resize_image ( filename , 200 , 200 ) article _ -> nil end end

Here is how I resized the image and saved it,

def resize_image(imagePath, width, height, _opts \\ []) do Mogrify.open(imagePath) |> Mogrify.resize_to_limit(~s(#{width}x#{height})) |> Mogrify.save(path: imagePath) end 1 2 3 4 5 def resize_image ( imagePath , width , height , _opts \ \ [ ] ) do Mogrify . open ( imagePath ) | > Mogrify . resize_to_limit ( ~ s ( #{width}x#{height})) | > Mogrify . save ( path : imagePath ) end

Once I had all this working, I needed to set a pool of Elixir workers so I my computer can do all this concurrent work without dying.

Here is where Poolboy comes in play. I used to configure a Supervisor which will have a series of workers available all the time.

defmodule ScrapperApp.Application do @moduledoc false use Application defp poolboy_config do [ {:name, {:local, :worker}}, {:worker_module, ScrapperApp.MyWorker}, {:size, 3}, {:max_overflow, 4} ] end def start(_type, _args) do import Supervisor.Spec, warn: false children = [ :poolboy.child_spec(:worker, poolboy_config()), ] opts = [strategy: :one_for_one, name: Scrapper.Supervisor] Supervisor.start_link(children, opts) end end 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 defmodule ScrapperApp . Application do @ moduledoc false use Application defp poolboy_config do [ { : name , { : local , : worker } } , { : worker_module , ScrapperApp . MyWorker } , { : size , 3 } , { : max_overflow , 4 } ] end def start ( _type , _args ) do import Supervisor . Spec , warn : false children = [ : poolboy . child_spec ( : worker , poolboy_config ( ) ) , ] opts = [ strategy : : one_for_one , name : Scrapper . Supervisor ] Supervisor . start_link ( children , opts ) end end

Running the App

AndreaMosaic

AndreaMosaic is a free software that creates mosaic images for you and it’s really fast. I love this tool.

Here is a screenshot of how it looks,

To make it work, choose a background image and a folder where you will get every mosaic. You can specify whether to repeat mosaics, the size of the final image, etc. Give it a try, is really easy to use.

Conclusion

I’m really impressed of how easy it is to use Elixir. I highly recommend that you make something from scratch. It worked really well for me.

The Elixir community is still growing and this is the time to get onboard.

Resources