Putting the plan in motion

First step, creating a new project. I decided to create Phoenix project as an umbrella project. An umbrella in Elixir is just a way of organizing a project into different standalone modules that depend on each other. This way, it is pretty straightforward to use parts of the project in other applications. It is also a very neat way of separating the components of your application into very organized, reusable and easy to understand modules.

mix phx.new --umbrella rent_bot

Phoenix is pretty cool and gives us a ready to use project with everything we need, from the basic functionality of a web application of receiving and responding to requests, database configuration, unit tests and some documentation.

Crunching HTML all day long

The real work started now. I decided to start by the web scraping of one of the classified advertisements websites and see how much effort was needed to get the metadata that I need.

First of all, I needed to know how could I reliably get the page with advertisements with the all of the filters for my search and also how I could programatically navigate through the multiples pages of results in the search of this sites. It turns out that, generally, when you make a search all the filters and the page for the pagination are on the URL as parameters. That way, I just needed to get the URL with the parameters for the filters I wanted and the page I could increment until I wouldn’t get any result.

https://www.randomwebsite.com/arrendar/apartamento/porto/?search%5Bfilter_float_price%3Ato%5D=600&search%5Bdescription%5D=1&page=1

Next, I needed to process the HTML of that page. I did a quick search for HTML parsing libraries for Elixir and I found Floki. Floki parses the given HTML and allows me to search for the desired DOM elements using regular CSS selectors. Just what I needed!

I had an HTML parsing library, but Floki is only just that. You still need to get the HTML and pass it to the library. For that I used the popular HTTPoison library that allows you to make HTTP requests in Elixir.

With every tool I needed, I just needed to write some functions make the HTTP request to the website I wanted, pass the HTML from the response to Floki and find the elements of every advertisement in the page and in each one of those elements, get the metadata. It turns out it is pretty simple to do all of that with the power of Elixir and the available libraries.

It almost looks like pseudo-code but Elixir gives an incentive to write small and descriptive functions that allow everyone to quickly understand the basic logic of the application. In this case we are creating a function called import that a page number as argument. In the first line of the function we are creating a string with the base URL and the page number. Then, using the awesome pipe operator, we first give the URL we created to a function called get_page_html . We don’t know how that functions is implemented, but we can make a pretty good guess that that function makes the HTTP request to get the HTML of the page in that URL.

It turns out that it does just that! In two lines of code! That functions returns the body of the request, in this case is the HTML of the requested page.

The next function in the pipe is the get_dom_elements . Also, just by looking the at the function name, we can guess that in that function the HTML will be parsed and search for the target elements that match our CSS selectors. Remember that the input of this function in the output of the previous function in the pipe operator, the get_page_html function that returned the HTML of the page.

How cool is this? You can parse and get the DOM elements you need in one line of code!

The last function in the pipe is the extract_metadata function. And you guess it, it process the DOM elements found and somehow extracts the metadata.

This function looks more complex but after you analyze it it turns out to be much more simple than it looks. We receive a list of DOM elements from the previous function (the get_dom_elements function). So we iterate through the list of elements with the Elixir’s Enum.map function. It is a normal mapping function, similar to what you would find in JavaScript or Java. So, for each element in our list we are applying a transformation creating a new list with the transformed elements. This transformation is the creation of regular Elixir map data structure with the title, URL, price, image URL (if available) and the name of the provider website where this data came from. You can see helper functions for each field (except the provider) that just do another search for the desired information in the given DOM element.

And that is it, we have a list of houses to rent!

When you feel shit is getting done

The beauty of all the code I did so far is that with some minor changes to the elements to search in the DOM, everything else is the same for other websites I want to search.

With that, the next step is to find a way to schedule a task to run regularly to check for new entries in all of the websites. With another quick search for an Elixir library for task scheduling, I found Quantum. Quantum allows me to schedule recurrent tasks using Cron-like notation.

In my configuration file, I’ve created a new task scheduler for each of the provider websites I wanted to search for advertisements. Each task is configured to every five minutes, run the function import_ads of the given module ( RendBotWeb.Tasks.XYZ ) with the given arguments (the list [1] argument is the page number to start the search).

As you can see, even this function is pretty simple after you analyze it. We start by doing some logging to the console to notify the start of the task. We then call our import function that we built before specifying the page number we want to search. Then we have a condition, if the returned list from that function is an empty list, we stop the process. Otherwise, we take the return list and give it to a process_entries function. Again, naming is important to make your code readable. Just by looking at the code, you can guess that the process_entries function will do some processing on our list and return a new list with just the entries that are new.

And of course it does just that! It maps over the entries list and passes each entry to the insert_entry function. That function takes the entry and first queries the database to see if that entry is already there. If it is, it return nil , otherwise inserts the entry in the database and returns it. The final step of the process_entries function is the a filter for the nil values in the list.

Going back to the import_ads function, we then see if the new entries list has more than zero elements, we call a notify_subscribers function and continue the importing process to the next page of results. Wait, who are these subscribers anyway?

Finally, a Facebook Messenger chat bot

Creating a Facebook Messenger bot is pretty straightforward. On your application, you just need two endpoints. An endpoint to receive a GET request to validate the application and another endpoint to receive POST requests with the messages for your bot.

Creating and validating your application in the Facebook Developers platform is also pretty easy and painless.

The validation endpoint is pretty easy to implement. You need a random string of your choosing to be your verification token. In the Facebook Developer App settings, you put that verification token. When asked to validate your application, Facebook will send a GET request with your verification token and some other parameters. One of those parameters is the hub.challenge parameter. Your endpoint should return the value of that parameter as response.

As everything else in Elixir, it is pretty easy to do that.

As you can see in this code snippet, we use pattern matching right on the function parameters to get just what we need from the given parameters. On the first line of the function, we get the verify token from the project configuration file. Using that, we compare it with the verify token sent on the request and if it matches, we respond with the challenge parameter value and a 200 status code. Otherwise, the request is unauthorized. Pretty neat!

All that is left is to handle the POST requests with the messages incoming for the bot. In the context of our application, the bot is just listening for some specific test to register the user as a subscriber. This way, we can store a unique ID created for the user that is interacting with our bot and use it to send messages back at any time.

We just iterate over each entry in the entries list that is part of the request parameters. For each message we just check if the text is our super secret string that subscribes a user to our platform, and if it match we save the sender_psid that identifies this user. Other messages are ignored and receive a generic message as response.

So with the IDs of the users interested in the notifications, we go back to our notify_subscribers function that we saw in the import function.

Again, it is just simple iteration through the subscribers and for each one send a card message with the given ad details.

And boom, you have a complete system working!

My daily batch of new apartments to rent

The next thing I did was to create a release using Distillery, create a Docker image with my application and set it to run on an EC2 machine on AWS. All of this in the two hours after dinner on the day that I decided to try something new.