Scraping metadata (e.g. title , description , url , etc.) from a URL is something that Facebook currently does for you when you paste a URL into the “Update Status” box. For a service that I’m currently building out, we wanted to do this as well for our users. Thus Metascraper was born.

There was already a Ruby solution called link_thumbnailer, but since this is a I/O heavy operation, I knew I wanted to build a solution using tools that supported non-blocking I/O and could be used without getting caught in callback spaghetti. Scala, Akka, and the Play framework immediately came to mind.

Existing solutions

Before I started building my own solution, I did some research and found that there were already some web-scraping solutions written in Scala or Java, such as chafed, and some more listed in this StackOverflow question.

I wanted something more focused, something that would “intelligently” return a page’s title, description, urls, and images back. I also wanted to make sure that if the page implemented the Open Graph Protocol, the information from those tags got prioritised. Since these requirements were not being fulfilled by existing Scala libraries, I set about creating my own Scala library.

Metascraper Components

The main components of the Metascraper library include:

Akka actors jsoup: While there were Scala web scrapers, the Java solution, jsoup, was very mature and easy to use.

Basic workflow (a.k.a. how to use)

This post won’t go over in too much detail how to use the library because that stuff is available from the Metascraper Github page and will probably change over time, but this is the basic workflow:

Instantiate a ScraperActor Send a message to the scraper with ScrapeUrl(url: String) When scraping is done, the actor will reply with a Either[FailedToScrapeUrl,ScrapedData]

The project is Mavenised and is availale from the Central Repository, so simply add the libraryDependency in your build.sbt (when you read this the versioning might be different so refer to the project’s Github page):

1 libraryDependencies += "com.beachape.metascraper" %% "metascraper" % "0.0.2"

And to use it,

Metascraper example code (metascraper_example.scala) download 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import akka.actor.ActorSystem import com.beachape.metascraper.Messages._ import com.beachape.metascraper.ScraperActor import scala.concurrent.Await import akka.pattern.ask import akka.util.Timeout import scala.concurrent.duration._ implicit val timeout = Timeout ( 30 seconds ) implicit val system = ActorSystem ( "actorSystem" ) implicit val dispatcher = system . dispatcher val scraperActor = system . actorOf ( ScraperActor ()) for { future <- ask ( scraperActor , ScrapeUrl ( "https://bbc.co.uk" )). mapTo [ Either [ FailedToScrapeUrl , ScrapedData ]] } { future match { case Left ( failed ) => { println ( "Failed: " ) println ( failed . message ) } case Right ( data ) => { println ( "Image urls" ) data . imageUrls . foreach ( println ) } } } /* #=> Image URLs: http://www.bbc.co.uk/img/iphone.png http://sa.bbc.co.uk/bbc/bbc/s?name=SET-COUNTER&pal_route=index&ml_name=barlesque&app_type=web&language=en-GB&ml_version=0.16.1&pal_webapp=wwhp&blq_s=3.5&blq_r=3.5&blq_v=default-worldwide http://static.bbci.co.uk/frameworks/barlesque/2.51.2/desktop/3.5/img/blq-blocks_grey_alpha.png http://static.bbci.co.uk/frameworks/barlesque/2.51.2/desktop/3.5/img/blq-search_grey_alpha.png http://news.bbcimg.co.uk/media/images/69612000/jpg/_69612953_69612952.jpg */

Example application

I’ve created an example Play2 application that integrates this library, called metascraper-service. Feel free to take a look !

Conclusion

Please give Metascraper a test drive and submit issues and pull requests !