The first full round of March Madness is Christmas morning for college basketball fans: 2 days, 32 games, lots of upsets and late-game drama. Last week, on the first full day of the tournament, WNYC transportation reporter Jim O’Grady casually mentioned that he couldn’t keep tabs on all the action during the day. He wished he could get a text message whenever a game was coming down to the wire so he would know when to neglect his professional responsibilities and tune in for the end. I started kicking around the idea in my head a little, and after work my colleague Jenny Ye and I decided to take a break from writing weird JavaScript to write some more weird JavaScript. The result was @NailbiterBot, a humble Twitter bot that posts a tweet whenever an NCAA tournament game is close late in the second half.

We needed something that constantly checks to see whether there’s a close game in progress, and if so, posts a tweet about it. The steps to building it would be something like:

Finding the Data

Deciding whether a game is close isn’t difficult once you have the current score and game clock, but getting that data is easier said than done. Most of the work for projects like these goes into finding accurate, up-to-the-minute data and reverse-engineering it into a usable format.

We started by going straight to the source, the NCAA.com scoreboard, and trying to scrape that page. It looks like this:

But what we actually care about is the markup under the hood, which looks like this:

If you break down that HTML structure, you can pull out the teams in each game, the score, and the time remaining. We used Node.js and the cheerio module, which makes it easy to write scrapers if you know how to write jQuery for a browser:

var request = require('request'); var $ = require('cheerio'); //Function to call once the page is downloaded function gotHTML(err, resp, html) { if (err) return console.error(err); //Load the HTML into cheerio var page = $.load(html); //Find all the <sections> for games in progress var gamesInProgress = page("section#scoreboard section.game.live"); //For each game in progress, see whether it's a nailbiter gamesInProgress.each(function(i,game){ var $game = $(game); var gameStatus = $game.find("div.game-status").text(); //If it's not the second half, do nothing if (!gameStatus.match(/^2nd/i)) return true; //This is the clock remaining, like "5:40" var gameClock = gameStatus.split(" ")[1]; //If there are more than three minutes left, do nothing if (parseInt(gameClock.split(":")[0]) >= 3) return true; //Get the teams' current scores var scores = []; $game.find("table.linescore td.final.score").each(function(j,score){ scores.push(parseInt($(score).text())); }); //If the point differential is more than 8, do nothing if (Math.abs(scores[0]-scores[1]) > 8) return true; //OK, it's a nailbiter! Next step: compose a nice tweet //We'll figure this out later }); } var url = 'http://www.ncaa.com/scoreboards/basketball-men/d1'; //Download the contents of the scoreboard page request(url, gotHTML);

Like pretty much all the JavaScript I write, this syntax borrows heavily from Max Ogden.

But there was an insidious problem with this approach. The results seem roughly correct by themselves, but they don’t match what you see in your browser. That’s because the data in the source of the page is actually a few minutes old, and it gets updated right when you load the page (if you refresh that page and look closely, you can see the flash when the numbers change). If we used the scraped data, the bot would be pretty worthless, tweeting “CLOSE GAME WITH THREE MINUTES LEFT!” when the game was actually ending.

This is a common problem when scraping data. The data you want might appear to be “on the page” but often it’s getting loaded in separately using JavaScript. A good way to check whether the data is in the raw HTML of the page is to right-click in your browser and use “View Page Source,” and then search for the same piece of data. If it doesn’t show up there, it means your data is getting mixed in after you load the page.

Fortunately we can put on our detective hats (you do own a detective hat, right?) and trace where the data is actually coming from through the magic of a browser’s developer tools. In Chrome, you can go to Tools > Developer Tools and then click on the Network tab (in Firefox, you can get a similar view at Tools > Web Developer > Network. In either case, you’ll get a console showing you all the OTHER files that your browser is loading in addition to the page itself.

Holy gibberish filenames, Batman! This is another common problem when trying to scrape data: most major websites load about ten billion files on every pageview, everything from images to scripts to ad trackers to stylesheets and more. It can be a lot to sort through. But two tricks can save you from having to manually investigate what’s in each of these files:

You can filter by file type. We only really care about two types: “ XHR ” (the most common way data gets sucked in) and “Scripts” (less common). If you only look at each of those types, the list is a lot shorter.

” (the most common way data gets sucked in) and “Scripts” (less common). If you only look at each of those types, the list is a lot shorter. If a page is updating data automatically without you leaving the page, like this one is, that means it’s getting a new file periodically. If you clear the list and then just wait until the scoreboard updates itself, you should see only the file you care about.

If we do this we eventually find a file that contains all kinds of details about each game:

Jackpot! This is a JSON file (technically JSONP), a data format that’s perfect for analyzing with JavaScript.