My current work requires keeping track of more than a dozen live websites and making sure that their versions are kept up to date. We have employed a small Go program to make this possible, which enable us to scrape the generator meta tags on the websites and thereby get the version number.

The code is simple and uses Goroutines to let us concurrently connect to the sites. The code we use in prod is a bit more specialised, but I've reproduced a version that captures the essentials below:

package main import ( "encoding/json" "io/ioutil" "net/http" "os" "regexp" "time" ) // Site type holds information about sites type Site struct { Name , URL , Version , TimeTaken string } var sites = [] Site { { "Site A" , "https://a.web.site" , "N/A" , "" }, { "Site B" , "https://b.web.site" , "N/A" , "" }, { "Site C" , "https://c.web.site" , "N/A" , "" }, } // Parser type holds information about the parsing type Parser struct { url , body string } func ( p * Parser ) setURL ( url string ) { p . url = url p . body = "" }

Site is simply a struct that holds info about the particular site, and fields to capture the version and the time taken to do the scrape.

We've initialised a slice of sites with the URLs we need to scrape.

The Parser simply encapsulates parsing behaviour, and has two more methods, reproduced below. For each scrape, we initialise a new parser.

func ( p * Parser ) getMatches ( regexString string , index int ) string { regex , _ := regexp . Compile ( regexString ) matches := regex . FindStringSubmatch ( p . body ) if len ( matches ) == index + 1 { return matches [ index ] } return "N/A" } func ( p * Parser ) getVersion () string { res , err := http . Get ( p . url ) if err != nil { return "Fetch Error" } body , err := ioutil . ReadAll ( res . Body ) res . Body . Close () if err != nil { return "Parse Error" } p . body = string ( body ) return p . getMatches ( "(<meta name=\"generator\" content=\")([^\"]+)" , 2 ) }

getVersion() simply gets and reads the page body, and getMatches() is a method to try and match the generator tag regex and return the value of the tag.

Now, we get to the main function and the getSiteVersion() goroutine:

func getSiteVersion ( key int , done chan <- bool ) { start := time . Now () // to check the time taken site := Parser {} site . setURL ( sites [ key ]. URL ) sites [ key ]. Version = site . getVersion () sites [ key ]. TimeTaken = time . Since ( start ). String () done <- true } func main () { done := make ( chan bool ) for index := range sites { go getSiteVersion ( index , done ) } for i := 0 ; i < len ( sites ); i ++ { <- done } enc := json . NewEncoder ( os . Stdout ) enc . Encode ( sites ) }

getSiteVersion() gets called in a loop, initialises the Parser, does the scrape, and does simple time keeping of how long a scrape takes. We don't use a mutex when accessing the sites slice here because each element is only accessed by one Goroutine, but if this was not the case a mutex would be essential here.

In main() , we're simply using a channel called done to make sure all the Goroutines are done executing before we quit the application, and as a final step we output the sites slice as JSON. As an alternative, a WaitGroup can be used.

This is sadly the only piece of Go code I have written at for work, but it does its job admirably, and I would highly recommend Go for tasks like these.

The full code can be found on Github.