The Guardian datablog has joined forces with J. Nathan Matias of the MIT media lab and data scientist Lynn Cherny to collect what is to our knowledge, the most comprehensive, high resolution dataset available on news content by gender and audience interest.

Why gather this data?

There is evidence that women's byline counts are consistently lower than mens throughout the UK national press. With a similar situation being observed in the US.

It's not clear if this situation is related to audience demand or behavior in the newsroom. To give some idea of how newsrooms function the Guardian's Jane Martinson makes the important observation that : Dawn Neesom is now the only female editor of a national newspaper in the UK. Exploring the subtle relationships between supply and demand is a good position to start to talk about how this inequality could change.

Over the coming weeks we will be exploring a unique database of every article published by the Guardian, Telegraph, and Daily Mail in the year from July 2011 to June 2012. This was an active and varied time for news including the UK riots and phone hacking coverage.

To give us a deeper understanding we have automatically tagged the articles with the date of publication, section, gender of author, and social popularity data from Twitter, Facebook and Google+.

How have we gathered this data?

In the past, to measure gender in the news, researchers have counted articles by hand. In this study Kira Cochrane and a group of researchers went through seven daily newspapers, for almost a month, counting and recording the number of male and female writers.

To take the pain out of this process and to gather larger samples with richer background information, we have turned our attention to online news.

We used the Guardian OpenPlatform to download a complete copy of the Guardian online, and Matias scraped the archives of the Daily Mail and Telegraph.

To identify the gender we used the database of baby names collated by Anna Powell Smith from the Office for National Statistics data.

Using this database Matias' software classifies articles by byline as: male, female, mixed, or unknown. Most articles with unknown gender come from the newswires with a byline of "the associated press" or "press association." A much smaller number include ambiguous names or names which aren't in the UK birth statistics. "Unknown" also includes a very small number of articles with empty bylines.

A universally available measure of the popularity of the articles is the number of "shares" on Facebook, Twitter and Google+. This is just one measure of the impact and influence of news - by no means to only one. To get the shares we used the open source Amo software by Knight-Mozilla Fellow Cole Gillespie. Amo can fetch all of the Facebook, Twitter, and Google+ sharing information for any web address. Using this data, we can draw conclusions about the reach of women's voices and the nature of audience demand associated with each news organisations.

An introduction to our data

In this, our first post, here is an overview of the number of articles published by each newspaper from July 2011 to June 2012:

You can see the Guardian is publishing more and the gender is more regularly defined. The Daily Mail journalists frequently use "DAILY MAIL REPORTER" for a byline which means we are being careful when comparing data.

Looking closer at just the opinion articles we can see the following:

gender in the media Photograph: Graphic

Opinion sections can shape a society's opinions and therefore are an important measure of women's voices in society. In his preliminary analysis Matias notes:

We have found women are more prominent in UK opinion pages than they are in American newspapers. According to Taryn Yaeger of the Op Ed Project, women write 20% of op eds in America's newspapers. Across the UK papers we studied, the rate is 26%.

Here's our data showing for each newspaper the gender balance of the writers and a first look at the social networks sharing data. Let us know what you think in the comments.

• DATA: download the full spreadsheet

More data

More data journalism and data visualisations from the Guardian

World government data

• Search the world's government data with our gateway

Development and aid data

• Search the world's global development data with our gateway

Can you do something with this data?

• Flickr Please post your visualisations and mash-ups on our Flickr group

• Contact us at data@guardian.co.uk

• Get the A-Z of data

• More at the Datastore directory

• Follow us on Twitter

• Like us on Facebook