It's a special occasion in Lingua Franca today, given the launch of a new Australian resource. Here's Maria Zijlstra with the goods.

Maria Zijlstra: Yep! Due to a collaborative effort by linguists and information technologists, a great big inventory of language in Australia called the AusNC or, more properly, the Australian National Corpus, has now been launched and, we wuz there recording the event, so that you can be part of it as well, since the AusNC really belongs to all of us.

So without further ado, here's the linguist Kate Burridge launching the AustNC at a Scholarly Information Futures session at Griffith University in Brisbane:

Kate Burridge: I remember—very, very vividly—the meeting on 4 July in 2008. It was the Australian Linguistics Society conference and we were all huddled together in this overcrowded room to hear Cliff Goddard and Michael Haugh talk about this exciting possibility of a national corpus of language. So the idea of a collection of language samples, much like the American National Corpus or the British National Corpus, except of course it was—the idea at first anyway was—Australian English.

But it wasn't long before it was quite clear that what was being hatched at that meeting were plans for something a lot more ambitious. Clearly the Australian National Corpus, or AusNC as it's now fondly called, had to be a multimodal; not just simply the text-based collections that were of these other two northern hemisphere collections, but high quality audiovisual recordings as well. So that enabled us to capture spoken language, written language and, of course, electronically mediated communication.

But then there was the thinking too that, just look at the variation out there in language, we had to look at all sorts of different groups: ethnic groups or different regional groups, age-related groups, social groups. So, suddenly, OzNC started to look very, very big! And then by the end of the meeting we were all agreeing that, of course, it had to encompass language much more broadly. So we had to have Australian Indigenous languages, Australian sign language, and the community languages of migrants so, Greek or Italian or Vietnamese, for example.

So it was looking, you know, AusNC was looking like a colossal task ahead. And I remember wondering whether this feeling in the room, this clear passion, this excitement, must have been something like what was at the meeting, this famous November meeting of PhilSoc—the Philological Society—in 1857 when Trench and Furnivall and Coleridge, they got up and they expressed their dissatisfaction with existing dictionaries. And the cry was out for something, I suppose an inventory of language, something that would chart the life story of words. And it was a whopping task, particularly given the time that we're talking about, in the 1800s.

I mean, obviously, they are very, very different beasts, the Oxford English Dictionary—the OED—and the AusNC. You'll be encountering a lot of acronyms and initialisms! But, as initiatives, they did share two ideals; one was the historic orientation, and the other the democratic approach. The cry, as I said, in the 1800s was for this inventory of language that would chart the career span of words; so, all words: old ones, archaic ones, obsolete ones, new ones, respectable ones, not so respectable ones, scruffy ones. Although I don't think all the scruffy ones made it, I'm not quite sure how much the OED in its early days was able to live up to this descriptive ideal!

And in the case of AusNC, indeed, it was historical right from the start; already we have a couple of collections that are of considerable historical significance, and I'll mention those in a moment. But it was also the idea that, and I should emphasise this point that, the idea from the start was not that it should be purely archival. As Simon Musgrave described recently, the idea was also that AusNC would attract new collections, perhaps even commission collections that would fill perhaps gaps that were there. So the idea, again, of charting language in Australia across time.

But like the OED there was no hard and fast rules about what to include and what not to include. So it would be language in all its glory: written texts, all sorts of different kinds, videoed interaction, transcripts of speech perhaps with the audio files if they were available, and e-communication, e-speak, blogs and Tweets and emails, all that would be there.

And also that this would be a research resource that was for everybody, for its users, at no cost, so an online database/database (I'm hearing both pronunciations tonight so, I better use both so I don't get up the noses of half the audience here). But I should emphasise that it's not just [for] people who are interested in languages and linguistics. The AusNC is meant for historians, for sociologists, for social psychologists—as it says on its website—for people involved in cultural studies; anyone who is interested in Australian society and what makes it tick.

What I'm saying is what a small bunch of linguists and a small bunch of language technologists have achieved in the last three years, I think, is miraculous. And I'll just read you very brief descriptions—one-liners—of the nine corpora/corpuses.(That's another one, isn't it! If I say 'corpora' I'm going to get up the noses of half of you who think that's pretentious, and if I say 'corpuses' I'll get up the noses of the other half who think it's ill informed. So I'll use both.) They are already remarkably diverse, as you'll hear. So there's the Australian Corpus of English, ACE (nice acronym this one), has a million words of published text, so non-fiction and fiction, from the 1980s. There's the Australian Radio Talkback, ART, this is talkback radio from 2004 to 2006. So, fabulous stuff, you know, from commercial radio stations and the ABC.

There's OzLit which gives people access to hundreds of examples of poetry and fiction and criticism from the 1790s to the 1930s. There is a very interesting collection called Braided Channels, which is about 70 hours of oral history interviews. There's one with my favourite acronym of all, this is COOEE, the Corpus of Oz Early English; this is material that was collected by Clemens Fritz. It has texts of all kinds, including speech-based texts, so journals and letters, written in Australia, New Zealand or Norfolk Island between 1788 and 1900. This is a really very exciting record of early Australian English.

There's the Griffith Corpus of Spoken English. You're going to have to work on this acronym, Michael, it's an unpronounceable one, and it's a collection of transcribed and annotated recordings of spoken interaction amongst Australian speakers of English, obviously. There's the ICE, International Corpus of English, which is another million-word corpus, of transcribed, spoken and written Australian English from 1992 to '95.

There's Mitchell and Delbridge, the MAD corpus. This is a wonderful database of recordings of Australian English spoken by 7,736 students at 330 schools across Australia, mostly collected in 1960. So, like COOEE, it's a corpus of real historical significance that's there. And then finally there's the Monash Corpus of Spoken English, which also doesn't have a terribly nice acronym, a collection of recordings and transcriptions of interviews, as I said, of schoolchildren.

I should just point out, since I've been talking about the OED, that it took something like 27 years, I think, before the first part of the dictionary appeared—and it was a 352 volume, I think—of words between 'A' and 'ant'. Obviously we are looking at very different enterprises here, very different conditions, but I suppose what I want to emphasise is that there are Herculean efforts in both cases.

So, this is all very well and good, but what can you do with a corpus? Lots of things in fact and, as I emphasised earlier, this is a research resource/research resource—there are a lot of linguistic pitfalls in tonight's talk, where you put the stress on 'research'—for everybody, but because I'm a linguist, let me just mention things that are obviously linguistic. And I'll mention three things that I've been involved in. One is this idea of tracking or charting the life span of words. It was the dream of Richard Trench originally and this, of course, becomes so much easier in a corpus. Now I get hundreds, literally hundreds and hundreds, and I have them in my inbox—and this is using 'literally' in the literal sense, not in the exaggerated sense—these are questions about word etymology and meanings of words. And what I'm asked about all the time and have been recently, which is what made me think of it, is this peculiar Australian expression, to 'behave like a pork chop', which appeared in Australian English roughly in the 1950s as a much longer expression, to be 'like a pork chop in a synagogue'. Obviously a pork chop in a synagogue is something that is embarrassingly out of place, it is inappropriate. But the problem is that Australian English speakers have shortened the expression to 'behaving like a pork chop', and once you start doing that, of course, words will just go their own merry way. And what is happening now, I think, is its shifting from inappropriateness to the idea of foolishness. But this is something you could actually look at in a corpus. You could test the currency of expressions; who is using it and when, what does it exactly mean at the time? These are wonderful things you can do with this.

There is also that corpora/corpuses, are fabulous things for counting. You might remember, in 2006 I think it was, when Tourism Australia launched that famous campaign 'Where the bloody hell are you?', where you had ordinary Australians set against beautiful extraordinary scenes, and then there was that very Australian message at the bottom, 'Where the bloody hell are you?' Well, it managed to get censored in North America, it was banned I think in the UK, because of the word 'bloody'.

So my colleague Keith Allan and I, a little bit later, decided we'd see what the corpus evidence said about 'bloody'. So we looked at the ART, the talkback corpus, and we found that 'bloody' appeared, I think it was, six times every 10,000 sentences. And then we looked in the ICE corpus, at the conversational data in ICE, and we found that it was 20 times per 10,000 sentences. And we compared New Zealand, just to see what the Kiwis were doing, and it was only—in a comparable corpus—seven times per 10,000 sentences. So it really is the great Australian adjective, although it is not an adjective, I hasten to add (and I could bore you with the tests for adjectivehood tonight, but I won't do that). And then, just for interest, we decided to look at the London corpus to see how 'bloody' featured there and there it was 27 times per 10,000 sentences. So we in the Antipodes lag very much behind Londoners in the use of 'bloody'. So something very interesting was going on there in the banning of 'bloody'.

The third thing that I want to say is that, yes, the historical application; obviously the current corpus, particularly with the incorporation of COOEE, is going to shed some very interesting light on early Australian English, on the linguistic processes that were happening there in the dialect melting pot in those very early years. And Clemens Fritz, who has donated this corpus, has already done some wonderful things with it. But Simon Musgrave and I decided we'd look at it quite recently to see, particularly, the evidence for Irish English input in those early years. Now I won't, you know, this is not the venue to tell you what we found or in fact what we didn't find—that was perhaps more interesting—what didn't appear in there (so 'youse', for example, the plural 'you', or some of the other Irish-isms).

But, in some ways really, the work is just beginning. There are still lots of truly marvellous collections out there, a lot of them inaccessible. But I know that AusNC are very keen to encourage researchers to get their materials together. It takes time and, of course, it takes money. So we're on the lookout for a mining magnate who has a passion for language and a passion for corpuses or corpora, whatever, I don't mind! But at some stage we really do need to toast AusNC and its creators because this is after all a birthday party for the corpus, and it needs to be celebrated.

Maria Zijlstra: The Chair of Linguistics at Monash University, Professor Kate Burridge, launching the Australian National Corpus at Griffith University in Brisbane last week. And, wherever you are in the world, why don't you have a squiz at the AusNC yourself, using the link on the program page for this Lingua Franca episode, via RN's internet address: abc.net.au/radionational.