Creativity is a complex, multi-faceted concept encompassing a variety of related aspects, abilities, properties and behaviours. If we wish to study creativity scientifically, then a tractable and well-articulated model of creativity is required. Such a model would be of great value to researchers investigating the nature of creativity and in particular, those concerned with the evaluation of creative practice. This paper describes a unique approach to developing a suitable model of how creative behaviour emerges that is based on the words people use to describe the concept. Using techniques from the field of statistical natural language processing, we identify a collection of fourteen key components of creativity through an analysis of a corpus of academic papers on the topic. Words are identified which appear significantly often in connection with discussions of the concept. Using a measure of lexical similarity to help cluster these words, a number of distinct themes emerge, which collectively contribute to a comprehensive and multi-perspective model of creativity. The components provide an ontology of creativity: a set of building blocks which can be used to model creative practice in a variety of domains. The components have been employed in two case studies to evaluate the creativity of computational systems and have proven useful in articulating achievements of this work and directions for further research.

Funding: The author(s) received no specific funding for this work. Anna Jordanous undertook part of this work during her PhD, which was part-funded by a stipend provided by the School of Informatics, University of Sussex.

Data Availability: 90 academic publications dated 1950-2009 are analysed as part of this work. All of these articles were accessed via Scopus searches, through academic publishers. A full list of these publications is given in Jordanous’s thesis and the creativity corpus publications are listed in this article, in Fig 1 . All data produced during analysis from the texts of these publications are available via Open Science Framework ( https://osf.io/nqr76/ ). In particular, this includes the lexical data for both corpora, with frequencies, the similarity data scores that we produced during analysis, and the 694 ‘creativity words’. Data from the British National Corpus (BNC) was used during analysis. The BNC data is available from http://www.natcorp.ox.ac.uk/ The results data generated during analysis (the 694 key words for creativity and the 14 key components of creativity) are openly available online in the form of an ontology (also submitted as a Supporting Information file), published under the URL http://purl.org/creativity/ontology . As also stated in the paper, these data are made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/ . These data are also available in the PhD thesis of Anna Jordanous (2012), which is openly available via the University of Sussex library ( http://sro.sussex.ac.uk/44741/ ) or via the University of Kent’s Academic Repository ( https://kar.kent.ac.uk/42388/1/Jordanous%252C_Anna_Katerina.pdf ). The creativity Semantic Web ontology links to data from the Wordnet lexical database ( http://wordnet-rdf.princeton.edu/ ), via the openly available data published at http://wordnet.rkbexplorer.com/ .

Copyright: © 2016 Jordanous, Keller. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The need for a clearer, multi-perspective understanding of creativity is evident, but remains to be addressed. There is a large quantity of material contributory to a satisfactory model of creativity and a number of key contributions have been discussed during this section. What must be done now is to marshal this assortment of material and to unify different perspectives where possible, in order to avoid the disciplinary ‘blinkers’ or compartmentalisation that is often seen in creativity research [ 11 ]. In approaching the semantic representation of subjective and multi-faceted concepts, some useful guidance is offered through philosophical reflections on the meaning of such concepts.

The key principle emerging across these present discussions is that the meaning of words like creativity can be modelled by identifying different aspects that collectively contribute to the meaning of the concept of creativity.

Linguistics research advocates that the meaning of a word is dependent on the context it is used in [ 46 ]. In particular, Lakoff has argued that the study of language helps reveal how people think [ 13 , 47 ]. Words used frequently in discussions of the nature of a concept provide the context for the commonly understood meaning of that concept, as has been shown in various corpus linguistics contributions [ 48 – 51 ].

Wittgenstein [ 14 ] has argued that ‘a clear view of the aim and functioning of the words’ helps us ‘dispers[e] the fog’ that obscures a clear vision of the ‘working of language’ [ 14 ] (Part 1, Paragraph 5). To understand the use of a word, one must have background information and context. Wittgenstein gives the example of a chess piece, which is introduced to someone as a ‘king’ (Paragraph 31): to understand this usage the person must already know the rules of chess, or must at least know what it means to have a piece in a game. To Wittgenstein, the semantics of words and statements are determined by how we use them, grounded in rules set by our habitual use of a word and our shared consensual practices, rather than being fixed by static, pre-assigned meanings.

Similarly, with creativity, different manifestations of creativity are not all necessarily required to share the same common, core elements in order to be identified as part of the creativity ‘family’. Rather, relationships between different manifestations reveal various shared characteristics that emerge in a similar way to Wittgenstein’s ‘family resemblances’ in language. We need to identify what those family resemblances are in the case of creativity. To understand creativity, we can investigate what resemblances exist across different instantiations of the concept.

[On discussing the example of what a ‘game’ is] ‘we see a complicated network of similarities overlapping and criss-crossing: sometimes overall similarities, sometimes similarities of detail. … I can think of no better expression to characterize these similarities than “family resemblances”; for the various resemblances between members of a family: build, features, colour of eyes, gait, temperament, etc. etc. overlap and criss-cross in the same way. And I shall say: “games” form a family.’

Creativity can be seen as an essentially contested concept [ 45 ]: it is subjective, abstract and can be interpreted in a variety of acceptable ways, such that a fixed ‘proper general use’ is elusive [ 45 ] (p.167). Gallie [ 45 ] defines an essentially contested concept through several features: being internally complex in nature, but amenable to being broken down into identifiable constituent elements of varying relative importance, and dependent on a number of factors such as context and individual preference. Although there may be consensus on the meaning of such concepts in very general terms, they may defy precise interpretation. There is not a single agreed instantiation, but instead many reasonable possibilities, influenced by changing circumstances and contexts. It is more productive to acknowledge that these different interpretations exist and refer to ‘the respective contributions of its various parts or features’ [ 45 ] (p.172), rather than to argue for a single interpretation. Thus, different types of creativity manifest themselves in different ways while sharing certain characteristics (not necessarily the same across all creative instances). This is what Wittgenstein refers to as ‘family resemblances’ [ 14 ]:

This framework presents creativity in a broader context, making our understanding of the concept more generally applicable and less specific to a domain or academic discipline. In contrast, models of the creative process [ 34 , 35 , 41 ], tests of people’s creativity [ 21 , 42 , 43 ] or tests based on creative artefact generation [ 25 , 44 ] are useful only within a limited sphere. Jordanous [ 40 ] has contextualised the Four Ps in a computational context, referring to the creative Producer (person or computational agent) carrying out Processes within the environmental context of a Press, to create computational Products.

The preceeding discussion indicates that creativity is a complex, multi-faceted concept that requires a broad and inclusive treatment. The Four Ps framework [ 7 , 18 , 38 – 40 ] ensures we pay attention to four key aspects of creativity:

Similarly, researchers distinguish between little-c and Big-C creativity, or psychological/P-creativity and historical/H-creativity[ 19 ], adjusting their focus accordingly to make their research more manageable. This is particularly the case in computational creativity, where endowing the computer with elements of general, human knowledge and experience is a major challenge. Little-c creative or p-creative work is perceived as creative by the creator personally but may replicate existing work (unknown to the creator) so is not necessarily creative in a wider social context. This encompasses the concept of Big-C creativity or h-creativity, where the work makes a creative contribution both to the creator and to society. To be Big-C creative/h-creative is to be little-c creative/p-creative in a way which has not been done before by anyone.

Other conflicts arise where a previously narrow view of creativity has been widened in perspective. To resolve the conflict, an inclusive, all-encompassing view of creativity should adopt the wider perspective and incorporate the narrower perspective. For example rather than focussing narrowly on creative genius, through the study of people with exceptional creative achievements (see [ 34 , 35 ]) emphasis has shifted to encompass the broader study of everyday creativity, with genius as a special case: the notion that everyone can be creative to some degree [ 36 , 37 ].

Several competing interpretations of creativity exist in the literature. Sometimes these differences of opinion do not need to be directly resolved but can be included alongside each other. Examples include whether creativity is centred around mental processes [ 19 , 27 , 28 ] or embodied and situated in an interactive environment [ 29 , 30 ]. Another example is whether creativity is domain-independent [ 31 ], or dependent on domain-specific context [ 32 ], or (as both Plucker and Baer have concluded) a combination of both [ 12 , 33 ].

The problem of identifying and quantifying creativity exists across many disciplines. How creative is this person? Does this person have the creative abilities to boost my business? Is this pupil’s story creative? Is this computational system an example of computational creativity? As a consequence, when attempts are made to define creativity, it is often from the perspective of a particular domain or research discipline. For example, psychometric tests for creativity such as [ 20 , 21 ] focus on problem solving and divergent thinking as key attributes of a creative person. In contrast, computational creativity research (for examples see [ 22 – 25 ]) has historially placed emphasis on the novelty and value of creative products. Whilst there is some consensus across academic fields, for example novelty and value are typically recognised as necessary (but arguably not sufficient) components of creativity [ 26 ], the differing emphases contribute to variations in the interpretation of creativity. These variations affect consistency across creativity research in different disciplines and potentially hinder interdisciplinary collaborations and cross-application of findings.

These more research-oriented definitions avoid the problems of self-reference and circularity noted for the dictionary entries given previously. However, whilst the definitions may provide somewhat deeper insight into the nature of creativity, the brevity of the definitions means that they still only succeed in providing shallow, summary accounts of the concept.

‘The word creativity is a noun naming the phenomenon in which a person communicates a new concept (which is the product). Mental activity (or mental process) is implicit in this definition, and of course no one could conceive of a person living or operating in a vacuum, so the term press is also implicit’

‘Creativity is the interaction among aptitude, process, and environment by which an individual or group produces a perceptible product that is both novel and useful as defined within a social context’

‘creativity is that process which results in a novel work that is accepted as tenable or useful or satisfying by a group at some point in time’

Given the problems inherent in dictionary definitions of creativity, it is not surprising that a number of creativity researchers have set out to provide their own definitions of the concept. Some examples are:

To find out the meaning of a word, a natural first step might be to consult a dictionary. Dictionary definitions of creativity provide a brief introduction to the meaning of the word. However, for the purposes of research, the utility of such definitions is severely restricted by their format and brevity, and they generally provide only cursory, shallow insights into the nature of creativity. More problematic still, dictionary entries are often self-referential or circular, defining creativity in terms of “being creative” or “creative ability”. To illustrate these limitations, there follow several typical dictionary definitions of creativity and the related words ‘creative’ and ‘create’. For readability, some definitions are edited slightly to standardise formats and remove etymological/grammatical annotations:

In short, we need to specify and justify the standards that we use to judge creativity. A more objective and well-articulated account of how creativity is manifested enables researchers to make a worthwhile contribution [ 8 – 10 ]. Particularly, in research we would like to focus on what processes and concepts relevant to creativity are ‘sufficiently important to warrant study’ [ 17 ] (p.15), based on an accumulation of the body of work on creativity to date [ 17 ].

Creativity can and should be studied and measured scientifically, but the lack of a commonly-agreed understanding causes problems for measurement [ 10 ]. Plucker et al. make recommendations about best practice based on their own survey of the creativity literature:

Two answers to this question are offered by Hennessey and Amabile, both of which are identified as desirable: to gain a deeper understanding of creativity and to learn how to boost people’s creativity.

‘Even if this mysterious phenomenon can be isolated, quantified, and dissected, why bother? Wouldn’t it make more sense to revel in the mystery and wonder of it all?’

‘[c]reativity defies precise definition … even if we had a precise conception of creativity, I am certain we would have difficulty putting it into words’

In the rest of this section we begin by noting a variety of attempts to define creativity. The representation of subjective, ambiguous, loosely structured concepts is considered. In the remaining sections, details are provided of the methodology used to identify components of creativity from an analysis of language data. The results of this analysis are then presented in terms of a model that encompasses fourteen key components. The derived set of components is evaluated in terms of how well it satisfies the need for a shared, inclusive and comprehensive account of creativity and provides a vocabulary of creativity that is accessible to both people and machines. Finally, conclusions are drawn and some directions for further work are outlined.

On our approach, statistical language processing techniques are used to identify words significantly associated with creativity in a corpus of academic papers on the subject. A corpus spanning some 60 years of research into the nature of creativity was collected together. The papers were gathered from a wide variety of disciplines including psychology, educational testing and computational creativity, amongst others. The language data drawn from this collection was then analysed and contrasted with data from a corpus of matched papers on subjects unrelated to creativity. From this analysis, a set of 694 creativity words was identified, where each creativity word appeared significantly more often than expected in the corpus of creativity papers. A measure of lexical similarity provided a basis for clustering the creativity words into groups of words with similar or shared aspects of meaning. Through inspection of these clusters, a total of fourteen key components of creativity was identified, where each represents a key theme or attribute of creativity. The set of components yields information about the nature of creativity, based on what is collectively emphasised in discussions about the concept.

The aim of the work reported in this paper is to examine the nature of creativity and to identify within it a set of components, representing key dimensions, that are recognised across a combination of different viewpoints. We present a novel, empirical approach to the problem of modelling how creative behaviour is manifested, that focuses on what is revealed about our understanding of creativity and its attributes by the words we use to discuss and debate the nature of the concept. Analysis of this language provides a sound basis for constructing a sufficiently detailed and comprehensive model of creativity [ 13 , 14 ]. The current work is intended as a significant, methodological contribution towards addressing the Grand Challenge of evaluation in computational creativity research. It should provide researchers with a firm foundation for evaluating exactly how creative so-called creative systems actually are.

There are many challenges to modelling a concept like creativity in a computational setting. Conceptually, creativity seems inherently fuzzy or vague, with a meaning that shifts depending on the domain of application. Tackling these challenges affords two key advantages, both of which motivate the current paper. First, we can take advantage of computing and artificial intelligence to perform or enhance creative activities using computational power and research expertise. Secondly, the act of modelling creativity requires us to more carefully identify what informs our intuitive notions about creativity and this can guide us towards a more rigorous and comprehensive understanding of the concept.

Creativity is a complex, multi-faceted concept encompassing a variety of related aspects, abilities, properties and behaviours. There have been many attempts to capture this concept in words; indeed the work described in this paper is based on thirty such attempts (see the Methods section and the papers listed in S1 Appendix ). In the academic literature on creativity, many common themes have emerged. However, multiple viewpoints exist, prioritising different aspects of the concept according to what are traditionally considered to be the primary factors for a particular discipline. The need for a more over-arching, inclusive, multi-dimensional account of creativity has been widely recognised [ 7 – 10 ]. Such a meta-level account would assist our understanding of creativity, highlighting areas of common ground and avoiding the pitfalls of disciplinary bias [ 11 , 12 ].

The evaluation of creative systems developed by researchers in the field of computational creativity has proven non-trivial. Creativity evaluation, a recurring topic for discussion, has been described as a ‘Grand Challenge’ for computational creativity research [ 6 ]. Difficulties are inherently linked to a question that both motivates and complicates the computational modelling of creativity: what do we mean when we talk about ‘creativity’ and what does it constitute?

Computational creativity research follows both theoretical and practical directions and crosses several disciplinary boundaries between the arts, sciences, and engineering. Research within the field is influenced by artificial intelligence, computer science, psychology and specific creative domains that have received attention from computational creativity researchers to date, such as art, music, reasoning and narrative/story telling (for examples, see [ 2 – 5 ]).

What is creativity, and how can we better understand and learn about creativity using computational modelling? Computational creativity is a relatively youthful research area that has been growing with significant pace in recent years. Computational creativity is:

Methods

Our approach makes use of an empirical study and analysis of the language used to talk about creativity in order to gather and collate knowledge about the concept. In addition, following from the observations above, a confluence approach to creativity is adopted [16, 26, 52]. This works on the principle that creativity results from several components converging and goes on to examine what these components are. Taking this approach in conjunction with the application of tools from computational linguistics and statistical analysis allows a wider disciplinary spectrum of perspectives on creativity to be captured than has previously been attempted. This is achieved by breaking down the whole into smaller and more tractable constituent parts identified through a broad cross-disciplinary examination of creativity research.

Tools from natural language processing and statistical analysis are used to identify words that appear to be highly associated with dimensions of creativity, as represented in a sample of academic papers on the topic. A key innovation is the use of a statistical measure of lexical similarity, which allows the words to be clustered into coherent and semantically-related groups. Clustering reveals a number of common themes or factors of creativity, allowing the identification of a set of fourteen components that serve as building blocks for creativity.

Corpus data A sample of academic papers discussing the nature of creativity was assembled as a creativity corpus in 2010. This creativity corpus consisted of 30 papers examining creativity from various academic stand-points ranging from psychological studies to computational models. Creativity corpus: a collection of thirty academic papers which explicitly discuss the nature of creativity. The 30 papers selected for the creativity corpus are listed in S1 Appendix. The strategy used to select papers for this corpus is illustrated in a flow diagram, in Fig 1. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. A flow diagram describing the search strategy used to identify papers for the creativity corpus. https://doi.org/10.1371/journal.pone.0162959.g001 The search strategy for identifying papers for the corpus involved a literature search for the term ‘creativity’ on the academic database Scopus to identify suitable papers. This literature search was supplemented with additional influential papers which may not have appeared in a Scopus search. For example, a Computer Science conference paper on cognitive models of creativity has been included, as in Computer Science, a number of conferences carry as much or more publication weight as some journals in the field. The eligibility of each identified article was verified for inclusion in the corpus via careful manual inspection. Paper selection for the creativity corpus was governed by inclusion criteria based on measuring the influence of a paper and coverage of a wide range of years and academic disciplines. The inclusion criteria are as follows, listed in descending order of precedence: Papers must have, as their primary focus, discussion of the nature of creativity.

Papers should be considered particularly influential. Influence was generally measured objectively, in terms of the number of times a paper had been cited by other academic authors. However, for papers published in recent years and which had consequently had little time to accrue citations, selection was based instead on a subjective judgement of influence grounded in a knowledge of the field.

Papers selected should, as far as reasonable, represent a cross-section of years over the range 1950-2009. [The corpus was compiled in 2010.] 1950 was chosen as a starting point in recognition of the effect of J. P. Guilford’s presidential address to the American Psychological Association [20], which examined contemporary creativity research (or more specifically, the lack of thereof). His talk was highly influential in encouraging more creativity research activity [10].

Papers selected should, as far as reasonable, represent a cross-section of disciplines relevant to discussions of creativity. Fig 2 illustrates the disciplinary distribution of the corpus as it changes over the time period covered by the selected papers. This distribution is based on the Scopus database, which classifies journals under their main subject area(s) covered. We should acknowledge here though that while many disciplines include creative practice, often the focus is on application rather than in depth discussion of what creativity entails. Hence, while we sought to cover creativity from a broad range of perspectives, we also felt it was important not to compromise the focus of our corpus as a representation of key discussions about the nature of creativity. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Representation of the disciplinary breakdown of the Creativity Corpus by time period. Disciplines are as specified for the paper’s journal, by the academic database Scopus. Note that Scopus may classify a journal under more than one discipline. https://doi.org/10.1371/journal.pone.0162959.g002 Exclusion criteria for this search were as follows: Authors were only represented more than once in the corpus if the relevant papers were written from different perspectives. For example, Mark Runco’s work is represented twice in the corpus, but covering two different topics relating to the nature of creativity (psychoeconomic approach to creativity; cognition and creativity). If the search process highlighted two or more papers with a shared author on the same or highly similar perspectives on creativity, then the more highly cited paper was chosen.

Papers had to be written in English, as the language processing tools we were working with were for English language texts.

Papers had to be available in a format that enabled us easily to extract plain text (this excluded books or book chapters). The creativity corpus is relatively small and necessarily selective in terms of the papers that are included. As such it constitutes just a small fraction of the many academic works on creativity that have been published in the last 60 or so years. Indeed, the 30 papers in the creativity corpus cannot be regarded as comprehensively representative of the wide range of academic positions on creativity that have been discussed in the literature over the decades. However, the goal of this work is not to present a fine-grained analysis of language use drawn from this complete literature, nor to provide a comprehensive lexicon or dictionary of creativity. Rather, the goal is to identify the broader ontological themes or factors that recur in our understanding of the concept of creativity. For this purpose, what is required is a sufficiently representative sample of the academic discourse on creativity. This sample can be used to identify the way in which word use reflects key themes or factors that persist across different perspectives. Our objective is to identify what is distinctive in the language used to discuss creativity, in contrast to the language used to discuss other topics. As a basis for comparison, therefore, a further sample of 60 academic papers on topics unrelated to creativity—the non-creativity corpus—was assembled alongside the creativity corpus, in 2010. Non-creativity corpus: a collection of sixty academic papers on topics unrelated to creativity, from the same range of academic disciplines and publication years as the creativity corpus papers. The non-creativity corpus papers were selected by a literature search retrieving, for each paper in the creativity corpus, the two most-cited papers in the same academic discipline (as categorised by Scopus) and published in the same year, that did not contain any words with the prefix creat (i.e. creativity, creative, creation, and so on). In other words, the criteria for inclusion in this second corpus were whether a paper was one of the two papers that was most highly cited at the time of the search (2010), in the same academic discipline, and published in the same year, as a paper in the creativity corpus, and that satisfied the exclusion criteria of not containing any words with the above mentioned prefixes. The 60 papers selected for the creativity corpus are listed in S2 Appendix. The search strategy used to select papers for this non-creativity corpus is illustrated in a second flow diagram, in Fig 3. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. A flow diagram describing the search strategy used to identify papers for the non-creativity corpus. https://doi.org/10.1371/journal.pone.0162959.g003 The non-creativity corpus is twice the size of the creativity corpus (≈ 700,000 words and ≈ 300,000 words respectively), in acknowledgement of the fact that in general the set of academic papers on creativity is only a small subset of all academic papers. Both corpora are very small in comparison to corpora such as the British National Corpus, a relatively large (≈ 100M words) corpus of written and spoken English in general usage across a number of different contexts, and tiny in comparison to more recent web-derived text collections containing billions of words. There are, however, several benefits associated with using a corpus derived from specialist academic literature: Ease of locating relevant and appropriate papers: e.g. availability of tools to perform targeted literature searches, electronic publication of papers for download, tagging of paper content by keywords, citations in papers to other related papers.

Ability to access timestamped textual materials over a range of decades.

Publication of academic papers in an appropriate format for computational analysis: most papers that are available electronically are in formats such as PDF or HTML, which can be converted to text fairly easily.

Availability of citation data as a measure of how influential a paper is on others: whilst not a perfect reflection of a paper’s influence, citation data is often used for measuring the impact of a journal [53] or an individual researcher’s output [54].

Availability of provenance data, such as who wrote the paper and for what audience (from the disciplinary classification of the journal). Some pre-processing was undertaken for each paper in both the creativity corpus and non-creativity corpus prior to analysis. A plain text file was generated for each paper, containing the full text of that paper. All journal headers and copyright notices were removed from each paper, as were the author names and affiliations, list of references and acknowledgements. All files were also checked for any non-ASCII characters and anomalies that may have arisen during the creation of the text file.

Natural language processing The corpus data was first pre-processed using the RASP natural language processing toolkit [55] in order to perform lemmatisation and part-of-speech tagging. Lemmatisation permits inflectional variants of a given word to be identified with a common ‘dictonary headword’ form or ‘lemma’. For example, performs, performed and performing all occur in the creativity corpus as distinct morphological variants of the verb, perform. Intuitively, we would like to count each of these inflectional variants as an instance of the same word, rather than as separate and distinct lexical tokens. Lemmatisation software enables us to do this by mapping such variants to a cannonical lemma form. As a further refinement, each lemma was also mapped to lower case to ensure that capitalised word forms (e.g. Novel) were not counted separately from their non-capitalised forms (novel). While this has the potential for occasional confusion between proper names and common nouns (e.g. Apple v. apple), it is not considered that the resulting level of ‘noise’ in the data is likely to adversely affect the results of the analysis. Each word was assigned a part-of-speech tag identifying its grammatical category (i.e. whether the word was a noun, verb, preposition, etc.). Such tagging is useful because it allows us to distinguish between different grammatical uses of a common orthographic form. For example, the use of novel as a noun in a good novel can be properly differentiated from its use as an adjective in a novel idea. The data was further simplified and filtered so that only words of the four ‘major’ categories (i.e. noun, verb, adjective and adverb) were represented. Note that the major categories bear the semantic content of the papers making up the creativity corpus. They may be distinguished from minor categories or ‘function words’, such as pronouns (something, itself) prepositions (e.g. upon, by) conjunctions (but, or) and quantifiers (e.g. many, more). Because such words have little independent semantic content, they are of limited interest for the present study and may be removed from the data. Following processing with RASP, a list of words found in the creativity corpus, together with their frequency counts was generated. The non-creativity corpus was pre-processed in the same way and a corresponding list of words and frequencies also generated.

Identifying words associated with creativity The word frequency data derived from the two corpora was used to establish which words occur significantly more often in the creativity corpus than in the non-creativity corpus. This in turn can be regarded as providing evidence of which words are salient to the concept of creativity. Salient words were identified using the log-likelihood ratio (also referred to as the G2 or G-squared statistic), which is a measure of how well observed data fit a model or expected distribution [48–50, 56]. It provides an alternative to Pearson’s chi-squared (χ2) test and has been advocated as the more appropriate measure of the two for corpus analysis as it does not rely on the (unjustifiable) assumption of normality in word distribution [48, 50, 56]. This is a particular issue when analysing smaller corpora, such as those used in the present work. The log likelihood ratio statistic is more accurate in its treatment of infrequent words in the data, which often hold useful information. By contrast, the χ2 statistic tends to under-emphasise such outliers at the expense of very frequently occurring data points. Our use of the log-likelihood ratio follows that of Rayson and Garside [49]. Given two corpora (in our case, the creativity corpus cc and the non-creativity corpus nc) the log-likelihood score for a given word is calculated as shown in Eq (1) below: (1) where O cc (O nc ) is the observed frequence of the word in cc (nc) and similarly E cc (E nc ) is its expected frequency. The expected frequency E cc is given by: (2) where N cc denotes the total number of words in corpus cc (i.e. the sum of the frequencies of all words drawn from corpus cc). The expected frequency E nc is defined in a way analogous to Eq (2). As computed above, the log-likelihood ratio measures the extent to which the distribution of a given word deviates from what might be expected if its distribution is not corpus dependent. The higher the log likelihood ratio score for a given word, the greater the deviation from what is expected. It should be noted however, that the statistic tells us only that the observed distribution of a word in the two corpora is unexpected (and to what extent). It does not tell us whether the word is more or less frequent than expected in the creativity corpus. To identify words significantly associated with creativity therefore, it was necessary to select just those words with observed counts higher than that expected in the creativity corpus. It should perhaps be further noted that the resulting words may be either positively or negatively connoted with respect to creativity. In practice this is not a problem, as the significance of a given word lies in its semantic connection to creativity, not in its sentiment or affect. Affect is taken into account as part of the later manual examination of the data used to identify components of creativity. The results of the calculations were filtered to remove any words with a log-likelihood score less than 10.83, representing a chi-squared significance value for p = 0.001 (one degree of freedom). In this way, the filtering process reduced the set of candidate words to just those that appear to occur significantly more often than expected in the creativity corpus. To avoid extremely infrequent words disproportionately affecting the data, any word occurring fewer than five times was also removed from the data. Finally, the words were inspected to remove any ‘spurious’ items such as proper nouns or misclassified or odd character sequences. This resulted in a total of 694 creativity words: a collection of 389 nouns, 205 adjectives, 72 verbs and 28 adverbs that occurred significantly more often than expected in the creativity corpus. Table 1 gives the top 20 results of these calculations. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. The top 20 results (in descending order) of the log-likelihood ratio (LLR) calculations. A significant LLR score at p = 0.001 is 10.83. N.B. POS = Part Of Speech: N = noun, J = adjective, V = verb, R = adverb. https://doi.org/10.1371/journal.pone.0162959.t001