We propose a new paradigm for information filtering based on brain activity associated with relevance. The brain-relevance paradigm is based on the following four hypotheses evaluated empirically in this paper:

H1: Brain activity associated with relevant words is different from brain activity associated with irrelevant words.

H2: Words can be inferred to be relevant or irrelevant based on the associated brain activity.

H3: Words inferred to be relevant are more informative for document retrieval than words inferred to be irrelevant.

H4: Relevant documents can be recommended based on the inferred relevant and informative words.

The following two sections provide motivation based on both cognitive neuroscience and information science, followed by existing foundations of the brain-relevance paradigm.

Cognitive neuroscience motivation

Event-related potentials (ERPs) are obtained by synchronizing electrical potentials from EEG to the onset (“time-locked”) of sensory or motoric events10. The last 50 years of psychophysiology have demonstrated beyond a reasonable doubt that ERPs have a neural origin, that cognitive processes can reliably elicit them, and that the measurement of their timing, scalp distribution (“topography”), and amplitude can provide invaluable information about brain function of healthy11 and pathological12 cases.

Mentally controlling interfaces through measured ERPs has, to date, principally relied on the P300. The P300 is a distinct, positive potential that occurs at least 300 ms after stimulus onset and is traditionally obtained via so-called oddball paradigms. Sutton et al.13 presented a fast series of simple stimuli with infrequently occurring deviants (e.g., 1 in 6 tones having a high pitch) and discovered that these rare “oddballs” would on average trigger a positivity compared to the standard stimuli. Experiments later showed that the degree to which the stimulus provided new information14 and was task-relevant15 amplified the P300, whereas repetitive, unattended16, or easily processed17 stimuli could remove the P300 entirely.

For the language domain, the onset of words normally evokes a negativity at ca. 400 ms, which has been attributed to semantic processing18. This N400 was first observed as a type of “semantic oddball” since the closing word in a sentence, such as “I like my coffee with milk and torpedoes,” is semantically improbable but would amplify the N400 rather than cause a P300. However, if a rare syntactic violation occurs in a sentence (“I likes my coffee [..]”), the deviant word once again evokes a positivity, but now at 600 ms19. As this P600 shows similarities to the P300 in polarity and topography, it started the ongoing debate as to whether it is a language-specific “syntactic positive shift” or a delayed P30020,21,22. Finally, research on memory has identified a late positive component (LPC) at a latency similar to the P600. The LPC has been related to semantic priming and is particularly strong in tasks where an explicit judgement on whether a word is old or new is to be made23. Consequently, it is often associated with mnemonic operations such as recollection24. In the present context, relevant words could cue recollection of the user’s intent, thereby amplifying the LPC.

Although the P300/P600 and N400 are often described as contrasting effects, this is not necessarily the case in predicting term relevance. That is, if an odd, task-relevant stimulus yields a P300 or P600 and a semantically irrelevant stimulus an N400, it follows that the total amount of positivity between an estimated 300 and 700 ms may indicate the summed total semantic task relevance. This was indeed found by Kotchoubey and Lang25, who showed that semantically relevant oddballs (animal names) that were randomly intermixed among words from four other categories evoked a P300-like response for semantic relevance (but at ca. 600 ms). Likewise, our previous work on inferring term relevance from event-related potentials26, showed that a search category elicited either P300s/P600s in response to relevant words or N400s evoked by semantically irrelevant terms.

Information science motivation

Relevance estimation aims to quantify how well the retrieved information meets the user’s information needs. Computational methods are used in estimating statistical relevance measures based on word occurrences in a document collection. These measures are used in many information retrieval applications, such as Web search engines, recommender systems, and digital libraries. One of the most well-known statistical measures for this purpose is tf-idf, which stands for term frequency-inverse document frequency. The tf-idf weight is a weight often used in information retrieval to statistically measure how important a word is to a document in a collection or corpus. We use a logarithmically scaled tf-idf formally defined in SI Equation 3 to SI Equation 5. The importance of a word increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus; low- and medium-frequency words have a higher discriminating power at the level of the document collection, particularly when they have high frequency in an individual document27. For example, the word “nucleus” has a low frequency at the collection level (i.e., across all documents in the system) but a higher frequency in a document about atoms (i.e., the “Atoms” document) and is therefore considered to discriminate this document better than, for example, the word “the,” which has a high frequency at both the collection and document levels.

Consequently, relevance feedback for different words has a different effect in predicting an individual’s search intention. For example, detecting that an individual finds the word “the” relevant is less valuable than detecting the relevance of the word “nucleus” because the latter has a higher discriminative power and can be more effectively used to predict the individual’s intention of finding documents, such as the “Atoms” document.

In summary, the brain-relevance paradigm requires the relevance of an individual word to be inferred from the individual’s brain signals. Word informativeness is determined by the search system using the tf-idf statistic. Words that are both relevant and informative are words that discriminate relevant documents from irrelevant documents and are needed to predict the individual’s search intention and, consequently, to recommend meaningful documents. In addition to the brain activity findings related to the semantic oddball (introduced in the cognitive neuroscience motivation), recent findings in quantifying brain activity associated with language also suggest a connection between the word class and frequency of the word as well as the corresponding brain activity. It has been shown that brain activity is different for different word classes in language28 and that high-frequency words elicit different activity than low-frequency words29.