I was honored to speak about disinformation at AI for Good today.

First, my bottom line: disinformation defence is based on data science. It contains the familiar data science elements of people, process, data, algorithms and insights. In my talk, I went through each of these.

People: disinformation-tracking communities

Over at CogSecCollab, we’ve been working on distributed disinformation defence: how do we build teams across disciplines, in many different geographical locations, to reduce the incidence, spread and effects of large-scale disinformation, which is itself distributed across groups, locations, platforms, areas of interest etc.

We had two templates for this: we’d seen the Baltic Elves (volunteer groups in Lithuania, Latvia etc that team up with media etc to push back on Russian disinformation with facts, humour etc), and some of our members had designed the processes and technology uses behind crisismapping — the creation of distributed worldwide teams to find and analyse data and produce situation awareness products (e.g. maps, datasets) for responders during natural disasters. As crisismappers, we’d taken the physical cardboard box that CrisisCamp London put its tools in every week to create an online “CrisisCamp in a Box” kit, with its smaller variant “CrisisCamp in a Bag” and larger version “CrisisCamp in a Container” (for country-sized responses), and created new camps and groups all round the world: there was no reason a similar mix of good processes, tools, connections and mentoring couldn’t also work for disinformation.

We had plans for 2020. Finish our work on disinformation countermeasures (organising our 200+ list of counters; extending our ideas about game theory and resource exhaustions); build out the toolset we’d started with the AMITT standards and their translation into STIX and MISP, expand the how-to notes we’d written for election officials into a playbook for the Elves; continue collecting and making datasets available in machine-readable form.

And then Covid19 happened, and we found ourselves running or helping to run 3 different community disinformation deployments: Covid19Activation (collecting disinformation data from round the world), Covid19Disinformation (providing surge capacity for an existing disinformation team), and the CTI League’s Disinformation team (running a disinformation deployment inside a large-scale volunteer information security deployment).

Process: incident tracking

We stopped planning and writing about how to run community deployments, and started writing the playbooks as we ran them. Here’s the base playbook for one of the deployments:

Incident tracking instructions

It’s pretty basic — deliberately so — we expect volume, and people tracking quickly need simple instructions. It’s also basic in the tools it uses: spreadsheets, google folders, and two open-source tools, because we want as many teams as possible to be able to do the same thing (and be capable of being linked together: each of the data tools have APIs). It’s also going to change over time.

Data: Disinformation ‘layers’

Disinformation Pyramid

We talk about the disinformation pyramid. We typically see longer-term campaigns — something a disinformation creator is focussed on long-term, like Covid19 or a specific election. And within those we see incidents — these have a relatively short time duration, and usually focus on one thing, like the Stafford Act. Within incidents are the narratives: the stories and memes that are shared between people, like “5g causes covid19”; and underneath those are the artifacts, the tangible objects that appear online, like messages, images, user accounts, groups and relationships. We typically see artifacts, and derive information about narratives and incidents from them.

At CogSecCollab, most of our work is based around tracking and sharing information about individual incidents. There are several parts to this. We worked last year on ways to decompose incidents into the tactics and techniques used in an incident, and on ways to counter each of those: that gave us the AMITT Framework:

AMITT Framework, in the ATT&CK Navigator tool

We’ve also been looking at ways to represent how narratives are connected to each other, how they form and ‘die’ (and sometimes come back from the ‘dead’), and how we might be able to make narrative labelling faster by showing only ‘current’ lists or auto-tagging artifacts with known narratives.

CMU IDEAS’ COVID19 narratives list, in mindmap format

We deliberately designed our disinformation data standards so we could feed them into information security threat intelligence systems. These describe the objects involved in an attack: the who, what, how etc. One of the more common standards used is STIX, so we adapted it for disinformation by adding two new objects (incident and narrative) and making the AMITT framework available as a STIX model too.

STIX objects plus incident, narrative

Having better incident descriptions meant we could start writing playbooks for common threat situations, including our work on technique-level counters (still unfinished, but we’ll get to it…)

fake engagement playbook, including real counters (C_00223 etc)

The AI/ML part: Algorithms

I’ve stressed that data science is a large part of disinformation response, but I was speaking at an AI event this morning. First, I believe strongly that all data science should be need-led and question-led: talk to the subject matter experts, watch what they already do, work out whether you can useful help them to do that at higher volume/ speed/ across more formats, and listen to what they say they need (sometimes the answer really is simpler than you think, e.g. a piece of paper — or a spreadsheet — or a decision tree).

But a short laundry-list of areas I’ve been looking at, and things I’ve been thinking about needing.

Text analysis:

finding themes — narratives are an incredibly useful way to group artifacts, as is being able to watch them form and die (hello sankey topic diagrams). It would also help with finding similar (but not exactly the same) narratives, and narrative mashups like Covid5g

classifying artifacts onto narratives — this would save us time tagging

clustering text — this would help with finding new groups (which aren’t always the same as narratives)

searching for similar text. We’ve been relatively lucky in Covid19, in that we’ve seen a lot of repeated text. Simple use of text generators, editing or obfuscation would make those repetitions a lot harder to find without help.

word groups in EuVsDisinfo clusters (thanks Gabe!)

Graph analysis:

Disinformation tracking and response has much in common with epidemiology (and many early papers on disinformation analysis were based on epidemiology before it was fashionable). Things that network/graph analysis algorithms could help with include:

Finding super-spreaders: the individuals, groups etc responsible for accelerating a rumour’s spread online

Finding rumor origins. We do a lot of this: carefully tracking artifacts like hashtags, language quirks, images and connections through time to find the patient zeros of a rumor. A lot of this could be automated better

Finding new artifacts. Often, the most useful artifacts aren’t obvious ‘til we look at network analysis results or look at them in network diagrams. Some of this work is manual and repetitive and could be sped up.

Tracking movement over time. All disinformation takes place over time — it exists, it spreads, it’s countered. It’s still difficult to quantify that, and there are temporal methods that could help.

Gephi graph of accounts connected to ‘vaccination’

Image, video, audio analysis

No, I’m not going to say “deep fake detection”. That’s important: deep fake algorithms have been used to generate botnet profile pictures and text, but of more immediate concern are:

Searching for similar images — online search algorithms aren’t always tuned to the types of image search that we need, e.g. returning images with similar color rather than similar content

Shallowfake detection. Shallowfakes are slightly-modified images and videos — the “slurred-speech” slowed-video Nancy Pelosi video below is a classic. Variants of otherwise-genuine images are cheaper, more effective, and more widespread than their sexier cousin, deep fakes.

Nancy Pelosi shallowfake (https://www.cnn.com/2019/05/23/politics/doctored-video-pelosi/index.html)

Insights: Graph Relationships

Ultimately, data science exists to give people insight and help them make better decisions.

To do this, we needed a richer description of the objects involved in a disinformation incident, so we also added AMITT to the MISP open-source threat intelligence tool (every MISP now comes with AMITT as standard), and added the Atlantic Council DFRlab’s Dichotomies of Disinformation codebook. Using MISP objects for artifacts like blog, microblog (twitter or facebook post), person, user-account etc allowed us to share and link complex information about incidents in graphical ways that users can click and traverse:

MISP event graph for disinformation incident Sekondary Infektion, showing who posted what etc.

People, Process, Technology, Data, Algorithms, Insights

People, process, technology, data, algorithms, insights: I didn’t talk much about technology above, but it’s threaded through the discussion, supporting the other parts.

Bottom line: ultimately, this is about people. Disinformation creators have tapped into a distributed online world and used the people and tools within it to their advantage. In my own humble opinion, disinformation defence needs to take a similar approach.