Final Cut X’s Missed Meta Opportunities

How to reclaim the high ground

When Final Cut Pro X was released in 2011 there was one thing it got very, very right: Metadata should be the focus for the future of editing.

FCP X 1.0 shipped with some very forward-thinking features: People detection & shot detection. Realizing that FCP could help sort our footage to help editors was a landmark moment for the Apple platform.

Unfortunately, the time since there have been many advancements to Final Cut Pro X, but one thing it has not kept up with is the beauty and simplicity that proper metadata could offer to the platform.

I’m Telling You

Final Cut has failed to properly support the dialogue in video all along, and no editor has properly seized the opportunity afforded in scripted or unscripted projects. The key to these would be based in speech-to-text technology which Apple has been doing on device since the 1990’s. The integration of Siri on phones then computers has refined thiese technologies-imagine how well they should work on professionally recorded audio from quality microphones.

The authors of Quicktime in the mid 1990’s understood that text could be part of a multimedia stream-and the support was there in the quicktime container format. FCP X has kludged support for dialog into inappropriate places in the project, like the “notes” column for clips, despite the existence of scripts/dialog during the development of the total rewrite of FCP X.

There needs to be an architectural change to Final Cut: it needs to understand text better. The kludge of solutions that brought us closed captioned support in the timeline was absolutely the wrong way to go. Squeezing transcriptions (generated by fingers or a computer) into the “Notes” field is poor practice.

Imagine instead that there is a third track type-audio, video, and timed text. The timed text track would move along with the clip-not like an attached lower 3rd, but like the audio: connected tightly to the clip, but can be broken off when relevent. Obviously, the timed text of the audio would stay connected to the audio if you split them up or performed an L-Cut. This timed text could be exposed to closed captioning with a single click, but available always to help with searching, editing, and organization.

Timed text is vital for Captioning, SDH, and can offer multiple language support as well. It’s an essential part of a workflow for inclusion and legal compliance. Support should exist well beyond the oddball treatment it gets in EVERY edit platform, including FCP X.

Therefore, Final Cut Pro X should natively analyze imported audio tracks for speech to text both on import and on command. It should offer the ability to write that timed text back to the video file container on a text track for use in other editors, situations, or as closed captions. If AV Foundation no longer supports timed text classes that needs to be fixed in the immediate next release of the OS.

Incidentally, This would offer an opportunity for Final Cut to re-capture marketshare. By replacing tools such as Speedscriber, Rev (and dozens others who try to address this need) Editors would re-install FCP to generate transcripts that are embedded in the QT file, and would hopefully consider it for the rest of the project.

A Final Cut Pro X that supported these files, where the captions are embedded in the file would be able to generate Closed Captioned masters by default, increasing the accessibility and compliance of broadcast and web videos instantly. They could also be used to easily generate open captions-where the text is burned-in on the screen (as Apple does in it’s own “Clips” app.)

Before you write: yes, I know lumberjack exists, and I’ve used it. It’s not the answer-it’s a whole separate app, it’s it own editing paradigm (that is designed for documentary work) and should be happy to co-exsist with a FCP that is smarter in every way.

Perfect for Scripted

In a scripted project, once the text inside a captured clip is understood, Final Cut X could then understand when there are multiple takes of one scene. If it looked for repeated moments of dialogue across several clips, it couldthen break those into takes-a marquee feature of FCP X. This is also a place where machine learning could help determine the reset point in each shot, but I’d be happy to have a stack of clips presented with similar dialgue.

This could also organize multiple camera coverage, or situations where one scene was shot from multiple angles, automagically. All of this comes from adding voice-to-text to the app and making it smarter.

You could get a huge head start to a project if smarter metadata processing was added to FCP X

Face Detection

Face detection is the next most obvious opportunity for metadata organization. Build a database for faces in a project and let me search for all the clips containing a face. Let me assign a name to that face and let me sort all the clips containing that interview subject or actor, no matter the day we shot them, or even use that data to tag speakers in the transcription audio or build lower 3rds.

Face organization new in Resolve 16 has a weak start, but much promise

Blackmagic has experimented with facial recognition in version 16, but it has had a luke-warm start so far. Apple’s got a huge head start in the code for Photos app (and iPhoto before it) that has been shippng and refining for years.

Other Metadata Opportunities

Having more smarts inside the editor could help us keep those bins cleaner. Sometimes I accidentally import the same media more than once, and Final Cut could alert me to that, and easily reject things already in the project. (It could run a checksum scan to make sure the files are different, but even just matching size, creation date, type, and name could save a lot of headaches.) There’s another way this would be immensely helpful: having a “watch folder” so that anything I add to that folder is added to the project, and tagged appropriately. I always have a folder for “graphics” for instance, and I add assets to that folder as the project moves along. Then I drag the assets out of that folder into the project. sometimes that results in having to hunt through the folder for the files I just added-which is obnoxious. If I can’t set the application to watch a folder for me, then if it rejected the assets it already had in the project, I could just re-import the whole folder and be sure that everything made it in.

There are even metadata opportunities in file dates: Realizing that files created shortly after another might also be a hint that they are takes of a similar shot or part of the same scene. If two files are created concurrently, they may be multiple camera angles, or an audio recording to be matched to a camera angle. Looking at timecode tracks inside the file should also be considered for this. Actively building those multicam and audio replacements would save a headache at the start of every project. This metadata could give us some of the automatic sorting describe above without the computational expense of generating speech to text.

Reverse Speech Recognition

For scripted projects imagine starting by feeding the edit software the script, so as it processes the audio in the dialogue tracks it’s matching against the script. By simply parsing the 50+ year old convention for script formatting would allow matching of words to character names, and would provide a base to check the work of the text-to-speech against.

OCR

Looking for text inside video would make it more searchable. “Find the shot where the subject is standing in front of the stop sign” would be great for raw footage, and re-editing finished pieces it could help recognize cards in the edit. Again these libraries are decades old, and require no special machine learning magic. Just scan, say, every 10th frame of a video fed in.

Cull

The most basic level of metadata smarts could be to cull out things I probably don’t want in my project, for example: extremely short clips. Clips that are all black or white with no audio. clips where the camera moves erratically for a majority of the shot. Mark those clips on import as possibly preferred for deleteion.

The Future of Video is Text

Adding any of this to an editor well will take engineering effort and time. And yet, even adding ALL of this would not make Final Cut Pro X the world’s leading editor — there are several areas where the Final Cut team needs to catch up with other software. But, adding in robust support for text that is spoken and appearing on-screen, and understanding that text is part of every project-if it’s a youtube video or a $30M movie-will prove that Apple is serious about metadata, editing, and giving their editors the best head-start on any project.

It should be noted that all of these suggested features could be added today-with on-device processing and no need for advanced AI or neural nets. Those things will surely enhance future generations of these features, but rolling them out sooner will show commitment to the product and the promise of metadata.