A Primer on Automated Content Recognition
April 14th, 2014 by Kyle Brink
As human beings, we know what we’re listening to or watching (most of the time, anyways). Automated Content Recognition (ACR) is what lets an app, device, or other technology platform make that same decision. Once the content is known, it becomes possible to enhance the experience, verify content consumption, and provide additional information.
Audio-based ACR is now commonplace. The two leading methods of audio ACR (fingerprinting and watermarking) have advantages and disadvantages. Coming technologies will supplement or replace audio ACR, particularly in entertainment media with no definitive audio “track” to watermark or fingerprint, or in entertainment venues with highly variable audio environments.
COMMON USES OF ACR
ACR enables context-based enhancement, verification of content consumption, and identification of content.
Context-based enhancement
Once we know what content is being viewed or listened to, it becomes possible to enhance that content on a companion platform (such as a mobile device or website).
Television in particular has explored this opportunity. Synced experiences and show enhancements recapture some television viewers who would otherwise be “lost” to the second screen (the device in the viewer’s hand). In order for these to work, it is essential that the platform not only know what is being watched, but where in the broadcast the viewer is at the moment.
This allows for seamless integration of the show being aired and the enhancements on the companion platform. For example, “insider” information about the scene currently airing, play-along games that rely on the live broadcast, or even additional camera feeds during live shows.
Verification
Broadcast media such as radio and television depend on listeners and viewers, but they have no direct way to verify what a given audience member is listening to or watching. ACR provides a way to close that gap. The business benefits are obvious: precise audience metrics allow for accurate advertising pricing, smart content creation decisions, and measurable results from promotional efforts.
Identification of content
Sometimes, we just want to know what that show or song is. ACR is helpful to us as audience members, because it can tell us without our having to type – or know – anything about it. In many cases, an app’s ability to provide this one-touch recognition of a song or show is the reason we download the app in the first place. It’s useful and convenient.
AUDIO-BASED ACR
Today, most ACR relies on audio solutions. In fact, many times “ACR” is used to mean “audio content recognition.” It makes sense. Audio is a media component that can be readily picked up by ay device with a microphone, and it’s common to both radio and television – the broadcast media whose audiences and businesses need it most.
There are two main approaches to doing audio-based ACR: fingerprinting and watermarking.
Fingerprinting
Fingerprinting is a method of audio ACR that creates a compressed, unique “map” – or “fingerprint” of the audio from a specific source. Samples of raw audio can then be compared against that fingerprint to see what source and time matches the sample.
How it works
When we as audience members watch or listen to something new, we don’t yet know what it is. But we know what station we are tuned to, and we know what time it is, so we can look it up.
Fingerprinting works much the same way. The system matches a sample of audio against everything that is airing (or has aired recently), finds what station and time matches that audio, and what program or son was airing then and there.
The fingerprinting itself doesn’t “know” what the content is; it simply provides a reliable means for looking up the broadcast data.
Nerd break: What’s a fingerprint? Each audio signal has a unique pattern. Whether one’s algorithm uses amplitudes, or frequencies, or some combination, there is always a unique pattern that can be saved and matched against later. This pattern is a tiny fraction of the size of the original media, so it can be efficiently stored and searched.
Advantages
Fingerprinting does not require the source media to be modified in any way. This means that it can be used for any broadcast media with no change in the content creator’s or broadcaster’s work flow, and requires no rights to the media because no media is being stored or modified. This makes it possible to use it broadly across many channels and shows.
If the infrastructure is set up properly, fingerprinting works quite well for live broadcasts. By the time the broadcast reaches the audience, gets sampled by the end user’s mobile device, and the fingerprint sample gets sent back to the matching servers, many milliseconds have passed. That’s plenty of time to have the fingerprint ready for comparison.
Note: Streaming media can also be fingerprinted, though it requires that each media item in the library be “ingested” (the audio compressed into fingerprints).
Disadvantages:
Matching the fingerprint to the media metatdata is the key to fingerprint-based ACR. It is imperative that the system have comprehensive, up-to-the-second broadcast schedules and show information for the fingerprint process to match against.
Every channel or station to be matched must be ingested in order for matching to work. There are thousands of channels and stations, with more being added daily.
Highly variable audio environments, such as public venues, can obscure the unique audio signature of the broadcast media. A robust fingerprinting algorithm will still work in most cases, but the chance of audio interference exists.
Watermarking
The second chief audio solution for ACR in use today involves “watermarking” the audio component of a broadcast with a distinct identifier inaudible to the human ear. Think of it like adding a logo or an overlay to the bottom of your TV screen, but in a way that only your device could “see.”
How it works
When we watch or listen to something new, one way to identify what it is can be from the “bumper” preceding or following the content. Think of this as a human-readable watermark: identifying content added to the broadcast so the audience knows what it is.
To watermark content for ACR, broadcasters or content creators add a specific identifying track to the audio of the content. This watermark then immediately declares to any equipped system what the content is.
Advantages
Watermarked content identification can be quite fast, as the set of possible matches is quite small (when compared to, say, everything that has aired on every channel in the last week).
Precision is usually good as well, because the content essentially self-identifies. There is no reliance on broadcast schedule data.
Disadvantages
Watermarking requires modification of the broadcast media. That means rights and access. This doesn’t prevent it from being a viable solution for broadcasters, but it does not scale efficiently.
Ultimately, this means that watermarked ACR tends to be used in targeted applications, such as a companion piece for a specific show or network or brand campaign, but not for applications intended to work for universal media ACR.
BEYOND AUDIO
Audio is today’s solution of choice, but there are other options available on the horizon. This is especially important when looking beyond just television and music.
Fundamentally, any broad ACR solution can follow the same structure as fingerprinting: take some form of compressed sample input, compare it against a schedule of what’s going on, and identify what the input relates to. So let’s take off the headphones and consider the non-audio options.
Video ACR
If one can ingest and fingerprint audio, then it stands to reason that something similar could be done for video. The same rules of signal pattern apply. The trick, which is being looked at by a number of people, is how to get that sample from a mobile device to match the fingerprint on the matching server. It’s a promising option, especially since it is immune to audio interference. Video ACR could be a killer app for the pub environment.
Location-Based ACR
What about content that isn’t broadcast? How can we perform ACR then? Concerts, sports events, and other live events call for a different kind of “fingerprint” to match. Using location data and local beacons, or even a combination of the two, has been commonplace in apps for years; marrying this information to a comprehensive schedule of live event locations unlocks the possibility of a “real world ACR.”