A research and development program to create automated software capable of detecting specific behaviors in videos has almost achieved its goal of detecting 75% of activity with a false alarm rate of just 2%.
The Deep Intermodal Video Analytics (DIVA) program, which is operated by the Intelligence Advanced Research Projects Activity (IARPA), “creates automatic activity detectors that can watch hours of video and highlight the few seconds when a person or vehicle is performing a specific activity,” explains the program’s website. Behaviors of interest include carrying heavy objects, loading those objects into a vehicle, and then driving. DIVA activity detectors work in single or multi-camera streaming video environments and can be used to enhance video forensic analysis and real-time alerting of threat scenarios, such as terrorist attacks and criminal activity, the website adds. .
“The goal has been to cut through the increasingly overwhelming amounts of security-type video that exists – think CCTV-type video – and automatically scan through it and identify specific activities,” says Jack Cooper, IARPA’s DIVA program manager. .
Artificial intelligence (AI) and machine learning (ML) technology could monitor security video at airports, border crossings, or government facilities where camera network operators are overwhelmed with the volume of video in time. real to monitor, or it could be used for forensic purposes after incidents have occurred to identify relevant activities.
“Imagine an operator who has dozens of feeds that he is responsible for monitoring. Humans are very good at analyzing visual information, but it just becomes too much for one person to look at the number of streams that may be important to them,” Cooper suggests.
Analysts would define specific activities of interest, and the technology could either alert them in real time for live video or highlight those behaviors for review during a forensic investigation” instead of having eyes on every screen on every video stream,” Cooper suggests, which would significantly reduce the amount of video for human analysts or operators to view.
The four-year program wraps up at the end of this month, and researchers are closing in on final detection and false alarm goals. “We’re currently at 70% detection at this false alarm rate, and we still have a few more bites in the apple to extract that last little bit,” Cooper reveals. “We have made a lot of progress during the program. I believe after phase one we were at about 25% detection, and after phase two about 50% detection and now 70%.
The major challenge was similar to the challenge faced by many AI/ML programs: finding the right type of labeled or annotated data to train the technology. “Data is king. Having large quantities and very high-quality labeled examples of what you’re looking for is the secret sauce for a machine learning algorithm,” says Cooper. “To really get to the level we’re aiming for, that 75%, while maintaining a good false alarm rate, we really needed a lot of data.”
To meet the challenge, the researchers spent time at the start of the program collecting video samples and labeling the data. “As the program progressed, as teams had more examples, more information, to train their really powerful systems, that was key,” says Cooper. “Getting that data, organizing that data, labeling that data, that was really one of the main challenges of the program.”
Cooper points to the need to label or annotate the vast amounts of data, which adds to the challenge. The program lists more than 30 human behaviors that the technology must accurately identify, which is no small feat for software. “The complexity of these activities – think of a person talking to another person – which can look very different in different circumstances. Some of the visual cues there are very subtle. So maybe we need 100, 500, 1,000 examples of this to know what this activity looks like and go find it.
This challenge, however, also led to another success for the program: the creation of a new company, Visym Labs. Founded just two years ago as a systems and technology research spin-off, Visym recruited volunteers around the world to use their cell phones to film themselves performing a variety of behaviors, annotating the videos and providing them to the research community for training or testing of algorithms.
“One of the teams managed to create a new business that has a faster way to create and annotate data,” Cooper reports. “They were actually able to generate millions of tagged examples in a reasonable amount of time. This technology is now commercially available, which is great.”
The Visym Labs website explains that large-scale data collection typically involves three steps: setting up cameras or fetching raw images or video from the web, sending that data to an annotation team for labeling, then to a verification team to ensure quality.
“This approach is slow, expensive, biased, non-scalable, and almost universally fails to obtain consent from every subject with a visible face,” explains the Visym website. “We are building visual datasets of people by enabling thousands of collectors worldwide to submit videos using a new mobile app. This mobile app allows collectors to record video while annotating, which creates tagged videos in real time, containing only people who have explicitly consented to be included.
The company also specializes in privacy sensor technology that “applies a private transformation encoded in the sensor’s optical/analog preprocessor forming an embedded image,” the website explains. It offers different levels of privacy. A high-privacy key-image, for example, cannot be interpreted by a human without knowing the key encoded in the sensor optics, but the same image can be interpreted by a paired so-called key-net.
“Our goal is to create an ethical visual AI platform that gives you the benefits of visual AI in private spaces like the home while preserving your civil liberties. To do this, we need sets of massive amounts of human activity that is collected with consent, and a new kind of visual sensor that maintains privacy by design,” says Jeffrey Byrne, founder and CEO of Visym, in an email exchange. , we have collected more than two million videos of people performing activities around the home, which will be highlighted in a new open research challenge for the detection of human activity in videos. We will officially launch this challenge in collaboration with NIST [National Institute of Technology and Standards] in the spring of 2022.
Cooper says the DIVA program also takes confidentiality into account. “First, the DIVA program is about identifying the activity, not the individual. We just identify that an activity is taking place, like a person carrying an object,” he says. Second, he adds, all IARPA programs undergo rigorous review to ensure they collect data using approved methods.
Cooper also touts the program’s real-time processing capabilities. “Another thing we’ve overcome is that we’ve been able to do a pretty good job of detecting these activities and keeping our processing real-time,” he says.
Only two teams – Carnegie Mellon University and the University of Maryland – remain in the competition. The Carnegie Mellon team has, with minimal retraining, applied their system to an activity detection challenge for self-driving cars. Self-driving cars, of course, need to identify humans and certain behaviors, such as carrying a child or crossing a road.
As part of the Office of the Director of National Intelligence, IARPA supports the entire intelligence community. Once the program is completed, it will be up to individual agencies to determine if or how to use it. “We have technology transition plans in place, and that’s something we prioritize for any IARPA program,” Cooper said. “We have several committed partners, but we leave it to them to apply the technology to fulfill their mission.”
The IARPA team expects to achieve its goals by the end of the program, but Cooper notes that more research could be beneficial. “I think we’re going to meet our goals, and those goals were set for a reason, but activity recognition in video is by no means a solved problem. How do you deal with less labeled data? How do you go about even faster than real time? How do you get false alarms down another order of magnitude? All of these research issues are valid,” he says.