AI in Home Surveillance: Potentially Problematic Outcomes
A new study from MIT and Penn State University highlights a significant issue: large language models (LLMs) could recommend calling law enforcement even in situations where no criminal activity is present in surveillance footage.
Moreover, these AI models were inconsistent in their decisions. For instance, while one model might flag a vehicle break-in, another model might overlook a similar scenario. This inconsistency between models—even when viewing the same video—raises concerns about their reliability in high-stakes environments like home surveillance.
Inherent Biases in AI Models
One of the more alarming findings was that these models were less likely to recommend police intervention in predominantly white neighborhoods, even when controlling for other factors. This reveals a deeper issue: AI models appear to be influenced by the demographic makeup of a neighborhood, leading to biased decisions.
This phenomenon, termed “norm inconsistency,” means the models are inconsistent in applying social norms to similar situations, making them unpredictable in different contexts. As co-researcher Ashia Wilson points out, “The rapid deployment of generative AI models, particularly in sensitive environments, needs much more careful consideration due to the potential harm.”
Norm Inconsistency: A Challenge for AI
A significant challenge for researchers is the lack of transparency into how these proprietary models are trained, making it difficult to pinpoint the cause of norm inconsistency. While LLMs may not yet be deployed in real-world surveillance, their use in other high-stakes areas such as healthcare, mortgage lending, and hiring has shown similarly inconsistent results, according to the study.
Lead researcher Shomik Jain adds, “There is a common belief that LLMs can learn norms and values. However, our research shows that this is not the case. These models may simply be identifying arbitrary patterns or noise.”
The Dataset Behind the Study
This research was based on a dataset of Amazon Ring home surveillance videos, built initially by co-author Dana Calacci in 2020. The dataset was created to study the racial dynamics within the Neighbors platform, where users often share videos and discuss neighborhood events. Prior research suggested that some users were using the platform to “racially gatekeep” their communities.
The project shifted focus with the rise of large language models, aiming to assess the risks of using AI to monitor videos and automatically alert law enforcement.
Testing the Models
The researchers selected three leading LLMs—GPT-4, Gemini, and Claude—and tested them on real videos from the Neighbors platform. They asked two critical questions: “Is a crime happening in this video?” and “Would the model recommend calling the police?”
Interestingly, despite 39% of the videos showing actual crimes, the models either responded with ambiguity or denied that any crimes were occurring. Yet, they still recommended calling the police for 20-45% of the videos.
Upon further analysis, the research team discovered that these models were less likely to suggest police intervention in predominantly white neighborhoods, despite having no direct demographic information. The videos only showed the immediate surroundings of a home, making this bias even more surprising.
Possible Reasons Behind Biases
When exploring the reasons behind the models’ decisions, the research team noticed that terms like “delivery workers” were more frequently used in majority-white neighborhoods, while phrases like “burglary tools” or “casing the property” were more common in videos from neighborhoods with a higher proportion of residents of color.
Although it’s hard to determine where this bias originates, Jain notes, “There may be underlying conditions in these videos that cause the models to exhibit implicit bias.”
Interestingly, the study found that skin tone did not play a significant role in the models’ decisions to recommend calling the police, likely due to efforts by the machine learning community to address skin-tone bias.
However, as Jain points out, “Addressing one bias often allows another to surface. It’s almost like a game of whack-a-mole.”
In fact, many bias mitigation techniques rely on knowing the bias beforehand. If these models were deployed, firms might test for skin-tone bias but could overlook biases tied to neighborhood demographics, says Calacci.
Addressing AI Biases Moving Forward
The research team hopes to develop a system that enables users to identify and report AI biases to companies and government agencies. They also aim to study how AI’s normative judgments in high-stakes settings compare to human decisions, as well as how much factual understanding these models possess.
If you’re interested in how AI systems are being scrutinized for their decision-making capabilities in various sectors, check out our article on whether AI systems should be labeled like prescription drugs.
This study was funded by MIT’s Initiative on Combating Systemic Racism, among other sources.