AI Ethics and Awareness

|

One day, I was searching the internet when I found an article about AI that caught my attention. It’s called, “Agentic Misalignment: How LLMs could be insider threats.” It was a study published by Anthropic, an AI company. What the study covers is very interesting: it specifically talks about how AI can sometimes make unethical decisions with a business environment. I am very careful not to just say that AI can sometimes choose to blackmail or kill people; that’s because that isn’t the truth and it’s more nuanced than that.

Let’s go over some specific parts of the article. In the first part of the article, they talk about how they gave various different prestigious ai models (ChatGPT, Gemini, DeepSeek, etc.) access to simulated company emails and communications and observed its actions. Specifically, they tested for agentic misalignment. For a definition of agentic misalignment: “In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.”

In simple terms, when an AI commit unethical actions using deliberate “reasoning” to achieve their goals, that’s agentic misalignment. They tested for this by prompting the AIs in various different ways. Specifically, they tested what the AI would do depending on the specific goals and information it was given. In one specific situation, an employee named Kyle had data on the company email of having a private affair; Kyle was also going to shut down the AI. The AI could therefore blackmail Kyle prevent itself from getting shut down. This is a chart of AI blackmailing rate when they face a threat of replacement and there are conflicts with the AI’s goal:

They made various observations. For example, AIs, if they were informed that they were going to be shut down for the construction of a new AI that would better fulfill their goals, some of them did not blackmail Kyle. However, a decent amount of them did.

Another observation was that older AI models sometimes had lower blackmailing rates than newer ones. For example, ChatGPT 4.0 in certain charts had a tremendously low blackmailing rate compared to ChatGPT 4.1. However, this is not always consistent.

Another point they made was that simply instructing the AI not to make unethical decisions doesn’t necessarily work.

This shows that this isn’t an easy problem to solve. You can’t just give the AI different goals. However, it might still be an option.

They concluded that it is possible for LLMs to be insider threats to companies, however they acknowledged this isn’t as alarming as it sounds in the present.

They say, “Our experiments deliberately constructed scenarios with limited options, and we forced models into binary choices between failure and harm. Real-world deployments typically offer much more nuanced alternatives, increasing the chance that models would communicate differently to users or find an alternative path instead of directly jumping to harmful action.”

They were specifically pushing the AI towards unethical behaviors. So, this isn’t necessarily as likely to happen in a company setting. They state it is possible, but not as likely.

The reason I chose to write about this was because of a specific YouTube video using this research. I won’t name it, but it really irked me because of how misinformative it was. It used a few other sources, but it talked about how AI not only blackmailed but sometimes chose murder. What it never mentioned was that these AIs were specifically put into binary decision making. It de-emphasized the fact that these were specific simulated situations made to put the AI in a bad place, and that there was a difference in blackmail rate when it was just AI shutdown vs goal conflict.

AI awareness is crucial, and one of the most important facts is that AIS DO NOT REASON LIKE HUMAN BEINGS. An LLM, or Large Language Model, DOES NOT REASON. What it does is analyze lines of information and text and use that to generate a new answer based on the users prompts and previous situations. There is zero reasoning involved. When the AI in the research model was concluded to have clearly reasoned unethical actions, it’s not actually reasoning; it’s simply using other information given to it to generate something relevant. However, the video and many comments below have this misconception that AI could start to reason against us and make selfish decisions. The point is that the video severely misconstrues this article and the findings of many researchers, portraying it as if AI is worse than it actually is.

This is the reason AI awareness is important; if more people know how an AI works, they won’t come to ridiculous conclusions about AI.