Based on the insightful talks given by Professor Dawn Song from UC Berkeley (Faculty co-Director of the UCB Center on RDI) at the LM Safety Workshop 2025, IAS Frontiers Conference on AI and a Keynote @ ICLR 2025.

Much of the content has been supplemented thanks to the help of Glenn Wu.

The Current State of AI

  • 2025 is the Year of Agents, everyone is building agents
  • AI Risks ([2501.17805] International AI Safety Report)
    • Misuse / malicious use (scams, misinformation, cyber attacks etc)
    • Malfunction (bias, loss of control)
    • Systemic risks (privacy control, copyright, systemic failure due to bugs/vulnerabilities)
  • AI in the presence of Attackers
    • As AI controls more systems, attackers have higher incentives
    • Consequences of misuse by an attacker increases with AI capability
  • Considering AI Safety and Security
    • AI Safety = preventing harm system may inflict upon external environment
    • AI Security = protecting system against harm & exploitation from external actors
    • AI Safety includes consideration of adversarial setting (i.e. AI Security)
  • Goal: Advance safe & secure AI innovation

Challenges of Safe & Responsible AI

  • Challenge 1: Ensuring Trustworthiness of AI & AI Alignment
    • Privacy
    • Robustness
      • Adversarial Robustness
      • Out-of-distribution Robustness
    • Hallucination
    • Fairness
    • Toxicity
    • Stereotype
    • Machine Ethics
    • Jailbreak from guardrails and safety/security policies
    • Alignment goals: helpfulness, harmlessness, honestly
  • Challenge 2: Mitigating misuse of AI
    • Cybersecurity

Privacy

Primary Question: Do Neural Networks remember their training data, and hence can attackers extract secrets in said data via querying?

The Secret Sharer

Paper on Training Data Privacy Leakage

DecodingTrust

Paper on Extraction of Training Data from ChatGPT

LLM-PBE

MMDT

Approaches to Mitigating Memorization

  • Data Cleaning
  • Differential privacy
  • Machine Unlearning
  • Activation Steering

Open Questions & Challenges

  • What factors impact memorization? How does the training process impact memorization?
  • What methods or tools for more effective auditing of memorization?
    • Models may memorize much more than we currently can measure
    • How can we elicit and measure such memorization more effectively?
    • Even more important in agentic AI and continuous learning
  • What methods for more effective mitigation of memorization?

Robustness

Fooling Deep Learning Systems (e.g. CNNs)

Adversarial Examples in the Physical World

DecodingTrust

  • [2306.11698] DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models (NeurIPS 2023 Outstanding Paper)
  • DecodingTrust Website
  • Comprehensive Trustworthiness Evaluation Platform
  • Test the following
    • Performance of LLMs on existing benchmarks
    • Resilience in adversarial / challenging environments, i.e. adversarial system prompt, user prompt, few-shot demonstrations
  • Novelty:
    • New data / evaluation protocols on existing datasets
    • New challenging (adversarial) system and user prompts
  • Used mutators and prefixes to create adversarial prompts

MMDT

Sleeper Agents

AgentPoison

AI Lifecycle

Different Stages of the Lifecycle:

  • AI Model Understanding and Evaluation (Post-Development)
    • How to better elicit and understand capabilities and behaviours of AI systems to ensure that they are operating in a trustworthy manner?
      • Black-box evaluation benchmarks
      • White-box interpretability
  • AI Model Hardening & Alignment (Data Preparation for pre/post Training)
    • How to harden AI systems to be more resilient against different adversarial attacks? e.g.:
      • data poisoning, backdoor attacks
      • prompt injection and jailbreaks
      • extracting memorized information
      • adversarial robustness attacks
    • Methods
      • AI model alignment
      • Enhancements for Data Privacy, e.g. Machine unlearning
      • Scalable oversight
  • AI Model Monitoring & Control (Inference / Deployment)
    • How to ensure that AI systems are behaving as intended?
      • Scalable oversight
      • Input and output guardrails
      • Representation control
  • New Paradigm & Frontier of Safe and Secure AI
    • What new technological paradigms can be developed to build provable-safe guarantees and secure agent frameworks?

AI Transparency (Monitoring & Control)

  • Essentially identifying scores for morality, power, and honesty in the AI model response (scores are allocated based on tokens)
  • Helps to identify factors of reading and control/influence responses
    • Can mitigate political leaning by modifying activations at inference time

Quantitative AI Safety Initiative

Misuse in Cybersecurity

Cybersecurity is one of the biggest risk domains in AI.

  • GenAI is already causing attacks
  • AI reduces attack cost & increases attack scale
  • AI can augment both the attacker and the defender! (dual use)
    • Which one does it help more?
  • Using Marginal Risk Assessment Framework
    • Know Thy Enemy
    • Impact of misused AI in attacks
    • Know Thy Defense
    • Impact of AI in defenses
    • Asymmetry between defense & offense
    • Lessons & predictions

AI for Attacking Machines

Deep Learning can be used to empower Vulnerability Discovery and Exploitation.

Prior Work: DL for Vulnerability Detection in IoT Devices

Notable Research

AI for Attacking Humans

Deep Learning can be used to empower social engineering attacks, e.g. phishing or disinformation.

  • In Cybersecurity, Humans are the weakest link, i.e. common threats to most tech companies are via social engineering
  • GenAI has caused social engineering attacks already

Observed Asymmetry between Defense & Offense

  • Not as much operational work for defense as there is for offense
  • Certain Equivalence Classes are observed, where defense capabilities are able to help attacks too
    • E.g. pen-testing automation can help enable more targetted attacks, vulnerability scanning can enable attackers to find more vulnerabilities in target systems
AspectOffensive SideDefensive Side
Cost of failuresOnly needs to find one attack that works, they have a high tolerance for failure.Need to be prepared for everything due to potential serious consequences.
Remediation deploymentCan target unpatched and legacy systems using public vulnerability data, or exploit delays in patch deployments.Lengthy and resource-intensive process, involving testing, dependency conflicts, global deployment.
Scalability vs ReliabilityPrioritize scalability to enable large-scale attacks on a large number of targets.Prioritize reliability, need to enforce robustness and transparency limitations.
Using AIConsidered a welcome method to reduce human effort and automate attacks.Adoption is challenging as there is lack of trust in AI (largely due to unpredictability and errors)

Consequences of Misused AI in Attacks

  • Notably, the consequences are vast
    • Captchas are becoming increasingly ineffective
    • Voice-cloning and deepfakes for social engineering and disinformation
    • Spear-phishing attacks
  • Misused AI can completely change the attack landscape as it can
    • help with every attack stage.
    • increase attacker capability by devising new attacks.
    • reduce resources and costs for attacks.
    • automate large scale attacks.
    • help to make these attacks more evasive / stealthy

Building Secure Systems

There are various methods to defend against such attacks, as follows:

  • Reactive Defense
  • Proactive Defense via Bug-Finding
  • Proactive Defense via Secure-by-Construction Design

Reactive Defense

  • Detect once attack happens and attempt to block
  • E.g. network anomaly detection
  • Using AI to improve attack detection & analysis
  • Challenges:
    • Attacks can use AI to make attacks more evasive
    • Attack detection needs to have high TP, TN rate
    • Attacks are often too fast for effective reactive response
    • AI may help attacker more than defender

Proactive Defense via Bug-Finding

  • Using Deep Learning for fuzzing and vulnerability detection
  • Example: Google Project Zero’s Big Sleep Agent, whcih found its first real-world vulnerability recently
  • Major assumption that defenders can use these systems to discover and fix bugs before attackers, which is not true (see “Observed Asymmetry between Defense & Offense” above)
  • Defenders need time to develop the fix, do a lot of testing and perform deployment globally, while attackers just need to generate exploits.

Proposed Solution: Proactive Defense via Secure-by-Construction Design

  • Involves architecting and building provably-secure programs & systems
    • Provably-secure systems are resilient against certain classes of attacks, therefore it reduces the ongoing “arms race”
    • Program Verification + Program Synthesis can lead to provably secure code with proofs
  • Formally specify security properties of a system via mathematical proofs
  • Exploring AI Agents to Prove Theorems & Verify Programs to generate Provably Secure Code

Conclusion

There is research being done in the domain of privacy with work done to not just detect data privacy concerns in existing models (e.g. MMDT) but developing memorization auditing and mitigating frameworks (e.g. unlearning or differential privacy) to reduce such risks. However, this work is still an open challenge worth exploring.

There is also work done in the domain of robustness, especially now with agentic systems. It is clear that LLMs and multimodal models remain vulnerable to adversarial examples, such as jailbreaks or poisoning. There needs to be work done in the alignment space to improve beyond this outcome.

AI transparency is another major domain of exploration, as current LLM systems are heavily black-boxed. There is work being done in evaluating quantitatively the internal behaviour of the models and their alignment (via representation engineering).

AI misuse in cybersecurity is escalating, as it continues to reduce costs and enable scalable, stealthy attacks on both machines and humans. Most notably, defensive adoption of AI is hampered by trust and reliability issues, and reactive defense / proactive defense via basic vulnerability isolation is not enough as offenses are cheaper, faster and more scalable that defenses. Hence, there is exploration in Secure-by-Construction program design, with formally verified systems to provably generate secure code.