papers.mlsafety.org Open in urlscan Pro
143.198.62.121  Public Scan

URL: https://papers.mlsafety.org/
Submission: On August 05 via automatic, source certstream-suspicious — Scanned from CA

Form analysis 0 forms found in the DOM

Text Content

You need to enable JavaScript to run this app.
More ML Safety Resources
More ML Safety Resources

SELECTIVENESS

TOPICS

Robustness
Alignment
Monitoring
Systemic Safety
Provide feedback on classifications



CERT-ED: CERTIFIABLY ROBUST TEXT CLASSIFICATION FOR EDIT DISTANCEROBUSTNESS

Zhuoqun Huang, Neil G Marchant, Olga Ohrimenko · 3 days ago
With the growing integration of AI in daily life, ensuring the robustness
ofsystems to inference-time attacks is crucial. Among the approaches
forcertifying robustness to such adversarial examples, randomized smoothing
hasemerged as highly promising due to its nature as a wrapper around
arbitraryblack-box models. Previous work on randomized smoothing in natural
languageprocessing has primarily focused on specific subsets of edit
distanceoperations, such as synonym substitution or word insertion, without
exploringthe certification of all edit operations. In this paper, we adapt
RandomizedDeletion (Huang et al., 2023) and propose, CERTified Edit Distance
defense(CERT-ED) for natural language classification. Through
comprehensiveexperiments, we demonstrate that CERT-ED outperforms the existing
Hammingdistance method RanMASK (Zeng et al., 2023) in 4 out of 5 datasets in
terms ofboth accuracy and the cardinality of the certificate. By covering
variousthreat models, including 5 direct and 5 transfer attacks, our method
improvesempirical robustness in 38 out of 50 settings.⌄


JAILBREAKING TEXT-TO-IMAGE MODELS WITH LLM-BASED AGENTSALIGNMENT

Yingkai Dong, Zheng Li, Xiangtao Meng · 4 days ago
Recent advancements have significantly improved automated
task-solvingcapabilities using autonomous agents powered by large language
models (LLMs).However, most LLM-based agents focus on dialogue, programming, or
specializeddomains, leaving gaps in addressing generative AI safety tasks. These
gaps areprimarily due to the challenges posed by LLM hallucinations and the lack
ofclear guidelines. In this paper, we propose Atlas, an advanced
LLM-basedmulti-agent framework that integrates an efficient fuzzing workflow to
targetgenerative AI models, specifically focusing on jailbreak attacks
againsttext-to-image (T2I) models with safety filters. Atlas utilizes
avision-language model (VLM) to assess whether a prompt triggers the T2I
model'ssafety filter. It then iteratively collaborates with both LLM and VLM
togenerate an alternative prompt that bypasses the filter. Atlas also
enhancesthe reasoning abilities of LLMs in attack scenarios by leveraging
multi-agentcommunication, in-context learning (ICL) memory mechanisms, and
thechain-of-thought (COT) approach. Our evaluation demonstrates that
Atlassuccessfully jailbreaks several state-of-the-art T2I models in a
black-boxsetting, which are equipped with multi-modal safety filters. In
addition, Atlasoutperforms existing methods in both query efficiency and the
quality of thegenerated images.⌄


AUTONOMOUS LLM-ENHANCED ADVERSARIAL ATTACK FOR TEXT-TO-MOTIONROBUSTNESS

Honglei Miao, Fan Ma, Ruijie Quan · 4 days ago
Human motion generation driven by deep generative models has enabledcompelling
applications, but the ability of text-to-motion (T2M) models toproduce realistic
motions from text prompts raises security concerns ifexploited maliciously.
Despite growing interest in T2M, few methods focus onsafeguarding these models
against adversarial attacks, with existing work ontext-to-image models proving
insufficient for the unique motion domain. In thepaper, we propose ALERT-Motion,
an autonomous framework leveraging largelanguage models (LLMs) to craft targeted
adversarial attacks against black-boxT2M models. Unlike prior methods modifying
prompts through predefined rules,ALERT-Motion uses LLMs' knowledge of human
motion to autonomously generatesubtle yet powerful adversarial text
descriptions. It comprises two keymodules: an adaptive dispatching module that
constructs an LLM-based agent toiteratively refine and search for adversarial
prompts; and a multimodalinformation contrastive module that extracts
semantically relevant motioninformation to guide the agent's search. Through
this LLM-driven approach,ALERT-Motion crafts adversarial prompts querying victim
models to produceoutputs closely matching targeted motions, while avoiding
obviousperturbations. Evaluations across popular T2M models demonstrate
ALERT-Motion'ssuperiority over previous methods, achieving higher attack success
rates withstealthier adversarial prompts. This pioneering work on T2M
adversarial attackshighlights the urgency of developing defensive measures as
motion generationtechnology advances, urging further research into safe and
responsibledeployment.⌄


SECURING THE DIAGNOSIS OF MEDICAL IMAGING: AN IN-DEPTH ANALYSIS OF AI-RESISTANT
ATTACKSROBUSTNESS

Angona Biswas, MD Abdullah Al Nasim, Kishor Datta Gupta · 4 days ago
Machine learning (ML) is a rapidly developing area of medicine that
usessignificant resources to apply computer science and statistics to
medicalissues. ML's proponents laud its capacity to handle vast, complicated,
anderratic medical data. It's common knowledge that attackers might
causemisclassification by deliberately creating inputs for machine
learningclassifiers. Research on adversarial examples has been extensively
conducted inthe field of computer vision applications. Healthcare systems are
thought to behighly difficult because of the security and life-or-death
considerations theyinclude, and performance accuracy is very important. Recent
arguments havesuggested that adversarial attacks could be made against medical
image analysis(MedIA) technologies because of the accompanying technology
infrastructure andpowerful financial incentives. Since the diagnosis will be the
basis forimportant decisions, it is essential to assess how strong medical DNN
tasks areagainst adversarial attacks. Simple adversarial attacks have been taken
intoaccount in several earlier studies. However, DNNs are susceptible to more
riskyand realistic attacks. The present paper covers recent proposed
adversarialattack strategies against DNNs for medical imaging as well as
countermeasures.In this study, we review current techniques for adversarial
imaging attacks,detections. It also encompasses various facets of these
techniques and offerssuggestions for the robustness of neural networks to be
improved in the future.⌄


OTAD: AN OPTIMAL TRANSPORT-INDUCED ROBUST MODEL FOR AGNOSTIC ADVERSARIAL
ATTACKROBUSTNESS

Kuo Gai, Sicong Wang, Shihua Zhang · 4 days ago
Deep neural networks (DNNs) are vulnerable to small adversarial perturbationsof
the inputs, posing a significant challenge to their reliability androbustness.
Empirical methods such as adversarial training can defend againstparticular
attacks but remain vulnerable to more powerful attacks.Alternatively, Lipschitz
networks provide certified robustness to unseenperturbations but lack sufficient
expressive power. To harness the advantagesof both approaches, we design a novel
two-step Optimal Transport inducedAdversarial Defense (OTAD) model that can fit
the training data accuratelywhile preserving the local Lipschitz continuity.
First, we train a DNN with aregularizer derived from optimal transport theory,
yielding a discrete optimaltransport map linking data to its features. By
leveraging the map's inherentregularity, we interpolate the map by solving the
convex integration problem(CIP) to guarantee the local Lipschitz property. OTAD
is extensible to diversearchitectures of ResNet and Transformer, making it
suitable for complex data.For efficient computation, the CIP can be solved
through training neuralnetworks. OTAD opens a novel avenue for developing
reliable and secure deeplearning systems through the regularity of optimal
transport maps. Empiricalresults demonstrate that OTAD can outperform other
robust models on diversedatasets.⌄


ADBM: ADVERSARIAL DIFFUSION BRIDGE MODEL FOR RELIABLE ADVERSARIAL
PURIFICATIONROBUSTNESS

Xiao Li, Wenxuan Sun, Huanran Chen · 4 days ago
Recently Diffusion-based Purification (DiffPure) has been recognized as
aneffective defense method against adversarial examples. However, we
findDiffPure which directly employs the original pre-trained diffusion models
foradversarial purification, to be suboptimal. This is due to an
inherenttrade-off between noise purification performance and data recovery
quality.Additionally, the reliability of existing evaluations for DiffPure
isquestionable, as they rely on weak adaptive attacks. In this work, we propose
anovel Adversarial Diffusion Bridge Model, termed ADBM. ADBM directly
constructsa reverse bridge from the diffused adversarial data back to its
original cleanexamples, enhancing the purification capabilities of the original
diffusionmodels. Through theoretical analysis and experimental validation across
variousscenarios, ADBM has proven to be a superior and robust defense
mechanism,offering significant promise for practical applications.⌄


CREW: FACILITATING HUMAN-AI TEAMING RESEARCHALIGNMENT

Lingyu Zhang, Zhengran Ji, Boyuan Chen · 4 days ago
With the increasing deployment of artificial intelligence (AI) technologies,the
potential of humans working with AI agents has been growing at a greatspeed.
Human-AI teaming is an important paradigm for studying various aspectswhen
humans and AI agents work together. The unique aspect of Human-AI
teamingresearch is the need to jointly study humans and AI agents,
demandingmultidisciplinary research efforts from machine learning to
human-computerinteraction, robotics, cognitive science, neuroscience,
psychology, socialscience, and complex systems. However, existing platforms for
Human-AI teamingresearch are limited, often supporting oversimplified scenarios
and a singletask, or specifically focusing on either human-teaming research or
multi-agentAI algorithms. We introduce CREW, a platform to facilitate Human-AI
teamingresearch and engage collaborations from multiple scientific disciplines,
with astrong emphasis on human involvement. It includes pre-built tasks for
cognitivestudies and Human-AI teaming with expandable potentials from our
modulardesign. Following conventional cognitive neuroscience research, CREW
alsosupports multimodal human physiological signal recording for behavior
analysis.Moreover, CREW benchmarks real-time human-guided reinforcement learning
agentsusing state-of-the-art algorithms and well-tuned baselines. With CREW, we
wereable to conduct 50 human subject studies within a week to verify
theeffectiveness of our benchmark.⌄


GENERALIZED OUT-OF-DISTRIBUTION DETECTION AND BEYOND IN VISION LANGUAGE MODEL
ERA: A SURVEYMONITORING

Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang · 4 days ago
Detecting out-of-distribution (OOD) samples is crucial for ensuring thesafety of
machine learning systems and has shaped the field of OOD detection.Meanwhile,
several other problems are closely related to OOD detection,including anomaly
detection (AD), novelty detection (ND), open set recognition(OSR), and outlier
detection (OD). To unify these problems, a generalized OODdetection framework
was proposed, taxonomically categorizing these fiveproblems. However, Vision
Language Models (VLMs) such as CLIP havesignificantly changed the paradigm and
blurred the boundaries between thesefields, again confusing researchers. In this
survey, we first present ageneralized OOD detection v2, encapsulating the
evolution of AD, ND, OSR, OODdetection, and OD in the VLM era. Our framework
reveals that, with some fieldinactivity and integration, the demanding
challenges have become OOD detectionand AD. In addition, we also highlight the
significant shift in the definition,problem settings, and benchmarks; we thus
feature a comprehensive review of themethodology for OOD detection, including
the discussion over other relatedtasks to clarify their relationship to OOD
detection. Finally, we explore theadvancements in the emerging Large Vision
Language Model (LVLM) era, such asGPT-4V. We conclude this survey with open
challenges and future directions.⌄


SAFETYWASHING: DO AI SAFETY BENCHMARKS ACTUALLY MEASURE SAFETY
PROGRESS?ALIGNMENT

Richard Ren, Steven Basart, Adam Khoja · 4 days ago
As artificial intelligence systems grow more powerful, there has beenincreasing
interest in "AI safety" research to address emerging and futurerisks. However,
the field of AI safety remains poorly defined andinconsistently measured,
leading to confusion about how researchers cancontribute. This lack of clarity
is compounded by the unclear relationshipbetween AI safety benchmarks and
upstream general capabilities (e.g., generalknowledge and reasoning). To address
these issues, we conduct a comprehensivemeta-analysis of AI safety benchmarks,
empirically analyzing their correlationwith general capabilities across dozens
of models and providing a survey ofexisting directions in AI safety. Our
findings reveal that many safetybenchmarks highly correlate with upstream model
capabilities, potentiallyenabling "safetywashing" -- where capability
improvements are misrepresented assafety advancements. Based on these findings,
we propose an empiricalfoundation for developing more meaningful safety metrics
and define AI safetyin a machine learning research context as a set of clearly
delineated researchgoals that are empirically separable from generic
capabilities advancements. Indoing so, we aim to provide a more rigorous
framework for AI safety research,advancing the science of safety evaluations and
clarifying the path towardsmeasurable progress.⌄


BETWEEN THE AI AND ME: ANALYSING LISTENERS' PERSPECTIVES ON AI- AND
HUMAN-COMPOSED PROGRESSIVE METAL MUSICALIGNMENT

Pedro Sarmento, Jackson Loth, Mathieu Barthet · 5 days ago
Generative AI models have recently blossomed, significantly impactingartistic
and musical traditions. Research investigating how humans interactwith and deem
these models is therefore crucial. Through a listening andreflection study, we
explore participants' perspectives on AI- vshuman-generated progressive metal,
in symbolic format, using rock music as acontrol group. AI-generated examples
were produced by ProgGP, aTransformer-based model. We propose a mixed methods
approach to assess theeffects of generation type (human vs. AI), genre
(progressive metal vs. rock),and curation process (random vs. cherry-picked).
This combines quantitativefeedback on genre congruence, preference, creativity,
consistency, playability,humanness, and repeatability, and qualitative feedback
to provide insights intolisteners' experiences. A total of 32 progressive metal
fans completed thestudy. Our findings validate the use of fine-tuning to achieve
genre-specificspecialization in AI music generation, as listeners could
distinguish betweenAI-generated rock and progressive metal. Despite some
AI-generated excerptsreceiving similar ratings to human music, listeners
exhibited a preference forhuman compositions. Thematic analysis identified key
features for genre and AIvs. human distinctions. Finally, we consider the
ethical implications of ourwork in promoting musical data diversity within MIR
research by focusing on anunder-explored genre.⌄