papers.mlsafety.org
Open in
urlscan Pro
143.198.62.121
Public Scan
URL:
https://papers.mlsafety.org/
Submission: On August 05 via automatic, source certstream-suspicious — Scanned from CA
Submission: On August 05 via automatic, source certstream-suspicious — Scanned from CA
Form analysis
0 forms found in the DOMText Content
You need to enable JavaScript to run this app. More ML Safety Resources More ML Safety Resources SELECTIVENESS TOPICS Robustness Alignment Monitoring Systemic Safety Provide feedback on classifications CERT-ED: CERTIFIABLY ROBUST TEXT CLASSIFICATION FOR EDIT DISTANCEROBUSTNESS Zhuoqun Huang, Neil G Marchant, Olga Ohrimenko · 3 days ago With the growing integration of AI in daily life, ensuring the robustness ofsystems to inference-time attacks is crucial. Among the approaches forcertifying robustness to such adversarial examples, randomized smoothing hasemerged as highly promising due to its nature as a wrapper around arbitraryblack-box models. Previous work on randomized smoothing in natural languageprocessing has primarily focused on specific subsets of edit distanceoperations, such as synonym substitution or word insertion, without exploringthe certification of all edit operations. In this paper, we adapt RandomizedDeletion (Huang et al., 2023) and propose, CERTified Edit Distance defense(CERT-ED) for natural language classification. Through comprehensiveexperiments, we demonstrate that CERT-ED outperforms the existing Hammingdistance method RanMASK (Zeng et al., 2023) in 4 out of 5 datasets in terms ofboth accuracy and the cardinality of the certificate. By covering variousthreat models, including 5 direct and 5 transfer attacks, our method improvesempirical robustness in 38 out of 50 settings.⌄ JAILBREAKING TEXT-TO-IMAGE MODELS WITH LLM-BASED AGENTSALIGNMENT Yingkai Dong, Zheng Li, Xiangtao Meng · 4 days ago Recent advancements have significantly improved automated task-solvingcapabilities using autonomous agents powered by large language models (LLMs).However, most LLM-based agents focus on dialogue, programming, or specializeddomains, leaving gaps in addressing generative AI safety tasks. These gaps areprimarily due to the challenges posed by LLM hallucinations and the lack ofclear guidelines. In this paper, we propose Atlas, an advanced LLM-basedmulti-agent framework that integrates an efficient fuzzing workflow to targetgenerative AI models, specifically focusing on jailbreak attacks againsttext-to-image (T2I) models with safety filters. Atlas utilizes avision-language model (VLM) to assess whether a prompt triggers the T2I model'ssafety filter. It then iteratively collaborates with both LLM and VLM togenerate an alternative prompt that bypasses the filter. Atlas also enhancesthe reasoning abilities of LLMs in attack scenarios by leveraging multi-agentcommunication, in-context learning (ICL) memory mechanisms, and thechain-of-thought (COT) approach. Our evaluation demonstrates that Atlassuccessfully jailbreaks several state-of-the-art T2I models in a black-boxsetting, which are equipped with multi-modal safety filters. In addition, Atlasoutperforms existing methods in both query efficiency and the quality of thegenerated images.⌄ AUTONOMOUS LLM-ENHANCED ADVERSARIAL ATTACK FOR TEXT-TO-MOTIONROBUSTNESS Honglei Miao, Fan Ma, Ruijie Quan · 4 days ago Human motion generation driven by deep generative models has enabledcompelling applications, but the ability of text-to-motion (T2M) models toproduce realistic motions from text prompts raises security concerns ifexploited maliciously. Despite growing interest in T2M, few methods focus onsafeguarding these models against adversarial attacks, with existing work ontext-to-image models proving insufficient for the unique motion domain. In thepaper, we propose ALERT-Motion, an autonomous framework leveraging largelanguage models (LLMs) to craft targeted adversarial attacks against black-boxT2M models. Unlike prior methods modifying prompts through predefined rules,ALERT-Motion uses LLMs' knowledge of human motion to autonomously generatesubtle yet powerful adversarial text descriptions. It comprises two keymodules: an adaptive dispatching module that constructs an LLM-based agent toiteratively refine and search for adversarial prompts; and a multimodalinformation contrastive module that extracts semantically relevant motioninformation to guide the agent's search. Through this LLM-driven approach,ALERT-Motion crafts adversarial prompts querying victim models to produceoutputs closely matching targeted motions, while avoiding obviousperturbations. Evaluations across popular T2M models demonstrate ALERT-Motion'ssuperiority over previous methods, achieving higher attack success rates withstealthier adversarial prompts. This pioneering work on T2M adversarial attackshighlights the urgency of developing defensive measures as motion generationtechnology advances, urging further research into safe and responsibledeployment.⌄ SECURING THE DIAGNOSIS OF MEDICAL IMAGING: AN IN-DEPTH ANALYSIS OF AI-RESISTANT ATTACKSROBUSTNESS Angona Biswas, MD Abdullah Al Nasim, Kishor Datta Gupta · 4 days ago Machine learning (ML) is a rapidly developing area of medicine that usessignificant resources to apply computer science and statistics to medicalissues. ML's proponents laud its capacity to handle vast, complicated, anderratic medical data. It's common knowledge that attackers might causemisclassification by deliberately creating inputs for machine learningclassifiers. Research on adversarial examples has been extensively conducted inthe field of computer vision applications. Healthcare systems are thought to behighly difficult because of the security and life-or-death considerations theyinclude, and performance accuracy is very important. Recent arguments havesuggested that adversarial attacks could be made against medical image analysis(MedIA) technologies because of the accompanying technology infrastructure andpowerful financial incentives. Since the diagnosis will be the basis forimportant decisions, it is essential to assess how strong medical DNN tasks areagainst adversarial attacks. Simple adversarial attacks have been taken intoaccount in several earlier studies. However, DNNs are susceptible to more riskyand realistic attacks. The present paper covers recent proposed adversarialattack strategies against DNNs for medical imaging as well as countermeasures.In this study, we review current techniques for adversarial imaging attacks,detections. It also encompasses various facets of these techniques and offerssuggestions for the robustness of neural networks to be improved in the future.⌄ OTAD: AN OPTIMAL TRANSPORT-INDUCED ROBUST MODEL FOR AGNOSTIC ADVERSARIAL ATTACKROBUSTNESS Kuo Gai, Sicong Wang, Shihua Zhang · 4 days ago Deep neural networks (DNNs) are vulnerable to small adversarial perturbationsof the inputs, posing a significant challenge to their reliability androbustness. Empirical methods such as adversarial training can defend againstparticular attacks but remain vulnerable to more powerful attacks.Alternatively, Lipschitz networks provide certified robustness to unseenperturbations but lack sufficient expressive power. To harness the advantagesof both approaches, we design a novel two-step Optimal Transport inducedAdversarial Defense (OTAD) model that can fit the training data accuratelywhile preserving the local Lipschitz continuity. First, we train a DNN with aregularizer derived from optimal transport theory, yielding a discrete optimaltransport map linking data to its features. By leveraging the map's inherentregularity, we interpolate the map by solving the convex integration problem(CIP) to guarantee the local Lipschitz property. OTAD is extensible to diversearchitectures of ResNet and Transformer, making it suitable for complex data.For efficient computation, the CIP can be solved through training neuralnetworks. OTAD opens a novel avenue for developing reliable and secure deeplearning systems through the regularity of optimal transport maps. Empiricalresults demonstrate that OTAD can outperform other robust models on diversedatasets.⌄ ADBM: ADVERSARIAL DIFFUSION BRIDGE MODEL FOR RELIABLE ADVERSARIAL PURIFICATIONROBUSTNESS Xiao Li, Wenxuan Sun, Huanran Chen · 4 days ago Recently Diffusion-based Purification (DiffPure) has been recognized as aneffective defense method against adversarial examples. However, we findDiffPure which directly employs the original pre-trained diffusion models foradversarial purification, to be suboptimal. This is due to an inherenttrade-off between noise purification performance and data recovery quality.Additionally, the reliability of existing evaluations for DiffPure isquestionable, as they rely on weak adaptive attacks. In this work, we propose anovel Adversarial Diffusion Bridge Model, termed ADBM. ADBM directly constructsa reverse bridge from the diffused adversarial data back to its original cleanexamples, enhancing the purification capabilities of the original diffusionmodels. Through theoretical analysis and experimental validation across variousscenarios, ADBM has proven to be a superior and robust defense mechanism,offering significant promise for practical applications.⌄ CREW: FACILITATING HUMAN-AI TEAMING RESEARCHALIGNMENT Lingyu Zhang, Zhengran Ji, Boyuan Chen · 4 days ago With the increasing deployment of artificial intelligence (AI) technologies,the potential of humans working with AI agents has been growing at a greatspeed. Human-AI teaming is an important paradigm for studying various aspectswhen humans and AI agents work together. The unique aspect of Human-AI teamingresearch is the need to jointly study humans and AI agents, demandingmultidisciplinary research efforts from machine learning to human-computerinteraction, robotics, cognitive science, neuroscience, psychology, socialscience, and complex systems. However, existing platforms for Human-AI teamingresearch are limited, often supporting oversimplified scenarios and a singletask, or specifically focusing on either human-teaming research or multi-agentAI algorithms. We introduce CREW, a platform to facilitate Human-AI teamingresearch and engage collaborations from multiple scientific disciplines, with astrong emphasis on human involvement. It includes pre-built tasks for cognitivestudies and Human-AI teaming with expandable potentials from our modulardesign. Following conventional cognitive neuroscience research, CREW alsosupports multimodal human physiological signal recording for behavior analysis.Moreover, CREW benchmarks real-time human-guided reinforcement learning agentsusing state-of-the-art algorithms and well-tuned baselines. With CREW, we wereable to conduct 50 human subject studies within a week to verify theeffectiveness of our benchmark.⌄ GENERALIZED OUT-OF-DISTRIBUTION DETECTION AND BEYOND IN VISION LANGUAGE MODEL ERA: A SURVEYMONITORING Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang · 4 days ago Detecting out-of-distribution (OOD) samples is crucial for ensuring thesafety of machine learning systems and has shaped the field of OOD detection.Meanwhile, several other problems are closely related to OOD detection,including anomaly detection (AD), novelty detection (ND), open set recognition(OSR), and outlier detection (OD). To unify these problems, a generalized OODdetection framework was proposed, taxonomically categorizing these fiveproblems. However, Vision Language Models (VLMs) such as CLIP havesignificantly changed the paradigm and blurred the boundaries between thesefields, again confusing researchers. In this survey, we first present ageneralized OOD detection v2, encapsulating the evolution of AD, ND, OSR, OODdetection, and OD in the VLM era. Our framework reveals that, with some fieldinactivity and integration, the demanding challenges have become OOD detectionand AD. In addition, we also highlight the significant shift in the definition,problem settings, and benchmarks; we thus feature a comprehensive review of themethodology for OOD detection, including the discussion over other relatedtasks to clarify their relationship to OOD detection. Finally, we explore theadvancements in the emerging Large Vision Language Model (LVLM) era, such asGPT-4V. We conclude this survey with open challenges and future directions.⌄ SAFETYWASHING: DO AI SAFETY BENCHMARKS ACTUALLY MEASURE SAFETY PROGRESS?ALIGNMENT Richard Ren, Steven Basart, Adam Khoja · 4 days ago As artificial intelligence systems grow more powerful, there has beenincreasing interest in "AI safety" research to address emerging and futurerisks. However, the field of AI safety remains poorly defined andinconsistently measured, leading to confusion about how researchers cancontribute. This lack of clarity is compounded by the unclear relationshipbetween AI safety benchmarks and upstream general capabilities (e.g., generalknowledge and reasoning). To address these issues, we conduct a comprehensivemeta-analysis of AI safety benchmarks, empirically analyzing their correlationwith general capabilities across dozens of models and providing a survey ofexisting directions in AI safety. Our findings reveal that many safetybenchmarks highly correlate with upstream model capabilities, potentiallyenabling "safetywashing" -- where capability improvements are misrepresented assafety advancements. Based on these findings, we propose an empiricalfoundation for developing more meaningful safety metrics and define AI safetyin a machine learning research context as a set of clearly delineated researchgoals that are empirically separable from generic capabilities advancements. Indoing so, we aim to provide a more rigorous framework for AI safety research,advancing the science of safety evaluations and clarifying the path towardsmeasurable progress.⌄ BETWEEN THE AI AND ME: ANALYSING LISTENERS' PERSPECTIVES ON AI- AND HUMAN-COMPOSED PROGRESSIVE METAL MUSICALIGNMENT Pedro Sarmento, Jackson Loth, Mathieu Barthet · 5 days ago Generative AI models have recently blossomed, significantly impactingartistic and musical traditions. Research investigating how humans interactwith and deem these models is therefore crucial. Through a listening andreflection study, we explore participants' perspectives on AI- vshuman-generated progressive metal, in symbolic format, using rock music as acontrol group. AI-generated examples were produced by ProgGP, aTransformer-based model. We propose a mixed methods approach to assess theeffects of generation type (human vs. AI), genre (progressive metal vs. rock),and curation process (random vs. cherry-picked). This combines quantitativefeedback on genre congruence, preference, creativity, consistency, playability,humanness, and repeatability, and qualitative feedback to provide insights intolisteners' experiences. A total of 32 progressive metal fans completed thestudy. Our findings validate the use of fine-tuning to achieve genre-specificspecialization in AI music generation, as listeners could distinguish betweenAI-generated rock and progressive metal. Despite some AI-generated excerptsreceiving similar ratings to human music, listeners exhibited a preference forhuman compositions. Thematic analysis identified key features for genre and AIvs. human distinctions. Finally, we consider the ethical implications of ourwork in promoting musical data diversity within MIR research by focusing on anunder-explored genre.⌄