Publications
-
Field, S.
"Why do Experts Disagree on Existential Risk and P(doom)? A Survey of AI Experts"
Journal of AI and Ethics, 2025.
[arXiv]
-
Kirsch, N., Field, S., Casper, S.
"What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks."
NeurIPS 2024 Workshop on Red Teaming Generative AI.
[arXiv]
-
Yampolskiy, R., Field, S.
"Assessing Controllability through Compliance with Irrational Orders"
Handbook of Human-Centered Artificial Intelligence.
-
Clymer, J., Juang, C., Field, S.
"Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals."
arXiv:2405.05466, 2024.
[arXiv]
✨ SafeBench AI Safety $20,000 Prize Winner
-
Costranelli, A., Alan, M., Field, S.
"Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language."
arXiv:2410.02472, 2024.
[arXiv]
-
Field, S., Krueger, D.
"AI Researchers' Perspectives on Automating AI R&D and Intelligence Explosions".