Apart Updates
Posts
Apart Updates: Growth Edition!

Apart Updates: Growth Edition!

Presenting Recent Lab Publications, Hackathon Lineup, Career Opportunities, and Leadership News

Jason, Finn Metz & Esben Kran
April 23, 2024

🤗 Welcome to this update from the Apart community! This issue's highlights include:

👩‍🔬 Publications from Apart and welcoming our new Lab fellows
📅 Upcoming sprints in the Flagging AI Risk Sprint Season
💚 Job openings at Apart
- Head of Community and Events
🌄 Thoughts on AI Safety in a for-profit context
🙋‍♂️ New Co-Director: Jason Hoelscher-Obermaier

Since our last Apart community update in January, we've seen experts increase their probability of an intelligence explosions, video models improving remarkably, NVIDIA introducing the Blackwell GPU at their investor meeting, Deepmind defining AGI, the Claude 3 and Llama 3 releases, software engineers being automated, and robots becoming more capable.

These impressive changes also mean that our mission of ensuring the safety of AI systems is more important than ever, and we're excited to share what the community has been working on for the past three months! Read on ;)

The Apart Updates is your portal to the community's research and events. Subscribe to follow our updates.

👩‍🔬 Research Update

Published Papers from the Lab!

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions (Neo et al., 2024)
This paper explores how attention heads work with certain MLP neurons to predict the next token. This paper was written by Clement Neo (Apart Lab Fellow and now a research assistant) in collaboration with Shay B. Cohen and Fazl Barez.
Understanding Addition in Transformers (Quirke & Barez, 2024)
In this work, we thoroughly reverse-engineer a one-layer transformer model trained to do n-digit addition. This paper was written by Philip Quirke (Apart Lab Fellow), who was advised by Fazl Barez.
- Increasing Trust in Language Models through the Reuse of Verified Circuits (Quirke et al., 2024)
  In this follow-up work, Philip Quirke and Clement Neo (advised by Fazl Barez) further develop their toy model setting to investigate how circuits can be reused for models of different sizes and different tasks.
Large Language Models Relearn Removed Concepts (Lo & Barez, 2024)
In this work, we find that models relearn deliberately removed concepts faster than a baseline. Michelle Lo, an Apart Lab fellow, has published her research in collaboration with Shay B. Cohen and Fazl Barez and won best paper/poster award at the Tokyo Technical AI safety Conference (TAIS). Read more on the paper's website.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al., 2024)
This paper from Anthropic puts an important focus on how unfixable deceptive models are. The work was covered in Nature, among many other publications. Fazl Barez and Clement Neo from the core Apart team both helped out on this paper.
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models (Marks et al., 2024)
Marks and Abdullah et al. use sparse autoencoders to understand how LLMs learn and train during RLHF fine-tuning.

Publicity from the Lab

Evan Anders, our Apart Lab Fellow and a postdoc at KITP, published a blog post showing that sparse autoencoders find composed features in small toy models
Our Apart Lab Fellow, Jacob Haimes, hosts a great podcast on getting Into AI safety and has released two episodes on participating about his experience during the Apart Evals Hackathon (1, 2) and one on writing grant applications
Juan Pablo Rivera gave a HackTalk on their hackathon project that was published at the multi-agent security workshop at NeurIPS’23
Christian Schroeder de Witt gave a keynote at our multi-agent security hackathon uncovering concrete paths to the security of multi-agent systems
Paul Bricman of Straumli gave a HackTalk on exciting and concrete technical projects for AI governance
You can also check out our recent re-uploads of previous livestreams on our YouTube channel

Besides these exciting research talks and findings, our co-director Esben was also a speaker at the Technical AI Safety Conference in Tokyo and for the Equiano Institute's African AI governance fellowship. Jason presented Apart's work at FAR Labs and our research advisor Fazl presented posters in both Tokyo and Malta. At EACL, Apart also sponsored and was part of organizing the Scale-LLM workshop.

During the coming weeks, you will be able to find us presenting work at the Foresight AGI security workshop in mid-May in San Francisco and at EAGx Nordics this coming weekend.

Welcoming New Research Teams into Apart

We're excited to welcome two new cohorts in our research accelerator, the Apart Lab. From January's cohort 4, some are already aiming to submit to NeurIPS in a month — Exciting!

Beyond our existing fellows, 35 researchers across cohort 4 and 5 are now actively engaged in impactful research questions within empirical AI safety; especially model evaluations, multi-agent security, and AI benchmarking.

We're excited to invite representation across more than 10 different countries and from undergraduate students to postdocs and senior engineering managers. We're looking forward to support their growth and the investigation of highly impactful research questions 🥳

Highlighting Sprint Winners and Results

During the past months, we have hosted the global AI governance sprint, the multi-agent security sprint in Berkeley, and the METR Code Red hackathon.

To avoid evaluation data ending up in the training dataset, we cannot share any concrete results from the Code Red hackathon. However, it was a massive success with over $28,000 in prizes to more than 116 participants for their high quality ideas, specifications, and projects. We'll have a post live soon describing more of the results from this exciting hackathon. Until then, you can watch Beth Barnes' keynote introducing METR's task-based evaluation framework!

However, let's highlight a few exciting projects from other hackathons (that we highly encourage you to check out!):

🏆 Jin Suk Park et al. explores DarkGPT and writes evaluations for dark patterns in chat AI software to uncover anti-user model development decisions from AGI companies driven by misaligned corporate incentives (Github)
🏆 Lutz and Duri investigates the dispersion of information in multi-agent networks of agents in Fishing for the Answer
🏆 Markov writes Obsolescent Souls, an interesting science fiction narrative about a future where AI is safe but society slowly loses autonomy
🏆 Segerie and Feuillade-Montixi make an overview of ways we might be able to effectively box AI in real-world scenarios

We're excited to introduce our Sprint Seasons! These are periods of the year where we will host hackathons within a specific agenda in AI safety in collaboration with top-tier labs to make significant progress on key problems within these agendas.

And today we welcome you to the Flagging AI Risks Sprint Season where we're evaluating, benchmarking, and detecting critically high-risk AI models during March through June 2024.

Besides participating in our research sprints, you're also very welcome to become part of our international group of local hackathon organizers who host hackathon locations for the sprints. Read more.

[finished] Code Red Hackathon ($28,000 in prizes 🤯)

In March, we collaborated with METR and more than 150 participants to develop tasks for testing language models on critical tasks. Some of these projects will directly contribute to the task suite that METR is working on with the UK AI Safety Institute, among others. Since these tasks should not end up in any future training data, we're not able to share the projects, but the lightning talks were super engaging with a high potential for impact and we'll have a summary of the hackathon on our blog soon!

AI x Democracy Risk Demonstrations Sprint (May 3-5)

Join us to demonstrate concrete risks to democracy from AGI and AI systems. We'll be looking to project these demonstrations into the future and propose concrete risk mitigation ideas. You will have the chance to win part of the 🏆 $2,000 prize pool and interact with top researchers in risk demonstration research from Anthropic, Simon Institute for Longterm Governance, and MATS. Read more and sign up!

AI Security Evaluations Sprint (May 24-26)

Join us at the end of May where we're hosting our hackathon inspired by the SafeBench competition, where you'll be hacking away on benchmarks and evaluation methods to assess the robustness, transparency, and alignment of AI systems. See concrete examples and sign up here.

Deception Detection Hackathon (June 28-30)

In June, we'll be working to tackle one of the most pivotal problems for future AI risk; Deception! You'll hear from researchers and experts in this area and hack towards more robust ways to evaluate and detect risk. See existing projects and sign up here.

Computational Mechanics Hackathon (May 18-20)

For anyone missing interpretability, we welcome you to join our SprintX organized by PIBBSS and an independent research team. Following up on fascinating results from an 80s physics field applied to the latent space of models during training. Read more and sign up to join.

🤗 Become part of the Apart Core Team!

We're incredibly grateful to our recent funders Long-Term Future Fund and Scott Alexander for supporting our mission to mitigate AI risk with global technical research hackathons, workshops, and fellowships. As a result, we're looking to grow the team and we invite ambitious and enthusiastic individuals to apply for our new roles!
Read more about working at Apart and browse our open roles.

Head of Community and Events

We're looking for a talented and ambitious individual with experience in event organization, facilitation, and process design to take on an executive role at Apart and take ownership of our sprints and research campus efforts.

This role will be crucial to help many researchers over the coming years work on impactful AI safety research and we're excited to share our plans for the future if you end up joining us!

Other positions

During the coming weeks, we are opening up more research positions. Stay updated on our careers page.

👀 For-profit AI safety on the horizon

As AI capabilities are advancing at an unprecedented pace, we recently wrote our thoughts on the need for for-profit AI safety to scale sustainably to match the development of capabilities.

With concrete examples of how to develop profitable initiatives within the categories AI security, transparency, alignment, and AI defense, we describe the potential for differential technological development within for-profit AI safety during the coming years.

Read the entire blog post here.

🥳 Welcoming our new co-director at Apart!

Jason Hoelscher-Obermaier started working at Apart in late 2023 as research lead and has since taken on the position of co-director with responsibility for accelerating our community's alignment research.

With a PhD in experimental quantum physics and five years of experience from machine learning startups, Jason is taking on an important position with a skillset that matches! We're incredibly excited to welcome him in this new role.

🧐 Are you interested in our work?

We have big plans for 2024 so if you are excited about what we do, want to help us, or wish to collaborate, share your interest below! You are also welcome to write to us at [email protected].

Express your interest here.

See you at the next hackathon! 😎