- Apart Updates
- Posts
- Interpretability, evaluations and critiques of AGI risk research
Interpretability, evaluations and critiques of AGI risk research
Apart Newsletter #27
This week, we look at new explorations of feature space, models to analyze training dynamics, and thoughts from the AGI risk space. We also share a few fellow newsletters that are starting up in AI safety along with exciting opportunities within AI safety.
ML safety research
Pythia (Biderman et al., 2023) is a dataset of 8 trained models with parameters ranging from 19 million to 12 billion. These models are trained to open source our ability to conduct research on how large models learn and they give access to copies of the model saved during training. Understanding how the "AI brains" learn is important to find new avenues for alignment.
A new paper from Redwood Research presents work to localize neural network behaviors to parts of its internal structure (Goldowsky-Dill et al., 2023). They formalize path patching and use it to test and refine hypotheses for behaviors in GPT-2 and more. You can explore their model behavior search tool yourself.
In recent work, Neel Nanda builds upon research into Othello-GPT (Li et al., 2023) that is trained to take random legal moves in the board game Othello. A common theory is that features of a network's understanding are encoded linearly and Li et al. show that this is not the case for the neural representation of the board state!
This was poised to flip our understanding of features; however, Nanda (2023) shows that if we re-interpret the features, we can extract them using a type of "logistic regression" over the neuron activation. With a simple transformation, interpretability luckily stays linearly interpretable.
Neel Nanda also joined us in making the interpretability hackathon 2.0 a success this weekend. You can follow the project presentations next Tuesday but as a short summary, teams worked to:
Identify tipping points in learning of the model (link).
Develop a way to qualitatively inspect many neurons in the Othello-GPT network (link to the tool and the report)
Improve on the TransformerLens library (report link and TransformerLens)
Investigate how dropout affects privileged bases (link)
Thoughts from AI risk research
Jan Kulveit and Rose Hadshar describe how the usual proposals for alignment ignore that the system we are trying to align to (humans) are usually not aligned within themselves. This puts several types of proposals on shaky ground.
They also provide an overview of ways to solve this problem, with examples such as aligning with Microsoft instead of humans, taking our preferences about our preferences into account, and using markets.
David Thorstad criticizes some of the extreme risk estimates on AI from the principle that several parts of the risk calculations do not have significant data nor arguments behind them. This echoes previous criticism from Nuno Sempere and Ben Garfinkel as well, who respectively highlight issues of estimation and of deference.
An anonymous post has been released critiquing one of the largest AI safety non-profit labs, describing issues related to experience of the researchers and conflicts of interest with their grantmakers.
Steven Kaas invites people to ask questions about artificial general intelligence (AGI) safety. It already has over 100 comments and might be interesting to explore. Examples include "how is AGI a risk?" and "is alignment even possible?".
What else?
A newsletter on AI governance and navigating the AI risks during the coming century has come out! It is focused on how we can govern the risks posed by transformative artificial intelligence and you'll receive their long-form thoughts on foundational questions in AI governance along with an overview of what has been happening every 2 weeks.
Nonlinear has launched a funder's network for AI safety with over 30 private donors and invites people to send in grant applications before the 17th of May.
The Center for AI Safety has launched a newsletter for what is happening in AI safety with their first post from a week ago. They already share the ML Safety Newsletter monthly, exploring topics in ML safety research.
Opportunities in ML safety
As usual, we thank our friends at aisafety.training and agisf.org/opportunities for mapping out the opportunities available in AI safety. Check them out here:
Submit your perspectives and explorations of our expectations on how AI will develop with Open Philanthropy's Worldview Prize. You can win up to $50,000!
On the 21st of April, applications for the RAND Corporation's technology and security policy fellowship to conduct independent research on the governance of AI.
Apply before the 30th of April as an intern to the Krueger Lab. They work on ML safety research and are doing great work within academic outreach.
The same deadline applies to joining the Effective Altruism Global (EAG) London conference happening next month. Apply here.
Thank you for following along and don't forget to share these with your friends interested in alignment research! You can follow both this newsletter and our hackathon updates at news.apartresearch.com.