Open Source AI

By Matt Stroud

Has ‘explainability’ just killed Open Source AI?

Despite the stellar progress in the capabilities of LLMs like Chat GPT, Claude, and Gemini, researchers still don’t really know how they work. Last month, a shaft of light was cast into these black boxes with the publication of a research paper by Anthropic: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (transformer-circuits.pub).

In the paper, they describe a technique for identifying nodes within the LLM that embody different concepts. For instance, some combinations of nodes might relate to the concept of the ‘Golden Gate Bridge,’ while others might relate to higher-level concepts such as ‘anger’ or ‘honesty.’ 

Further, these insights enabled the researchers to manipulate the models during their training. By boosting the weights of nodes associated with specific concepts, they could increase the prominence of those concepts in the LLM’s subsequent answers. For instance, this could make the model more ‘honest’ or make it obsessed with and frequently reference ‘the Golden Gate Bridge’.

The stark fact is that these new techniques enable LLMs to be manipulated in ways that will be very hard to detect subsequently. As we look forward a year or two to an era when we each have our own personal AIs, it will be imperative that we trust that those models are entirely within our control. This will make it vital that we can trust our model’s provenance and be sure that it’s not been manipulated before it came into our possession. This raises some profound questions for the open-source AI community who, by their nature, may find it challenging to find a methodology to offer such assurances.

Previous
Previous

Cognitive carnage