A team at ๐๐ง๐ญ๐ก๐ซ๐จ๐ฉ๐ข๐, creator of the Claude models, published a paper about extracting ๐ข๐ง๐ญ๐๐ซ๐ฉ๐ซ๐๐ญ๐๐๐ฅ๐ ๐๐๐๐ญ๐ฎ๐ซ๐๐ฌ from Claude 3 Sonnet. This is achieved by placing a sparse autoencoder halfway through the model and then training it. An autoencoder is a neural network that learns to encode input data, here a middle layer of Claude, into a compressed vector representation and then decode it back to the original input. In a sparse autoencoder, a sparsity penalty is added to the loss function, encouraging most units in the representation to remain inactive, which helps in capturing essential features efficiently.
It turns out that these features range from very ๐๐จ๐ง๐๐ซ๐๐ญ๐, e.g., โGolden Gate Bridge,โ to highly ๐๐๐ฌ๐ญ๐ซ๐๐๐ญ ๐๐ง๐ ๐๐จ๐ง๐๐๐ฉ๐ญ๐ฎ๐๐ฅ, such as โcode errorโ or โinner conflict.โ To get a feeling for the quality of these features, it is illuminating to look at their nearest neighbors (similar representation vector) as shown in the Figure 1. The paper contains a link to an interactive tool for more such examples.
Moreover, the features are ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฆ๐จ๐๐๐ฅ: the โGolden Gate Bridgeโ feature will get activated regardless of whether the input is an image or a text. The features also carry across ๐ฅ๐๐ง๐ ๐ฎ๐๐ ๐๐ฌ: the โtourist attractionโ feature (see Figure 2) will get activated when the model sees either โtour Eiffelโ in French or ้้จๅคงๆกฅ (Golden Gate Bridge in Mandarin).
The experiments on influencing the modelโs behavior, called ๐๐๐๐ญ๐ฎ๐ซ๐ ๐ฌ๐ญ๐๐๐ซ๐ข๐ง๐ , are fascinating reading. When clamping (i.e. manually setting in the model) the โtransit infrastructureโ feature to five times its max value, the model will send you across a bridge when asking for directions, where otherwise it wouldnโt have.
At this point, you might be thinking about whether these insights could be applied to increase modelsโ ๐ฌ๐๐๐๐ญ๐ฒ. Indeed, the paper reports detecting, e.g., unsafe code, bias, sycophancy (I had to look this one up: behavior of flattering or excessively praising someone to gain favor or advantage), deception and power-seeking, and dangerous or criminal information. Could feature steering help steer modelsโ answers in favorable ways? The authors caution against high expectations, but I believe this research direction has sufficient potential to warrant further exploration.