Interpretable Features
A team at 𝐀𝐧𝐭𝐡𝐫𝐨𝐩𝐢𝐜, creator of the Claude models, published a paper about extracting 𝐢𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐥𝐞 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 from Claude 3 Sonnet. This is achieved by placing a sparse autoencoder halfway through the model and then training it. An autoencoder is a neural network that learns to encode input data, here a middle layer of Claude, into