{"id":171,"date":"2024-05-26T07:22:45","date_gmt":"2024-05-26T07:22:45","guid":{"rendered":"https:\/\/wsw-int.de\/?p=171"},"modified":"2024-10-25T15:25:14","modified_gmt":"2024-10-25T15:25:14","slug":"interpretable-features","status":"publish","type":"post","link":"https:\/\/multai.eu\/de\/interpretable-features\/","title":{"rendered":"Interpretierbare Merkmale"},"content":{"rendered":"<p>A team at \ud835\udc00\ud835\udc27\ud835\udc2d\ud835\udc21\ud835\udc2b\ud835\udc28\ud835\udc29\ud835\udc22\ud835\udc1c, creator of the Claude models, published a <a href=\"https:\/\/transformer-circuits.pub\/2024\/scaling-monosemanticity\/index.html\">paper<\/a> about extracting \ud835\udc22\ud835\udc27\ud835\udc2d\ud835\udc1e\ud835\udc2b\ud835\udc29\ud835\udc2b\ud835\udc1e\ud835\udc2d\ud835\udc1a\ud835\udc1b\ud835\udc25\ud835\udc1e \ud835\udc1f\ud835\udc1e\ud835\udc1a\ud835\udc2d\ud835\udc2e\ud835\udc2b\ud835\udc1e\ud835\udc2c from Claude 3 Sonnet. This is achieved by placing a sparse autoencoder halfway through the model and then training it. An autoencoder is a neural network that learns to encode input data, here a middle layer of Claude, into a compressed vector representation and then decode it back to the original input. In a sparse autoencoder, a sparsity penalty is added to the loss function, encouraging most units in the representation to remain inactive, which helps in capturing essential features efficiently.<\/p>\n\n\n\n<p>It turns out that these features range from very \ud835\udc1c\ud835\udc28\ud835\udc27\ud835\udc1c\ud835\udc2b\ud835\udc1e\ud835\udc2d\ud835\udc1e, e.g., \u2018Golden Gate Bridge,\u2019 to highly \ud835\udc1a\ud835\udc1b\ud835\udc2c\ud835\udc2d\ud835\udc2b\ud835\udc1a\ud835\udc1c\ud835\udc2d \ud835\udc1a\ud835\udc27\ud835\udc1d \ud835\udc1c\ud835\udc28\ud835\udc27\ud835\udc1c\ud835\udc1e\ud835\udc29\ud835\udc2d\ud835\udc2e\ud835\udc1a\ud835\udc25, such as \u2018code error\u2019 or \u2018inner conflict.\u2019 To get a feeling for the quality of these features, it is illuminating to look at their nearest neighbors (similar representation vector) as shown in the Figure 1. The paper contains a link to an interactive tool for more such examples.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"973\" src=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png\" alt=\"\" class=\"wp-image-172\" srcset=\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png 1024w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/05\/conflict-300x285.png 300w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/05\/conflict-768x730.png 768w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/05\/conflict-1536x1460.png 1536w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/05\/conflict.png 1732w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Nearest neighbors to &#8216;inner conflict&#8217; feature. Image from the paper.<\/figcaption><\/figure>\n\n\n\n<p>Moreover, the features are <a href=\"https:\/\/wsw-int.de\/from-language-models-to-multimodal-models\">\ud835\udc26\ud835\udc2e\ud835\udc25\ud835\udc2d\ud835\udc22\ud835\udc26\ud835\udc28\ud835\udc1d\ud835\udc1a\ud835\udc25<\/a>: the \u2018Golden Gate Bridge\u2019 feature will get activated regardless of whether the input is an image or a text. The features also carry across \ud835\udc25\ud835\udc1a\ud835\udc27\ud835\udc20\ud835\udc2e\ud835\udc1a\ud835\udc20\ud835\udc1e\ud835\udc2c: the \u2018tourist attraction\u2019 feature (see Figure 2) will get activated when the model sees either \u2018tour Eiffel\u2019 in French or \u91d1\u95e8\u5927\u6865 (Golden Gate Bridge in Mandarin).<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"580\" height=\"393\" src=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-26-091710.png\" alt=\"\" class=\"wp-image-173\" srcset=\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-26-091710.png 580w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-26-091710-300x203.png 300w\" sizes=\"(max-width: 580px) 100vw, 580px\" \/><figcaption class=\"wp-element-caption\">Figure 2: inputs activating the &#8216;tourist attraction&#8217; feature. Image from the paper.<\/figcaption><\/figure>\n\n\n\n<p>The experiments on influencing the model\u2019s behavior, called \ud835\udc1f\ud835\udc1e\ud835\udc1a\ud835\udc2d\ud835\udc2e\ud835\udc2b\ud835\udc1e \ud835\udc2c\ud835\udc2d\ud835\udc1e\ud835\udc1e\ud835\udc2b\ud835\udc22\ud835\udc27\ud835\udc20, are fascinating reading. When clamping (i.e. manually setting in the model) the \u2018transit infrastructure\u2019 feature to five times its max value, the model will send you across a bridge when asking for directions, where otherwise it wouldn\u2019t have.<\/p>\n\n\n\n<p>At this point, you might be thinking about whether these insights could be applied to increase models\u2019 \ud835\udc2c\ud835\udc1a\ud835\udc1f\ud835\udc1e\ud835\udc2d\ud835\udc32. Indeed, the paper reports detecting, e.g., unsafe code, bias, sycophancy (I had to look this one up: behavior of flattering or excessively praising someone to gain favor or advantage), deception and power-seeking, and dangerous or criminal information. Could feature steering help steer models\u2019 answers in favorable ways? The authors caution against high expectations, but I believe this research direction has sufficient potential to warrant further exploration.<\/p>\n\n\n\n<p><strong><a href=\"https:\/\/multai.eu\/de\/\">MultAI.eu<\/a><\/strong> offers safe and easy access to Anthropic models, as well as to OpenAI&#8217;s, Google&#8217;s, and others&#8217; models.<\/p>","protected":false},"excerpt":{"rendered":"<p>A team at \ud835\udc00\ud835\udc27\ud835\udc2d\ud835\udc21\ud835\udc2b\ud835\udc28\ud835\udc29\ud835\udc22\ud835\udc1c, creator of the Claude models, published a paper about extracting \ud835\udc22\ud835\udc27\ud835\udc2d\ud835\udc1e\ud835\udc2b\ud835\udc29\ud835\udc2b\ud835\udc1e\ud835\udc2d\ud835\udc1a\ud835\udc1b\ud835\udc25\ud835\udc1e \ud835\udc1f\ud835\udc1e\ud835\udc1a\ud835\udc2d\ud835\udc2e\ud835\udc2b\ud835\udc1e\ud835\udc2c from Claude 3 Sonnet. This is achieved by placing a sparse autoencoder halfway through the model and then training it. An autoencoder is a neural network that learns to encode input data, here a middle layer of Claude, into [&hellip;]<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[51,52,53,10,48,27],"class_list":["post-171","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-anthropic","tag-feature-steering","tag-interpretable-features","tag-llm","tag-multimodal","tag-safety"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Interpretable Features - MultAI<\/title>\n<meta name=\"description\" content=\"Anthropic team uses sparse autoencoders to detect features in their Claude model. These features turn out to be highly abstract, multimodal and multilingual. Steering this features permits changing the model&#039;s behavior.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/multai.eu\/de\/interpretable-features\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Interpretable Features - MultAI\" \/>\n<meta property=\"og:description\" content=\"Anthropic team uses sparse autoencoders to detect features in their Claude model. These features turn out to be highly abstract, multimodal and multilingual. Steering this features permits changing the model&#039;s behavior.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/multai.eu\/de\/interpretable-features\/\" \/>\n<meta property=\"og:site_name\" content=\"MultAI\" \/>\n<meta property=\"article:published_time\" content=\"2024-05-26T07:22:45+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-10-25T15:25:14+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png\" \/>\n<meta name=\"author\" content=\"hans\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"hans\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"3\u00a0Minuten\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/multai.eu\/interpretable-features\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/multai.eu\/interpretable-features\/\"},\"author\":{\"name\":\"hans\",\"@id\":\"https:\/\/multai.eu\/#\/schema\/person\/06def8c374b5d6724bec911e9880c292\"},\"headline\":\"Interpretable Features\",\"datePublished\":\"2024-05-26T07:22:45+00:00\",\"dateModified\":\"2024-10-25T15:25:14+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/multai.eu\/interpretable-features\/\"},\"wordCount\":384,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/multai.eu\/#organization\"},\"image\":{\"@id\":\"https:\/\/multai.eu\/interpretable-features\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png\",\"keywords\":[\"Anthropic\",\"feature steering\",\"interpretable features\",\"LLM\",\"multimodal\",\"safety\"],\"articleSection\":[\"Uncategorized\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/multai.eu\/interpretable-features\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/multai.eu\/interpretable-features\/\",\"url\":\"https:\/\/multai.eu\/interpretable-features\/\",\"name\":\"Interpretable Features - MultAI\",\"isPartOf\":{\"@id\":\"https:\/\/multai.eu\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/multai.eu\/interpretable-features\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/multai.eu\/interpretable-features\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png\",\"datePublished\":\"2024-05-26T07:22:45+00:00\",\"dateModified\":\"2024-10-25T15:25:14+00:00\",\"description\":\"Anthropic team uses sparse autoencoders to detect features in their Claude model. These features turn out to be highly abstract, multimodal and multilingual. Steering this features permits changing the model's behavior.\",\"breadcrumb\":{\"@id\":\"https:\/\/multai.eu\/interpretable-features\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/multai.eu\/interpretable-features\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/multai.eu\/interpretable-features\/#primaryimage\",\"url\":\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png\",\"contentUrl\":\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/multai.eu\/interpretable-features\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/multai.eu\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Interpretable Features\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/multai.eu\/#website\",\"url\":\"https:\/\/multai.eu\/\",\"name\":\"WSW\",\"description\":\"Generative AI for your business\",\"publisher\":{\"@id\":\"https:\/\/multai.eu\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/multai.eu\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/multai.eu\/#organization\",\"name\":\"WSW\",\"alternateName\":\"MultAI\",\"url\":\"https:\/\/multai.eu\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/multai.eu\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/10\/Logo.png\",\"contentUrl\":\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/10\/Logo.png\",\"width\":225,\"height\":244,\"caption\":\"WSW\"},\"image\":{\"@id\":\"https:\/\/multai.eu\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/multai.eu\/#\/schema\/person\/06def8c374b5d6724bec911e9880c292\",\"name\":\"hans\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/multai.eu\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1409f6643b6f17d5838709af9deca41643884a95390f8a4f8ea478b9187aec41?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1409f6643b6f17d5838709af9deca41643884a95390f8a4f8ea478b9187aec41?s=96&d=mm&r=g\",\"caption\":\"hans\"},\"sameAs\":[\"https:\/\/wsw-int.de\"],\"url\":\"https:\/\/multai.eu\/de\/author\/hans\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Interpretable Features - MultAI","description":"Anthropic team uses sparse autoencoders to detect features in their Claude model. These features turn out to be highly abstract, multimodal and multilingual. Steering this features permits changing the model's behavior.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/multai.eu\/de\/interpretable-features\/","og_locale":"de_DE","og_type":"article","og_title":"Interpretable Features - MultAI","og_description":"Anthropic team uses sparse autoencoders to detect features in their Claude model. These features turn out to be highly abstract, multimodal and multilingual. Steering this features permits changing the model's behavior.","og_url":"https:\/\/multai.eu\/de\/interpretable-features\/","og_site_name":"MultAI","article_published_time":"2024-05-26T07:22:45+00:00","article_modified_time":"2024-10-25T15:25:14+00:00","og_image":[{"url":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png"}],"author":"hans","twitter_card":"summary_large_image","twitter_misc":{"Verfasst von":"hans","Gesch\u00e4tzte Lesezeit":"3\u00a0Minuten"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/multai.eu\/interpretable-features\/#article","isPartOf":{"@id":"https:\/\/multai.eu\/interpretable-features\/"},"author":{"name":"hans","@id":"https:\/\/multai.eu\/#\/schema\/person\/06def8c374b5d6724bec911e9880c292"},"headline":"Interpretable Features","datePublished":"2024-05-26T07:22:45+00:00","dateModified":"2024-10-25T15:25:14+00:00","mainEntityOfPage":{"@id":"https:\/\/multai.eu\/interpretable-features\/"},"wordCount":384,"commentCount":0,"publisher":{"@id":"https:\/\/multai.eu\/#organization"},"image":{"@id":"https:\/\/multai.eu\/interpretable-features\/#primaryimage"},"thumbnailUrl":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png","keywords":["Anthropic","feature steering","interpretable features","LLM","multimodal","safety"],"articleSection":["Uncategorized"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/multai.eu\/interpretable-features\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/multai.eu\/interpretable-features\/","url":"https:\/\/multai.eu\/interpretable-features\/","name":"Interpretable Features - MultAI","isPartOf":{"@id":"https:\/\/multai.eu\/#website"},"primaryImageOfPage":{"@id":"https:\/\/multai.eu\/interpretable-features\/#primaryimage"},"image":{"@id":"https:\/\/multai.eu\/interpretable-features\/#primaryimage"},"thumbnailUrl":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png","datePublished":"2024-05-26T07:22:45+00:00","dateModified":"2024-10-25T15:25:14+00:00","description":"Anthropic team uses sparse autoencoders to detect features in their Claude model. These features turn out to be highly abstract, multimodal and multilingual. Steering this features permits changing the model's behavior.","breadcrumb":{"@id":"https:\/\/multai.eu\/interpretable-features\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/multai.eu\/interpretable-features\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/multai.eu\/interpretable-features\/#primaryimage","url":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png","contentUrl":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/05\/conflict-1024x973.png"},{"@type":"BreadcrumbList","@id":"https:\/\/multai.eu\/interpretable-features\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/multai.eu\/"},{"@type":"ListItem","position":2,"name":"Interpretable Features"}]},{"@type":"WebSite","@id":"https:\/\/multai.eu\/#website","url":"https:\/\/multai.eu\/","name":"WSW","description":"Generative AI for your business","publisher":{"@id":"https:\/\/multai.eu\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/multai.eu\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/multai.eu\/#organization","name":"WSW","alternateName":"MultAI","url":"https:\/\/multai.eu\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/multai.eu\/#\/schema\/logo\/image\/","url":"https:\/\/multai.eu\/wp-content\/uploads\/2024\/10\/Logo.png","contentUrl":"https:\/\/multai.eu\/wp-content\/uploads\/2024\/10\/Logo.png","width":225,"height":244,"caption":"WSW"},"image":{"@id":"https:\/\/multai.eu\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/multai.eu\/#\/schema\/person\/06def8c374b5d6724bec911e9880c292","name":"hans","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/multai.eu\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/1409f6643b6f17d5838709af9deca41643884a95390f8a4f8ea478b9187aec41?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1409f6643b6f17d5838709af9deca41643884a95390f8a4f8ea478b9187aec41?s=96&d=mm&r=g","caption":"hans"},"sameAs":["https:\/\/wsw-int.de"],"url":"https:\/\/multai.eu\/de\/author\/hans\/"}]}},"_links":{"self":[{"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/posts\/171","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/comments?post=171"}],"version-history":[{"count":2,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/posts\/171\/revisions"}],"predecessor-version":[{"id":1437,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/posts\/171\/revisions\/1437"}],"wp:attachment":[{"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/media?parent=171"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/categories?post=171"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/tags?post=171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}