Laurent Perrinet<p>A perspective on <a href="https://neuromatch.social/tags/chatGPT" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>chatGPT</span></a> (or Large Language Models <a href="https://neuromatch.social/tags/LLMs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>LLMs</span></a> in general): <a href="https://neuromatch.social/tags/Hype" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Hype</span></a> or milestone?</p><p>[Rodney Brooks (<a href="https://spectrum.ieee.org/amp/gpt-4-calm-down-2660261157" rel="nofollow noopener noreferrer" target="_blank"><span class="invisible">https://</span><span class="ellipsis">spectrum.ieee.org/amp/gpt-4-ca</span><span class="invisible">lm-down-2660261157</span></a>) tells us that </p><blockquote><p>What large language models are good at is saying what an answer should <em>sound like</em>, which is different from what an answer should <em>be</em>.</p></blockquote><p>For a nice in-depth technical analysis, see this <a href="https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/" rel="nofollow noopener noreferrer" target="_blank">blog post by Stephen Wolfram</a> (himself!) on "What is ChatGPT Doing ... and Why Does It Work? ". Worth reading -even for non-experts- in a non-trivial effort to make the whole process explainable. The different steps are:</p><ul><li><p><u><a href="https://neuromatch.social/tags/LLMs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>LLMs</span></a> compute probabilities for the next word.</u> To do this, they aggregate huge datasets of text so that they create a function that, given a sequence of words, computes for all possible words in the dictionary the probability that adding this new word is statistically congruent with past words. Interestingly, this probability, conditioned on what has been observed so far, falls of as a power law, just like the global probability of words in the dictionary,</p></li><li><p><u>These <a href="https://neuromatch.social/tags/probabilities" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>probabilities</span></a> are computed by a function that leans on the dataset to generate the best approximation.</u> Wolfram makes a minute description of how to do such an approximation, starting from linear regression to using non-linearities. This leads to deep learning methods and their potential for universal function approximators,</p></li><li><p><u>Crucial is how these <a href="https://neuromatch.social/tags/models" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>models</span></a> are trainable, in particular by way of <a href="https://neuromatch.social/tags/backpropagation" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>backpropagation</span></a>.</u> This leads the author to describe the process, but also to point out some limitations of the trained model, especially, as you might have guessed, compared to potentially more powerful systems, like <a href="https://neuromatch.social/tags/cellularautomata" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>cellularautomata</span></a> of course...</p></li><li><p><u>This now brings us to <a href="https://neuromatch.social/tags/embeddings" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>embeddings</span></a>, the crucial ingredient to define "words" in these <a href="https://neuromatch.social/tags/LLMs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>LLMs</span></a> models.</u> To relate "alligator" to "crocodile" vs. a "vending machine," this technique computes distances between words based on their relative distance in the large dataset of text corpus, so that each word is assigned an address in a high-dimensional space, with the intuition that words that are syntactically closer should be closer in the embedding space. It is highly non-trivial to understand the geometry of high-dimensional spaces - especially when we try to relate it to our physical 3D space - but this technique has proven to give excellent results, I highly recommend the <a href="https://neuromatch.social/tags/cemantix" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>cemantix</span></a> puzzle to test your intuition about word embeddings: <a href="https://cemantle.certitudes.org" rel="nofollow noopener noreferrer" target="_blank"><span class="invisible">https://</span><span class="">cemantle.certitudes.org</span><span class="invisible"></span></a></p></li><li><p><u>Finally, these different parts are glued together by a humongous <a href="https://neuromatch.social/tags/transformer" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>transformer</span></a> network.</u> A standard <a href="https://neuromatch.social/tags/NeuralNetwork" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>NeuralNetwork</span></a> could perform a computation to predict the probabilities for the next word, but the results would mostly give nonsensical answers... Something more is needed to make this work. Just as traditional Convolutional Neural Networks <a href="https://neuromatch.social/tags/CNNs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CNNs</span></a> hardwire the fact that operations applied to an image should be applied to nearby pixels first, transformers do not operate uniformly on the sequence of words (i.e., embeddings), but weight them differently to ultimately get a better approximation. It is clear that much of the mechanism is a bunch of heuristics selected based on their performance - but we can understand the mechanism as giving different weights to different tokens - specifically based on the position of each token and its importance in the meaning of the current sentence. Based on this calculation, the sequence is reweighted so that a probability is ultimately computed. When applied to a sequence of words where words are added progressively, this creates a kind of loop in which the past sequence is constantly re-processed to update the generation.</p></li><li><p><u>Can we do more and include syntax?</u> Wolfram discusses the internals of <a href="https://neuromatch.social/tags/chatGPT" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>chatGPT</span></a>, and in particular how it trained iOS to "be a good bot" - and adds another possibility, which is to inject the knowledge that language is organized grammatically, and whether <a href="https://neuromatch.social/tags/transformers" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>transformers</span></a> are able to learn such rules. This points to certain limitations of the architecture and the potential of using graphs as a generalization of geometric rules. The post ends with a comparison of <a href="https://neuromatch.social/tags/LLMs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>LLMs</span></a>, which just aim to sound right, with rule-based models, a debate reminiscent of the older days of AI...</p></li></ul>