\
  The most prestigious law school admissions discussion board in the world.
BackRefresh Options Favorite

Gary Marcus reply to SSC on scaling hypothesis

https://garymarcus.substack.com/p/does-ai-really-need-a-para...
Black Trailer Park
  06/21/22
>They found that at least a 5-layer neural network is nee...
odious lilac parlor
  06/22/22
Not a "math guy" but isn't that correct?
Beta 180 Range Doctorate
  06/22/22
that's what I'm saying edit: turns out I'm a retard thoug...
odious lilac parlor
  06/22/22
the idea of an artificial intelligence is less threatening i...
charismatic state regret
  06/22/22
i would be willing to bet that GPT-4 is nowhere near 100 tri...
Flickering Drunken Senate Windowlicker
  06/22/22
...
Black Trailer Park
  06/22/22
speaking of scale is all you need. did you see Google's new ...
Flickering Drunken Senate Windowlicker
  06/22/22
I did but ty for linking, it embiggens the spirit.
Black Trailer Park
  06/22/22
it will be interesting to see how they intend to use that. t...
Flickering Drunken Senate Windowlicker
  06/22/22
gary marcus is a fraud. a friend of mine was his doctoral...
Hairless old irish cottage
  06/22/22
I thought he was a linguist or something. Anyway, he links t...
Black Trailer Park
  06/22/22
he was a psycholinguist, yes. but that doesn't mean much. a...
Hairless old irish cottage
  06/22/22
I'm aware of critiques of Marcus and the general reputation ...
Black Trailer Park
  06/22/22
do you read lesswrong? the ML discussion (not alignment rela...
heady nofapping cuckold telephone
  06/23/22
I have in the past but it's cultic. Hard to ignore the farci...
Black Trailer Park
  06/23/22
lesswrong is a strange place but it does encourage intellect...
Flickering Drunken Senate Windowlicker
  06/23/22
See, that's the issue with LessWrong: it's built around a ra...
Black Trailer Park
  06/23/22
there are people there that totally buy into the Eliezer koo...
Flickering Drunken Senate Windowlicker
  06/23/22
Thanks. I'm puzzled by your rebuttal re religion, given that...
Black Trailer Park
  06/23/22
so i think it's different because religious predictions of d...
Flickering Drunken Senate Windowlicker
  06/23/22
I don't think that's quite true re salvation and the afterli...
Black Trailer Park
  06/23/22
how did he get Uber to buy his company
Flickering Drunken Senate Windowlicker
  06/22/22
don't know but that ended as disastrously as it would under ...
Hairless old irish cottage
  06/22/22
A challenge for AI is that much of human society, and hence ...
Black Trailer Park
  06/23/22
[also, fairly unrelated & more speculative but check thi...
Black Trailer Park
  06/23/22
this is worth listening to: https://theinsideview.ai/raph...
..,.,,,...,..,,.,...,.,,.,..
  06/25/22
Many thanks! this looks great. I am pasting the transcript b...
Adolf Anderssen
  06/25/22
“Gary Marcus doesn’t contribute to the field&rdq...
Valiant
  06/25/22
(Jerry Fodor). Jokes aside yes, I think it would be bette...
Adolf Anderssen
  06/25/22
Ok I read it and I agree with LeCun and his list. The proble...
Valiant
  06/25/22
TY for your response. But query, what if Chomsky & co...
Adolf Anderssen
  06/25/22
...
Private First Class James Ramirez
  06/25/22
Yes ultimately it’s empirical but I think the weight o...
Valiant
  06/25/22
There's a lot of interesting work on this that goes far beyo...
Adolf Anderssen
  06/25/22
Yes, Fristonian “predictive processing” would im...
Valiant
  06/25/22
How much of the modularity and fixed structure is the result...
..,.,,,...,..,,.,...,.,,.,..
  06/25/22
Re: the fixity of neuroanatomy, the data are (even now) frus...
Adolf Anderssen
  06/25/22
"The problem with “learning like a baby” is...
.,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.
  06/26/22
I'm sympathetic to this view, but see https://psyarxiv.com/5...
Adolf Anderssen
  06/26/22
it would be interesting to know whether they are looking at ...
..,.,,,...,..,,.,...,.,,.,..
  06/26/22
>transformers seem to understand an awful lot about human...
Valiant
  06/26/22
"3) model other agents and their internal states" ...
Adolf Anderssen
  06/26/22
https://youtu.be/q9XR9Wbl7iI?t=870 Watch till minute 17. ...
Valiant
  06/26/22
Language models don't see words (at least GPT-3 doesn't and ...
..,.,,,...,..,,.,...,.,,.,..
  06/26/22
https://ai.googleblog.com/2022/06/minerva-solving-quantitati...
..,.,,,...,..,,.,...,.,,.,..
  06/30/22
...
Adolf Anderssen
  07/01/22
It's not just about model architecture but the entire framew...
Valiant
  07/01/22
i think you are right in a certain sense. one of the weird t...
..,.,,,...,..,,.,...,.,,.,..
  07/01/22
Humans can generate their own training data since they can f...
Adolf Anderssen
  07/01/22
LMs can do that as well (someone at Google Brain or DeepMind...
..,.,,,...,..,,.,...,.,,.,..
  07/01/22
We need to hook these things up to robot bodies and let do I...
Adolf Anderssen
  07/01/22


Poast new message in this thread



Reply Favorite

Date: June 21st, 2022 2:39 AM
Author: Black Trailer Park

https://garymarcus.substack.com/p/does-ai-really-need-a-paradigm-shift

Also a note re GPT-4

https://towardsdatascience.com/gpt-4-will-have-100-trillion-parameters-500x-the-size-of-gpt-3-582b98d82253

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44715944)



Reply Favorite

Date: June 22nd, 2022 9:52 PM
Author: odious lilac parlor

>They found that at least a 5-layer neural network is needed to simulate the behavior of a single biological neuron. That’s around 1000 artificial neurons for each biological neuron.

>human brain has 100 billion neurons

>GPT-4 will have 100 trillion neurons

interesting math

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726205)



Reply Favorite

Date: June 22nd, 2022 9:55 PM
Author: Beta 180 Range Doctorate

Not a "math guy" but isn't that correct?

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726208)



Reply Favorite

Date: June 22nd, 2022 9:56 PM
Author: odious lilac parlor

that's what I'm saying

edit: turns out I'm a retard though because "parameters" are not "neurons" (it's the weight matrix). number of neurons is fewer.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726212)



Reply Favorite

Date: June 22nd, 2022 10:18 PM
Author: charismatic state regret

the idea of an artificial intelligence is less threatening if these are the organic “intelligences” hard at work behind the curtain.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726285)



Reply Favorite

Date: June 22nd, 2022 10:19 PM
Author: Flickering Drunken Senate Windowlicker

i would be willing to bet that GPT-4 is nowhere near 100 trillion parameters. i would be shocked if it's above 15 trillion parameters.

there are several reasons for this. a dense transformer of that size would cost tens or hundreds of billions of dollars to train. no indication of budgets of that size yet.

the Chinchilla scaling laws also imply that GPT-3 was very inefficiently trained. they should have trained a smaller sized model on many more tokens. Chinchilla is better than GPT-3 with many fewer parameters because it used more data. i would be surprised if OpenAI didn't realize that before Chinchilla was released, which means they will be pumping much more of their computing power into running more data through the network instead of just increasing parameter count. a 1 trillion parameter GPT trained according to the Chinchilla scaling laws would much better than even PaLM.

there are also other avenues for improvement that they haven't explored besides parameter scaling. recurrent transformer architectures with larger context windows. language models that learn how to plan (imagine a MuZero style training for GPT-3, where it bootstraps itself into learning how to plan how to write good text).

there are tons of avenues for improvement that would likely be more efficient than just making GPT-4 gigantic.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726290)



Reply Favorite

Date: June 22nd, 2022 10:24 PM
Author: Black Trailer Park



(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726330)



Reply Favorite

Date: June 22nd, 2022 10:26 PM
Author: Flickering Drunken Senate Windowlicker

speaking of scale is all you need. did you see Google's new image model

https://parti.research.google/

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726350)



Reply Favorite

Date: June 22nd, 2022 10:33 PM
Author: Black Trailer Park

I did but ty for linking, it embiggens the spirit.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726387)



Reply Favorite

Date: June 22nd, 2022 10:48 PM
Author: Flickering Drunken Senate Windowlicker

it will be interesting to see how they intend to use that. that was surely expensive to train so i think they intend to turn it into some product.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726477)



Reply Favorite

Date: June 22nd, 2022 10:49 PM
Author: Hairless old irish cottage

gary marcus is a fraud.

a friend of mine was his doctoral student. and i have a PhD in ML. he has never done any meaningful research in the field.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726485)



Reply Favorite

Date: June 22nd, 2022 10:50 PM
Author: Black Trailer Park

I thought he was a linguist or something. Anyway, he links to various sources after his initial woke fusillade.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726493)



Reply Favorite

Date: June 22nd, 2022 10:52 PM
Author: Hairless old irish cottage

he was a psycholinguist, yes. but that doesn't mean much. a lot of the best researchers, including geoff hinton and michael i. jordan (who is the most cited 'computer scientist' in history) come from psych/cognitive science departments.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726501)



Reply Favorite

Date: June 22nd, 2022 10:55 PM
Author: Black Trailer Park

I'm aware of critiques of Marcus and the general reputation and provenance of Michael Jordan (if only due to the unusual confluence of names), but I appreciate you taking the time to inform of them. Any chance you can recc better sources to read?

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726521)



Reply Favorite

Date: June 23rd, 2022 12:59 AM
Author: heady nofapping cuckold telephone

do you read lesswrong? the ML discussion (not alignment related, more of the capabilities side) there is often pretty high quality. some of the posters are actual researchers from places like DeepMind or OpenAI. even those who aren't are usually well read. i think most of that community believes in the scaling hypothesis.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726944)



Reply Favorite

Date: June 23rd, 2022 6:26 AM
Author: Black Trailer Park

I have in the past but it's cultic. Hard to ignore the farcical alignment side and neckbeardism. Hm.

Ty for the response.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727341)



Reply Favorite

Date: June 23rd, 2022 9:47 AM
Author: Flickering Drunken Senate Windowlicker

lesswrong is a strange place but it does encourage intellectual honesty and people at least don't dismiss ideas just for being "weird." even in ML circles the scaling hypothesis is deeply unpopular, possibly because of the implications. if your model of ML progress basically predicts that society is going to be run by machines in 3 to 20 years, odds are you will find a reason to not believe in it.

https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine

https://www.lesswrong.com/posts/N6vZEnCn6A95Xn39p/are-we-in-an-ai-overhang



(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727822)



Reply Favorite

Date: June 23rd, 2022 9:54 AM
Author: Black Trailer Park

See, that's the issue with LessWrong: it's built around a rabbinical cult leader with silly ideas about the social impact of machines that he uses as a substitute for religion. The scaling hypothesis says nothing about society being run by machines, and it would be nice to be able to enjoy a technical discussion of it without the dire prognostications.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727847)



Reply Favorite

Date: June 23rd, 2022 10:08 AM
Author: Flickering Drunken Senate Windowlicker

there are people there that totally buy into the Eliezer kool aid, but you'll find a lot of pushback against him. here is a recent thread from a former OpenAI researcher that criticizes some of his views.

https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer

i'm not so sure the religious argument works in their case. it makes more sense if you think about the Kurzweil types that envision positive outcomes. most of that community is pessimistic about the prospect of alignment and doesn't find any comfort in these ideas.

the scaling hypothesis doesn't directly say society is going to be run by machines in a few years, but if you think simple neural circuitry trained at the size of the brain will yield human level cognition, i think it's very likely AGI is near. i can sort of imagine scenarios in which that doesn't happen (no one is willing to invest billions to do very large scale training runs), but they seem somewhat far fetched.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727922)



Reply Favorite

Date: June 23rd, 2022 10:51 AM
Author: Black Trailer Park

Thanks. I'm puzzled by your rebuttal re religion, given that gloom and doom millenarian predictions are a religious mainstay. I don't think we have a real disagreement.

As for AGI, I think suprahuman intelligence has already been achieved in some areas and will continue to be achieved in others, sometimes with synergies. I don't think we'll encounter a FOOM scenario for substantially the reasons Hanson has laid out. I'm unconcerned about alignment since if you accept the premises then there's nothing to be done about it, and because I don't accept the premises in that computers are easy to control since you can turn them off.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44728154)



Reply Favorite

Date: June 23rd, 2022 10:56 AM
Author: Flickering Drunken Senate Windowlicker

so i think it's different because religious predictions of doom are usually accompanied by some sort of salvation in an afterlife. these people just think the world is about to end and there's no hope. i don't think these views have much psychological utility, except maybe that they can feel like they are living in a unique time.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44728178)



Reply Favorite

Date: June 23rd, 2022 11:00 AM
Author: Black Trailer Park

I don't think that's quite true re salvation and the afterlife. Look at Calvinism. Rather, the unifying theme is that one must give money or status or deference or something else out of value to the preacher, who speaks of supernatural or speculative dangers.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44728203)



Reply Favorite

Date: June 22nd, 2022 10:55 PM
Author: Flickering Drunken Senate Windowlicker

how did he get Uber to buy his company

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726524)



Reply Favorite

Date: June 22nd, 2022 10:57 PM
Author: Hairless old irish cottage

don't know but that ended as disastrously as it would under the presumption that he is a fraud (got fired after a few months and then they gave up on self-driving and wrote off the entire 'investment'). i would've saved uber $995 million if they'd just written me a $5 million check to advise them the guy was a fraud and had no real tech.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44726542)



Reply Favorite

Date: June 23rd, 2022 8:14 AM
Author: Black Trailer Park

A challenge for AI is that much of human society, and hence our various outputs & training data, are built on lying.

See generally The Elephant in the Brain.

Thus, for example, medicine is not about medicine, by and large, but rather palliating anxiety, which is why medical costs but not the quality of care increases monotonically with wealth.

The law is not about producing just outcomes, but rather about producing determinate outcomes while channelizing unproductive people who can't cooperate to an extractive system.

People create art because they do not stand 6'3".

Etc., etc.

People gain implicit knowledge of such things because they operate as agents with needs, drives, and goals in a world of limited resources, cause and effect, & other agents.

I am very interested to see what emerges when some of the new AI are put into something like the old ABM models but with moar power.

Becoming increasingly plausible, cf. e.g.,

https://iopscience.iop.org/article/10.1088/1741-2552/ac6ca7/meta

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727515)



Reply Favorite

Date: June 23rd, 2022 8:18 AM
Author: Black Trailer Park

[also, fairly unrelated & more speculative but check this out -- Freeman Dyson would've been interested I think. I had not appreciated the extent to which quantum theory just is a theory of empiricism and its limits. Peirce was truly way ahead of his time with tychism + objective chance.]

***

Abstract The Free Energy Principle (FEP) states that under suitable conditions of weak coupling, random dynamical systems with sufficient degrees of freedom will behave so as to minimize an upper bound, formalized as a variational free energy, on surprisal (a.k.a., selfinformation). This upper bound can be read as a Bayesian prediction error. Equivalently, its negative is a lower bound on Bayesian model evidence (a.k.a., marginal likelihood). In short, certain random dynamical systems evince a kind of self-evidencing. Here, we reformulate the FEP in the formal setting of spacetime-background free, scale-free quantum information theory. We show how generic quantum systems can be regarded as observers, which with the standard freedom of choice assumption become agents capable of assigning semantics to observational outcomes. We show how such agents minimize Bayesian prediction error in environments characterized by uncertainty, insufficient learning, and quantum contextuality. We show that in its quantum-theoretic formulation, the FEP is asymptotically equivalent to the Principle of Unitarity. Based on these results, we suggest that biological systems employ quantum coherence as a computational resource and – implicitly – as a communication resource.

Indeed while quantum theory was originally developed – and is still widely regarded – as a theory specifically applicable at the atomic scale and below, since the pioneering work of Wheeler [24], Feynman [25], and Deutsch [26], it has, over the past few decades, been reformulated as a scale-free information theory [27, 28, 29, 30, 31, 32] and is increasingly viewed as a theory of the process of observation itself [33, 34, 35, 36, 37, 38]. This newer understanding of quantum theory fits comfortably with the generalization of the FEP, and hence of self-evidencing and active inference, to all “things” as outlined in [10], and with the general view of observation under uncertainty as inference. In what follows, we take the natural next step from [10], formulating the FEP as a generic principle of quantum information theory. We show, in particular, that the FEP emerges naturally in any setting in which an “agent” or “particle” deploys quantum reference frames (QRFs), namely, physical systems that give observational outcomes an operational semantics [39, 40], to identify and characterize the states of other systems in its environment. This reformulation removes two central assumptions of the formulation in terms of random dynamical systems employed in [10]: the assumption of a spacetime embedding (or “background” in quantum-theoretic language) and the assumption of “objective” or observer-independent randomness. It further reveals a deep relationship between the ideas of local ergodicity and system identifiability, and hence the idea of “thingness” highlighted in [10], and the quantum-theoretic idea of separability, i.e., the absence of quantum entanglement, between physical systems.

https://www.sciencedirect.com/science/article/abs/pii/S0079610722000517

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44727524)



Reply Favorite

Date: June 25th, 2022 12:28 PM
Author: ..,.,,,...,..,,.,...,.,,.,..


this is worth listening to:

https://theinsideview.ai/raphael

good criticism of naive scaling as a solution to general intelligence. this seems better thought through than what I have seen from Marcus

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742063)



Reply Favorite

Date: June 25th, 2022 12:50 PM
Author: Adolf Anderssen

Many thanks! this looks great. I am pasting the transcript below to peruse later (& for ne1 else interested).

One thing I have noticed about GPT-3/raphael is that it is very talented at mimicking "voice" and "tone," but has no conceptual understanding. Very unusual, since usually an adept at aping voice / tone is extremely well-steeped in a genre & understands all of the ins and outs of plot and the micro- and macro system of cause & effect. It's as if it has a higher-order Williams Syndrome.

* * *

TRANSCRIPT OF https://theinsideview.ai/raphael

Raphaël: And I just posted some examples on Twitter of hybrid animals testing Dall-E 2 on combining two different, very different animals together like a hippopotamus octopus or things like that. And it’s very good at combining these concepts together and doing so even in a way that demonstrates some minimal comprehension of a world knowledge in the sense that it combines the concept, not just by haphazardly throwing together features of a hippopotamus and features of an octopus or features of a chair if it is an avocado, but combining them in a plausible way that’s consistent with how chair looks and would behave in the real world or things like that. So those are examples that are mostly semantic composition because it’s mostly about the semantic content of each concept combined together with minimal syntactic structure. The realm in which current text to image generation models seem to struggle more right now is with respect to examples of compositionality that have a more sophisticated syntactic structure. So one good example from the Dall-E 2 paper is prompting a model to generate a red cube on top of a blue cube.

Raphaël: What that example introduces compared to the conceptual blending examples I’ve given is what people call in psychology, variable binding. You need to bind the property of being blue to a cube that’s on top and the property of being red to cube that’s…I think I got it the other way around. So red to the cube that’s on top and blue to the cube that’s at the bottom and a model like Dall-E 2 is not well suited for that kind of thing. And that’s, we could talk about this, but that’s also an artifact of its architecture because it leverages the text encodings of CLIP, which is trained by contrastive learning. And so when it’s training CLIPs it’s only trying to maximize the distance between text image pairs that are not matching and minimize the distance between text image pairs that are matching where the text is the right caption for the image.

Raphaël: And so through that constructive learning procedure, it’s only keeping information about the text that is useful for this kind of task. So it’s kind of, a lot of this can be done without modeling closely the syntactic structure of the prompts or the captions, because unless we adversely designed a new data set for CLIP that would include a lot of unusual compositional examples like a horse riding an astronaut and various examples of blue cubes and red cubes on top of one another. Given the kind of training data that CLIP has, especially for Dall-E 2 stuck [inaudible 02:10:20] and stuff like that, you don’t really need to represent rich compositional information like that to train CLIP, and hence the limitations of Dall-E 2. Imaginen does better at this, because it uses a frozen language model T5-xl, I think, which, we know that language models do capture rich compositional information.

Michaël: Do you know how it uses the language model?

Raphaël: I haven’t done a deep dive into the imaging paper yet, but it’s using a frozen T5 model to encode the prompts. And then it has some kind of other component that translates these prompts into imaging embeddings, and then does some gradient upscaling. So there is some kind of multimodal diffusion model that takes the T5 embedding and is trained to translate that into image embedding space. But I couldn’t, I don’t exactly remember how it does that, but I think the key part here is that the initial text embedding is not the result of constructive learning, unlike the CLIP model that’s used for. Dall-E.

Michaël: Gotcha. Yeah. I agree that…yeah. From the images you have online, you don’t have a bunch of a red cube on top of a blue on top of a green one, and it’s easy to find counter examples that are very far from the training distribution.

Raphaël’s Experience with DALLE-2

Michaël: I’m curious about your experience with Dall-E because you’ve been talking about Dall-E before you had access. And I think in the recent weeks you’ve gained the API access. So have you updated on how good it is or AI progress in general, just from playing with it and being able to see the results from octopus, I don’t know how you call it.

Raphaël: Yeah. I mean, to be honest, I think I had a fairly good idea of what Dall-E could and couldn’t do before I got access to it. And there’s nothing that I generated that kind of made me massively update that prior I had. So again, it’s very good at simple conceptual combination. It can also do fairly well some simple forms of more syntactically structured composition. So if you ask it for, I don’t know, one prompt that I tested it on, that was great. Quite funny is an angry Parisian holding a baguette. So an angry Parisian holding a baguette digital art. Basically every output is spot on. So it’s like a picture of an angry man with a beret holding a baguette, right? So this kind of simple compositional structure is doing really well at it. That’s already very impressive in my book.

Raphaël: So I was pushing back initially against some of the claims from Gary Marcus precisely on that. Around the time of the whole deep learning is hitting a wall stuff. He was emphasizing that deep learning as he would put it fails at compositionality. I think first of all, that’s a vague claim because there are various things that could mean depending on how you understand compositionality and what I spouted out in my reaction to that is that really the claim that is actually warranted by the evidence is that there are failure cases with current deep learning models, with all current deep learning models at parsing compositionally structured inputs. So there are cases in which they fail. That’s true, especially the very convoluted examples that Gary has been testing Dall-E 2 on, like a blue tube on top of a red pyramid next to a green triangle or whatever. When you get to a certain level of complexity, even humans struggle.

Raphaël: If I ask you to draw that, and I didn’t repeat the prompt. I just gave it to you once you probably would make mistakes. The difference is that we humans can go back and look at the prompt and break it down into sub components. And that’s actually something I’m very curious about. I think a low hanging fruit for research on these models would be to do something a little similar to chain of thought prompting, but with text to image models instead of just language models. So with text chain of thought prompting of the Scratchpads paper of language models, you see that you can get remarkable improvements in context learning when you in your few shot examples, you give examples of breaking down the problem into sub steps.

Michaël: Let’s think about this problem step by step.

Raphaël: Yeah, yeah, exactly. And so, well, actually the “Let’s think about this step by step” stuff was slightly debunked in my view, by a blog post that just came out. Who did that? I think someone from MIT. I could send you the link, but someone who tried a whole bunch of different tricks for prompt engineering and found that at least with arithmetic, the only really efficient one is to do careful chain of thoughts prompting, where you really break down each step of the problem. Whereas just appending, let’s think step by step wasn’t really improving the accuracy. So there are some, perhaps some replication concerns with “Let’s think step by step.”

Raphaël: But if you do spell out all the different steps in your examples of the solution, then the model will do better. And I do think that perhaps in the near future, someone might be able to do this with text to image generation where you break down the prompt into first let’s draw a blue triangle, then let’s add a green cube on top of the blue triangle and so on.

Raphaël: And maybe if you can do it this way, you can get around some of the limitations of current models.

Michaël: Isn’t that already something, a feature of the Dall-E API? At least on the blog post, they have something where they have a flamingo that you can add it to be, remove it or move it to the right or left.

Raphaël: Yeah. So you can do in painting and you can gradually iterate, but that’s not something that’s done automatically. What I’m thinking about would be a model that learns to do this similarly to channel thought prompting. So there is a model that just came out a few days ago that I tweeted about that does something a little bit different, but along the same broad lines. So it’s breaking down the prompts, the compositional prompts of the diffusion models into distinct prompts, and then has this compositional diffusion model that has compositional operators like “and” that can generate first embeddings for. For example, if you want a blue cube and a red cube it will generate first embedding for a blue cube and for a red cube. And then it will use a compositional operator to combine these two embeddings together.

Raphaël: So kind of like hard coding into the architecture, compositional operations. And I think my intuition is that this is not the right solution for the long term, because you don’t want, again, the bitter lesson, blah, blah, blah, you don’t want to hard code too much in the architecture of your model. And I think you can learn that stuff with the right architecture. And we see that in language models, for example, you need to hard code any syntactic structure, any knowledge of grammar in language models. So I think you don’t need to do it either for vision language models, but in the short term, it seems to be working better than Dall-E 2 for example, if you do it this way,

Michaël: Right, so you split your sentence with the “and” and then you com combine those embeddings to engineer the image. I think, yeah, as you said, it is probably the general solution is as difficult as solving the understanding of language, because you would need to see in general how in a sentence the different objects relate to each other. And so to split it effectively, it would require a different understanding.

The Future of Image Generation

Michaël: I’m curious, what do you think would be kind of the new innovation? So imagine when we’re in 2024 or even 2023 and Gary Marcus is complaining about something on Twitter. Because for me, Dall-E was not very high resolution, the first one, and then we got Dall-E 2 that couldn’t generate texts or yeah. Do you know faces or maybe that’s something from the API, not very an AI problem, and then Imagine came along and did something much more photorealistic that could generate text.

Michaël: And of course there’s some problems you mentioned, but, do you think in 2023, we would just work on those compositionality problems one by one, and we would get three objects blue on top of red and top of green, or would it be like something very different? Yeah, I guess there are some long tail problems in solving fully the problem of generating images, but I don’t see what it would look like. Would it be just imaging a little bit different or something completely different?

Raphaël: So I think my intuition is that yeah, these models will keep getting better and better at this kind of compositional task. And I think it’s going to happen probably gradually just like language models have been getting better and better at arithmetic first doing two digit operations and then three digit and with Palm, perhaps more than that, or with the right channel of thought prompting more than that, but it still hits a ceiling and you get diminishing returns and that will remain the case. As long as we can’t find a way to basically approximate some form of symbolic-like reasoning in these models with things like variable binding. So I’m very interested in current efforts to augment transformers with things like episodic memory, where you can store things that start looking like variables and do some operations.

Raphaël: And then have it read and write operations. To some extent the work that’s been done by the team at Anthropic led by Chris Olah and with people like [inaudible 02:21:37], which I think is really fantastic is already shedding light on how transformers, they’re just vanilla transformers. In fact, they’re using time models without MLP layers. So just attention-only transformers can have some kind of implicit memory where they can store and retrieve information and do read and write operations in sub spaces of the model. But I think to move beyond the gradual improvement that we’ve seen for tasks such as mathematical reasoning and so on from language models to something that can more reliably and in a way that can generalize better perform these operations for arbitrary digits, for example, we need something that’s probably some form of modification of the architecture that enables more robust forms of variable binding and manipulation in a fully differentiable architecture.

Raphaël: Now, if I knew exactly what form that would take, then I would be funding the next startup that gets $600 million in series B, or maybe I would just open source it. I don’t know, but in any case I would be famous. So I don’t know exactly what form that would take. I know there is a lot of exciting work on somehow augmenting transformers with memory. There’s some stuff from the Schmidt Huber lab recently on fast weight transformers. That looks exciting to me, but I haven’t done a deep dive yet. So I’m expecting a lot of research on that stuff in the coming year. And maybe then we’ll get a discontinuous improvement of text to image models too, where all of a sudden, instead of gradually being able to do three objects, a red cube on top of a blue cube and then four objects, and gradually like that, all of a sudden would get to arbitrary compositions. I’m not excluding.

Conclusion

Michaël: As you said, if you knew what the future would look like, you would be funding as a series B startup in the Silicon valley, not talking on a podcast. Yeah. I think this is an amazing conclusion because it opens a window for what is going to happen next. And, yeah. Thanks for being on the podcast. I hope people will read all your tweets, all the threads on compositionality, Dall-E, GPT-3 because I learned personally a lot from them. Do you want to give a quick shout out to your Twitter account or a website or something?

Raphaël: Sure. You can follow me at, @Raphaelmilliere on Twitter. That’s Raphael with PH the French way. And my last name Milliere, M, I, L, L, I, E, R, E. You can follow my publications on raphaelmilliere.com. And I just want to quickly mention this event that I’m organizing with Gary Marcus at the end of the month, because they might interest some people who enjoy the conversation of compositionality.

Raphaël: So basically I’ve been disagreeing with Gary on Twitter about how extensive the limitations of current models are with respect to compositionality. And there’s something that I really like, a model of collaboration that’s emerged initially from economics, but that’s been applied to other fields in science called adversarial collaboration, which involves collaborating with people you disagree with to try to have productive disagreements and settle things with falsifiable predictions and things like that. So in this spirit of adversarial collaboration, instead of…I think Twitter amplifies disagreements rather than allowing reasonable, productive discussions. I suggested to Gary that we organize together a workshop, inviting a bunch of experts in compositionality and AI to try to work these questions out together. So he was enthusiastic about this and we organized these events online at the end of the month that’s free to attend. You can register compositionalintelligence.github.io.

Raphaël: And yeah, if you’re interested in that stuff, please do join the workshop. It should be fun. And thanks for having me on the podcast. That was a blast.

Michaël: Yeah, sure. I will definitely join. I will add a link below. I can’t wait to see you and Gary disagree on things and make predictions and yeah. See you around.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742150)



Reply Favorite

Date: June 25th, 2022 1:16 PM
Author: Valiant

“Gary Marcus doesn’t contribute to the field” is an ad hominem diversion that fails to address the substance of his arguments. He represents the “East Pole” of cognitive science with its focus on modularity and symbolic reasoning, and criticism of AI/ML from an adjacent field is fair game. You will find similar critiques from active and well respected practitioners like Brenden Lake and Joshua Tanenbaum (if not as vocal or adversarial).

In general the field is populated with intelligent people with no philosophical training. Among its brightest lights you will not find anything close to the depth or care of a Fodor or Pylyshyn. They suffer a serious lack of perspective. Francois Chollet is the best among them, and he is a skeptic.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742312)



Reply Favorite

Date: June 25th, 2022 1:24 PM
Author: Adolf Anderssen

(Jerry Fodor).

Jokes aside yes, I think it would be better to respond to the substance of Marcus's arguments; I linked the piece I linked in part because he took pains to quote various active & very well-respected researchers.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742382)



Reply Favorite

Date: June 25th, 2022 1:40 PM
Author: Valiant

Ok I read it and I agree with LeCun and his list. The problem with “learning like a baby” is that blank slate-ism isn’t going to work. No DL model will spontaneously learn human language from raw audio data. Babies can do this because they have significant functionality built in.

I also agree strongly with LeCun’s point about interactivity. Real life intelligences act and sense and reason in tightly coupled feedback loops. These are completely missing from most static DL architectures.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742504)



Reply Favorite

Date: June 25th, 2022 1:42 PM
Author: Adolf Anderssen

TY for your response.

But query, what if Chomsky & co. are just wrong with UG and the connectionists and behaviorists [well, as modified into enactivists] were right the whole time? It's an empirical question.

{I have a whole intricate view on all of this, developed largely independently of the ML/AI foofaraw, that I'm too tired to expand on now.}

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742515)



Reply Favorite

Date: June 25th, 2022 1:43 PM
Author: Private First Class James Ramirez



(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742521)



Reply Favorite

Date: June 25th, 2022 1:49 PM
Author: Valiant

Yes ultimately it’s empirical but I think the weight of evidence so far strongly suggests innatism. Babies display all kinds of biases that hint at underlying structures, and the fact that specific brain regions (eg Broca’s area) are responsible for specific functions does as well.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742549)



Reply Favorite

Date: June 25th, 2022 2:16 PM
Author: Adolf Anderssen

There's a lot of interesting work on this that goes far beyond the old poverty-of-stimulus debates and into self-evidencing systems & the acquisition of concepts; I can without getting into the weeds give a sense of my views with some cites:

Friston et. al., Generative models, linguistic communication and active inference (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7758713/)

>>>> ("Note that the mathematical formulation used here—which is described in detail in the sections that follow—differs from previous approaches in this literature. There are two key points to note here. First, the current formulation considers the uncertainty of the agent’s beliefs about the scene at hand. Second, we introduce an active component—which generates predictions about the information that an agent will seek to resolve their uncertainty. In other words: What questions should I ask next, to resolve my uncertainty about the subject of our conversation?")

>>>>

https://www.semanticscholar.org/paper/Evaluating-the-Apperception-Engine-Evans-Hern%C3%A1ndez-Orallo/8085e0b4900fc83976632a68c6db510c7b0dca81 (described in

https://www.degruyter.com/document/doi/10.1515/9783110706611/pdf#page=11 ("Delving into the rich details of this implementation is beyond the scope of this review and must be left for another occasion, but the readers may consult Evans’ contribution to this volume themselves (Evans 2022, this volume, ch. 2). This requirement explicitly exceeds the typical empiricist approaches that are purely data-driven, as criticized by the 'pessimists' such as Pearl, Mitchell and Marcus & Davis (see above). But it does not mean that Evans thereby takes his Apperception Engine to constitute a nativist system, as demanded by Marcus and Davis. With respect to the debate between optimists and pessimists, Evans objects to Marcus’ interpretation of Kant as a nativist, because it is important what is taken to be innate. That is, it makes a difference whether one claims that concepts are innate or faculties (capacities) whose application produces such concepts. Kant allegedly did not conceive of the categories as innate concepts: “The pure unary concepts are not ‘baked in’ as primitive unary predicates in the language of thought. The only things that are baked in are the fundamental capacities (sensibility, imagination, power of judgement, and the capacity to judge) [. . .]. The categories themselves are acquired – derived from the pure relations in concreto when making sense of a particular sensory sequence” (Evans 2022, this volume, p. 74). Evans follows Longuenesse (2001), who grounds her interpretation in a letter Kant wrote to his contemporary Eberhard; in it, he distinguishes an “empirical acquisition” from an 'original acquisition', the latter applying to the forms of intuition and to the categories. Evans is right in saying that, as far as the cognition of an object is concerned – like the 'I think' – the categories come into play only by being actively (spontaneously) applied through the understanding, and can thus be derived, if you will, through a process of reverse engineering which reveals that they have to be presupposed in the first place, being a transcendental condition of experience. But this is compatible with the claim that, given their a priori status (and given that they can be applied also in the absence of sensory input, though not to yield cognition in the narrow sense but still cognition in the broad sense, as characterized above), 'they have their ground in an a priori (intellectual, spontaneous) capacity of the mind' (Longuenesse 2001, p. 253)"))

>>>>

A. Pietarinen, Active Inference and Abduction https://link.springer.com/article/10.1007/s12304-021-09432-0 (drawing link between late-Peircean account of his semiotics & abduction & current work in active inference & the Bayesian brain).

& cf. https://www.academia.edu/24037739/Pragmatism_and_the_Pragmatic_Turn_in_Cognitive_Science (placing Fristonian self-evidencing & active inference in Peircean pragmaticst context.)

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44742701)



Reply Favorite

Date: June 25th, 2022 3:23 PM
Author: Valiant

Yes, Fristonian “predictive processing” would impose structure by balancing accuracy with simplicity, yielding what LeCun calls a “regularized latent variable” model. Active inference would resolve the ambiguities in, say, the inverse optics problem. Bengio frames this as an optimal sampling problem and addresses it with his GFlowNets.

Good! But I don’t think this explains the fixity of neuroanatomy or such specialized (and localized) functions as facial recognition. Infant studies and lesion studies suggest it is not learned so much as developed.

There are other mysteries besides… The analog, oscillatory, synchronized aspects of the brain seem to have been completely ignored since the advent of cybernetics.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44743016)



Reply Favorite

Date: June 25th, 2022 4:18 PM
Author: ..,.,,,...,..,,.,...,.,,.,..


How much of the modularity and fixed structure is the result of where sensory information comes into the brain? The rewiring experiments strongly suggest in the neocortex that other parts are able to take over and learn the functions of the specialized regions if sensory information is redirected. This doesn't seem to fit with the view there are a lot of valuable priors coded into specific regions from the genome.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44743310)



Reply Favorite

Date: June 25th, 2022 4:57 PM
Author: Adolf Anderssen

Re: the fixity of neuroanatomy, the data are (even now) frustratingly mixed. IMO, the more exciting work is happening at the Marr - computational / algorithmic level as of now when it comes to complex tasks. And this remains true even if the work doesn't really translate to understanding how the human brain functions (e.g., https://psyarxiv.com/5zf4s/ (critiquing use of DNN's as model of human vision system)) since engineering success is cool regardless of what it says about you.

[IT seems like some sort of 'soft connectionist' paradigm is probably cr, in which, due to architectures being re-used for energy efficiency & certain patterned problems having certain optimal solutions that are converged on, including with respect to architecture -- but the empirics seem to me to need time to season.

It has been incredibly frustrating learning all this junk in the 90s and 00's only to later discover that, say, tons of fMRI studies are garbage; or adult neurogenesis is not really a thing, etc. {Maybe it is time to go back to lesioning people. At least that's a real experimental intervention amirite! Cf. S. Siddiqi et al., Causal Mapping of Human Brain Function, Nature Review of Neuroscience April 20, 2022 (https://www.nature.com/articles/s41583-022-00583-8)}.

Even so, as you reference, emerging work on diff. types of brain operation--synchrony, analog computing, etc.--or just the results of new probes (optogenetics, viral labeling techniques like [https://www.nature.com/articles/s41593-022-01014-8]) promise to deliver much interesting info in years to come.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44743471)



Reply Favorite

Date: June 26th, 2022 12:53 AM
Author: .,.,...,..,.,..:,,:,...,:::,.,.,:,.,,:.,:.,:.::,.


"The problem with “learning like a baby” is that blank slate-ism isn’t going to work. No DL model will spontaneously learn human language from raw audio data. Babies can do this because they have significant functionality built in. "

even now (when they are clearly way smaller and undertrained compared to what's possible), transformers seem to understand an awful lot about human language. there's clearly a ways to go, but i wouldn't be confident at all in this assertion. remember that whatever is innately programmed can also be learned directly from data. learning from raw data decreases the efficiency, but who cares about efficiency when you can feed in millions of hours of youtube videos, billions of tweets, every book, etc? the constraints are nothing like a human during normal development.

if we had reason to believe the normal human mind is a very unique program and only a few learning systems converge on it, this might not work. it's really hard for to buy into that notion given how versatile simple techniques are and how they rather consistently show human or superhuman performance in many domains when trained at scale. i think we would be seeing a lot more need for highly tailored architectures for specific problems at this point if that were actually true.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44745948)



Reply Favorite

Date: June 26th, 2022 10:45 AM
Author: Adolf Anderssen

I'm sympathetic to this view, but see https://psyarxiv.com/5zf4s/ (pasted above) for a good discussion re: showing "human" performance (tied to vision / visual perception, which we understand pretty well at this point).

AI may be showing "human-like" or "human-level" performance on tasks, but don't seem to be implementing the _same_ algorithms that provide human affordances. This may or may not be important, depending on your area of interest.

That is why I think it is important to separate excitement about AI qua engineering feat and AI as a tool for cognitive psych.

I'll try to come back to this l8r today, I've more to say but duty calls.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44746815)



Reply Favorite

Date: June 26th, 2022 11:48 AM
Author: ..,.,,,...,..,,.,...,.,,.,..


it would be interesting to know whether they are looking at model features of convolutional neural networks only or also vision transformers. vision transformers are more resistant to things like adversarial examples and produce generally better performance, so i would expect them to have fewer failure modes like what are described in the paper.

i think there are a couple things going on here. neural networks are as good as they need to be for a particular task. if looking only at local features or texture is able to produce high performance on a particular task (or are the easiest parts of the loss landscape to navigate down on), that's what they'll learn first. given sufficient scale and continued training, eventually they'll have to learn more global features and be more resistant to perturbations of the input. it's just for most image benchmark tasks learning things like that are probably not very beneficial for reducing loss.

the other part of this is that almost all these models are just feedforward networks designed to emulate fast human vision. people might have an initial response that's similar to a neural network, but they are also able to think about a stimulus and refine a classification. there are recurrent circuits in the brain. these don't fit in the standard feedforward model and i expect will put constraints on their performance. i think these deficiencies could be corrected, but everyone is ok for now with the reasonably good performance of models like ViT which is already better than humans in certain contexts.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44747055)



Reply Favorite

Date: June 26th, 2022 3:50 PM
Author: Valiant

>transformers seem to understand an awful lot about human language

They can master the syntax, sure, and make predictions about missing words from context. But there is no semantic grounding that tethers words to reality. For example, the "sentient" Google AI wrote of "spending time with friends and family" that don't exist.

Moreover words are discretely represented (as "tokens") in LLMs. This itself encodes significant abstract, human-curated information. Very different from learning from raw audio streams without human-curated labels.

>if we had reason to believe the normal human mind is a very unique program

We have reasons to believe that various human cognitive and perceptual systems are unique, modular programs because we can tease out their quirks and regularities from careful experimentation and even selectively disable certain functionality (like short-term memory, speech, and face recognition).

>who cares about efficiency when you can feed in millions of hours of youtube videos

Certainly you can train impressive models to do impressive things with this data. What you cannot do, with today's deep learning, is to train an intelligent agent that replicates the basic functionality of a Star Wars droid:

1) model and navigate new environments (like the real world)

2) formulate and pursue goals

3) model other agents and their internal states

4) communicate with other agents about real things really happening in the real world

A simple "butler" droid that brings you a bottle of beer or fetches the newspaper does not currently exist. Why not? A "baby" droid that spontaneously develops sight, hearing, movement, and language from raw experience in real time does not currently exist. Why not? Babies can do this -- why not bots?

The point is that something is missing, and I don't think you can get AGI / HLAI without identifying and filling in these gaps.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44748428)



Reply Favorite

Date: June 26th, 2022 3:51 PM
Author: Adolf Anderssen

"3) model other agents and their internal states"

I suspect that this is the big one.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44748434)



Reply Favorite

Date: June 26th, 2022 4:00 PM
Author: Valiant

https://youtu.be/q9XR9Wbl7iI?t=870

Watch till minute 17. (Timestamped at 14:30)

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44748475)



Reply Favorite

Date: June 26th, 2022 5:17 PM
Author: ..,.,,,...,..,,.,...,.,,.,..


Language models don't see words (at least GPT-3 doesn't and the ones I know about). GPT-3 uses byte level encoding, so it sees character groupings. it has to learn the construction of words from BPEs. i should note that this is actually a major problem for certain tasks and forces the model to memorize more things than it should have to because it's not able to map functionally identical inputs to the same representation. it's remarkable it's able to learn things like arithmetic as well as it does despite being totally crippled by the encoding scheme.

https://nostalgebraist.tumblr.com/post/620663843893493761/bpe-blues

i think the modularity of human brains, at least in the neocortex, is produced by a couple things. certain types of sensory input flow into certain areas and the brain uses sparse activation patterns (rather than the sort of dense activations that GPT-3 use, where the entire network processes every input). only a small part of the brain is activated by certain inputs because the brain has learned to construct a modular architecture based on its past history. networks that are better able to process an input are activated in response to it.

this is different from saying the modular architecture is programmed. the neocortex looks rather uniform anatomically, and functionally different areas are able to take over and process other types of data.

https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine#Dynamic_Rewiring

i think the future of these models is likely to be more brain like because training brain sized dense models is needlessly expensive, but we aren't likely to see human engineering of different modules. sparsity will produce modularity organically.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44748832)



Reply Favorite

Date: June 30th, 2022 10:51 PM
Author: ..,.,,,...,..,,.,...,.,,.,..


https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html

and this is basically why i don't think we need a new paradigm. connectionist models are perfectly capable of learning symbolic reasoning. worth noting that this seems to be a lot earlier than anticipated (2025!). how worthless are AI capability predictions?

https://twitter.com/Yuhu_ai_/status/1542560073838252032

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44776256)



Reply Favorite

Date: July 1st, 2022 7:00 AM
Author: Adolf Anderssen



(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44777497)



Reply Favorite

Date: July 1st, 2022 11:45 AM
Author: Valiant

It's not just about model architecture but the entire framework of offline supervised learning. Yes if you curate a huge dataset that was generated by humans, you can train cool models to do whatever was in the training set, including symbol manipulation.

But can you get Data from Star Trek? I don't think so. However Data learns, and however he solves equations through symbol manipulation, I don't think it's the same way deep learning does it. Data could explain why he made each step.

It's a qualitatively different kind of thing.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44778478)



Reply Favorite

Date: July 1st, 2022 12:55 PM
Author: ..,.,,,...,..,,.,...,.,,.,..


i think you are right in a certain sense. one of the weird things about this model is that it does well on university and high school level problems while only being slightly better on middle school problems. whatever this is doing, it's clearly different from the way a human learns with a logical capability progression. it's sort of like how language models are in general - idiot savants where they have striking capabilities while having bizarre misunderstandings.

the key question is whether these misunderstandings will go away with better scale and better training regimes. i think they will. the thing is this is really unpredictable. one thing that people have noted with these large models is that there are certain capabilities that progress smoothly with parameter count and training size, and then there are capabilities that spike suddenly during training. it seems like initially they memorize certain low level patterns that work well for most problem classes (while still having striking deficiencies and not "seeing" the big picture), until eventually they "grok" the simple program/function that provides complete generalization.

https://www.lesswrong.com/posts/JFibrXBewkSDmixuo/hypothesis-gradient-descent-prefers-general-circuits

human brains might have better priors encoded in them that provide this sort of generalization without extensive training, or maybe it's the fact they are pre-trained on loads of audio/video data that allows them to generalize quicker. even if it's the good prior model, i think it's unlikely that doesn't go away at large scale.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44778912)



Reply Favorite

Date: July 1st, 2022 1:56 PM
Author: Adolf Anderssen

Humans can generate their own training data since they can first figure out self/other boundaries & then conduct 'experiments'.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44779241)



Reply Favorite

Date: July 1st, 2022 2:25 PM
Author: ..,.,,,...,..,,.,...,.,,.,..


LMs can do that as well (someone at Google Brain or DeepMind or OpenAI just has to get around to it). MuZero is a general framework. right now they are just using it for games, but it will work with language. i think this is probably the best way to correct a lot of their deficiencies. right now GPT-3 just sequentially samples from a token distribution at every point in a sample, but it can't look into the future to see how a sentence or paragraph will go. if it picks a bad token at some point, it just has to make do with it and try to generate the rest of the sample as best as it can.

with something like AlphaZero, it's basically a tree search based on the probability distribution at every point in time. even if the network predicts a move that is ultimately bad, it can see through that by continuing down the simulated move tree and seeing the implications of it. once someone gets around to doing that with language, i anticipate very large performance gains.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44779426)



Reply Favorite

Date: July 1st, 2022 2:30 PM
Author: Adolf Anderssen

We need to hook these things up to robot bodies and let do Imo.

(http://www.autoadmit.com/thread.php?thread_id=5135617&forum_id=2#44779457)