Competitive-Rub-1958

Competitive-Rub-1958 t1_j7dhy93 wrote

Google is a leader in DL research. That's a fact. They chose to keep most of their research internal because as above commenters said, they don't have much to gain through it - marketing and hype lasts only so long.

> It's about the UX

what UX? its just a normal frontend mate

> scalability

You do realize Google were serving LLMs before OAI was even hypothesized? Or that they have TPUs which are far more scalable and cost efficient, which could already rip major players apart.

> liability

OAI have fought nothing liability or legality-wise. They just remain in a gray area and hopes no one focuses on them (bad luck, they got caught in the AI art lawsuits too)

8

Competitive-Rub-1958 t1_j6z8a7t wrote

For someone who simply wants to use ANE (haven't bought it, just considering) for testing out bare-bones models locally (I find remotely debugging quite frustrating) for research purposes before finally training them on cloud, how good is the support with Containerization solutions like Singularity - does it even leverage ANE?

I know the speedup won't really be anything drastic, but if it helps (is faster and more resource efficient than the CPU/GPU) then that just translates to a lower time-to-iterate anyways...

So for someone using plain PyTorch (w/ a bells and whistles), how much of a pain would it be?

1

Competitive-Rub-1958 t1_j3etbs5 wrote

I think HTM was doomed to fail from the very start; even Hawkins has distanced himself from it. The problem is that HTM/TBT all are iterations for a better model of the neocortex. But it would definitely take quite a bit of time to really unlock all the secrets of the brain.

I think Numenta quickly realized that the path forward is going to be even longer than they thought (they've been researching ~20 years now IIRC) so they're wanting to quickly cash out - their whole new push towards "integrating DL" with their ideas (spoiler: doesn't work well, check out their latest paper on that) and working on sparsifying LLMs - something which the NeuralMagic folks already lead quite a huge part of the industry by (see the recent paper: LMs can be pruned in one-shot).

That argument of "If we'd put X resources in Y thing, we'd have solved ASI by now!" is quite illogical and applicable to literally every field. In the end, Numenta's work simply did not yield the results that Hawkins et al. were hoping to get. No results is a very tricky grounding to attract the interests of other researchers. If HTM/TBT wants a comeback, it would have to be on the shoulders of some strong emergent abilities in their architectures...

7

Competitive-Rub-1958 t1_izo1n60 wrote

Assuming those objections were directed towards my comment (as they seem to directly address them) and brushing over the antagonistic tone, I have no doubt that your evaluation was not systematic nor am I reaching conclusions with a few examples - that's misrepresenting what the general consensus towards this paper is.

I wholeheartedly agree with you that LLMs should 0-shot understanding implicature, but there are certain nuances here that seem to be ignored. What I was going for is simply this:

1> The paper should have compared their alternative prompt templates to CoT, especially if you explicitly mention CoT. The idea is quite clear - look at this paper for instance. Complex tasks which usually involve disambiguating a chain of events ("I wore gloves" -> gloves cover the finger -> they don't expose fingerprints -> therefore, the answer is Y/N) benefit greatly from CoT. It may seem like an insignifcant demand, maybe even some reviewer-2 vibes here but it seems reasonable to expect that a method that works on almost every task should have been tested here - merely out of scientific curiosity to observe the outcome had this template been incorporated.

2> And more important - when you prompt your model k-shot, it does NOT reveal any context whatsoever about the actual, target question. when you few-shot, you give it completely independent examples of how you perform the task at hand with no bearing to the actual question you ask. So it would perceive "gloves" and concept of fingerprints independently to the provided examples, which could be about bananas and groceries. Yet Few-shot primes the LLM for better understanding this task, there is so much literature exploring this interesting phenomena (mostly attributed to a mix of ICL and statistical patterns).

This extremely important point wasn't mentioned in the paper at all; few-shot doesn't actually invalidate LLMs not being human-aligned communicators. Hence why I quoted above there being an ~5% difference in accuracy between average human and Few-shot LLM.

Lastly, No one's claiming ChatGPT is perfect. All I mentioned was that I would like to see it being tested on that latest iteration of RLHF models and see how it fares. It was in no way meant to denigrate the authors or the paper at hand, or expressing some claim that ChatGPT can somehow perform tasks that GPT3/InstructGPT cannot.

1

Competitive-Rub-1958 t1_izil2ps wrote

I feel this paper could've been written significantly more clearly and fairly. While I do understand that the authors wanted to create a punchy title declaring "poor" 0-shot performance, it reads slightly a bit like LLMs can't understand context or reason very well (this is just my impression and opinion though).

From 4.2, The average human gets 86.2% correct - the best LLM gets 80.6% w/ natural language, and 81.7% with a structured prompt, both few-shot.

My main gripe is that disambiguating implicature is fundamentally a reasoning task. Due to the inherent ambiguity, you have to create multiple hypotheses and test them to see which fits the best. With enough context, that task becomes simpler.

So they should've evaluated with Chain-of-thought prompting. They even mention it in the paper they try finding other prompt templates as alternatives to it - but don't test w/CoT? This is a very recent paper, with some famous authors. We've seen CoT help in almost all tasks - additionally, with U-shaped inverse scaling too. I don't see why this task gets a pass.

If someone tests this against ChatGPT to further confirm the RLHF hypothesis, and against CoT, I will be satisfied that understanding implicature 0-shot is indeed hard for LLMs.

88

Competitive-Rub-1958 t1_iwqmaic wrote

It does need more parameters to compensate (For instance, it has nearly a billion more parameters than GPT-J-6B without substantial performance gains) while losing out on LAMBADA (Ignoring the weighted average as I don't really understand the point of weighing it, since it distorts the metrics).

Its an extremely interesting direction, but I fear as you scale this model the scaling plot might start to flatten out - much like other RNN rewrites/variants. Hope further research is able to pinpoint the underlying issue and fix it. Till then, best of luck to OP! 👍

16

Competitive-Rub-1958 t1_irwn68x wrote

I definitely agree with you there, but I wouldn't take the LEGO paper results on face value until other analyses confirm it. Basically, LEGO does show (appendix) that as you increase the sequence length, the model obtains more information about how to generalize to unseen lengths with a clear trend (https://arxiv.org/pdf/2206.04301.pdf#page=23&zoom=auto,-39,737)

As the authors show, the pre-trained model also learns an Associative and manipulation head (if you add those at initialization to a randomly-init model, you obtain same perf as pre-trained one) So the model effectively discovers a prior - just not general enough for OOD generalization.

You're definitely right in that the equivariance it learns it a shortcut. The difference is, from the model's POV its not. It performs well w.r.t the loss function which is evaluated only on the training set.
But once you start giving it longer and longer sequences, it's pre-existing priors act towards more evolving more general representations and priors.

And ofc, as the paper said that its OOD due to positional encodings - so if they'd used some other positional encodings it might've been showing better results. Right now, its hard to judge because there were no ablations for encodings (despite the paper mentioning them like 5 times)

2

Competitive-Rub-1958 t1_irs8vxl wrote

Even in the context of AGI, humans also carry many priors - most of them embedded in the DNA pertaining to the fundamental "blueprint" of a cortical column.
It appears that instead of evolution, natural selection and mutation if we can learn those same priors faster and more efficiently that natural selection with gradient based methods.

https://twitter.com/gruver_nate/status/1578386103417069569 is a twitter summary describing how the transformer learns positional equivariance in the scope of their dataset. This is quite a complex prior, and is present in convolutions implicitly.

It makes sense to collate all our findings, and think that with scale those priors simply become more general - hence why we obtain such massive performance boosts which are also predictable and haven't yet stopped progress (530B is a number thrown around everywhere, but people don't realize the insane amount of compute and work which went into it. It's absolutely humongous for any system to scale to that size, let alone still be able to beat benchmarks)

I feel there are still more general priors we could embed in these models to make them more parameter efficient. But it is clear that DL is still currently the most viable route towards AGI as of now.

2

Competitive-Rub-1958 t1_irj4p6e wrote

It's rather a trend they're trying to study and explain. It appears, as you scale models and bootstrap from pre-trained variations, you learn plenty of useful priors. this is quite crucial for LLMs which are able to solve many tasks which may not be explictly in their distribution, but are able to muddle their way along much better rather than being pre-trained from scratch. In that sense, transfer learning is much more about transferring priors than knowledge.

LLMs like Chinchilla and PaLM best demonstrate that, I suppose. PaLM was trained with 95% of that data being Social Media (which is 50% alone) and miscellaneous topics, only 5% being the GitHub subset. Yet with 50X less code in its dataset, its able to pull up to Codex.

This may hint towards larger models learning more general priors applicable on a variety of tasks, and this trend being highly correlated with scale. So, I think the hope is that as you scale up the priors these models learn the underlying function better rather than just shortcut learning their way. A good demonstration would've been fine-tuning GPT3 with a sizeable chunk of the LEGO dataset and checking if it has higher generalizability on those tasks.

2

Competitive-Rub-1958 t1_iret615 wrote

It goes into the heart of what OOD is, I suppose - but in fairness, LEGO is a synthetic task, AFAIK novel in that respect. That coupled with BERT's smaller pre-training dataset lends more credence to the idea of pre-training introducing priors to chop through the hypothesis space rather than simply copy-pasting from the dataset (which I heavily doubt contains any such tasks anyways)

2

Competitive-Rub-1958 t1_ir7f6mk wrote

Well, scaling alleviates OOD generalization while cleverly pre-training induces priors into the model, shrinking the hypothesis space and pushing the model towards being able to generalize more and more OOD by learning the underlying function rather than taking shortcuts (since those priors resist simply learning statistical regularities).

The LEGO paper demonstrates that quite well - even demonstrating pre-trained networks being able to generalize a little on unseen seqlen before diving down to 0 - presumably because we still need to find the ideal positional encodings...

2