Competitive-Rub-1958 t1_jd40cwb wrote on March 21, 2023 at 5:56 PM

Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

would that mean for forcing MHA to use it, I should wrap the ctxmanager around the line where I forward through it?

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_mem_efficient=True):
            x = x + self.attn_head(x, x, x, need_weights=False)[0]

because that doesn't really seem to work :(

Competitive-Rub-1958 t1_jcn8bti wrote on March 18, 2023 at 1:32 AM

Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

cool. I just wanted to make it explicit to make sure I'm running `FlashAttention`. Perhaps there's an easy way to check that?

Competitive-Rub-1958 t1_jcm5ahk wrote on March 17, 2023 at 8:50 PM

Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

cool! So I just need to enable `flash_sdp`, then ensure I'm basically computing self-attention and have `batch_first=True`. Would that be correct?

Competitive-Rub-1958 t1_jcl97q0 wrote on March 17, 2023 at 5:22 PM

Reply to comment by No-Belt7582 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

> Either autograd is disabled (using torch.inference_mode or torch.no_grad) or no tensor argument requires_grad > training is disabled (using .eval())

What's the point of FlashAttention if you can't use it during training? 🤔

https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html

Competitive-Rub-1958 t1_jccyreq wrote on March 15, 2023 at 10:59 PM

Reply to [N] PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever by [deleted]

I think I may be reading things wrong here, but FlashAttention is only for calculating basic scaled QKV attention, not embedded inside their MHA module?

Competitive-Rub-1958 t1_j92lak6 wrote on February 18, 2023 at 7:27 PM

Reply to comment by PassionatePossum in [D] Coauthor Paper? by [deleted]

What about for top-tier conferences/journals? would top-3 be generally viewed as decent, or does you just have to be the first author? 🤔

Competitive-Rub-1958 t1_j7dhy93 wrote on February 6, 2023 at 12:05 AM

Reply to comment by Freed4ever in [N] "I got access to Google LaMDA, the Chatbot that was so realistic that one Google engineer thought it was conscious. First impressions" by That_Violinist_18

Google is a leader in DL research. That's a fact. They chose to keep most of their research internal because as above commenters said, they don't have much to gain through it - marketing and hype lasts only so long.

> It's about the UX

what UX? its just a normal frontend mate

> scalability

You do realize Google were serving LLMs before OAI was even hypothesized? Or that they have TPUs which are far more scalable and cost efficient, which could already rip major players apart.

> liability

OAI have fought nothing liability or legality-wise. They just remain in a gray area and hopes no one focuses on them (bad luck, they got caught in the AI art lawsuits too)

Competitive-Rub-1958 t1_j6z8a7t wrote on February 2, 2023 at 10:51 PM

Reply to comment by TheDeviousPanda in [D] Apple's ane-transformers - experiences? by alkibijad

For someone who simply wants to use ANE (haven't bought it, just considering) for testing out bare-bones models locally (I find remotely debugging quite frustrating) for research purposes before finally training them on cloud, how good is the support with Containerization solutions like Singularity - does it even leverage ANE?

I know the speedup won't really be anything drastic, but if it helps (is faster and more resource efficient than the CPU/GPU) then that just translates to a lower time-to-iterate anyways...

So for someone using plain PyTorch (w/ a bells and whistles), how much of a pain would it be?

Competitive-Rub-1958 t1_j3etbs5 wrote on January 8, 2023 at 1:35 AM

Reply to comment by rduke79 in [Discussion] Is there any alternative of deep learning ? by sidney_lumet

I think HTM was doomed to fail from the very start; even Hawkins has distanced himself from it. The problem is that HTM/TBT all are iterations for a better model of the neocortex. But it would definitely take quite a bit of time to really unlock all the secrets of the brain.

I think Numenta quickly realized that the path forward is going to be even longer than they thought (they've been researching ~20 years now IIRC) so they're wanting to quickly cash out - their whole new push towards "integrating DL" with their ideas (spoiler: doesn't work well, check out their latest paper on that) and working on sparsifying LLMs - something which the NeuralMagic folks already lead quite a huge part of the industry by (see the recent paper: LMs can be pruned in one-shot).

That argument of "If we'd put X resources in Y thing, we'd have solved ASI by now!" is quite illogical and applicable to literally every field. In the end, Numenta's work simply did not yield the results that Hawkins et al. were hoping to get. No results is a very tricky grounding to attract the interests of other researchers. If HTM/TBT wants a comeback, it would have to be on the shoulders of some strong emergent abilities in their architectures...

Competitive-Rub-1958 t1_j2mxgnz wrote on January 2, 2023 at 3:00 PM

Reply to comment by hollow_sets in [D] What do you do while you wait for training? by hollow_sets

use `%` modulo to do a eval check before you start training (i.e 0th step). Saves a ton of time to debug, because something always goes wrong.

Competitive-Rub-1958 t1_j2hy4kg wrote on January 1, 2023 at 1:13 PM

Reply to comment by MrAcurite in [D] Is there any research into using neural networks to discover classical algorithms? by currentscurrents

Incidentally, that task has already been solved (https://twitter.com/arpitbansal297/status/1580922302543167488?cxt=HHwWgICgyaidyPArAAAA) They can OOD generalize to novel, unseen mazes of arbitrary sizes as long as they compute for more iterations at test-time!

Competitive-Rub-1958 t1_izo1n60 wrote on December 10, 2022 at 3:36 PM

Reply to comment by egrefen in [R] Large language models are not zero-shot communicators by mrx-ai

Assuming those objections were directed towards my comment (as they seem to directly address them) and brushing over the antagonistic tone, I have no doubt that your evaluation was not systematic nor am I reaching conclusions with a few examples - that's misrepresenting what the general consensus towards this paper is.

I wholeheartedly agree with you that LLMs should 0-shot understanding implicature, but there are certain nuances here that seem to be ignored. What I was going for is simply this:

1> The paper should have compared their alternative prompt templates to CoT, especially if you explicitly mention CoT. The idea is quite clear - look at this paper for instance. Complex tasks which usually involve disambiguating a chain of events ("I wore gloves" -> gloves cover the finger -> they don't expose fingerprints -> therefore, the answer is Y/N) benefit greatly from CoT. It may seem like an insignifcant demand, maybe even some reviewer-2 vibes here but it seems reasonable to expect that a method that works on almost every task should have been tested here - merely out of scientific curiosity to observe the outcome had this template been incorporated.

2> And more important - when you prompt your model k-shot, it does NOT reveal any context whatsoever about the actual, target question. when you few-shot, you give it completely independent examples of how you perform the task at hand with no bearing to the actual question you ask. So it would perceive "gloves" and concept of fingerprints independently to the provided examples, which could be about bananas and groceries. Yet Few-shot primes the LLM for better understanding this task, there is so much literature exploring this interesting phenomena (mostly attributed to a mix of ICL and statistical patterns).

This extremely important point wasn't mentioned in the paper at all; few-shot doesn't actually invalidate LLMs not being human-aligned communicators. Hence why I quoted above there being an ~5% difference in accuracy between average human and Few-shot LLM.

Lastly, No one's claiming ChatGPT is perfect. All I mentioned was that I would like to see it being tested on that latest iteration of RLHF models and see how it fares. It was in no way meant to denigrate the authors or the paper at hand, or expressing some claim that ChatGPT can somehow perform tasks that GPT3/InstructGPT cannot.

Competitive-Rub-1958 t1_izil2ps wrote on December 9, 2022 at 11:08 AM

Reply to [R] Large language models are not zero-shot communicators by mrx-ai

I feel this paper could've been written significantly more clearly and fairly. While I do understand that the authors wanted to create a punchy title declaring "poor" 0-shot performance, it reads slightly a bit like LLMs can't understand context or reason very well (this is just my impression and opinion though).

From 4.2, The average human gets 86.2% correct - the best LLM gets 80.6% w/ natural language, and 81.7% with a structured prompt, both few-shot.

My main gripe is that disambiguating implicature is fundamentally a reasoning task. Due to the inherent ambiguity, you have to create multiple hypotheses and test them to see which fits the best. With enough context, that task becomes simpler.

So they should've evaluated with Chain-of-thought prompting. They even mention it in the paper they try finding other prompt templates as alternatives to it - but don't test w/CoT? This is a very recent paper, with some famous authors. We've seen CoT help in almost all tasks - additionally, with U-shaped inverse scaling too. I don't see why this task gets a pass.

If someone tests this against ChatGPT to further confirm the RLHF hypothesis, and against CoT, I will be satisfied that understanding implicature 0-shot is indeed hard for LLMs.

Competitive-Rub-1958 t1_iwqmaic wrote on November 17, 2022 at 4:53 PM

Reply to comment by ChuckSeven in [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng

It does need more parameters to compensate (For instance, it has nearly a billion more parameters than GPT-J-6B without substantial performance gains) while losing out on LAMBADA (Ignoring the weighted average as I don't really understand the point of weighing it, since it distorts the metrics).

Its an extremely interesting direction, but I fear as you scale this model the scaling plot might start to flatten out - much like other RNN rewrites/variants. Hope further research is able to pinpoint the underlying issue and fix it. Till then, best of luck to OP! 👍

Competitive-Rub-1958 t1_isorvsw wrote on October 17, 2022 at 4:05 PM

Reply to comment by sambiak in [D] Now that Colab has introduced "compute units". Which are the best free/cheap alternatives? by zuccoff

Get the Pro subscription then? that's a flat fee

Competitive-Rub-1958 t1_irwn68x wrote on October 11, 2022 at 4:23 PM

Reply to comment by yldedly in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue

I definitely agree with you there, but I wouldn't take the LEGO paper results on face value until other analyses confirm it. Basically, LEGO does show (appendix) that as you increase the sequence length, the model obtains more information about how to generalize to unseen lengths with a clear trend (https://arxiv.org/pdf/2206.04301.pdf#page=23&zoom=auto,-39,737)

As the authors show, the pre-trained model also learns an Associative and manipulation head (if you add those at initialization to a randomly-init model, you obtain same perf as pre-trained one) So the model effectively discovers a prior - just not general enough for OOD generalization.

You're definitely right in that the equivariance it learns it a shortcut. The difference is, from the model's POV its not. It performs well w.r.t the loss function which is evaluated only on the training set.
But once you start giving it longer and longer sequences, it's pre-existing priors act towards more evolving more general representations and priors.

And ofc, as the paper said that its OOD due to positional encodings - so if they'd used some other positional encodings it might've been showing better results. Right now, its hard to judge because there were no ablations for encodings (despite the paper mentioning them like 5 times)

Competitive-Rub-1958 t1_irs8vxl wrote on October 10, 2022 at 5:22 PM

Reply to comment by yldedly in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue

Even in the context of AGI, humans also carry many priors - most of them embedded in the DNA pertaining to the fundamental "blueprint" of a cortical column.
It appears that instead of evolution, natural selection and mutation if we can learn those same priors faster and more efficiently that natural selection with gradient based methods.

https://twitter.com/gruver_nate/status/1578386103417069569 is a twitter summary describing how the transformer learns positional equivariance in the scope of their dataset. This is quite a complex prior, and is present in convolutions implicitly.

It makes sense to collate all our findings, and think that with scale those priors simply become more general - hence why we obtain such massive performance boosts which are also predictable and haven't yet stopped progress (530B is a number thrown around everywhere, but people don't realize the insane amount of compute and work which went into it. It's absolutely humongous for any system to scale to that size, let alone still be able to beat benchmarks)

I feel there are still more general priors we could embed in these models to make them more parameter efficient. But it is clear that DL is still currently the most viable route towards AGI as of now.

Competitive-Rub-1958 t1_irj4p6e wrote on October 8, 2022 at 4:45 PM

Reply to comment by yldedly in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue

It's rather a trend they're trying to study and explain. It appears, as you scale models and bootstrap from pre-trained variations, you learn plenty of useful priors. this is quite crucial for LLMs which are able to solve many tasks which may not be explictly in their distribution, but are able to muddle their way along much better rather than being pre-trained from scratch. In that sense, transfer learning is much more about transferring priors than knowledge.

LLMs like Chinchilla and PaLM best demonstrate that, I suppose. PaLM was trained with 95% of that data being Social Media (which is 50% alone) and miscellaneous topics, only 5% being the GitHub subset. Yet with 50X less code in its dataset, its able to pull up to Codex.

This may hint towards larger models learning more general priors applicable on a variety of tasks, and this trend being highly correlated with scale. So, I think the hope is that as you scale up the priors these models learn the underlying function better rather than just shortcut learning their way. A good demonstration would've been fine-tuning GPT3 with a sizeable chunk of the LEGO dataset and checking if it has higher generalizability on those tasks.

Competitive-Rub-1958 t1_iret615 wrote on October 7, 2022 at 3:00 PM

Reply to comment by yldedly in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue

It goes into the heart of what OOD is, I suppose - but in fairness, LEGO is a synthetic task, AFAIK novel in that respect. That coupled with BERT's smaller pre-training dataset lends more credence to the idea of pre-training introducing priors to chop through the hypothesis space rather than simply copy-pasting from the dataset (which I heavily doubt contains any such tasks anyways)

Competitive-Rub-1958 t1_irai5ot wrote on October 6, 2022 at 3:27 PM

Reply to comment by yldedly in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue

https://arxiv.org/abs/2206.04301

Competitive-Rub-1958 t1_ir7f6mk wrote on October 5, 2022 at 9:44 PM

Reply to comment by yldedly in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue

Well, scaling alleviates OOD generalization while cleverly pre-training induces priors into the model, shrinking the hypothesis space and pushing the model towards being able to generalize more and more OOD by learning the underlying function rather than taking shortcuts (since those priors resist simply learning statistical regularities).

The LEGO paper demonstrates that quite well - even demonstrating pre-trained networks being able to generalize a little on unseen seqlen before diving down to 0 - presumably because we still need to find the ideal positional encodings...