Additional-Escape498

Additional-Escape498 t1_j9yorbo wrote

You’re definitely right that it can’t do those things, but I don’t think it’s because of the tokenization. The wordpieces do contain individual characters, so it is possible for a model to do that with the wordpiece tokenization it uses, but the issue is that the things you’re asking for (like writing a story with pig Latin) require reasoning and LLMs are just mapping inputs to a manifold. LLM’s can’t really do much reasoning or logic and can’t do basic arithmetic. I wrote an article about the limitations of transformers if you’re interested: https://taboo.substack.com/p/geometric-intuition-for-why-chatgpt

1

Additional-Escape498 t1_j9w2ix6 wrote

LLM tokenization uses wordpieces, not words or characters. This is standard since the original “Attention is All you Need Paper” that introduced the transformer architecture in 2017. Vocabulary size is typically between 32k - 50k depending on the implementation. GPT-2 uses 50k. They include each individual ASCII character plus commonly used combinations of characters. Documentation: https://huggingface.co/docs/transformers/tokenizer_summary

https://huggingface.co/course/chapter6/6?fw=pt

4

Additional-Escape498 t1_j9vqmlh wrote

For a small dataset still use cross validation, but use k-fold cross validation so you don’t divide the dataset into 3, just into 2 and then the k-fold subdivides the training set. Sklearn has a class for this already built to make this simple. Since you have a small dataset and are using fairly simple models I’d suggest setting k >= 10.

3

Additional-Escape498 t1_j9rq3h0 wrote

EY tends to go straight to superintelligent AI robots making you their slave. I worry about problems that’ll happen a lot sooner than that. What happens when we have semi-autonomous infantry drones? How much more aggressive will US/Chinese foreign policy get when China can invade Taiwan with Big Dog robots with machine guns attached? What about when ChatGPT has combined with toolformer and can write to the internet instead of just read and can start doxxing you when it throws a temper tantrum? What about when rich people can use something like that to flood social media with bots that spew disinformation about a political candidate they don’t like?

But part of the lack of concern for AGI among ML researchers is that during the last AI winter we rebranded to machine learning because AI was such a dirty word. I remember as recently as 2015 at ICLR/ICML/NIPS you’d get side-eye for even bringing up AGI.

193