Memorizing Kings Won’t Bring Us Closer to AGI
Our grandparents had to memorize all the Visigothic kings in order (25 years later I still vaguely remember my grandpa proudly reciting…
Our grandparents had to memorize all the Visigothic kings in order (25 years later I still vaguely remember my grandpa proudly reciting them). While this task played a role in structuring our brains, there is a limit to how much it made us smarter or better at cognitive tasks. Then the Internet arrived, and people claimed we “youngsters” would ruin our memories by relying on Google for everything. However, we’ve simply become more efficient in using our cognitive power. We use our minds for reasoning and retrieve the necessary information from external sources.
We are currently training models on the equivalent of the entire internet (15T in the case of Llama 3, which is roughly equivalent to 170 million books). Much of this data is low-quality, and I would argue it helps models as much as memorizing all the Visigothic kings helped our grandparents. It’s good to some extent, but too much is wasting energy and time.
The hypothesis is that similar levels of intelligence can be achieved by identifying the minimum necessary key tokens needed for the model to perform some memorization and then reasoning, allowing advanced intelligence to emerge. This raw reasoning power is then complemented by a powerful large context (think 1M tokens in Gemini used to learn on the fly a new language) and smart retrieval-augmented generation (RAG, not a new idea!). In other words, we let the model do inference like an open-book exam: “Hey, you’re smart enough to figure it out; here are all the world’s sources.” I know the counter-argument is that we haven’t yet seen diminishing returns from scaling up data and power, but I believe the idea still holds: high-quality data and efficient use of compute will trump low-quality sources like subreddits.
Microsoft’s Phi family of models is attempting this approach. They feed high-quality tokens distilled (think, large model trains small one) by more powerful models like GPT-4. Combine this with a good RAG system, and voila. If this challenge is solved, the scaling law curves will shift left, opening competition beyond large AGI labs with billions in funding and leveling the playing field.

