3 Comments

Re the data wall, while it is a constraint to building larger than say GPT-6/GPT-7 systems, I think that the data wall ultimately is the 3rd most important constraint currently, and the most immediate constraints are the power and chip constraints, so for now until 2028-2030, we can assume with reasonable probability that the data wall isn't that much of a constraint in practice.

See here for Epoch's analysis, including the picture:

https://x.com/EpochAIResearch/status/1826038739015184463

Expand full comment
author

Yeah, those guys are doing good work. Just glancing at the picture, looks like data has the largest error bars by far, which seems right. Also, seems like there'd be a lot of schlep, which is often underestimated. I may write some more in conversation with their analysis at some point, though; I do think it's the thing to respond to for this question.

Expand full comment
Nov 2·edited Nov 2

For my own inside view, I am also a synthetic data optimist in domains like programming and mathematics, and at the very least think that a lot of the claims of the synthetic data poisoning models are greatly exaggerated.

One main issue is that the paper that reported synthetic data causing models to collapse made a very questionable assumption that if invalidated, invalidates the conclusion, and that's the idea that they would delete the real internet data after the synthetic data, but there's no reason to do this.

See these tweets and sources for more:

https://x.com/RylanSchaeffer/status/1849145985710399735

https://x.com/aparnadhinak/status/1819414522266091827

https://x.com/RylanSchaeffer/status/1816881533795422404

https://www.lesswrong.com/posts/CAKdA8wrDFjEAuPzJ/musings-on-text-data-wall-oct-2024

Agree that the schlep will in practice slow down solutions, which is why I have reasonably high probability mass on post-2030 timelines for AI automating everything.

That said, I think the data wall is likely to be solved by default, though I'm not certain about this.

Also, thanks for responding.

Expand full comment