Let's say you you want to store what you believe to be a ludicrous amount of data. It may or not actually be a ludicrous amount of data. It could really be a lot, or it could just seem that way because of your current setup (I just want to serve 5 TB).
There's really one question — are you storing many things, are are you storing one big thing?
If you are building a whitebox SAAS platform where one customer's data doesn't interact with another customer's data, you are storing many things. It might be petabytes overall, but as long as each individual customer stays under a certain size, it's not that much of a problem. You're running five thousand different stores — which admittedly has its own problems — but each of those stores is sane. You're doing the equivalent of adding more directories.
The inspiration for this post was a Microsoft Azure ad about how the cloud helped store police dashcam video. That's storing many things. The files are only so large, and it's a clear, discrete unit. Even if you're going through all of it to get any kind of aggregate information, it's easily batchable.
The web is a big thing. The web links to itself in a non-predictable way — everything is talking to everything else. Any analysis is going to be on a page in relation to all the other pages its related to, and those pages can be anywhere. You're not going to store five levels deep of depth-first link search because that's an insane amount of storage and at some point, you'll need six levels deep. Random seeks are the enemy, but there's no way around it.
The Facebook Graph is a big thing. Everybody knows everybody, or at least has a non-zero chance of knowing them. It used to be many things — the school networks.
Ten years ago increasing storage of unrelated items. Now, it's merely annoying. What's the step to make storing and analyzing huge, complex, interconnected items easier?