Monthly Archives: March 2016

FoxDB stores many things, but HedgehogDB stores one big thing

Let's say you you want to store what you believe to be a ludicrous amount of data. It may or not actually be a ludicrous amount of data. It could really be a lot, or it could just seem that way because of your current setup (I just want to serve 5 TB).

There's really one question — are you storing many things, are are you storing one big thing?

If you are  building a whitebox SAAS platform where one customer's data doesn't interact with another customer's data, you are storing many things. It might be petabytes overall, but as long as each individual customer stays under a certain size, it's not that much of a problem. You're running five thousand different stores — which admittedly has its own problems — but each of those stores is sane. You're doing the equivalent of adding more directories.

The inspiration for this post was  a Microsoft Azure ad about how the cloud helped store police dashcam video. That's storing many things. The files are only so large, and it's a clear, discrete unit. Even if you're going through all of it to get any kind of aggregate information, it's easily batchable.

The web is a big thing. The web links to itself in a non-predictable way — everything is talking to everything else. Any analysis is going to be on a page in relation to all the other pages its related to, and those pages can be anywhere. You're not going to store five levels deep of depth-first link search because that's an insane amount of storage and at some point, you'll need six levels deep. Random seeks are the enemy, but there's no way around it.

The Facebook Graph is a big thing. Everybody knows everybody, or at least has a non-zero chance of knowing them. It used to be many things — the school networks.

Ten years ago increasing storage of unrelated items. Now, it's merely annoying. What's the step to make storing and analyzing huge, complex, interconnected items easier?

 

Math and Shakespease, one at a time please

The answer is the 15th.

If you do a deep, head scratching analysis of Ides of March it's Ides + of + March = Half of March, which will give you fifteen days on either side of March, because March has 31 days.

This quiz was 30 minutes long. I spent 25 on this question, because I remembered 15 and got 16 by hand.

So that's why I got an A- in Sophomore English, admissions council. We studied Julius Caesar and I used my brain.

My unfeasible dream of a data processing platform

I build a lot of charts and dashboards. Sometimes the numbers are wrong. This is the worst thing in the entire world.

Why is it wrong? Well, let's just look through the thirty or so different data sources we have, surely one of those will have an obvious error! No? Let's look at the data sources that populate those data sources! Surely we will have access to all of them, and they will be in a reasonable format, and the bizarre interactions between different ways of string processing and date processing done over a decade or so by different people!

If you're looking at this kind of disaster you've done at least one thing right. You probably have a pretty robust data warehouse platform because you're fucking up at scale. If you don't, everything fell to pieces a long time ago when you had to manage your own database servers and disk handling and...

Back to the disaster.

Imagine you could trace everything. Imagine we have made up tables like this:

SELECT * FROM enormous_table;

id | a      | b  | c    | foreign_id
1  | 122.13 | -1 | 0.32 | 1
SELECT * FROM another_table;

id | a        | e | f    | foreign_id
1  | 944.1311 | 2 | true | 1
INSERT INTO combo_table
SELECT SUM(a) FROM enormous_table INNER JOIN another_table
ON enormous_table.foreign_id = another_table.foreign_id;

And imagine westore all the history and origin on  When the time comes to read from the combo table, we have all the history.

SELECT ORIGIN FROM combo_table WHERE foreign_id = 1;

id | a
1  | 1066.2611
||
||
==== SUM
      || == enormous_table
             1 | 122.13 | -1 | 0.32 | 1
             || == INSERT INTO enormous_table 
                   1 | 122.13 | -1 | 0.32 | 1
      || == another_table
             1 | 944.1311 | 2 | true | 1
             || == INSERT INTO another_table
                   1 | 944.1311 | 2 | true | 1

And then you have that for every. Single. Row. Problem solved. You can look up where everything went wrong.

It's impossible to do, I think. No matter how I go about it, I wind up with a Schlemiel the painter problem - doing one more thing involves doing everything before it, and then the one new one who lived in the house that Jack built. How many steps were involved?

There's a record for each. That record either has to have all the records before that, or a pointer to its parents. Storing all the records gets insane quickly. Pointers mean exploding disk seeks.

It would be great, though.

Reverse engineering Facebook's growing pains

TRIGGER WARNING: Ivy League humblebragging

Facebook Graph API Explorer gave me a little bit of insight into what their process must have been like when they were first getting big. Right now, if you make a new account your id number will be very long and not have much correlation with anything. If your friend made an account at the same time your numbers would be very different.

My id # is around 120,000. If you were at Columbia and got your Facebook account at the same time I did, your account id would be in that range. It's true for my friends. Generally, the older you are the lower you are in this range, the younger the higher.

I worked with someone who went to Cornell, his id number was around 450,000. He had his account for a year longer than I did. His friends had the same cluster, just around that number.

Clearly, at some point Facebook tried to carve up the id space and shard based on that -- surely, no one would have friends from other schools! (It is worth remembering you originally needed a college email, and that the school networks use to be a lot tighter than they are now.)