So how do you store a lot of data if there is already over your head? The simplest answer is: partition horizontally, in other words divide and conquer. In horizontal partitioning each data instance is stored on a single storage node, so you can increase your total capacity by adding more nodes.
For horizontal partitioning to work data instances have to have no dependency between each other. If this condition is satisfied, there is no need to visit multiple storage nodes in order to reconcile a single data instance.
Sounds easy, ah? But not as it seems…
First of all, data instances have to be evenly distributed among storage nodes, otherwise some storage nodes will run out storage capacity. Next, when a data instance arrives to the system, we must be able to identify it, so that we can refer to it by an ID or a name. We also need to be able to rebalance the system whenever some node runs out of capacity. And finally it would be nice to maintain some redundancy in the system, so that when some node fails, the data it contains is not lost without a trace.
We can employ several interesting strategies in order to address some or all of these problems. I’ll give you some ideas.
This strategy yields the most optimal allocation, as it is possible to observe current location of each data instance. Think about a centralized database that keeps the location of every data instance in the system. When a new data instance arrives (e.g. file is uploaded) you insert a record into this database, specifying data instance’s key and new location. Whenever an instance is requested, you look up a record and follow specified location in order to retrieve the data instance.
It is possible to rebalance your entire system, because you are able to observe its overall state. For example, when certain node runs out of disk space, you might start a background process that moves instances away from that node. In order to do that you just need to copy an instance to a new node and update its location in centralized database. After this operation is complete, the original copy of moved data instance can be erased.
The more sofisticated your allocation algorithm is, the greater capacity you can get out of the same system. Some nodes might have some specifics, e.g. be configured for latency rather than throughput by having solid-state drives installed. You can make your allocation algorithm take that into account and populate such nodes with small files. Your system would then respond faster to requests for small files and greater capacity will be available for large files.
There are several implementations of distributed storage systems with centralized accounting. One of them is MogileFS and for nginx there is mogilefs module. MogileFS is not trivial to use, but once set up it works well for small to medium size systems.
There is, however, a huge deficiency in such kind of systems: they have a single point of failure. And it is… the centralized database! The whole system depends on this database, so when it is down or responding too slow, no matter how many storage nodes you have, the entire system experiences problems. There are some approaches to solve that, but let’s first see how else we can organize our distributed storage.
Centralized accounting requires talking to a centralized database every time you want to do something with a data instance. Could it be avoided? Yes! We just need to decide on the storage node by using limited amount of information contained in the request that came to the system. Once we obtained a hash of a request it’s easy to turn it into a node address by just computing a modulo of the hash and mapping it via a static table of installed nodes (the divisor in module is the number of available nodes). Good hash functions produce uniform distribution of hashes and thus creates relatively uniform distribution of data instances over storage nodes.
The system’s configuration, however, becomes very rigid: if you install more nodes or downsize your installation, the divisor in modulo will change and mapping of hashes to nodes will be different! Of course, you can rebalance your system, but it’s not as easy as with centralized database.
For permanent storages this disadvantage is practically intolerable. But for transient storages or write-only storages it’s quite okay! There are many application that you can use transient storages for: in-memory databases (this is exactly how memcached works, by the way), inbound and outbound mail exchangers, image conversion clusters, logging destinations and so on.
A way to avoid single points of failure is to decentralize. By pushing this idea to the limits we come to the P2P systems. There are plenty of articles on the web on P2P systems. Every aspect of them seems to researched. Let me just summarize how we can use P2P technology for a horizontally partitioned storage.
When a request to store a data instance comes, it gets forwarded to arbiter node. The arbiter then asks every storage node to bid on this instance (e.g. a file). They bid with amount of free storage, current load or other parameters. The arbiter then decides which node wins and passes the item to that node.
In case of bidding with amount of free storage, for example, the node with the most amount of free storage gets most of the inbound traffic. Once that that node gets filled, other nodes come into play. By combining the amount of free storage with the time last bid from given node was accepted, we get a nicely balanced system (the arbiter will round robin over nodes then instead of sticking to the node with the most amount of free storage).
What do we do a request to retrieve a data instance comes? As we have no centralized database any more, we cannot easily locate a storage node by instance’s key. The solution is to just ask every node to locate this instance. This sounds like overkill, but there are many ways we can optimise that. Just few examples:
- caching. We keep a map of resolved keys in memory and don’t resolve a single key repetitively;
- sharding. Keys are evenly split into sets. Each set goes to an isolated group of nodes.
But the largest optimisation comes from choosing the right protocol! And I’m talking about a datagram protocol like UDP. To avoid sending multiple packets to everyone around you can use multicast transmission. In this case you need no more that a single request and a single response packet in this communication. This effectively shifts load from given node and its network interface to network equipment, which is quite advance nowadays.
I hope this little essay gave you an idea of what horizontal partitioning is. We can go on and on by discussing how to combine these strategies and implement interesting things like de-duplication and ability to plug and unplug nodes on the fly. The most important is: horizontal partitioning seems to be the simplest way to scale up the storage and simple things tend to be more reliable and easily implementable. And I hope that’s what you are looking for.