Replication popularized

Data replication is a corner stone of distributed systems: since the very beginnings of cloud computing, being able to assert that pieces of data would not be lost in an incident always was critical. That is the same in SStorage: an encrypted cloud storage cannot afford to lose customer data.

Let's see how common techniques of data replication work.

Complete replication

This technique is rather typical: you simply duplicate the data on several servers, generally, it would be duplicated on 3 servers. This is so that if the network was to separate the 3 servers in 2 groups, one of the groups would have more servers than the other and would be the only one accepting connections. For that reason most of the time, whenever you see anything replicated.

Those replications are costly: you need more than 3 times the bandwidth and 3 times the storage space to perform them. They also require more bandwidth each time something is replicated, meaning that in case of a server crash, a large amount of bandwidth is suddenly needed.

For those reasons, we deemed them unsuitable for our encrypted cloud storage service: we are a small company with only 2 associates and a few friends around us, we can't afford to waste resources.

Stripped replication

This method is slightly more advanced, it consists in spliting the information in specific ways so that if a number of pieces were missing, you would be able to reconstruct the remaining information. For example, 1GB of information can be split into 2 times 0.5GB with an additional 0.5GB of information. each can be split into a different server and while at least 2 of these servers are kept online, the data will be accessible.

Another advantage is that the data being split between servers means the individual servers can complete their jobs of sending the data twice as fast, meaning that the amount of data the system can send is twice as big with 3 times the servers. This can be scaled further to survive the loss of more than one server. This is once more a feature critical to an encrypted cloud storage system: this feature doesn't have to do with privacy, but it makes the task of someone accessing the hardware or intercepting network traffic harder as it is more split up.

Stripped replication in SStorage

In SStorage, the data is split in chunks of 32KB, those chunks are then split into 16KB pieces and an additional 16KB for parity. In the system, those pieces are named A-piece for the first chunk, B-piece for the second chunk, and AB-piece for the parity chunk. The gateway service handles the merging of pieces so that the client side ever have to manage the 32KB chunks, and the storage layer handles the management of the smaller chunks.

Due to the encrypted nature of the data, it is the client that makes and handles the architecture of the file-system or database that is stored on the service.

Conclusion

Replication is a critical component of a privacy focused cloud storage software. It actually is for any cloud software. I hope this post explained you the fundamentals of automated replication properly.

I invite you to check the information at https://nekoit.xyz/, join us on Discord or Telegram, or follow me on Mastodon Archivist@social.linux.pizza

Image credit (header): Doc. RNDr. Josef Reischig, CSc. CC BY-SA 3.0 Modified (cropped, scaled, compressed)