The archive community is constantly asking the same question — how many copies are necessary? Let’s turn the question around and ask how many copies can you afford. That is where the real debate begins. Most questions about the number of copies of a file are really asking a reliability question about the data. For example, I am often asked are two copies on low-cost, low-reliability media better than a single copy on enterprise media.
There are so many variables when what you’re really trying to calculate is the reliability of your data. These qualifiers range from the obvious reliability of the media and natural disasters to the not so obvious, like a software bug or a deliberate attack on your data to wipe it out, or in some even worse cases, to change the data.
I am regularly asked how many copies should be kept, and yet people are unwilling or unable to answer the question in a realistic way about the level of reliability they want and what are they trying to protect against. Also, 100 percent reliability for very large amounts of data in every circumstance is virtually impossible, given everything from natural disasters to known issues with failure of devices and storage, human error and whatever else might come up. So the question goes back to how many copies do you need and what does that get you?
First, consider the basics:
Background Information
Many of you have seen these charts before, but they bear repeating, given the topic:
Device |
Hard error rate in bits |
Equivalent in bytes |
PB equivalent |
SATA consumer |
10E14 |
1.25E+13 |
0.01 |
SATA Enterprise |
10E15 |
1.25E+14 |
0.11 |
Enterprise SAS/FC |
10E16 |
1.25E+15 |
1.11 |
LTO and some Enterprise SAS SSDs |
10E17 |
1.25E+16 |
11.10 |
Enterprise Tape |
10E19 |
1.25E+18 |
1110.22 |
Another way to look at the data is to look at how fast you hit the hard error rate based on the device running at 100 percent of the average rate (for disk average of inner and outer cylinders).
Device Type |
1 |
10 |
50 |
100 |
200 |
|
Device |
Devices |
Devices |
Devices |
Devices |
|
Hours to reach hard error rate at sustained data rate |
||||
Consumer SATA |
50.9 |
5.1 |
1.0 |
0.5 |
0.3 |
Enterprise SATA |
301.0 |
30.1 |
6.0 |
3.0 |
1.5 |
Enterprise SAS/FC 3.5 inch |
2,759.5 |
275.9 |
55.2 |
27.6 |
13.8 |
Enterprise SAS/FC 2.5 inch |
1,965.2 |
196.5 |
39.3 |
19.7 |
9.8 |
LTO-5 and |
23,652.6 |
2,365.3 |
473.1 |
236.5 |
118.3 |
some Enterprise SAS SSDs |
7,884.2 |
788.4 |
157.7 |
78.8 |
39.4 |
Enterprise Tape |
1,379,737.1 |
137,973.7 |
27,594.7 |
13,797.4 |
6,898.7 |
Clearly, one copy of the data is not going to be acceptable if you want to guarantee that nothing is lost from looking at the hard error rate. Enterprise tapes are a possible exception to this. Of course, with one copy of anything, you are susceptible to a whole bunch of potential failures from things like a bad lot of disks, tapes or other media (and we all know and have heard of this happening). There are many other factors, such as devices that have been used in RAID groups. Most of these factors add cost, which is always a consideration when discussing archival data.
However, while interesting, this does not answer the question of how many copies do you need.
Calculating the Copies
Hard error rates and device failures are only part of the equation. Among the many other things to consider are:
- Silent data corruption
- A bad lot of media
- Natural disaster
- Network failure so you cannot replication
- Human error
- Intentional data damage
- A combination of these factors
We will now look at each of these.
Silent Data Corruption
This is a big problem, as it means you do not know if the data has been corrupted. If you have only one copy of the file, this is a real problem. If you have more than one, you must have some external checksum framework and replicate the data because what if both copies go bad at the same time or nearly the same time before you can replicate?
A Bad Lot of Media
This has not happened in a while in the disk drive industry, but it has happened before for both disk and tape. If you have two copies on the same media lots, there is always the risk that both copies could be on media with a manufacturing defect. Be sure to have at least two different media lots if you are going to have the same type of media.
Natural Disaster
Whether you live in an earthquake zone, tornado zone, hurricane zone, flood zone or trouble zone of choice zone, almost every major population area in the country could be a target. If you have only two copies of your data and one of them gets destroyed, you will be replicating from only one copy. Given the media reliability and the amount of data, that might be a problem. Of course, you could have a computer center built into a missile silo that was designed to survive a nuclear attack, but most enterprises do not have a computer center that can survive a disaster such as an F5 tornado
Network Failure — Preventing Replication
Having two copies of your data via replication is only as good as your network. There are three potential issues:
- Do you have enough network bandwidth to replicate your incoming data?
- Do you have enough network bandwidth to replicate your incoming data and re-replicate to failed devices?
- Do you have enough network bandwidth to replicate all of your data in the event of a disaster?
Clearly, having number three running is not practical given the cost, but some planning is needed.
Human Error
Everyone makes mistakes, and archives can be lost via human error. Issues tend to occur if you have only two copies of the data, depending on the software that is chosen. How you decide to make sure a human error does not take out all of your copies is generally a function of the software and testing procedures.
Intentional Data Damage
Whether it be an employee with an ax to grind or someone hacking into your system to change or destroy data, having multiple copies of your data is critical. Each copy must be checksumed to ensure data has not been tampered with or been silently corrupted.
>Combination of Factors
Likely the worst possible scenario is that a combination of factors happens at the same time. Most people plan for one thing to happen but not a combination. This must be a consideration when you are determining how many copies you want.
Final Thoughts
So how many copies of the data do you need, on what media, and at what locations? Some of it depends on the size of your archive. If you have 1 PB of data, you might be able to keep it safe with two copies on enterprise RAIDed SATA drives. On the other hand, if you have 50 PB (50*1024*1024*1024*1024*1024 bytes) of data and want 99.9999999 percent (56,294,994 bytes of data lost in 50 PB) reliability, two copies on enterprise tape might not be enough because some lost bytes might overlap on the two copies. The count of copies depends on how much risk you want to tolerate and, of course, your budget.
You might be willing to archive far more data with a higher risk of loss, and that might be your corporate policy. On the other hand, if you are a drug company and the FDA requires you to keep all drug trial information and you lose some of the data — as Ricky said to Lucy “you have some explaining to do.” In large archives (over 50 PB) with high reliability, two copies might not be enough if you want, say 99.9999999999999 percent (15 9s) of data reliability or 56 bytes lost for 50 PB. I am not sure that even three copies are enough, given the myriad issues and impacts. The media type also comes into play: Three copies on non-RAIDed consumer drives are a recipe for disaster, while three copies on enterprise tape are likely close to what you need from a media perspective. However, if all three copies are in a hurricane zone or you have an employee intent on destruction, all bets are off.
Given all the variables, there are no good answers for how many copies you need based on the media type used and the amount of data you have. Some variables, like human error or intentional damage, are not really possible to quantify, but things like WORM media can surely help. Others, like disasters, you might be able to quantify, but even that is not exactly easy nor cheap to figure out. Everyone in the process must be aware of the risks and issues and make the best choices based on budget.
So back the question I get asked often: Are two copies on low-cost, low-reliability media better than one copy on enterprise media? My answer for large archives is that, from a media reliability perspective, one copy on enterprise media is better than two on low-cost, low-reliability media because media failures have a higher probability of failure than natural disasters, malicious employees and the like.
Everyone must know the limitations, and 100 percent data reliability is very costly, if not impossible to achieve, for large archives. As Sir Francis Bacon said, knowledge is power.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn’t require diplomatic skills. Diplomacy’s loss was HPC’s gain.