Reliability, Availability, and Data Integrity: What’s the Difference and Why Does It Matter?
Reliability, availability, and data integrity are often thought of as interchangeable terms, but they are not. Understanding the differences in these concepts and the relationships between them will help organizations write better service level agreements and understand how “safe” their data is from physical corruption.
Reliability measures the frequency of system repair activities. From a storage architect’s perspective reliability and MTBF (Mean Time Between Failure) are synonyms. Mathematically reliability or MTBF is defined as the reciprocal of a system’s failure rate: MTBF = 1/system failure rate. Experience has taught us that complicated systems break more often than simple systems because they have more components in them that can fail. Hence, it is easy to mistakenly conclude that simpler is always better than complex because if a component is not there it can’t fail. The flaw in this logic is that if a storage system only has the bare minimum of components needed to make it function, all hardware failures are critical failures: failures that always result in downtime, a situation that is simply unacceptable in an always-on world. The solution is adding redundancy so that all hardware failures become non-critical: that is all single points of failure (SPOFs) in the system are eliminated. However, an unavoidable consequence of building redundancy into a system is that it increases the frequency of repair activities or lowers system reliability.
Availability measures the amount of time that a storage system can satisfy I/O requests and is frequently expressed as a percentage: the system is available 99.99999% of the time or down less than 1 minute/year. By convention availability is defined as the MTBF/(MTBF+MTTR) with MTTR defined as the Mean Time To Repair. This formula shows that 100% availability can only be achieved if the system has no SPOFs and all software updates, capacity upgrades, and repair activities are nondisruptive. As a practical matter, developing reliable error recovery software is more difficult than developing the software that creates the storage system persona: functionality, manageability, and performance.
In remote locations with limited access to field engineering and spares, and possibly power, storage architects should favor systems with no SPOFs and minimal parts counts. Minimizing the parts count lowers system build costs, power, cooling; and space requirements, and the number of opportunities a field engineer has to botch repair activities that can bring down the system. As a practical matter, in most environments that means scale-up storage architectures or scale-out architectures with minimal node counts: 2 to 4 redundant nodes.
A limitation of this availability formula is that it is silent on the system’s ability to meet service level objectives. A system may be to satisfy I/O requests, but not meet service level objectives. Hence, it’s debatable whether a dual controller storage array with a controller offline delivering less than half of its rated performance is from an end-user’s perspective available. Usable availability defined as the system’s ability to meet service level objectives in the presence of hardware failures is a measure that addresses this shortcoming. Since usable availability depends upon the system’s architecture and the workloads it is supporting, usable availability is something that is managed toward.
Data Integrity measures, from a storage system’s perspective, the ability to maintain and assure data accuracy as data that is written to, stored in, or read from a system. Intuitively data integrity is tightly coupled to media quality, but whether it is DRAM, SCM, flash, or HDDs, data corruption is a constant threat. Bits on memory buses and stored in memory cells get flipped and HDDs have undetected bit error rates. So, it is the software, data protection algorithms, and monitoring that actually maintain and assure storage system data integrity. Strategies include increasing the degree of resiliency – moving from RAID 5 to RAID 6 and erasure encoding, background scrubbing to detect and correct single bit errors before they become double bit errors, shrinking the windows of vulnerability that occur when media fails by reducing rebuild times, monitoring changes in media — varying flash columns or rows offline or flagging bad HHD sectors that could cause data corruption, and replacing media that is showing early signs of failure.
The ability of storage systems to reliably meet service level objectives requires storage architects and operations to think in terms of usable availability. It is the job of storage architects working with storage vendors to quantify the performance impact of various hardware failures and data rebuilds on native performance, and the job of operations to reserve enough spare performance and capacity to meet usable availability objectives. Depending upon RPO, RTO, and performance SLAs, especially at PB scale users should favor dual parity data protection schemes because they are orders of magnitude better than RAID 1, 10, and 5 at protecting data integrity.