In statistics, reliability is the consistency of a set of measurements or measuring instrument, often used to describe a test. Reliability is inversely related to random error.
Contents 
There are several general classes of reliability estimates:
Reliability does not imply validity. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of accuracy and precision, reliability is analogous to precision, while validity is analogous to accuracy.
An example often used to illustrate the difference between reliability and validity in the experimental sciences involves a common bathroom scale. If someone who is 200 pounds steps on a scale 10 times and gets readings of 15, 250, 95, 140, etc., the scale is not reliable. If the scale consistently reads "150", then it is reliable, but not valid. If it reads "200" each time, then the measurement is both reliable and valid. This is what is meant by the statement, "Reliability is necessary but not sufficient for validity."
Reliability may be estimated through a variety of methods that fall into two types: singleadministration and multipleadministration. Multipleadministration methods require that two assessments are administered. In the testretest method, reliability is estimated as the Pearson productmoment correlation coefficient between two administrations of the same measure. In the alternate forms method, reliability is estimated by the Pearson productmoment correlation coefficient of two different forms of a measure, usually administered together. Singleadministration methods include splithalf and internal consistency. The splithalf method treats the two halves of a measure as alternate forms. This "halves reliability" estimate is then stepped up to the full test length using the SpearmanBrown prediction formula. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible splithalf coefficients.^{[2]} Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, KuderRichardson Formula 20.^{[2]}
These measures of reliability differ in their sensitivity to different sources of error and so need not be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true reliability is different in this second population. (This is true of measures of all typesyardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)
Reliability may be improved by clarity of expression (for written assessments), lengthening the measure,^{[2]} and other informal means. However, formal psychometric analysis, called the item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have nearzero or negative discrimination are replaced with better items, the reliability of the measure will increase.
In classical test theory, reliability is defined mathematically as the ratio of the variation of the true score and the variation of the observed score. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:
where ρ_{xx'} is the symbol for the reliability of the observed score, X; , , and are the variances on the measured, true and error scores respectively. Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test.
Some examples of the methods to estimate reliability include testretest reliability, internal consistency reliability, and paralleltest reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.
It was wellknown to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for testtakers with moderate trait levels and worse among high and lowscoring testtakers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score.
In statistics, reliability is the consistency of a set of measurements or measuring instrument. This can either be whether the measurements of the same instrument give (testretest) or are likely to give the same measurement, or in the case of more subjective instruments, whether two independent assessors give similar scores (interrater reliability). Reliability does not imply validity. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance.
In experimental sciences, reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results.
In engineering, reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time." It is often reported in terms of a probability. Evaluations of reliability involve the use of many statistical tools. See Reliability engineering for further discussion.
Contents 
Reliability may be estimated through a variety of methods that fall into two types: Singleadministration and multipleadministration. Multipleadministration methods require that two assessments are administered. In the testretest method, reliability is estimated as the Pearson productmoment correlation coefficient between two administrations of the same measure. In the alternate forms method, reliability is estimated by the Pearson productmoment correlation coefficient of two different forms of a measure, usually administered together. Singleadministration methods include splithalf and internal consistency. The splithalf method treats the two halves of a measure as alternate forms. This "halves reliability" estimate is then stepped up to the full test length using the SpearmanBrown prediction formula. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible splithalf coefficients.
Each of these estimation methods is sensitive to different sources of error and so might not be expected to be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true reliability is different in this second population. (This is true of measures of all typesyardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)
Reliability may be improved by clarity of expression (for written assessments), lengthening the measure, and other informal means. However, formal psychometric analysis, called the item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test.
In classical test theory, reliability is defined mathematically as the ratio of the variation of the true score and the variation of the observed score. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:
where is the symbol for the reliability of the observed score, X; , , and are the variances on the measured, true and error scores respectively. Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test.
Some examples of the methods to estimate reliability include testretest reliability, internal consistency reliability, and paralleltest reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.
It was wellknown to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for testtakers with moderate trait levels and worse among high and lowscoring testtakers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score. Higher levels of IRT information indicate higher precision and thus greater reliability.
This page uses content from the English language Wikipedia. The original content was at Reliability (statistics). The list of authors can be seen in the page history. As with this Familypedia wiki, the content of Wikipedia is available under the Creative Commons License. 
