Krippendorff's alpha coefficient^{[1]} is a statistical measure of the agreement achieved when coding a set of units of analysis in terms of the values of a variable. Since the 1970s, alpha is used in content analysis where textual units are categorized by trained readers, in counseling and survey research where experts code openended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis.
Krippendorff’s alpha generalizes several known statistics, often called measures of intercoder agreement, interrater reliability, reliability of coding (as distinct from unitizing) but it also distinguishes itself from statistics that claim to measure reliability but are unsuitable to assess the reliability of coding or the data it generates.
Krippendorff’s alpha is applicable to any number of coders, each assigning one value to one unit of analysis, to incomplete (missing) data, to any number of values available for coding a variable, to binary, nominal, ordinal, interval, ratio, polar, and circular metrics (Levels of Measurement), and it adjusts itself to small sample sizes of the reliability data. The virtue of a single coefficient with these variations is that computed reliabilities are comparable across any numbers of coders and values, different metrics, and unequal sample sizes.
Contents 
Reliability data are generated in a situation in which m ≥ 2 jointly instructed (e.g., by a Code book) but independently working coders assign any one of a set of values c or k of a variable to a common set of N units of analysis. In their canonical form, reliability data are tabulated in an mbyN matrix containing n values c_{iu} or k_{ju} that coder i or j has assigned to unit u. When data are incomplete, some cells in this matrix are empty or missing, hence, the number m_{u} of values assigned to unit u may vary. Reliability data require that values be pairable, i.e., m_{u} ≥ 2. The total number of pairable values is n ≤ mN.
where the disagreement
is the average difference between two values c and k over all m_{u}(m_{u}1) pairs of values possible within unit u – without reference to coders. is a function of the metric of the variable, see below. The observed disagreement
is the average over all N disagreements D_{u} in u. And the expected disagreement
is the average difference between any two values c and
k over all n(n–1) pairs of values
possible within the reliability data – without reference to coders
and units. In effect, D_{e} is the disagreement
that is expected when the values used by all coders are randomly
assigned to the given set of units.
One interpretation of Krippendorff's alpha is:
In this general form, disagreements D_{o} and D_{e} may be conceptually transparent but are computationally inefficient. They can be simplified algebraically, especially when expressed in terms of the visually more instructive coincidence matrix representation of the reliability data.
A coincidence matrix cross tabulates the n pairable values from the canonical form of the reliability data into a vbyv square matrix, where v is the number of values available in a variable. Unlike contingency matrices, familiar in association and correlation statistics, which tabulate pairs of values (Cross tabulation), a coincidence matrix tabulates all pairable values. A coincidence matrix omits references to coders and is symmetrical around its diagonal, which contains all perfect matches, c = k. The matrix of observed coincidences contains frequencies:
Because a coincidence matrix tabulates all pairable values and
its contents sum to the total n, when four or more coders
are involved, o_{ck} may be fractions.
The matrix of expected coincidences contains frequencies:
which sum to the same n_{c},
n_{k}, and n as does
o_{ck}. In terms of these coincidences,
Krippendorff's alpha becomes:
Difference functions ^{[2]} between values c and k reflect the metric properties (Levels of Measurement) of their variable.
In general:
In particular:
Inasmuch as mathematical statements of the statistical distribution of alpha are always only approximations, it is preferable to obtain alpha’s distribution by bootstrapping (statistics)^{[3]}^{[4]}. Alpha 's distribution gives rise to two indices:
The minimum acceptable alpha coefficient should be chosen according to the importance of the conclusions to be drawn from imperfect data. When the costs of mistaken conclusions are high, the minimum alpha needs to be set high as well. In the absence of knowledge of the risks of drawing false conclusions from unreliable data, social scientists commonly rely on data with reliabilities α ≥ .800, consider data with 0.800 > α ≥ 0.667 only to draw tentative conclusions, and discard data whose agreement measures α < 0.667^{[5]}.
Krippendorff's alpha can be misunderstood^{[6]}.
Let the canonical form of reliability data be a 3coderby15 unit matrix with 45 cells:
Units u:  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 

Coder A  *  *  *  *  *  3  4  1  2  1  1  3  3  *  3 
Coder B  1  *  2  1  3  3  4  3  *  *  *  *  *  *  * 
Coder C  *  *  2  1  3  4  4  *  2  1  1  3  3  *  4 
Suppose “*” indicates a default category like “cannot code,” “no answer,” or “lacking an observation.” Then, * provides no information about the reliability of data in the four values that matter. Note that unit 2 and 14 contains no information and unit 1 contains only one value, which is not pairable within that unit. Thus, these reliability data consist not of mN=45 but of n=26 pairable values, not in N =15 but in 12 multiply coded units.
The coincidence matrix for these data would be constructed as follows:
Values c or k:  1  2  3  4  n_{c} 

Value 1  6  1  7  
Value 2  4  4  
Value 3  1  7  2  10  
Value 4  2  3  5  
Frequency n_{k}  7  4  10  5  26 
In terms of the entries in this coincidence matrix, Krippendorff's alpha may be calculated from:
For convenience, because products with exclude c=kpairs from being counted and coincidences are symmetrical, only the entries in one of the offdiagonal triangles of the coincidence matrix are listed in the following:
Considering that all , the above expression yields:
With , , and , the above expression yields:
Here, _{inter val}α > _{nomin al}α because disagreements happens to occur largely among neighboring values, visualized by occurring closer to the diagonal of the coincidence matrix, a condition that _{inter val}α takes into account but _{nomin al}α does not. When the observed frequencies o_{c ≠ k} are on the average proportional to the expected frequencies e_{c ≠ k}, _{inter val}α = _{nomin al}α.
Comparing alpha coefficients across different metrics can provide clues to how coders conceptualize the metric of a variable.
An SPSS and SAS macro for computing alpha^{[7]} is available at http://www.comm.ohiostate.edu/ahayes/SPSS%20programs/kalpha.htm. It computes alphas for nominal, ordinal, interval, and ratio scale data and their distributions, one variable at a time. For additional software and useful papers see http://cswww.essex.ac.uk/Research/nle/arrau/alpha.html.
Krippendorff's alpha brings several know statistics under a common umbrella, each of them has its own limitations but no additional virtues.
Evidently, Krippendorff's alpha is more general than
either of these special purpose coefficients. It adjusts to varying
sample sizes and affords comparisons across a great variety of
reliability data, mostly ignored by the familiar measures.
Semantically, reliability is the ability to rely on something, here on coded data for subsequent analysis. When a sufficiently large number of coders agree perfectly on what they have read or observed, relying on their descriptions is a safe bet. Judgments of this kind hinge on the number of coders duplicating the process and how representative the coded units are of the population of interest. Problems of interpretation arise when agreement is less than perfect, especially when reliability is absent.
Naming a statistic as one of agreement, reproducibility, or reliability does not make it a valid index of whether one can rely on coded data in subsequent decisions. Its mathematical structure must fit the process of coding units into a system of analyzable terms.
Bennett, Edward M., Alpert, R. & Goldstein, A. C. (1954). Communications through limited response questioning. Public Opinion Quarterly, 18, 303308.
Brennan, Robert L. & Prediger, Dale J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687699.
Cohen, Jacob (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20 (1), 3746.
Cronbach, Lee, J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16 (3), 297334.
Fleiss, Joseph L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378382.
Goodman, Leo A. & Kruskal, William H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732764.
Hayes, Andrew F. & Krippendorff, Klaus (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1, 7789.
Krippendorff, Klaus (2004). Content analysis: An introduction to its methodology. Thousand Oaks, CA: Sage.
Krippendorff, Klaus (1978). Reliability of binary attribute data. Biometrics, 34 (1), 142144.
Krippendorff, Klaus (1970). Estimating the reliability, systematic error, and random error of interval data. Educational and Psychological Measurement, 30 (1),6170.
Lin, Lawrence I. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45, 255268.
Nunnally, Jum C. & Bernstein, Ira H. (1994). Psychometric Theory, 3^{rd} ed. New York: McGrawHill.
Pearson, Karl, et al. (1901). Mathematical contributions to the theory of evolution. IX: On the principle of homotyposis and its relation to heredity, to variability of the individual, and to that of race. Part I: Homotyposis in the vegetable kingdom. Philosophical Transactions of the Royal Society (London), Series A, 197, 285379.
Scott, William A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19, 321325.
Siegel, Sydney & Castella, N. John (1988). Nonparametric Statistics for the Behavioral Sciences, 2^{nd} ed. Boston: McGrawHill.
Tildesley, M. L. (1921). A first study of the Burmes skull. Biometrica, 13, 176267.
Spearman, Charles E. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101.
Zwick, Rebecca (1988). Another look at interrater agreement. Psychological Bulletin, 103 (3), 347387.
