Human genetic clustering data can be used to infer population structure and assign individuals to groups that often correspond with their self-identified geographical ancestry. Recently, Lynn Jorde and Steven Wooding argued that "Analysis of many loci now yields reasonably accurate estimates of genetic similarity among individuals, rather than populations. Clustering of individuals is correlated with geographic origin or ancestry."
A study by Neil Risch in 2005 used 326 microsatellite markers and self-identified race/ethnic group (SIRE), white (European American), African-American (black), Asian and Hispanic (individuals involved in the study had to choose from one of these categories), to representing discrete "populations", and showed distinct and non-overlapping clustering of the white, African-American and Asian samples. The results were claimed to confirm the integrity of self-described ancestry: "We have shown a nearly perfect correspondence between genetic cluster and SIRE for major ethnic groups living in the United States, with a discrepancy rate of only 0.14%." But also warned that: "This observation does not eliminate the potential for confounding in these populations. First, there may be subgroups within the larger population group that are too small to detect by cluster analysis. Second, there may not be discrete subgrouping but continuous ancestral variation that could lead to stratification bias. For example, African Americans have a continuous range of European ancestry that would not be detected by cluster analysis but could strongly confound genetic case-control studies. (Tang, 2005)
Studies such as those by Risch and Rosenberg use a computer program called STRUCTURE to find human populations (gene clusters). It is a statistical program that works by placing individuals into one of two clusters based on their overall genetic similarity, many possible pairs of clusters are tested per individual to generate multiple clusters. These populations are based on multiple genetic markers that are often shared between different human populations even over large geographic ranges. The notion of a genetic cluster is that people within the cluster share on average similar allele frequencies to each other than to those in other clusters. (A. W. F. Edwards, 2003 but see also infobox "Multi Locus Allele Clusters") In a test of idealised populations, the computer programme STRUCTURE was found to consistently under-estimate the numbers of populations in the data set when high migration rates between populations and slow mutation rates (such as single nucleotide polymorphisms) were considered.
Nevertheless the Rosenberg et al. (2002) paper shows that individuals can be assigned to specific clusters to a high degree of accuracy. One of the underlying questions regarding the distribution of human genetic diversity is related to the degree to which genes are shared between the observed clusters. It has been observed repeatedly that the majority of variation observed in the global human population is found within populations. This variation is usually calculated using Sewall Wright's Fixation index (FST), which is an estimate of between to within group variation. The degree of human genetic variation is a little different depending upon the gene type studied, but in general it is common to claim that ~85% of genetic variation is found within groups, ~6-10% between groups within the same continent and ~6-10% is found between continental groups. For example The Human Genome Project states "two random individuals from any one group are almost as different [genetically] as any two random individuals from the entire world."
On the other hand Edwards (2003) claims in his essay "Lewontin's Fallacy" that: "It is not true, as Nature claimed, that 'two random individuals from any one group are almost as different as any two random individuals from the entire world'" and Risch et al. (2002) state "Two Caucasians are more similar to each other genetically than a Caucasian and an Asian." It should be noted that these statements are not the same. Risch et al. simply state that two indigenous individuals from the same geographical region are more similar to each other than either is to an indigenous individual from a different geographical region, a claim few would argue with. Jorde et al. put it like this:
The picture that begins to emerge from this and other analyses of human genetic variation is that variation tends to be geographically structured, such that most individuals from the same geographic region will be more similar to one another than to individuals from a distant region.
Whereas Edwards claims that it is not true that the differences between individuals from different geographical regions represent only a small proportion of the variation within the human population (he claims that within group differences between individuals are not almost as large as between group differences). Bamshad et al. (2004) used the data from Rosenberg et al. (2002) to investigate the extent of genetic differences between individuals within continental groups relative to genetic differences between individuals between continental groups. They found that though these individuals could be classified very accurately to continental clusters, there was a significant degree of genetic overlap on the individual level, to the extent that, using 377 loci, individual Europeans were about 38% of the time more genetically similar to East Asians than to other Europeans.
The results obtained by clustering analyses are dependent on several criteria:
Additionally two studies of European population clusters have been produced. Seldin et al. (2006) identified three European clusters using 5,700 genome-wide polymorphisms. Bauchet et al. (2007) used 10,000 polymorphisms to identify five distinct clusters in the European population, consisting of a south-eastern European cluster (including samples from southern Italians, Armenian, Ashkenazi Jewish and Greek "populations"); a northern-European Cluster (including samples from German, eastern English, Polish and western Irish "populations"); a Basque cluster (including samples from Basque "populations"); a Finnish cluster (including samples from Finnish "populations") and a Spanish cluster (including samples from Spanish "populations"). Most "populations" contained individuals from clusters other than the dominant cluster for that population, there were also individuals with membership of several clusters. The results of this study are presented on a map of Europe. (Bauchet, 2007) The existence of allelic clines and the observation that the bulk of human variation is continuously distributed, has led some scientists to conclude that any categorization schema attempting to partition that variation meaningfully will necessarily create artificial truncations. (Kittles & Weiss 2003). It is for this reason, Reanne Frank argues, that attempts to allocate individuals into ancestry groupings based on genetic information have yielded varying results that are highly dependent on methodological design. Serre and Pääbo (2004) make a similar claim:
The absence of strong continental clustering in the human gene pool is of practical importance. It has recently been claimed that “the greatest genetic structure that exists in the human population occurs at the racial level” (Risch et al. 2002). Our results show that this is not the case, and we see no reason to assume that “races” represent any units of relevance for understanding human genetic history.
In a response to Serre and Pääbo (2004), Rosenberg et al. (2005) make three relevant observations. Firstly they maintain that their clustering analysis is robust. Secondly they agree with Serre and Pääbo that membership of multiple clusters can be interpreted as evidence for clinality (isolation by distance), though they also comment that this may also be due to admixture between neighbouring groups (small island model). Thirdly they comment that evidence of clusterdness is not evidence for any concepts of "biological race".
Similarly Witherspoon et al. (2007) have shown that while it is possible to classify people into genetic clusters this does not resolve the observation that any two individuals from different populations are often genetically more similar to each other than to two individuals from the same population:
Discussions of genetic differences between major human populations have long been dominated by two facts: (a) Such differences account for only a small fraction of variance in allele frequencies, but nonetheless (b) multilocus statistics assign most individuals to the correct population. This is widely understood to reflect the increased discriminatory power of multilocus statistics. Yet Bamshad et al. (2004) showed, using multilocus statistics and nearly 400 polymorphic loci, that (c) pairs of individuals from different populations are often more similar than pairs from the same population. If multilocus statistics are so powerful, then how are we to understand this finding?
All three of the claims listed above appear in disputes over the significance of human population variation and "race"...The Human Genome Project (2001, p. 812) states that "two random individuals from any one group are almost as different [genetically] as any two random individuals from the entire world."
Risch et al. (2002) state that "two Caucasians are more similar to each other genetically than a Caucasian and an Asian", but Bamshad et al. (2004) used the same data set as Rosenberg et al. (2002) to show that Europeans are more similar to Asians 38% of the time than they are to other Europeans when only 377 microsatellite markers are analysed.
In agreement with the observation of Bamshad et al. (2004), Witherspoon et al. (2007) have shown that many more than 326 or 377 microsatellite loci are required in order to show that individuals are always more similar to individuals in their own population group than to individuals in different population groups, even for three distinct populations.
In 2007 Witherspoon et al. sought to investigate these apparently contradictory observations. In their paper Genetic similarities within and between human populations they expand upon the observation of Bamshad et al. (2004). They show that the observed clustering of human populations into relatively discrete groups is a product of using what they call "population trait values". This means that each individual is compared to the "typical" trait for several populations, and assigned to a population based on the individual's overall similarity to one of the populations as a whole: "population membership is treated as an additive quantitative genetic trait controlled by many loci of equal effect, and individuals are divided into populations on the basis of their trait values." They therefore claim that clustering analyses cannot necessarily be used to make inferences regarding the similarity or dissimilarity of individuals between or within clusters, but only for similarities or dissimilarities of individuals to the "trait values" of any given cluster. The paper measures the rate of misclassification using these "trait values" and calls this the "population trait value misclassiﬁcation rate" (CT). The paper investigates the similarities between individuals by use of what they term the "dissimilarity fraction" (ω): "the probability that a pair of individuals randomly chosen from different populations is genetically more similar than an independent pair chosen from any single population." Witherspoon et al. show that two individuals can be more genetically similar to each other than to the typical genetic type of their own respective populations, and yet be correctly assigned to their respective populations. An important observation is that the likelihood that two individuals from different populations will be more similar to each other genetically than two individuals from the same population depends on several criteria, most importantly the number of genes studied and the distinctiveness of the populations under investigation.
Given 10 loci, three distinct populations, and the full spectrum of polymorphisms, the answer is ω ~ 0.3, or nearly one-third of the time. With 100 loci, the answer is ~20% of the time and even using 1000 loci, ω ~ 10%. However, if genetic similarity is measured over many thousands of loci, the answer becomes never when individuals are sampled from geographically separated populations.
By geographically separated populations, they mean sampling of people only from distant geographical regions while omitting intermediate regions, in this case Europe, sub-Saharan Africa, and East Asian. They continue:
On the other hand, if the entire world population were analyzed, the inclusion of many closely related and admixed populations would increase ω... In a similar vein, Romualdi et al. (2002) and Serre and Paabo (2004) have suggested that highly accurate classiﬁcation of individuals from continuously sampled (and therefore closely related) populations may be impossible.... Classiﬁcation methods typically make use of aggregate properties of populations, not just properties of individuals or even of pairs of individuals... The Structure classiﬁcation algorithm (Pritchard et al. 2000) also relies on aggregate properties of populations, such as Hardy–Weinberg and linkage equilibrium. In contrast, the pairwise distances used to compute ω make no use of population-level information and are strongly affected by the high level of within-groups variation typical of human populations. This accounts for the difference in behavior between ω and the classiﬁcation results.
Witherspoon et al. also add:
given enough genetic data, individuals can be correctly assigned to their populations of origin is compatible with the observation that most human genetic variation is found within populations, not between them. It is also compatible with our ﬁnding that, even when the most distinct populations are considered and hundreds of loci are used, individuals are frequently more similar to members of other populations than to members of their own population.