Michael Cooley's Genetic Genealogy Blog GEN • GEN
15 August 2018

Greater Genetic Difference Means a Greater Genealogical Relationship

Earlier this year, one of the CF09 members of the Cooley DNA Project ordered the Big Y, which looks at more than ten million locations on the Y chromosome. I wrote about the findings last April in Article 48 and asked the group to "light my fire." Some of you came through, and we now have benefit of a total of three CF09 Big Y testers, as follows:

A definition and summary before I begin: The Y chromosome passes down only the male line, a single thread through each of our genealogies: son to father, to father, to father — a genetic lineage that can be traced over tens of thousands of years. Even when we arrive to the point of complete pedigree collapse (that moment in history when every person then living — and having descendants living today — were ancestors of each of us), only a handful of men belonged to any one Y thread: a father with his living son, grandson, etc.1 (This five generation chart illustrates the single-thread nature of the Y.)

Over time, "kinks" emerge in these threads — mutations that pass down and accumulate through the generations. A mutation called a Single Nucleotide Polymorphism, or SNP, is of particular interest. A SNP results when a single genetic letter changes from one to another — an A, C, T, or G transmuted to one of the others. Each designation noted in the above diagram is such a mutation.

The top graphic in the earlier article illustrates how the first CF09 tester (a descendant of James Cooley and Penelope Gargus) fits into a somewhat larger picture, including the very distantly-related CF02 Cooleys. It also shows that the tester was found to have nine SNPs unique to him. (Actually, I should have listed only eight.)2 But we now know that four of those SNPs (A21489, A21492, A21495, and A21496) are shared by the new testers, and that they constitute a brand new haplogroup, which I call R1b-A21489, named for the first SNP in the list. In other words, the group of nine (or so) SNPs split in two by virtue of the new tests.

The Case of SK411

SK411 is listed in this new group of SNPs. FTDNA didn't report it when the first Big Y results were released earlier this year, and I wasn't doing independent studies of the results at the time because the company had held up production of BAM files for several months. (A BAM is a compressed, multi-gigabyte file listing the raw data reads.) Yet it's undeniably there. But I have a significant issue with the SNP, if only because the folks at FTDNA have designated it as the name for the new haplogroup.3 The problem is simply that SK411 had been discovered, named, and identified in 2014 as part of haplogroup B, a major haplogroup that is separated from the R haplogroup by tens of thousands of years. There's nothing wrong with the mutation; identical mutations can be identified down multiple branches. But this designation can confuse the origins of the haplogroup. Is it B-SK411 or R-SK411? It's both, of course, but the uniquely-defined A21489 avoids the possible confusion. For example, U106 is a famous European SNP belonging in R1b. What sense would it make to identify a Southeast Asian haplogroup as U106? Yes, the reason for its appearance can be annotated, but those who know of U106 will certainly scratch their heads and wonder whether there had been an error. With all due respect to FTDNA, I'll continue to refer to this haplogroup (at least up until the time it splits in two) with A21489 at its head.

Two other SNPs in the same group have also shown up in FTDNA's recent reports: BY43760 and BY84373. I've grayed these because Yseq reports that both are unreliable.4 It should be noted, though, that setting SNPs aside for that reason is controversial among some project admins due to the reasonable position that such as SNP might defy logic and end up being a valuable marker. (I'm dearly holding onto my own YP4494, despite its status.) For example, although the quality of reads for BY84373 is high in all three testers, it exists in a highly variable region and might disappear in lineages where we'd expect to find it. The point of SNPs, then, becomes moot. (That appears, at least, to be the position of population geneticists.) In other words, a designated SNP should be consistent and verifiable. It should be forever. I'll leave these two gray, at least for now.

The Case of A22608

The genetic terms coverage and depth are reportedly poorly-defined and are often used interchangeably. They can refer to either the number of reads the sequencing has successfully performed on any one position, or to the percentage of positives found at that position. Apart from the quality scores generated by the lab, these are the most important factors used to determine whether we have a verifiable SNP.

FTDNA assessed SNP A22608 as positive only for tester #B177240 (descended from Washington Coley). But when I saw it listed as a nebulous (my word) positive for #332918 (Benjamin Coley), I became excited. Might it be, I wondered, that this SNP could be used to distinguish the CF09 Coleys from the CF09 Cooleys? Deciding to use the term depth to refer to percentage, I turned to the BAM files for each person, but I was disappointed in the results.

Perhaps a score of 81% achieves a threshold for FTNDA's algorithms, but it's clear that all three testers exhibit the presence of A22608. Although the number of successful reads is more than adequate (they can be in the thousands), the percentage of reads showing the mutation (a C at position 23848899) is rather low. (Ideally, it would be in the high 90s, if not 100%.) The SNP will become more acceptable if all CF09 testers were shown to have it, but it clearly does not define the Washington Coley lineage. And it has another factor going against it: the quality for each read (which I don't discuss here) is very low. Compare these readings with those for A21489 (designated as the upstream haplogroup). We find excellent depth scores for all three testers. So are, it turns out, the quality scores.

So, there's nothing here (unfortunately) that distinguishes the Coleys from the Cooleys — and I've gone through the list SNP by SNP. If anything, A22608 belongs upstream with A21489, and that's where I've placed it — provisionally.

Judging Relatedness through SNP Count

But let's not lose sight of the reason we're doing this. We now know how these three men are related genetically, but how are they related genealogically? Our ultimate object is to identify their common ancestor. We see clearly that there are three separate branches. Their confluence — the A21489 group of SNPs — represents their Most Recent Common Ancestor (MRCA). That's where Granddaddy Cooley/Coley resides. How can we home in on him?

The first step is to gain an understanding about the timeframe in which he lived. The MRCA (Granddaddy), whoever he was, lived during very specific years; he was born on X date and died on Y date — and he was born with the A21489 SNPs. (We know this because all three testers have them and, because of the nature of Y inheritance, they could only have come from him.) SNP counting can help us. Five mutations emerged during the lineage from the MRCA to tester #B12285. Only one occurred in the lineage to #B177240. And three SNP mutations were born between the birth of one of the MRCA's sons and our tester #332918. We don't know when any one SNP popped in, only that each event happened during the reproductive years of our MRCA's descendants. The law of averages helps us out a bit. Yes, averages of small numbers aren't always representative, but the law of large numbers will come to our rescue as more testers come online.

So, in this case, we have an average of three mutations per lineage since the MRCA — (5 + 3 + 1) / 3. If we multiply that by the grossly estimated mutation rate of 144 years per SNP, we arrive at a guesstimate that the MRCA was born about 432 years before the present (BP), or ca. 1518.5

That doesn't help us a great deal. If accurate, our MRCA never lived in the New World. We might reasonably guess, however, that the five mutations in #B12855, whose earliest known ancestor (EKA) is James Cooley (believed to have been in Virginia in 1758), are higher than to be expected from the MRCA — assuming we knew who he was. Can it be that the average SNP count between the other two testers is more accurate: two SNPs, which would translate roughly to a birth year of 1662? That's still beyond the genealogical memory of our lineages.

Judging Relatedness through Genetic Distance

One of the advantages of the new Big Y-500 test is that FTDNA now reports the Short Tandem Repeats (STRs) of up to 561 markers on the Y chromosome. STRs are the numbers we see on the results page of the Cooley DNA Project. Each small box represents the number of times a defined series of genetics letters repeats at the stated position. For example, all the CF09 Cooleys have 12 repeats of GTT at DYS426, the sixth column listed on the results page. Genetic distance (GD) is the number of differences between two or more people or between populations. Because of the law of large numbers, the data on 561 markers theoretically provides us with more accurate data than that for a smaller set.6

These values are a little on the high side. The average GD over 561 markers among the testers from my own CF01 group is about 4, and that's with a known MRCA born about 1738. The average genetic distance between those of a known MRCA (died in 1702) in a group at the Strother DNA Project is 5. Closer to our mark is the GD between the two CF02 Benjamin Cooley (1615-1685) Big Y testers, which is 8. Of course, as I've described many times before, STRs are fickle. They exist in volatile areas of the Y, and the number of repeats at any one position can go either up or down (and that's why we don't want to see SNPs in these regions). So, these are by no means hard numbers. But by looking at a larger sample (the law of large numbers again), we can identify a trend. For example, when we compare CF02 against CF09 — lineages that are separated by about 2,000 years through haplogroup R1b-Y15926 — we get an average GD of 31 markers over 561 STRs.

Clearly, the more time that passes, the greater the opportunity for mutations to accumulate in the lineage. But if we have a GD of 31 over 2,000 years, does that translate to 15 or 16 per every thousand years, and 8 or so during a 500-year period? Perhaps. We'll need to apply a lot of fudge, though.

Bringing it Home — Maybe

Although Washington and Benjamin Coley Jr were born well after James, the number of SNP and STR differences makes it very unlikely they were descended from James. If we can trust these figures — and they result from a small sample group — my guess is that the CF09 MRCA was likely born in the 17th century.

But this is among only three Big Y testers in a group that presently has sixteen obviously-related Y-STR testers — and this is a population that clusters around North Carolina. Indeed, a large cluster in a specific geographical region tells a population geneticist that she might be dealing with an older population. Of course, we're not talking here about a cohesive tribe of several hundred people. But it's a viewpoint that has merit. In other words, it's conceivable that the CF09 MRCA was an early colonist in the 17th century, perhaps in Virginia. I'm not going to place money on it, but it's a hypothesis from which to work.

The Genetic Genealogy Future for CF09

Sixteen STR testers is a fair amount of potential data to pull from. Although further Big Y testing would be helpful, individual SNP testing can take us a long way, especially where there's a shared Earliest Known Ancestor (EKA). For example, there are two other testers known to have descended from James Cooley and Penelope Gargus. James was probably not born in 1758 with all five SNPs found in the first lineage. It's likely that one or two of them surfaced over the descent from James. Because shared descent means shared DNA markers, a test from a second James descendant will tell us exactly which of the five SNPs James was born with. (I know, for example, that my John Cooley was born with YP4491 simply because all of his patrilineal Cooley descendants have it.)

So, the possible field of Cooley/Gargus SNPs has been narrowed down from nine at my previous reporting to five, and, in the last analysis, that will likely end up being closer to three — a set number of SNPs held by all of James's Y descendants — a set that can prove (or otherwise) Cooley/Gargus descent. But it needn't take a Big Y to identify James's Y-DNA print. All five SNPs — A21490, A21493, A21494, A21498, and A21961 — can be ordered from Yseq.net for a total of $90.

But remember that up to this date, we've identified three distinct lineages descended from our elusive (but not fictional) MRCA. Do you have strong genealogical data that you're descended from Benjamin Coley, Jr? Testing the three SNPs in that lineage will help determine the truth of it. But it's not always that simple. Feel free to start a discussion with me about that.

The Nature of It

Genetics is science. But like all of nature — indeed, all of the cosmos — scientific inquiry reveals a heavy dosage of randomness. We find natural flaws, in the form of mutations in the case of DNA, everywhere we look. Scientists pursue and interpret those markers — whether they be the chemical traces left in an ice core or the light spectrum emanating from a star — and come up with a best-guess interpretation based on all the evidence at hand. Any degree of certainty or machine-like precision is found only in our inventions, mathematics and all. It takes thoughtful analysis of a steady stream of data to make new discoveries. Like geneticists, genealogists who have any experience at all understand that. This is a slow-moving train — destination unknown. Still, we need only follow the tracks.

FTDNA presently has a special through the month of August on many of its products, including the Big Y. Now's the time to jump on board! I'll be happy to help you out along the way.

1 The number of our ancestors grows exponentially as we go back generation to generation (2, 4, 8, 16, 32, etc). If each of our ancestors at the 33rd generation were unique individuals, there would have been more than ten billion living in about 1100 AD, which is impossible. Pedigree collapse fixes that. We could be descended tens of thousands of times from any one ancestor.

2 Yseq.net reports that A21497 is unreliable. I've grayed it out in the above graphic.

3 A haplogroup is merely a collection of SNPs (which can amount to even one SNP) belonging, as per current knowledge, at the same level. As we saw above, these levels or, haplogroups, often split upon the arrival of new testers.

4 BY43760, for example, is in a region called DYZ19, a long, disorganized series of Short Tandem Repeats (STRs) deemed unsuitable for reliable SNP calling.

5 Any standard, such as "before the present," needs a specific reference point. In this case, it's the year 1950.

6 It needs to be noted, however, that the tested sets of 12, 37, and 67 markers are carefully selected to yield the most reliable results. Furthermore, FTDNA does not guarantee the accuracy of all 561 markers.