Michael Cooley's Genetic Genealogy Blog GEN • GEN
20 August 2016

Imagine a Bucket Full of SNPs

What if David E. Cooley, the former president of the Cooley Family Association of America, had been, hypothetically, the very first person to have 10 million positions on his Y chromosome sequenced? The results would have been a heap of nucleotides (A's, C's, T's, and G's) arranged in no particular order, thrown about willy-nilly. Sense could not have possibly been made of them, no way to know which values represent mutations, and which values are "ancestral" — those that are shared by virtually everyone. Having no other test results, it would have been impossible to conduct a comparative analysis.

But first, some review. The Cooley DNA Project is a surname organization. In the standard western model, a surname is passed through the father line. It just so happens that the Y chromosome, which carries the male sex gene, also passes through the male line. Like all forms of DNA, the Y moves from one generation to another making copies of itself, only one of which gets handed off to a male fetus. These copies are not made of the same molecules as the original; they are clones. This cloning has been going on since the beginning of humanity. Indeed, we can regard our mortal bodies as mere hosts to our immortal DNA.

This passing of Y-DNA from father to son is a particularly valuable tool. Because mutations rarely occur, the Y is an archaeological treasure trove. From it, we can harvest data that has moved from host to host for tens, even hundreds, of thousands of year.

Every time a cell divides, mutations occur, with each cell having a different set of mutations. But the cells that really matter for genetic genealogy are the germ cells — the male sperm cells in this case — all of which contain a Y chromosome, which in turn includes the male sex gene. Those that reach the egg, carrying its unique set of mutations, are born along with the baby boy. That's what I meant in New Big Y Results for Cooley Group CF02 when I said that SNPs are people too.1 Theoretically, each mutation could bear the name of its host. But what was the name of the person into which the 4,000 year-old mutation known as Z2563 was born? We'll never know. Instead, we call it Z2563.

If all but a handful of men have the value of G (a guanine molecule) at a certain position on the Y, the assumption is that G is the "ancestral" value. Any other value found at that location would be a mutation. For example, men who have Z2563, which is found at position 14364877 on the Y, have a C (cytosine) instead of the T (thymine) found in the general population.

Setting the hypothetical no-other-tests aside, what do we know about this bucket (left side of the page) of more than 3,000 SNP mutations found in David Cooley's Big Y results? Each represents a specific value (A, C, T, or G). Because they're mutations, they're not found in the general population. How do we make sense of them? First, we need to discover where they're clustered on a map. They can then be arranged in a relative timeline based on geography.

For example, the SNP called M168 is found in nearly all men (the exception being two populations in Africa). We can pull it from the bucket and use it as our starting point. Looking at P143 we find that it is found anywhere but Africa. Obviously, this emerged just after a major population movement out of the continent perhaps about 70,000 years ago. The SNP M173 is typically found in Eurasia, and M343 (also known as R1b) is found in its largest concentration in Western Europe, as shown in this map:

Genetic genealogists would shorthand the descent of the four SNPs I've just described in this manner: M168 → P143 → M173 → M343. Three things typically occur as we move down the tree: the geographic regions get smaller, the populations initially get bigger, and the SNPs themselves get younger. Of course, there are exceptions. A SNP always originates in just one person and grows in area and numbers. It can arrive to a spot on the map, such as Britain, and grow exponentially in a small region. But that takes time. As we approach the modern era, the newer the SNP, the fewer its representation in the population.

For example, the very bottom the Benjamin Cooley SNP tree, covering more than 4,000 years, looks like the following. As the SNPs get younger, the known population decreases until we have only two people who have tested, David and Doug, both of whom are known to have the 17 SNPs listed in the green box above Benjamin's name. They are the only two in the worldwide SNP database who have them. Of course, all of Benjamin's patrilineal descendants had and have those SNPs — as well as all the SNPs that emerged into the lineage before them.

So what's the deal with that green block of 17 SNPs? Remember, we started with the single block, or bucket, of SNPs on the left — those found with that very first test by David. Then a second hypothetical tester came along and matched all David's SNPs from DF83 and above. That resulted in a split of the single block into two blocks — two SNP blocks. The common ancestor, and the common DF83 SNP born with that unknown person, between David and the second tester would have lived about 4,000 years ago. Likewise, the common ancestor for the CF09 Cooleys and David lived about 2800 years ago, and that for the Brown family about 2200 years ago. And we know, of course, that the most recent common ancestor (MRCA) for David and Doug, Benjamin, was born 400 years ago. This means that the 17 SNPs in that block emerged, one man at a time, between about 200 BC and 1615 AD. Eventually, with additional testing, most of those SNPs will be arranged in a timeline in manner we next see with my own Cooley family.

The first CF01 (and non-Cooley) tester was a descendant of John Hackett, born in Derbyshire, England in 1746. When he tested, he had 18 SNPs that were unique to him, not found in any other tester — until my Cooleys, descendants of John Cooley of Stokes County, NC, tested. This five-frame animated GIF illustrates what happened as additional testers came along, and in the order the tests occurred. Those 18 SNPs, having emerged over about 2,000 years, were broken into four blocks, all in green:

The three Cooley SNPs (YP4491, YP4492, and YP4493) shared by myself and another tester likely came into John Cooley's line within about 500 years of his birth. Their emergence could have been evenly split on the timeline, say about every 150 years or about five generations, which is the currently estimated approximate average of SNP mutation. But they could have also been bunched together in a short period of time. It's unlikely, but even one person might have been responsible for all three. We don't yet know. But consider that last white box of 5 SNPs. Those came into my line in under 200 years, over the period eight generations, an average of one mutation in less than every other generation. Now, consider that block of five unparsed SNPs that were left to the Hackett tester. They're thought to have emerged over a period of about 700 years, not 200. If the timeline for those turns out to be compressed over a shorter period of time, so will the timeline for the three Cooley SNPs. Time, and more testing, will tell.

What happened to Hackett's initial 18 SNPs, which emerged over about 2,000 years, can happen to Benjamin's block of 17 SNPs, which emerged over an equivalent timeline. What can be done to break that up? Well, another Big Y tester, someone whose lineage broke away over that interval. Perhaps as many as four testers will come along, similarly breaking up the block in the way Hackett's were parsed.

One thing is certain, however. There is nothing to be gained by another Big Y test from another known Benjamin descendant — he'd have all those same SNPs that Doug and Dave have, plus the small number of mutations that saw the light in his lineage since the birth of Benjamin's children. To learn those would merely be an academic exercise.

What about those CF02 testers that are not known to be descended from Benjamin? A mismatch on any of that block of 17 SNPs (now dubbed the A12020 block), would be very telling. At this point, a match will not tell us anything. We're not there yet. But the A12020 SNPs can now be tested at yseq.net for $17.50 each. You'd receive a yea, a partial yea, or a nay — and no new SNP discovery — for a much smaller fee that the $575 for a Big Y. If someone wants to do that, I wouldn't discourage it. However, it might be best to wait until the A12020 block is whittled down and we know a whole lot more than we now know.

The singer Sarah Vaughan used to say that there are notes between the notes. Well, there are SNPs between the SNPs. Each of the above represents a block of them, just like Benjamin's A12020 block of 17, my YP4491 block of 3 (if it hadn't been for the Hackett tester, there'd be 7), and the 10 SNPs in CF07's Y23275 block. Sometimes they number in the hundreds. That there are more than 600 hundred SNPs in the top L984 block likely means that a large population, probably in Africa, has not yet tested. In other words, the 34 levels of SNP blocks shown for Benjamin Cooley will slowly expand to several thousand. We merely need to find the right testers!

1A Single Nucleotide Polymorphism (SNP) occurs when a nucleotide, aka a base, mutates from one molecule to another.