Michael Cooley's Genetic Genealogy Blog GEN • GEN
30 June 2018

The Eldridge DNA Project

I assumed administration of the Eldridge DNA Project sometime in May 2013 while finishing my degree in history at Humboldt State University. It, like other projects I took on at that time, had been orphaned. Although my genealogical connection to the name is remote (I'm descended from Prudence Eldridge of Cape May, New Jersey, mythically known as Princess Snowflower or Toudl-Hkiligo),1 this was another opportunity to study the workings of genetic genealogy and become familiar with project management.

The time I've spent with the project has largely been devoted to maintenance, the majority of my endeavors being with the Cooley DNA Project. But Group 04 — a collection of Eldridges, Reynolds, and Gardiners — has recently stepped up to the plate with advanced testing, particularly the Big Y, which examines more than 10 million bases (the famous four genetic letters) in search of undiscovered mutations. This article is an opportunity not only to present the new findings, but to open some additional dialog, and to discuss, with illustration, the advantages in advanced testing. Specifically, I'll do my best to explain what the three new Big Y results shown at left mean to the testers' genealogy. But first, some definitions.

Repeated Genetic Strings

The results page for the Eldridge DNA Project shows Short Tandem Repeats (STRs). Each number represents the number of times a string of genetic letters (A, C, T, and G) repeats at the designated marker. For example, most of the testers have 14 repeats of TAGA at a marker called DYS19, the third marker (and column) listed on the results page. These repeats, however, can go either up (insertions) or down (deletions) from one generation to the next, making STRs impractical for clearly identifying lineages. Surely, if a half dozen testers completely match across 37 markers, it's reasonable to assume what the "ancestral" values were. But any real degree of variance makes it nearly impossible to determine which value preceded another, and therefore, the sequence of marker descent is fogged.

This just means that a different approach at the data is required. It turns out that the number of differences between any two testers can give us a hint as to the degree of relationship between them, and the broader the sample, the more reflective that hint will be. In other words, the more testers, the more reliable the results. Statisticians call it the law of large number. This genetic distance (GD), if not always accurate, is a strong indicator of relatedness.

Still, a degree of skepticism about GD is needed. For example, I was born with an insertion of an additional repeat at DYS449, making me the only person in the CF01 group at the Cooley DNA having 34 repeats at that position. Right out of the womb, I was a genetic distance of 1 from my father. Furthermore, another mutation occurred sometime between the birth of Greenbury Cooley in 1844 and my grandfather's birth in 1899. (I haven't found the right tester with whom I can further identify its entry into the lineage.) This makes me a genetic distance of at least 2 over 37 markers with all other group members. That's certainly within the bounds of acceptability, especially considering the common ancestor was born nearly 300 years ago. But a descendant of Joseph Cooley (1767-1826) has an acceptable GD of 4 out of 37 over the average values (modal) for the group. This makes he and I a GD of 6 out of 37 markers, which is pushing the statistical envelope. Individually it looks good and the relationship is not doubted, but collectively the number is a little misleading. That's just the nature of STRs.

Single Point Mutations

It's different with Single Nucleotide Polymorphisms (SNPs). These mutations are pretty much as they sound: a single genetic letter has morphed to one of the other three values. For example, among those Group 04 members who have tested, it's been discovered that position 21335389 has mutated from the ancestral C value, that which the population is expected to have, to a T. This mutation has been dubbed A22119 (and later BY45361 by FTDNA). SNPs are rare events and, considering there are about 58 million nucleotides on the Y chromosome, specific SNP combinations are found only in specific lineages (family, clan, tribe, geographic region, or population, depending on how far back into time we go). And they almost invariably stay put, not just over hundreds of years, but tens and hundreds of thousands of years. That's due to the nature of the Y chromosome, the carrier of the male sex gene. The Y is passed from one generation of men to the next without any genetic influence from the maternal side of the family, and generally without intervention from natural selection (because these markers are not expressed in trait-bearing proteins). The differences we find in the population are due only to the rare and rather whimsical event of chance mutation.

Haplogroups

It's also important to define haplogroups — those points or nodes at which a tree forks off into descendant branches. The SNP at the very top of our tree, U106, which defines the haplogroup R-U106, is estimated by YFull.com to be about 4700 years old. U106 itself is a descendant of R-M343, also known as R1b, which is the most common major haplogroup in Western Europe. (Eastern Europe is dominated by R1a.)

Each parent node can have one or several child branches, each a sibling SNP to one another. But the metaphor to a human family pretty much ends there: a parent node isn't the union of two branches, as happened between your parents. After all, we're talking about the Y chromosome, which descends, in clone-like fashion, only through men. In fact, multiple SNPs can exist at the same level and in the same node. They live in the same box because every Y tester below the box possesses all member SNPs. But eventually, testers come along who have only some of those SNPs, meaning that their line broke off before the currently-defined haplogroup fully formed. The "box" has been split in two, the new haplogroup being upstream (or the parent) of the remaining SNPs. A new branch (node, subclade, or haplogroup) is thus discovered.

It's much like following a family tree. As you climb it, one limb at a time, the more collateral branches you encounter and the more cousins (or twigs and leafs) we find are descended from (or below) them. For example, there may be two siblings who share parents, eight first cousins who share grandparents, and twelve second cousins who share great-grandparents. A Revolutionary War patriot might have hundreds of descendants. Genghis Khan has millions — even only among his Y chromosome descendants. (Another prolific male-breeder was the Scottish Chieftain, Somerled, with whom I share similar Nordic markers — but not so many as to be patrilineally descended from him.)

In short, the more DNA we share with one another, the closer related we are; the less we share, the greater the cousinship. Therefore, the more descendants a SNP has, the older it is. By comparing and contrasting the SNP mutations between multiple testers, we can begin to build a tree. The example to the right comes from the Cooley DNA Project. The first slide shows that the first tester had eighteen novel SNPs (those SNPs not yet known to be shared by anyone). Three testers later, the 18-SNP haplogroup has broken down into four smaller haplogroups collectively comprised of the original eighteen SNPs (the gray boxes). With each test, the SNP tree not only grows (white boxes) but parses (gray box/es). Most of the new subclades still have multiple SNPs, which means we could end up with several more haplogroups and obtain finer detail about the family's descent and better identify those families that are related, or collateral, to it.

Recent Changes and Difficulties

The SNP waters have lately been muddied, however, by Family Tree DNA (ftdna.com) through one of their many recent (Spring 2018) policy and procedure changes. SNP mutations will often be found in unreliable regions of the Y chromosome, such as the large, jumbled mess of repeats in DYZ19. As inferred above, regions of repeats tend to be fickle. It can't be expected that a SNP living in such a region will be stable; it might not be found among all of its Y descendants. Although most testing companies will flag them as such (I've grayed them out in the first graphic), FTDNA is now acknowledging them as legit SNPs. Moreover, they're naming haplogroups for them. For example, haplogroup R-BY50276, shown above as an ancestor of both the Eldridges and Gardiners, consists solely of such a SNP (BY50276) and, by the reckoning of some scientists, should not be included on the tree — certainly not as a single-SNP haplogroup.2

Up until a few months ago (weeks ago, really), FTDNA was (and is) known as a premier testing company but weak on the interpretion side. A year or more would often pass before a SNP was recognized by FTDNA, a name assigned to it, then placed on their Y tree. It is clumsy to write (as above) "that position 21335389 has mutated from the ancestral C value, that which the population is expected to have, to a T," so I began reporting newly discovered SNPs to Yseq.net. The company will, for $1, verify the viability of a SNP and assign an ascension number behind their company's designation of A (for Astrid, the wife half of the Yseq management team). Hence, we have SNP names such as A22119 — the 22,119th SNP named by Yseq. So when the new results came in, I recently did exactly what I've done for the last three years and had Yseq provide names for the SNPs. But this time FTDNA jumped in right behind me and named them themselves (a BY followed by the next ascension number). Thus, we have doubly-named SNPs. (This is not unusual, by the way. There are dozens of labs, all of which have their own naming schemes, and one will often make the same discoveries at about the same time as another. In fact, some SNPs have up to ten different names. It's not a huge problem but suggests that a central clearinghouse is needed.)

Simplified SNP Tree

If we remove the anomalies — those doubly-named SNP designations and the SNPs that stalk notorious neighborhoods — we can simplify our tree. The BY50276 SNP and haplogroup disappears, which, despite its reputation, is rather unfortunate because it otherwise provides us with a valuable landmark ancestor to the Eldridges, Reynolds, and Gardiners, distinguishing them from the Crumplers. But this graphic has the benefit of being a lot cleaner, if lacking the granularity of the first diagram. Two Reynolds testers in the 04 group have tested these SNPs at Yseq.net ($18 each) and were discovered to be negative for the Gardiner SNPs but predictably positive for the three SNPs in the upstream R-A22119 haplogroup, that which we can call the Eldridge / Reynolds / Gardiner / Crumpler (ERGC) MRCA (Most Recent Common Ancestor). If either tested for the Big Y, they might be found to have SNPs unique to the Reynolds, or even share one or both of the "shady" Eldridge SNPs. But that would be a large cost in exchange for the possibility of little in return. (I'm beginning to suspect that we can't expect to find meaningful SNPs younger than 300-400 years old.)

Dating SNPs

So what have we learned? As mentioned, SNPs are rare events, but they do happen on average of once every 144 years. Of course, we can't set our clocks on it. It's entirely possible, if unlikely, that our Gardiner tester was born with all four of his personal SNPs and is the only person in the world with that exact combination of mutations. It's more likely, however, that they came into his lineage singly (or even in doubles) over the last 600 or so years, which could make the "ERGC" MRCA born sometime in the 15th century — or not. That we don't have strong SNP candidates for Eldridge would suggest that the MRCA lived more recently. But we also have the seven Crumpler SNPs to figure into the analysis.3 If we add up the number of qualified personal SNPs for the three testers (0 + 4 + 7), divide by the number of testers (3) and multiply that by 144 years we get 528 years. In other words, the ERGC common ancestor might have been born (with his three "good" R-A22119 Y-SNPs) in about 1500 AD. If we figure in the "gray" SNPs, he might have lived a couple of centuries earlier. (YFull.com estimates that A22119's parent haplogroup, S6881, is about 1100 years old.) That's the enduring legacy of SNPs. Once introduced into the lineage, they retain the power of the Energizer Bunny. And once they're identified, we need only triangulate and resolve multiple test results onto a timeline. If we're lucky, the life of a once-breathing and identifiable man will also converge at the same point. (See The Man Who Would Be BY23988.)

There's another data point that can help determine the age of haplogroups. Along with the SNP discovery of the Big Y, FTDNA now sequences approximately 500 STRs. As mentioned, STRs don't have quite the staying power of SNPs. The values can go up, down, and back up again, which makes it difficult to know what the ancestral values were. But the power of big numbers comes into play again. The Eldridge and Crumpler testers have a genetic distance of 18 among the 503 STRs they both had results for. STRs mutate at a faster rate than SNPs, but it gets complicated as each position has an average mutation all its own. Even the age of the father is factored into the equations.4 So by itself, this figure doesn't tell us much. But as more data comes in, the big numbers of testers will further modify the big number of STRs and the correlation between STR genetic distance and our estimate by SNP count will become the more meaningful. For now, baby steps.

Conclusions

How do we break this stalemate and peer into the haze of ancestral descent? The aforementioned law of large numbers will help us. In the meantime — while those numbers remain small — the immediate conclusion is that the families I've discussed had common origins about 500-600 years ago. The Eldridge tester claims descent from a John Eldred (1419-1489) and the Gardiner tester from George Gardner (1599-1677). Although there are many more Gardiner matches than Eldridge in Group 04, there's not sufficient genetic evidence at this stage to determine whether an Eldridge ancestor changed his name from Gardiner or the other way around.

But there are other elements to the story that stand to be discovered. My Cooley clade, for example, has had sufficient testers to suggest that our paternal ancestor sailed across the North Sea to Scotland sometime during the Viking Age and settled down with what became the Cochrane clan, which is — even if proved to be only half true — very satisfying to me. Although the Eldridge R-A22119 haplogroup (aka R-BY45256) has only three testers right now, the immediate upstream haplogroup, R-S6881, has 40. That represents the makings of a definable population — an ancient tribe or clan that lived in an ascertainable geographic region. Considering that that much data is available, I would recommend that all R-U106 testers start by reading the introduction to the R-U106 Haplogroup Project, which includes several notable genetic genealogists as administrators, including author Debbie Kennett and astrophysicist Dr Iain McDonald.

We as genealogists are, of course, anxious to add the next name to our pedigree, but the top-down method can also be useful. Although we're early on in the genetic game, we can feel secure about what has been discovered to date, and know that as new results arrive the analyzes will become both deeper and clearer. There may be a gap between our earliest known ancestor (EKA) and the SNP tree as it is now understood, but the combined study of history, anthropology, geography, genetics, and genealogy will slowly flesh it out.

The End

One further point. When we have multiple testers with the same identified MRCA, we acquire a distinct genetic profile for that person. Charles Duncan (1761-1838), for example, was born with the SNP A1147. Jeremiah Strother (1700-1741) was born with six SNPs, including A12273. Benjamin Cooley (1615-1684) was born with A12022, A12024, and Y23835, and my own John Cooley (c1738-1811) was born with YP4491 and three others. We can be as certain about this as we are about an ancestor's marriage or his date of death. It's real genealogical data, and the evidence of it is carried in every cell of a man's body.

1 Although Prudence certainly might have had Native American blood, her mitochondrial DNA was European haplogroup V, nicknamed by population geneticist Bryan Sykes as Clan Velda. These Cape May Eldridges, by the way, appear to be of Group 02. I would be delighted to hear from any member of that group.

2 Thomas Krahn, owner of Yseq.net, recently commented on the company's Facebook page, "We don't offer the problematic regions because they are not useful for phylogeny. It is incorrect that we can't sequence those regions with Sanger sequencing technology. It just doesn't make sense to test them."

3 The Crumpler tester also has three SNPs that reside in the DYZ19 region of repeats and should not (according to one camp of geneticists) be relied on.

4 S. Claerhout et al, "Determining Y-STR mutation rates in deep-routing genealogies: Identification of haplogroup differences," PubMed, National Center for Biotechnology Information, https://www.ncbi.nlm.nih.gov/pubmed/29360602.