Michael Cooley's Genetic Genealogy Blog
9 March 2021

Introduction to the Y-DNA of Southard (Etc.) Group SF01

Most testers in the SF01 group of the Southard (etc) DNA Project have drawn their lineage back to Thomas Southard, believed to have been born at Leiden in 1615. The first Big Y tester in the group, kit #10999, is a descendant. The latest tester (kit #214048) has proved descent back only to William Southern who was born in Virginia a century later. The DNA, however, demonstrates that they're related, if distantly. These tests constitute what I hope to be the beginning of an ongoing study to explore the degree of their relationship, the degree of relationship between them and other group members, and the group's origins, not only in England but back through the millennia. Although this article concerns only SF01 DNA results, it shouldn't matter to a reader belonging to another Southard family or even of another surname. The science and procedures are the same. Surnames are, after all, merely a social convention, an affectation. Some might suggest they are nothing more than a manifestation of human ego. As such, we probably shouldn't take surnames too terribly seriously, despite their usefulness to genealogists.

Before discussing the new Big Y results let's examine a graphic for SF01's three Y-111 STR (Short Tandem Repeats) tests. The numbers found on the project's page represent the number of repeats of a particular string of genetic letters within a defined segment of the Y. For example, there are 13 repeats of AGAT at DYS393, the graphic's first listed value. This particular STR happens to mutate very slowly, which accounts for the large number of matches across the population. But the average mutation rate for each STR varies considerably and, over time, patterns appear. These patterns are useful except for one thing: the counts can go either up or down — and back the other way — making it nearly impossible to predict their historic or future values. Still, the changes are relatively slow and the values consistent enough that we can make broad judgments about to which group, population, or extended family a tester belongs.

We make this judgment by taking the most common values in a group and assume they belonged to the earliest common ancestor. (The more testers, the more accurate this assumption.) The resultant values, referred to as the modal, are listed in gray on the top row. The numbers in blue, next to the kit numbers, represent the number of differences between each tester and the modal.

Further discussion about STRs will be left for a future date. The point now is simply that the rather chaotic nature of STR inheritance makes it next to impossible to determine the degree of relationship between any two individuals. STRs are too fluid for representing trees of descent. Still, they're great for overall grouping, as can be noted by studying the differences between the several groups on the project page.

Although Big Y testing extracts up to 700 STRs, its primary goal is to sequence up to fifteen million markers called SNPs (Single Nucleotide Polymorphisms). The potential ping pong like back and forth often found in STR values is rarely seen with SNPs. Their stability makes SNPs far more suitable for the scientific study of biological descent, known as phylogeny. Once the lab has finished its work, the sequenced markers (aka mutations or SNPs) are compared to all testers across the company's database and matches are noted.

A SNP is merely a mutation from one of the four genetic letters (A, C, T, and G) to another. For example, the BY99561 marker discussed below refers to a mutation at position 12958074 on the Y chromosome: a C became a G. Each of us born with an average of 44 to 60 mutations across our entire genome, which consists of 3.1 billion genetic letters. Only about 1% or 2% of that are genes. Although a single-letter mutation in a gene can cause any number of disorders, a mutation in the vastly larger "non-coding" regions presents no health consequences. Mutations such as BY99561 do nothing and, therefore, play no role in natural selection. They just sit there and pass unnoticed from one generation to the next. As such, these SNPs are silent witnesses to history, somewhat akin to tree rings, layers of sediment, and ice cores. And this is especially true for the Y chromosome, which barely has purpose beyond being a vessel for the male sex gene.

Determining the descent and age of a SNP

R1b-BY99561 SNP Tree

The value of Y-DNA SNP mutations, then, is found in the fact that they constitute a permanent historical archive of a lineage. But they can't be counted like tree rings. Instead of being laid down by layers, they float like vegetables in a soup and, therefore, require a different approach. Another form of comparative analysis is used.

Taking the soup analogy further, imagine we have twenty bowls of differing flavors of soups. The contents of two are used to make up the contents of yet another bowl (half from each of the parent bowl). A person trained in soup analysis can study the makeup of the child soup and determine the recipes of the parent bowls. The same can be done with the parents. In time, a genealogy of soup recipes emerges from which we can begin to estimate cousin relationships among the many bowls. Think about it: the more ingredients shared, the closer the relationship; the fewer, the more distant. In fact, our two testers share thousands of known SNPs (I've listed only a handful) and they mismatch on only eleven unique SNPs — an average of about five per tester. They are a genetic distance (GD) of 11 from one another.

All SNPs shared by the testers are listed above the horizontal line. That block of SNPs (called a haplogroup) placed immediately above the named ancestors is the terminal haplogroup and broadly represents the testers' Most Recent Common Ancestor (MRCA) who is presently unknown. FTDNA chose BY99561 as the lead SNP and that represents the temporary name for the haplogroup, the contents of which can change with additional testing. Each block listed above the terminal haplogroup results from discovered additional branching of non-Southards testers (the actual branching not illustrated here). This stacking of haplogroups continues for up to 300,000 years. Of course, because only a tiny percentage of the male population has tested, there are huge blanks in our understanding of the worldwide SNP tree. Still, the markers found at the topmost haplogroup, known as Y-MRCA, are shared by every living man.

A SNP tree isn't really any different from a standard genealogical tree. For siblings, the MRCA is a parent. For first cousins the MRCA is a grandparent, for second cousins a great-grandparent, and etc. In other words, SF01's terminal block (R1b-BY99561) of SNPs represents the great-grandfather to the nth degree of our two testers. Our principle goal is to gain an understanding about that man, to determine his Y chromosomal fingerprint, to calculate the degree at which our testers are cousins, and then superimpose the Y-DNA tree over the genealogical tree, which can provide more glue and detail. By extension, we can begin to estimate where and when this mystery man lived. To that end, every new marker becomes a data point toward that understanding and provides further resolution to the overall picture. And, unless any two testers are very closely related, new data is nearly always found with each test.

In short, we can rearrange our bits of soup into layers based on the number of tested descendants and sort the same layers from oldest to youngest. FTDNA doesn't provide sufficient data to use our group as an example, but I've developed the needed data from the 21 Big Ys in my R1a-YP4248 subclade project and removed all distracting details. The topmost haplogroup, about a thousand years old, covers all tested members. Like any genealogical tree, it branches out as we move forward to the present time. As we climb up the tree, more of the population is included until we reach the threshold for R1a-YP4248.

Example of haplogrouping

Because genetic mutations are random (when not inherited), their creation is unpredictable. But thanks to the Law of Large Numbers we can make imperfect predictions about ages by taking the attributes found across all tests and averaging them. Big Y-700 testing tells us that a new mutation occurs on the Y chromosome about once every 60 to 100 years. I usually restate that as a SNP-rate of about once every three to five generations and average that to once every four generations. You can see the problem. But it's all we have for now.

An average of about five unmatched mutations per tester tells us that the MRCA for our two testers lived up to 500 years ago. At least that's start. It can be refined upon. And we already know that the MRCA was born with the BY99561 block of markers, real-life biological data that can be taken to the bank.

More testing will satisfy the Law of Large Numbers

Our first Big Y tester had the earlier Y-500 version, which looked at about ten million positions on the Y. The current Y-700 product sequences about 50% more sample. That test is currently being upgraded. We can reasonably expect that additional novel SNPs will be found, some of which may match some of the second tester's novels. If that happens, the MRCA will likely calculate to a more recent date.

William Southern's ancestry is uncertain, and we have only one tested descendant. But SF01 has several testers descended from Thomas Southard. Because we know Tom's year of birth, a precise Y profile for him will provide a measuring stick by which we can gauge the relationship between all lineages involved, including whether William's lineage broke away before or after Thomas's birth. That measuring stick can be acquired by Big Y testing a second known Thomas Southard descendant. Because kit #B63657 has completed the Y-111 there's a significant discount for an upgrade to Big Y-700, and that will tell us more about William Southern's descent, as well the relationship to all others who likewise upgrade. But our journey will not end with just one more test. I've adopted a new motto for life: More data is better data.

I'm happy to answer any questions about any of this.