Michael Cooley's Genetic Genealogy Blog GEN • GEN
4 January 2019

SNP-Calling and the Strother DNA Project

Several new Big Y tests for the Strother DNA Project were processed this fall. Because my previous Strother articles were largely written to provide a basic understanding of the Y chromosome, I hope to demonstrate some more advanced — or more detailed — ideas here. But first, here's the new Strother SNP tree integrated with the known genealogy for the first couple of generations from William Strother, who died in Virginia in 1702. To may mind, the most important new finding is the placement of kit #126422. The manner by which this conclusion was arrived at is my first topic.




Strother / Struthers SNP Tree

I frequently compare Short Tandem Repeats (STRs) to the ever-changing landscape of sand dunes; and Single Nucleotide Polymorphisms (SNPs) to the (seemingly) timeless state of mountains. STRs are good for placing a tester into a general population, all the better if the surname also matches. Only SNPs, however, can trace specific lineages. Still, not all mountain summits are equal. Some, we find, were built atop unstable ground. For example, SNPs, single-letter mutations, are often found in the middle of STRs, rather fickle repeats of short strings of letters. Since we're looking for reliability in SNPs, these should be noted, but unless proven otherwise, not taken too seriously. FTDNA, however, has lately named haplogroups (a collection of SNPs sitting at a determined spot on the Y-SNP tree) for SNPs found in STR regions. For example, they've designated BY50276 as the terminal SNP for Group-04 of the Eldridge DNA Project. This marker is strongly represented in the group, but it sits right in the middle of the notorious DYZ19 STR region. It's useful, but probably not reliable.

DNA is usually sampled through saliva. Samples, which include the DNA from thousands of cells, is mashed up and the genetic material is chemically isolated. The process causes the remaining sample to be highly fragmented. Much of it is lost — chemically corrupted or so fragmented that it can't possibly be reassembled. This image from researchgate.net illustrates how a short sequence is determined by realigning three samples.



Matching segments can have thousands of copies. Sequence variations are uncovered and noted by comparing the sample against a computer model of the human genome, now in version 38, or hg38. Here's a summary for the reads of my own terminal SNP, YP4491 (which is decidedly Cooley, not Strother).


Position Expected value
(ancestral)
Actual value
(derived)
Percentage
(weight)
Total reads
(coverage)
Average map
quality
Average read
quality
2846003GT100.006460.036.9
YP4491

"Percentage" simply refers to the percentage of reads that accurately reveals the SNP value.1 (There can be errors in processing, sequencing, and alignment, or even simple contamination, which one would expect from saliva. Insertions and deletions can also figure into the analysis.) Quality is determined at the lab using phred scoring, the details of which we need not know except to say that the maximum score for map quality is 60 and 40 for read quality. One hundred percent of the 64 reads at position 2846003 of my Y chromosome, above, were found to have the unique value of T. Furthermore, these reads are of the highest quality. It happens that this SNP is unique to both my brand of Cooleys and a mystery Whitfield family. (This is the kind of clue that can be found only through DNA analysis.)

Using this criteria (coverage, weight, and quality) we can readily dismiss these two reads:


Position Expected value
(ancestral)
Actual value
(derived)
Percentage
(weight)
Total reads
(coverage)
Average map
quality
Average read
quality
10770457 A T 50.00 237.524.0
Poor Reads

FTDNA, however, reports quality SNP reads only when a minimum of ten reads are found in the sample. It's not a bad idea — redundancy is always good — but ten is an arbitrary number. Reliable SNPs are often missed. Returning to Strother results, Y133726 is an excellent example:


Kit Position Expected value
(ancestral)
Actual value
(derived)
Percentage
(weight)
Total reads
(coverage)
Average map
quality
Average read
quality
#52218312131577AG100.00860.037.2
#555347 12131577AG100.00760.039.3
#126422 12131577AG100.00660.037.0
Y133726

These are good, clean, high quality SNPs living outside of the unstable STR or centromere regions of the chromosome. Their presence places our mysterious kit #126422 right in the middle of the Strother genealogy. The tester is missing several generations from his genealogy, but he can be assured that his early American descent proceeded from William Strother (-1709), through his son Jeremiah (1655-1741), and through his son, William who died in 1751. The genetics is fun, but, after all, it's the genealogy we're after.

Because so much of the genealogy has been done, the Strother DNA Project is perfect for learning how to evaluate SNP data. I spent several hours going over the reports generated from project members' SAM files.2 But care needs to be taken when calling SNPs. The mechanical and chemical process of sequencing isn't as clean as we'd like it to be, and it's possible to go down a rabbit hole. The results need to be heavily filtered to prevent being distracted by genetic noise. For example, the sampling for BY26932 passed most of the thresholds I'd set, but I threw it out because it's scattershot and doesn't make genealogical sense. Likely, the reads are artifacts of an ancient SNP rather than one that will help us parse out the early generations of the family. Rather than try to make sense of it, it's best to ignore it, at least for now. I was able, however, to make this determination only because I'm using an eleven member study. I browsed nearly 400 similar readings — well worth the effort if you're trying to suss out much needed lineage-defining SNPs.


Kit Position Expected value
(ancestral)
Actual value
(derived)
Percentage
(weight)
Total reads
(coverage)
Average map
quality
Average read
quality
#8360756849490T A 100.009 58.437.0
#52218356849490T A 100.001 60.037.0
#555347 56849490T A 80.00 5 58.038.0
#12642256849490T A 85.71 7 57.037.0
#386246 56849490T A 80.00 5 55.637.0
#28255856849490T A 100.008 58.533.8
BY26932: Mixed Results 3

Low Count but Clean Reads

For this study, I looked for SNPs that were free of clutter, such as the readings for Y133726, above. No genetic noise surrounding the marker is present in the other samples. The three grayed out SNPs in the top graphic are low-read SNPs, two or three each. They're so clean and error-free, especially when compared to nearly a dozen samples from the same family, that I'm willing to bet they're stable. BY146, however, isn't strictly unique to the kit #78192. The fact is recognized from its low ascension number: It was only the 146th SNP discovered via Big Y testing. That would have happened some years ago. Either the prior reading and naming was a mistake or the SNP exists elsewhere on the world-wide SNP tree, perhaps even outside R1b, the most common Y haplogroup in Western Europe. Still, its high-quality three reads is unique among the Strothers and could conceivably be used to "break a tie," so to speak.

Triangulation

The term triangulation in genetic genealogy is most often used for autosomal analysis. From my experience, a fully triangulated SNP is relatively rare. It achieves its status (I call it an anchor SNP) when proven that it manifested at the birth of a specific, identifiable man. I wrote about such a Strother SNP, A20343, last September in Article 59. We know it first saw the light of day in Virginia with Francis Strother's birth in 1700, and we know this because the descendants of Francis's brothers are not positive for it, whereas the descendants of his sons are.


Francis
(A20343)
born 1700
#83607 #282558

Every tested male Strother descended from Francis — that is, anyone inside the triangle — will have the SNP. Except for the rare fluke — and probably one occurring far outside the Strother genealogical sphere — no one outside the triangle will test positive for it. Anchor SNPs can, then, be invaluable for genealogy.

With five strong SNP candidates, the Christopher Strother descendant on the right side of the top graphic might well find a SNP that clearly defines the lineage. To that end, individual SNP testing is being conducted at Yseq.net. (The company's usual $18-per-SNP price tag has been reduced to $15 to the end of January 2019.) But try as I might, I was unable to bring myself to call any unique SNPs for the Butler lineage (far right). But hold on. Two Butler Big Y's are in the works. It's likely that some of the "noise" I struggled with will drift to the top.

The opportunity to assign the birth of a SNP to a known man and a specific birth date exists only within the narrow window of the genealogical timeframe. But even outside that window, the Y-SNP tree is dotted with single-SNP haplogroups that did, in fact, arise in specific men. By triangulating the results of dozens and more tests, we can reasonably isolate the region and era in which each SNP, and their birth host, was born. For example, by examining the results of more than 80 testers, we know that SNP YP355 was born about 2500 years ago in Scandinavia, likely on or near the North Sea. We might call him Olaf, but for all we know his name was Ugh. But never mind. He existed. Theoretically, a man can prove whether he's an Ugh descendant by testing that one SNP.3

Deep Ancestry

I've discussed the near-art of SNP calling and the cleaner process of SNP triangulation. The latter led to a brief look at deep ancestry: The higher on the tree a SNP or haplogroup is triangulated, the older it is. Because SNPs are rare, eons fly by as we amble up the branches. In fact, BY23988, which sits at the top of our Strother tree, is probably about 1200 years old, give or take an untold number of centuries. (The more testers we find, the better the estimate.) This suggests that the split between the Strothers and the Struthers (on the far left — of which there are a reasonably large number) might have occurred before the adoption of names. In other words, did the surnames arise out of a common root name or is it coincidental that the two lineages have a similar name? In a way, it really doesn't matter. We know that the two groups had a common ancestor.

STRs can help us out. As I've stated a number of times, I'm not a huge fan of STRs. They simply don't have the precision of SNPs. But just as we can use them to place testers in broad groups, the law of large numbers can help us ascertain the deep relationship among testers. I spent a good deal of space in Article 59 discussing the concept, but we now have several more tests to add to our sample. It's worth revisiting. Included here are the figures for the distantly related Reid (B139817) and Allison (556525) testers. But I won't spell out all 561 STR values here. One line will suffice:


MarkerModal 126422169888282558 33480386246522183 555347556525 718928360790562 94368B139817
DYS444 13 13 13 13 13 13 13 13 15 13 13 13 13 13
STR Values for DYS444

The modal is simply the most common (not the average) value among those who have tested. Obviously, our Strother testers are expected to have 13 repeats of TAGA at DYS444. The Allison tester (556525) has 15. Since we're after GD, we can represent it like this:


MarkerModal 126422169888282558 33480386246522183 555347556525 718928360790562 94368B139817
DYS444 13 0 0 0 0 0 0 0 2 0 0 0 0 0
Genetic Distance for DYS444

Again, there's no point in including the entire table of genetic distance. By itself, Allison's GD of 2 at this marker isn't significant. It's just one value. But if we look at the total GD per tester for all 561 STRs, we have something very interesting:


126422169888282558 33480386246522183 555347556525 718928360790562 94368B139817
1636175516467523
Total Genetic Distance

The Strothers' total genetic distance from the modal is in single digits. But the Struthers tester (386246) has a GD closer to Allison and Reid. (Of course, the vast majority of the testers are Strothers, so the modal will be tipped in their direction.) Again, care needs to be taken not to read too much into this. The actual value difference between Struthers and Allison are quite different, for example. It is an indication, however, that the genetic connection (never mind the social connection) between Strother and Struthers is quite old.

What Now?

Genealogy is about the search for ancestral roots and how the various parts are related. So is population genetics, the grand-daddy of genetic genealogy. In pursuit of those roots we often look for cousins, as is often effectively accomplished with autosomal testing. By comparing and triangulating distant branches we begin to learn about deep origins. The split between the Strothers, Struthers, Allisons, Reids, and Butlers might have occurred before the genealogical timeframe. And there are likely other surnames that grew out of BY23988, which likely had its own origins in Scotland or the North of England. Jeremiah's descendants have shwon us that SNPs can be parsed down individually. There are six acknowledged SNPs between the very top of our tree and William Strother's birth. If we're successful in breaking those up, the timeline will become refined. The upcoming Butler testing will help with that.

Compare our top graphic with the third one from the first Strother Y-DNA report written more than two years ago. That first tester had twenty unique SNPs that had to be named, parsed, and distributed on the tree. We've come a long way, baby.

1 YFull refers to this as a "weight" of 1.0 for the value of G. For example, I report a percentage of 97.50 for T on my 80 copies of S224; YFull reports a weight of 0.975254237288. They're a picky crowd.

2 SAMs are text-based files extracted from their compressed binary equivalents, BAM files. I like SAMs because, despite their sheer volume, they can be read and interpreted by the human eye.

3 If these results belonged only to kits #83607 and #282558, I'd have taken them.

4 Because it's always possible to get anomalous readings, it's best to get the lay of the land either through an STR marker product (I recommend Y37), which would indicate whether the tester possesses a qualifying haplotype for that SNP, or a SNP pack, which includes the target SNP among its hierarchical neighbors.