It has long been a wish within the genetic genealogy community that the major Y-DNA databases become integrated in order to provide a more complete and compatible dataset. Several attempts have failed. Although more work is needed, I've largely succeeded. (I had hoped it would be a more collaborative effort, but I can be single-minded enough to get this sort of thing done.) The following addresses several points: the nature of the database, its purpose, the methodologies used for its development, its several benefits, its accessibility, and future plans. I also ask several questions directed to the genetic genealogy community in general.
The haplogroup/SNP databases for both Family Tree DNA and YFull can be downloaded by developers (without, of course, member or kit information!). Certain changes, outlined below, were made to allow them to be compatible for the merge. Tools have been developed and several aspects to the Open Y database are available for download, as described below. But why do this?
The combined data allows for a larger database and greater resolution. Here are the present stats:
| FTDNA | YFull | Open Y
| Number Haplogroups | 91,492 | 33,501 | 106,056
| Number SNPs | 765,577 | 407,897 | 825,156
| Number Recurrent SNPs | 26,485 | 12,626 | 35,766
| Coverage* | 8.37 | 12.18 | 7.78
| *Coverage refers to the average number of SNPs per haplogroup. The Open Y SNP count is presently under counted. | |||||
Combining the data provides a more accurate timeline and estimated SNP mutation rate. This benefits project administrators and testers alike. Certain haplogroup lineages in one database will be longer and better resolved than the other. This merge closes significant gaps providing testers a more complete lineage profile.
Issues arose, of course. ISOGG's YBrowse database lists 157,984 SNPs having multiple names. Companies will vary as to which name they use in their presentations. This is particularly problematic when different lead SNPs are used for the haplogroup names. For example, one company might use R-L448 while another uses S200. Open Y's download archive includes a list of the multiply-named SNPs.
Although labs operate under any number of technological and genetic conventions, adopted conventions within the genetic genealogy community have been minimal, although generally effective. I needed to create additional in-house conventions to make this work. These ideas aren't new. We've seen them discussed within the genetic genealogy community for years. However, the community at large seems to consider these issues to be rather minor, and it is true that labels may appear to be superficial. ("You can't judge a book by its cover.") But standardized nomenclature is critical to clear communications. Certainly, the changes in the new database have made a huge difference in Open Y's development. It is my hope that the following methods, even with some modification, will propagate throughout the industry.
Earlier work to convert from the longhand method for haplogroup naming to the shorthand method was important. Most readers of this article will know that the longhand form starts with a capital letter, followed by a number, followed by a lower case letter, followed by another number and so forth for each added downstream haplogroup. These quickly became unwieldy. The adopted shorthand version starts with the initial capped letter followed by a hyphen and the SNP name chosen (by someone) as the lead for the haplogroup. For example, R1b1a1a2a1a2 is now known as R-P312, the haplogroup's chosen lead SNP. For the purpose of this study, I refer to the initial letter as the class (they're not always a basal haplogroup). Personally, I believe it would serve to include the first three characters of the longhand form. Using my own terminal haplogroup as an example, R-YP4491 would become R1a-YP4491. There's considerably more information packed in at front and, for that reason, this usage is already popular within the community. (R1b and R1a generally coincide with different regions of Europe.) Renaming the classes can be done throughout the Open Y database, but I leave this for future discussion.
YBrowse.org, maintained by Thomas Krahn, solved a couple of serious problems, first by providing a central up-to-date registry. When the same SNPs were being discovered by multiple labs, they were independently named. This resulted in thousands of hornet nests populated by alternate SNP names (the lab code followed by the lab's ascension number). In a very small way, even I contributed to the problem. Up until several years ago when their methodology improved, FTDNA sometimes took months to name a SNP. Being the impatient man I am, I started having them named through YSEQ. But it now appears to be a given throughout the industry that YBrowse be first consulted before submitting a new name. The problem has been reduced, perhaps even to nil. However, we must now contend with thousands of multiply-named SNPs. I've done that in order to proceed with the conversion.
Of the more than 157,000 SNPs having multiple names, nearly 9,000 uses are found in the FTDNA and YFull databases using different names to one another. Left alone, the databases are incompatible for merging. The solution was simple enough: choose the lowest-order SNP name but only by its numeric portion. Otherwise, low-order lab codes, such as YSEQ's A SNPs, would always receive preference. Incompatibility was also found with YFull's continued usage of many of the upper longhand haplogroup names. They've been converted to their shorthand equivalents. A list of those conversions are found in Open Y's download archive.
These methods worked very well and the Open Y haplogroup tree structure was easily constructed, the two parts nicely folding together. But another, far more difficult problem, emerged: recurrent SNPs.
Recurrent SNPs are those SNPs that (justifiably) appear in more than one branch of the SNP tree. They make sense. Mutations are random and, in a strict biological sense, they're not related to haplogroups (human ingenuity creates those designations.) After all, the nucleotide value of every SNP is merely one of the four letters of DNA's alphabet soup. Of course, there will be repetitions. Just as the letters C, E, and L are recurrent in my first and last names, these letters appear in different patterns among the Y chromosome's 57 million bases irrespective of haplogroup designations. It's not, of course, that the repeated SNPs (position plus value) appear multiple times in a tester's kit (I'm not referring to the number of reads) but that they will show up in separate testers having differing genetic lineages. Recurrent SNPs of this nature are wholly legitimate and should be well understood. Following are examples of recurrent SNPs.
| SNP | Haplogroups
| A10526 | E-M7193 L-FTC44742 R-A10526
| ZS9784 | E-BY56570 I-Y23686 J-ZS9797
| YP1153 | G-PH3038 J-Z27220 R-YP1145
| Y99151 | E-FT133616 G-FTB3122 R-FTB7230 T-TY491561
| PF4259 | I-PF4241 Q-F1298 R-FT199292
| |
I divided the recurrent SNPs into two basic categories: derived and false (or faux). (I'll take suggestions for better names!) The derived recurrents are those discovered by the labs by deep analyses of testers' data. (I'm certainly not in the position to dispute those findings. After all, it's the labs that hold the raw data that justifies their presence.) By renaming both the SNPs and the haplogroups as described above, the vast majority of these conflicts simply melted away. To resolve the rest, I've identified two sub categories, by path length and synonymous haplogroups (again, name suggestions will be welcomed). Due to a lack of kits, one company may have data only for shorter paths to the Y-root than the other. My algorithms choose the haplopath with the longest lineage. This was often the case with YFull and necessitated the adoption of the corresponding longer FTDNA paths. However, the reverse was often true with YFull having the longest and more detailed path.
Synonymous haplogroups are those having the exact same SNP members. They number to about 130,000 pairs. They were resolved first by choosing those without subclades and then by selecting the name with the lowest ascension number. However, there are still more than 400 pairs to be resolved — but will soon to be fixed.
Here are some examples of synonymous haplogroups.
| Recurrent SNPs | Synon Haps | MRCA
| Y630943 | E-BY47475 | E-CTS7761
| Y630943 | E-Y48004 | E-CTS7761
| Y23313 | J-V4911 | J-Y5176 | Y23313 | J-V5921 | J-Y5176
| Y107854 | N-BY95664 | N-YP6091
| Y107854 | N-Y81323 | N-YP6091 | Z18860 | H-Y19966 | H-P96
| Z18860 | H-Z19008 | H-P96 | Y142443 | T-Y142367 | T-Y37796
| Y142443 | T-Y142466 | T-Y37796
| FT155389 | J-FT153371 | J-ZS9061
| FT155389 | J-Y188562 | J-ZS9061
| BY131442 | E-M281 | E-PAGES00040
| BY131442 | E-V16 | E-PAGES00040
| CTS11728 | R-BY36256 | R-CTS8087
| CTS11728 | R-Y36499 | R-CTS8087
| |
False recurrent SNPs are duplicate SNPs placed into different haplogroups and result only from the merge. They are SNPs that were not previously identified as recurrent by either company. In other words, they can be declared, for now, as not being legitimately recurrent. In many of the cases, the longest lineage to the shared MRCA is selected. But how many steps up the ladder are required before they can be treated as living in separate branches? For example, that one is a subclade or second-degree subclade of the other is easily resolved: the longest path is used. But do we go up the tree three, four, seven, or more steps before we recognize them as being true recurrent SNPs? These questions are open to discussion. Here's an example.
| SNP | Path to MRCA
| FGC79681 | T-BY140338 > T-Y61099 > T-CTS8603 > T-CTS660
| FGC79681 | T-Y63308 > T-FGC79705 > T-FGC3997 > T-CTS660
| |
The FGC79681 SNP is found as a member of two haplogroups, each three generations down from their MRCA (MRCH?), T-CTS660. If that's the case, then the relationship is false. They could not have existed in the MRCA's two immediate subclades. But is it possible the SNP arose again separately in one of the two lineages? Yes, but how do we determine that? Without the kit data, it's virtually impossible to sort out. However, I do see that T-Y63308 has no subclades whereas the other has one. Should the SNP be eliminated from T-Y63308? Does it matter? (Well, it could to the kit owners!). The decision was easy with the synonymous haplogroups. But this needs more pondering over.
Again, Open Y doesn't have access to the kits themselves in order to resolve the problem — and that's as it should be. But I'm of the opinion that virtually everything can be resolved programmatically, at least where sufficient data is available. I'll keep working on this over the coming weeks before arriving to a final conclusion about how the remaining false recurrents are to be treated. At minimum, however, they can be marked as unresolved recurrents and entered in the Open Y database. (Normal recurrents are presently marked with an asterisk.)
A couple of minor hiccups arose in the FTDNA conversion. Perhaps due to a formatting problem in the JSON file, four SNP names are replaced by their positions on the Y chromosome. Not wanting to build exceptions into the code, I've left them as downloaded. Hopefully they will resolve themselves.
| SNP pos. | Haplogroup
| 12845939 | J-Z478
| 13103443 | E-CTS4051
| 13954273 | C-M208
| 19636804 | E-FTA65116
| |
And the haplogroups of two recurrent SNPs each have one haplogroup in the path to another.
| SNP | Haplogroups
| FTD7372 | I-BY114995 I-BY75224
| Z27072 | J-ZS8904 J-ZS8914
| |
These are very small issues, but they can't be fixed at this end, unlike the problems that arise directly from the merge.
The YBrowse, YFull, and FTDNA databases are downloaded to the server every night. Anything new is captured, processed, and written to the database. In other words, the entire database is updated every 24 hours. It's automatic and requires no human intervention. Data needing additional processing, including the false recurrents, are placed into separate files. Once all algorithms are in place and any bugs fixed, the database will be completely reliable, efficient, and accurate (to the extent that the incoming data is accurate). But despite the needed work, Open Y is up and running and is approximately 90% complete.
On the other hand, the haplogroup tree alone is often all that is needed for a tester to find his exact placement on the tree. A paying customer may not even understand or care to understand their haplogroups' member SNPs. But accurate placement provides the researcher with significant data. Counting the number of SNPs can be used to calculate the average SNP mutation rate and, from that, determine approximate haplogroup ages. (They will always be approximate. After all, we will never find a birth certificate for the first man born with P312!)
At present there is no Open Y haplogroup-only tree for download, but a child/parent database is available. Also in the archive is a simple perl script that quickly loads the data into a perl hash for further manipulation.
Finally, the Open Y database is not intended for medical research. Indeed, the Y is not a good candidate for that usage. It's "gene deficient."1 And although about 90% of the SNPs are likely properly placed, most of the false recurrents (not yet fully incorporated), are questionable. This renders Open Y not useful to medical interpretations. However, it is highly useful for both genetic genealogy and haplogroup research regarding the study of human phylogenetics, particularly when triangulated with geography, archaeology, and the historic record.
The Open Y database is housed at ysnp.info. Other tools are present, including the SNP Tree Builder designed for group administrators. But the site is not just for admins and those who do haplogroup research. Individuals can input their terminal haplogroup into another tool and their full SNP tree to the root is displayed. At present, these additional tools are using only the FTDNA data, but they will soon be upgraded to provide a choice between the three (FTDNA, YFull, and Open Y).
One thing needs to be made clear. Both YFull and FTDNA retain, of course, full rights to their data. However, I've been told that the data is publicly available (as JSON files) to see what others can come up with. Other than the reports regarding the recurrent SNPs for each database (I regard them as must needed information), the full databases will never be available for download through ysnp.info or the Open Y Project. However, once the project is complete, all Open Y data will be available to the public, as will be a number of scripts used to process the data. It must further be noted that I will not charge for services, although I will explore funding options in order to keep the server running indefinitely.
Future plans include the offering of additional tools and the creation of various statistics, timelines for example. The CGI scripts that face the public will be rewritten for speed and efficiency, perhaps in the Rust programming language. Once the database is complete, it will be wrapped up and provided to the public as a JSON file, a text-based relational database format. It is also my hope that the Open Y database will stimulate discussion about the in-house naming standards I've adopted that made Open Y possible. Should the three-character "class" be adopted? Should all haplogroups be named for their lowest-order SNP thereby avoiding any future naming conflicts? The latter would certainly make all Y databases both compatible and equitable.
Potential volunteers, even if only to test the database, are welcome to contact me. There's a discussion group at Facebook: Open Y-Tree. I always welcome feedback.
|
|