Michael Cooley's Genetic Genealogy Blog GEN • GEN
15 September 2016

Naming SNPs

I briefly wrote about the naming of SNPs, particularly of the Y chromosomal variety, in the sidebar of article #5. This is here primarily for convenient referencing.

SNPs are Single Nucleotide Polymorphisms. This occurs when a single nucleotide at a specific position on the Y chromosome, having one of four molecules, has "morphed" into one of the other three values. The labels for these nucleotides, also referred as bases, are simply one-letter abbreviations of the molecules Adenine, Cytosine, Guanine, and Thymine. The original value of a nucleotide is known as the ancestral or reference value, which is the value that most men have at the stated position on the Y. The value to which it mutated is called the derived value or the haplotype. For example, in a small number of men a C at position 10040518 is now a T molecule. These men have it because the mutation occurred in a man who was their mutual ancestor. They descend from the same man — and that's why genetic genealogists are interested in it. At the discovery of a new SNP, we've unearthed a very real biological feature of an otherwise unnamed man who lived, hundreds, thousands, even tens or hundreds of thousands years ago.

New SNPs have to be evaluated. The quality of the reads at that position needs to be verified. There are regions on the Y chromosome, DYZ for example, that are volatile. SNPs found there are apt to change. Such SNPs will be noted and catalogued but, until additional data comes along that illustrates otherwise, they're not to be counted on. And, typically, a mutation is not considered a SNP until it's found in a second person. After that, its position on the SNP tree needs to be determined.

While this process is underway, a new SNP is referred to by its three elements: position, ancestral value, and derived value. For example, I would refer to the above SNP as C10040518T — ancestral value, position, derived value. There are other methods for presenting it but I prefer this one merely because I don't have to worry about word wrap. Regardless of the method, the ancestral value comes first. So, we can have:

C10040518T
10040518CT
10040518 C T
10040518-C-T
10040518 C->T

The last nomenclature is commonly seen, probably because it's intuitive.

Once the evaluation is complete, the last step is to give the SNP a name. With that, referencing it not only becomes easier but its discovery is credited to the lab that discovered it. The above SNP, for instance, is called Z16271. Sometimes more than one lab will make the same discover about the same time, such as the case with M9145/PF733/V3063.

The process of naming is simple. Each starts with one or more letters, representing a company or other research entity, followed by the ascension number. For example, M9145 is the 9,145th SNP to be discovered by Peter Underhill, Ph.D. of Stanford University. The International Society of Genetic Genealogists lists these at the bottom of the introductory page to its Y-SNP tree. Here are a handful of entries:




It's that's simple.