Michael Cooley's Genetic Genealogy Blog GEN • GEN
7 January 2019

SAM and I

I wrote a C program in the early '90s that quickly parsed out GEDCOMs and created a binary index for fast, easy access and display. A correspondent at the time helped translate it to Polish and it was published in 1995 by the Polish State Library on a CD of Polish genealogies. I loved working in C but it abandoned me after I transitioned to the web. Although my CGI scripts were fast, they had to be tweaked with each platform I worked on. Perl was the answer for me, and I now use it for all my projects — on and off the web.

BAM files are binary representations of sequenced DNA data. They're built for fast access but are not human readable. I was frustrated in my first attempts to glean data from BAMs using SAMtools, so I decided to take a cursory look at the extracted, text-based data itself. I found it straight-forward and progressed quickly from there. The DIY method I use might not be for everyone, but the following description of the SAM format could be useful for anyone working with surname DNA projects. (It's essential, I believe, that we understand the data we're trying to work with. Besides, the democratization of information is massively important, and should be accessible outside a relatively small priesthood of technocrats.)

To use BAM files, they need to be unzipped:

unzip $file.bam.zip

As far as I know, samtools is the only utility that will extract the SAM data. We're looking here for only Y chromosome data:

samtools view -F 4 $filename.bam chrY > $filename.sam

These are very large files, by the way. The BAM I'm using for this example is just under 2 gigs and extracts to a whopping 17 gigs!

SAM data records — one record per read — are delimited with a newline ("\n"); fields are separated by tabs ("\t"). Here's the first record in a SAM file, each field broken at the tabs. I've labeled only those fields used in this study:

2Bitwise flag99[TAB]
4Start position2781691[TAB]
5Map qual60[TAB]
12AS:i:135[TAB]ps:i:272[TAB]RG:Z:GRC16821363_S105_L0 02_R

Every record can easily be parsed in perl. In this simple example, I'm creating a new file that includes only the needed data. (Remember, the first field has an array index value of 0.)

#!/usr/bin/perl -w

while (<STDIN>)
  (@fields) = split /\t/, $_;  ## Splits fields at the tabs

  print "$fields[3] $fields[1] $fields[4] $fields[5] $fields[9] $fields[10]";

This alone reduces the same 17 gig file to 10 gigs, but we can make it even smaller.

Three of these fields are obvious. The sequenced nucleotides in field 10 starts at the position stated in field 4. The quality flags for each read is situated directly under the base. It looks like a no-brainer at this point, but there are complexities. I'll start with the bitwise flag, field 2.

Translating the 99 found in this example into binary, we get 01100011. This is used by SAM to flag certain qualities of the read. Reading the binary number from right to left, top to down below, this is what each position means:

0x1template having multiple segments in sequencing (1)
0x2each segment properly aligned according to the aligner (2)
0x4segment unmapped (4)
0x8next segment in the template unmapped (8)
0x10SEQ being reverse complemented (16)
0x20SEQ of the next segment in the template being reversed (32)
0x40the first segment in the template (64)
0x80the last segment in the template (128)
0x100secondary alignment (256)
0x200not passing quality controls (512)
0x400PCR or optical duplicate (1024)

I test only for the last two flags. If either is present, I move on. We can filter out those records by inserting another line of code.
#!/usr/bin/perl -w

while (<STDIN>)
  (@fields) = split /\t/, $_;  ## Splits fields at the tabs

  if (($fields[1] & 512) || ($fields[1] & 1024)) {next}

  print "$fields[3] $fields[1] $fields[4] $fields[5] $fields[9] $fields[10]";
And that further reduces the file to just over 1.5 gigs, a bit smaller than the original BAM. And that can be zipped down to under 90 megs, making it manageable for exchange.

The CIGAR field is encoded with the results from comparing each nucleotide to the reference data. It uses a data compression method of nA where n is the number of times the subsequent keyboard character is repeated. For example, in our sample record an equal sign (=) is repeated 94 times followed by a single D. Completely unwinding the CIGAR provides this when layed over the other data:


FTDNA uses five values in the CIGAR:

Matches the reference
Mismatches the reference
A deletion
An insertion
Soft trim

Remember that the start position for this read is 2781691. Mismatches, then, were found at 2781818 and 2781828. But to properly align the sequence, we have to shift our markers to the left by one in order to make up for the deleted nucleotide. This gives us an A for 2781818 and a T for 2781828. We can verify the accuracy of this by aligning the data to the reference sequence: C has become an A, and G a T.


The last field to look at is the read quality. To translate it from its single digit representation, subtract 33 from the character's ASCII value. In other words, the decimal ASCII value for an F is 70. Subtract 33 from that and you get a quality score of 37. For FTDNA generated BAMs, that appears to be the highest score.

So, to be clear, the first column in our string of data is a high quality read (F) that reveals a G, which matches, or equals, the reference data (=). The vast majority of reads are of the highest score or some degree below it.

Once we figure out the positions for X, the readings for every noted position and their related quality scores are extracted, counted, and saved. Simple arithmetic tells us whether any particular SNP is worth investigating further. The results can be compared to SNP databases to determine whether we have any unnamed novel SNPs. My scripts note any SNPs that are found in STR regions, a topic that seems to piss some people off.

If there's interest, I'll write more about this.