Michael Cooley's Genetic Genealogy Blog GEN • GEN
8 January 2019

X Marks the Spot: SAM Part 2

The previous article described the SAM format, at least that part of it that is appropriate to our study. I demonstrated how we can severely reduce the file size during output while identifying each read's quality, allele value, and its status as either matching or non-matching to the reference genome.

The first question might be, Why make a secondary file? Why not process it all in place — in memory? The first problem is that SAMs are huge; the example I'm using is more than 17 gigs. Resource requirements would be immense. But it turns out we need to take a second pass through the file anyway:

Mutations do not always appear in all the reads. In fact, very frequently only 50%, or even fewer reads, might exhibit the mutation — in which case we throw it out. In other words, if we look at only the X markers, we're missing all the cases in which no mutation is present at the same position. This first pass through the file is, apart from file reduction, a fact-gathering mission. And the most important fact, the very crux of the matter, involves the locations of the markers. We can make a list of them to reference during the second pass — the pass that does the actual collection of SNP data.

As explained in the last article, the CIGAR field encodes the marker data. The string is comprised of pairs: a number followed by a character. A fairly complex, and real-life, CIGAR is 10=2D117=1X12=1I9=. Simply stated, it tells us that the associated string of nucleotides, when compared to the same segment in the reference data, is comprised, left to right, of the following:

10 matches
2 deletions
117 matches
1 mismatch
12 matches
1 insertion
9 matches
We can easily parse out the cigar with one line of code and print it for verification:
#!/usr/bin/perl -w

$cigar="10=2D117=1X12=1I9=";             ## Any CIGAR will do.
(@pairs) = $cigar =~ /\d+\D/g;           ## Split into pairs (any
                                         ## number of digits followed
                                         ## by a single non-digit).
foreach $p (@pairs) { print "$p\n" }     ## Print them.
The printed output is what we expect:
10=
2D
117=
1X
12=
1I
9=
The difference this time, however, is that we have the values in the computer's memory.

My real-life SAM example gives a start position for this record at 6875649. (Never mind quality, region, etc.) We can cycle through the pairs and record the positions of each marker, making allowances for INDELs (insertions and deletions). I'll use the simple chop function to trim off the marker type (=,X,I,D,S) from the right-hand end and create a hash table of arrays, one for each type:

foreach $p (@pairs)
{
  $type = chop $p;                       ## Spit the number from the marker.
  push @{$TABLE{$type}}, $start;         ## Record the start position for marker.
  $start+=$p;                            ## Increment to the position of the 
}                                        ## next marker.
That's a good start and demonstrates the concept, but it's not that simple. Each marker type needs to be treated differently because INDELs jostle the positions. We want more than this tells us:

I: 6875791
=: 6875649 6875661 6875779 6875792
D: 6875659
X: 6875778
So we need to calculate the pairs conditionally and treat consecutive positions of the same marker (such as the two deletes) individually.
#!/usr/bin/perl

$cigar="10=2D117=1X12=1I9=";
$start=6875649;

(@pairs) = $cigar =~ /\d+\D/g;

foreach $p (@pairs)
{
  $type = chop $p;

  ## Save the position of each marker
  for ($i=0;$i<$p;$i++)
  {
    push @{$TABLE{$type}}, $start;

    ## Increment pos except for Inserts
    if ($type ne 'I') {$start+=1}
  }
}
Let's veryify by printing the positions for the "D," "X," and "I" markers.
for $k ( keys %TABLE )
{
  if ($k ne '=' && $k ne 'S')
  { print "$k: @{ $TABLE{$k} }\n" }
}
This is more like what we're after.

D: 6875659 6875660
X: 6875778
I: 6875791

There's one more piece of useful data we can save before writing out the file and proceding. It would be helpful to quickly access each marker. This can be done by creating an index of file pointers to the start of each record. We need only increment the pointer by the length of the previous record. In other words, the very first record in my sample SAM, the same record I referenced last time, will begin writing to the first byte in the new, reduced datafile. The read's start position, recorded in field 4, (an array index of 3) is 2781691. Including the length of the string of alleles will help us later down the road. Here's a perl script that will do only that much: create an index of filepointers and other pertinent data.

#!/usr/bin/perl -w

$write_position = 1;

while (<STDIN>)
{
  (@fields) = split /\t/, $_;  ## Splits fields at the tabs

  ## NOT FAILED QUAL CONTROLS & NOT PCR DUPE ## 
  if (($fields[1] & 512) || ($fields[1] & 1024)) {next}

  $write_string = "$fields[3] $fields[1] $fields[4] $fields[5] $fields[9] $fields[10]\n";

  $len = length $fields[9];

  ## DATA POSITION, LENGTH OF ALLELES, FILE POINTER ## 
  print "$fields[3] $len $write_position\n";
  $write_position += length ($write_string);
}
Here's the first few lines of output.
Pos     Len Pointer
2781691 151 1
2781762 150 339
2781805 150 664
2781818 151 986
2781826 150 1310
2781828 151 1636
2781831 151 1964
We could include more information, but let's leave it there for now and get to the business of putting it all together. We'll open four files — the SAM file for input, and three output files: the reduced data file, the index, and the list of X data, which are the single polymorphism positions. (The same can be done, of course, for INDELs.)

#!/usr/bin/perl -w

open SAM, "<our.sam" or die "Could not open file $!";
open INDEX, ">our.idx" or die "Could not open file $!";
open X_POS, ">our.x" or die "Could not open file $!";
open REDUCED, ">our-reduced.dat" or die "Could not open file $!";

$write_position = 1;

while (<SAM>)
{
  (@fields) = split /\t/, $_;  ## Splits fields at the tabs

  ## NOT FAILED QUAL CONTROLS & NOT PCR DUPE ## 
  if (($fields[1] & 512) || ($fields[1] & 1024)) {next}

  ## GRAB ONLY THE FIELDS WE WANT
  $write_string = "$fields[3] $fields[1] $fields[4] $fields[5] $fields[9] $fields[10]\n";

  $len = length $fields[9];
  print INDEX "$fields[3] $len $write_position\n";
  print REDUCED $write_string;
  $write_position += length ($write_string);

  ## EXTRACT THE CIGAR, which is $fields[5] ##
  $cigar = $fields[5];
  (@pairs) = $cigar =~ /\d+\D/g;

  ## PARSE THE CIGAR. The start position is $fields[3] ##

  $start = $fields[3];
  foreach $p (@pairs)
  {
    $type = chop $p;

    ## Save the position of each marker
    for ($i=0;$i<$p;$i++)
    {
      push @{$TABLE{$type}}, $start;

      ## Increment pos except for Inserts
      if ($type ne 'I') {$start+=1}
    }
  }

  ## SAVE THE X DATA ##
  for $k ( keys %TABLE )
  {
    if ($k eq 'X')
    { push @x_array, @{$TABLE{$k}} }
  }
  undef %TABLE;
}

## DELETE DUPES, SORT, AND PRINT THE X ARRAY ##

@unique = keys { map { $_ => 1 } @x_array };
@sorted = sort { $a <=> $b } @unique;
foreach $marker (@sorted) { print X_POS "$marker\n" }

More can be done. Checking the data for consistency is always a good idea. I'd rather deal the with X data hash at the end of the program, but even my semi he-man laptop can't handle the large hash table. Because the data is so simple — a single integer per record — creating and manipulating a simple array does the trick.

We have all the parts we need: a reduced data file, an index of file pointers to the start of each record, and a list of all potential SNPs — nearly 400,000 in the sample SAM I'm using. The next article will attempt to sort through all this and produce a much shorter list of possibly viable SNPs.