`maf-convert sam` make BAM/CRAM-incompatible CIGAR lines for long sequences.
Hi Martin,
I have been exploring the use of CRAM as an alignment storage instead of MAF, because it can be 5 times smaller for pairwise genome alignments. On small genomes it works but on larger ones it fails because of the following:
To represent genome alignments in SAM format, the genome sequences left and right of the alignment are hard-clipped. In the CIGAR lines, this is represented by a number of clipped bases followed by H. `maf-convert` does it well.
Unfortunately, the way the CIGAR lines are binary-encoded in the BAM and CRAM format only allows for the use of 28 bits to represent the number of clipped bases. This limits their number to 2^28 - 1 = 268,435,455 bases. This causes the SAM to CRAM or BAM conversion to crash for genomes that have chromosomes longer than that.
I think that I found a workaround: I post-process the SAM file so that large hard-clip values are split in chunks of 268,435,455 or less bases. Here is I do it in Perl but I am sure it would be trivial for you to add a python function within `maf-convert`.
```
# Define the divisor
my $divisor = 268435455;
# Function to process numbers in the "numberH" format
sub split_number_H {
my ($match) = @_; # Capture the matched number
my $num = $match;
my $quotient = int($num / $divisor);
my $remainder = $num % $divisor;
# Construct the new value
return ("268435455H" x $quotient) . "${remainder}H";
}
# Read from standard input (line by line)
while (<>) {
chomp; # Remove trailing newline
my @fields = split /\t/; # Split the line into tab-separated fields
# Process column 6 (index 5)
if (defined $fields[5]) {
$fields[5] =~ s/(\d+)H/split_number_H($1)/ge; # Apply transformation to all occurrences
}
# Print the modified line, keeping tab-separated format
print join("\t", @fields), "\n";
}
```
Do you think it make sense?
Charles
issue