`maf-convert sam` make BAM/CRAM-incompatible CIGAR lines for long sequences.
Hi Martin, I have been exploring the use of CRAM as an alignment storage instead of MAF, because it can be 5 times smaller for pairwise genome alignments. On small genomes it works but on larger ones it fails because of the following: To represent genome alignments in SAM format, the genome sequences left and right of the alignment are hard-clipped. In the CIGAR lines, this is represented by a number of clipped bases followed by H. `maf-convert` does it well. Unfortunately, the way the CIGAR lines are binary-encoded in the BAM and CRAM format only allows for the use of 28 bits to represent the number of clipped bases. This limits their number to 2^28 - 1 = 268,435,455 bases. This causes the SAM to CRAM or BAM conversion to crash for genomes that have chromosomes longer than that. I think that I found a workaround: I post-process the SAM file so that large hard-clip values are split in chunks of 268,435,455 or less bases. Here is I do it in Perl but I am sure it would be trivial for you to add a python function within `maf-convert`. ``` # Define the divisor my $divisor = 268435455; # Function to process numbers in the "numberH" format sub split_number_H { my ($match) = @_; # Capture the matched number my $num = $match; my $quotient = int($num / $divisor); my $remainder = $num % $divisor; # Construct the new value return ("268435455H" x $quotient) . "${remainder}H"; } # Read from standard input (line by line) while (<>) { chomp; # Remove trailing newline my @fields = split /\t/; # Split the line into tab-separated fields # Process column 6 (index 5) if (defined $fields[5]) { $fields[5] =~ s/(\d+)H/split_number_H($1)/ge; # Apply transformation to all occurrences } # Print the modified line, keeping tab-separated format print join("\t", @fields), "\n"; } ``` Do you think it make sense? Charles
issue