Help


Overview

The application accepts a set of trace files and a reference file, performs alignments of trace sequences to corresponding reference sequences and displays an overview of resulting alignments.

Uploading Files

Trace files can be uploaded in .ab1 or .zip format. If matching to reference by ID is desired, the trace file name has to contain the corresponding reference ID. Otherwise the file name can be arbitrary.

The reference file can be an Excel spreadsheet (.xlsx file), a .csv file, a fasta file or a GenBank file. In case of csv and xlsx, the first row contains a header. Recommended column names are "ID" and "Sequence". A warning will be displayed otherwise, but the functionality is not affected as long as there is a header. The first column contains reference IDs, which can be used to match corresponding trace files. The second column contains the sequences. The reference IDs should be unique for each row. For reference sequences in fasta format, the ID in the header (after ">") is used as sequence ID. Similarly, the ID in a GenBank file is also used.

Trace File Score Threshold

Each position of the sequence contained in a trace file is assigned a score of confidence. A slider enables the user to choose a score threshold. Any positions in the trace file with a lower score are discarded, the coverage in that part of the alignment will then be lower.

End Trimming Score Threshold

As the quality towards the ends of trace files tends to decrease, both ends of sequences are discarded until three bases in a row pass the trimming threshold. This can also be set by the user and should be higher than the score threshold, otherwise it has no effect.

Threshold for calling mixed peaks (f)

The application detects positions, where more than one trace peak is present, resulting in one primary and one or more secondary base calls. This threshold specifies the minimal fraction of the area of the primary peak that another peak has to attain in order to be considered a secondary peak.

Matching to Reference

Trace files are matched to reference sequences automatically. TraceTrack first checks if the reference ID is contained in the trace file name, and if no match is found this way, then the reference is chosen by best alignment score after aligning the trace file sequence to all the uploaded reference sequences. The read directionality is determined automatically. Both the reference and the directionality can be changed by the user manually.

Data from Multiple Groups

If you wish to upload trace files that share the same reference ID, but come from a different group and you wish them to be aligned separately, upload one set of trace files using the available Choose Files button and another button will appear. Use this to upload files from other groups. You can repeat this step for multiple groups. Within each of the uploaded groups the trace files will be sorted by reference IDs and an alignment will be created for each ID in each group.

Alignments

A multiple sequence alignment is created for each reference and its associated trace sequences. Trace files from each group are aligned separately. Substitution mutations, deletion mutations and insertion mutations are accepted when all trace files agree on them. In detail, the consensus sequence is produced by evaluating each alignment position separately using the following rules:

Every reference must contain at least one coding region (CDS feature). For reference sequences in GenBank format, features are extracted from the file. When reference sequences are uploaded in a different format, the whole sequence is considered coding. Coding and non-coding regions are displayed differently:


Reference C C C C
Read 1 A A A A
Read 2 A A C
Read 3 A T A
Result A A C C
Coverage 3 1 0 1

Mixed peak detection

Each trace file is searched for heterogeneous positions. Signal to noise ratio is calculated for all positions as the ratio of the area of the main peak and the sum of areas of all other peaks. Areas with a low StN ratio are excluded from further steps, as these are noisy areas with low quality. Then an approximation of the area under curve fo each of the four traces is calculated at a given position, as well as heights of individual peaks. When a different base than the main called one has both peak area and height greater than a threshold (set to 15% of the main peak) and it's trace is concave around the center of the main peak, the position is marked as mixed. In the alignment, a position is marked as mixed if the secondary peak is detected in two or more trace files, or if it is detected in the only present trace file.

Alignment Properties Table

The following properties of each alignment are displayed in the results table:

The table can be ordered by different columns when the header is clicked.

Displaying the Alignments

Each alignment can be displayed by clicking its Reference ID. The first character in the alignment (an asterisk) represents the Kozak sequence (GCCACC). If it is black, the intact sequence is present in at least one of the reads. If it is red, the Kozak sequence is either not covered by any read or mutated. The last character is an asterisk representing a stop codon. The same color coding applies as with the Kozak sequence. In the alignment itself, the background color represents coverage. Style of the letters represents mutations. Hovering your cursor over any nucleotide will highlight the codon with black and red underlining. If gaps are present in the alignment, black underlining represents the original reading frame and red is the frame after including gaps. A tooltip shows the reference nucleotide in the first row and individual reads in the following row.

Download

The results can be downloaded as an Excel spreadsheet using the Download all button. The first sheet of the file will contain the same information as the results table. The reference IDs are clickable and take you to sheets containing the given alignments. The alignment sheet (named like the trace file, with "Seq" appended) contains a row for both the reference and consensus sequence, as well as each trace sequence. Amino acid translations are provided for each of these. The following sheet (marked by "Mut" appended to the sheet name) lists all the mismatched positions and regions of zero coverage. Position numbers are clickable and contain links to the corresponding position in the alignment.
If desired, each alignment can be downloaded as a separate spreadsheet using the button in the results table row.