Error Correction Evaluation Toolkit

In this page, we discuss the scripts and other utilites used to evaluate error correction methods. In the first section, we discuss the step-by-step procedure for evaluating error correction programs for Illumina and 454/Ion Torrent reads. In the second section, we discuss the tools we have developed for evaluation and their usage. The download section provides the link to download the tools.

Evaluation method

We use two different methods for evaluation - one for Illumina reads, and one for 454 or IonTorrent reads.

Evaluation method for Illumina reads

Evaluation of output from error correction methods that can handle Illumina reads consist of the following steps,

  1. Align reads prior to correction.
  2. Convert pre-correction alignment to Target Error Format (TEF).
  3. Run error correction program.
  4. Convert error correction output to Target Error Format (TEF).
  5. Compare TEFs from steps (2) and (4) and measure the required statistics.

The following sub-sections describe these steps.

Target error format (TEF)

We use a neutral format called target error format (TEF) to perform the analysis for Illumina reads. TEF represents the errors of a read as below:

readid n-errors [pos tb wb ind]+

In the above format, the fields are described as below :

Fields Description
readid ID of the read corrected
n-errors Integer. Number of errors corrected in the read.
pos Position for fix (0 < = pos < length of the read)
tb true value of the base at pos.
wb wrong value of the base at pos.
wb should be current base at read
tb,wb is one of {0,1,2,3,4,5}
0 = 'A', 1 = 'C', 2 = 'G', 3 = 'T', 5 = '-'
ind indicates the type of error. one of {0,1,2}
0 substitution (bad char in the read at pos) or
1 deletion (missing char in the read after pos) or
2 insertion (extra char in the read at pos)

Align uncorrected reads to BWA

  1. Before performing alignment, the reads are pre-processed (Section Pre-processing Data).
  2. Uncorrected reads are aligned with BWA. BWA generates alignments in SAM format. The script sam-analysis.py (See section SAM to TEF Conversion) converts alignments from SAM to TEF.

Correct reads

Error correction program is run against the uncorrected reads. Then, the output of error correction program is converted to the target error format (TEF) using the following scripts for the corresponding error correction program.

Program Script
Coral coral-analy.pl
HiTEC hitec-analy.pl
Quake quake-analy.py
ECHO quake-analy.py

Reptile generates output in TEF. The usage of these scripts are described in the Section Error correction output to TEF.

Measure Gain

The target error format files generated in the previous two sub-sections - alignment and correction - are compared using the utility Comp2PCAlign (Section Comp2PCAlign), which measures Gain and Sensitivity.

Evaluation method for 454/Ion Torrent Reads

Evaluation consists of the following steps

  1. Align reads prior to correction.
  2. Construct set of Errors prior to error correction
  3. Construct set of Errors post error correction
  4. Measure Gain by comparing (2) and (3).

(2), (3) and (4) are done by a single script. The following sub-sections explain these steps.

Align uncorrected reads to Mosaik/TMAP

Before performing alignment, the reads are pre-processed as given in Section Pre-processing Data. For 454 reads, alignments are performed using Mosaik. For Ion Torrent reads, TMAP is used for alignment. Alignments are converted to SAM format for further processing.

Measure Gain

Script `compute-stats.py' (Section Corrected 454/Ion Torrent Reads Analysis) measures gain for 454/Ion Torrent Reads using the method explained in the paper.

Tools

Requirements

Tools used for evaluation have the following dependencies :

  1. GCC C++ compiler v. 4.3
  2. Perl v. 5
  3. Python v. 2.7.2
  4. MPI
  5. mpi4py python package

No build is required for python and perl scripts. Run `make' in the `tools/CompToPCAlign/' directory to build Comp2PCAlign executable.

Pre-processing Data

Short Read Archive have the reads stored in '.sra' format. '.sra' format can be converted to 'fastq' format using the 'fastq-dump' tool is available with NCBI SRA tool kit, which can be downloaded from NCBI.

We use FASTA format as input to error correction programs. To convert FASTQ to FASTA, the script 'fastq-converter-v2.0.pl' is used as follows:

$ fastq-converter-v2.0.pl fastq/ fasta/ 1

Here, `fastq/' directory consists of all the fastq files of the data-set are stored. Output is generated in the 'fasta/' directory. The last argument `1' causes the script to ignore the reads with ambiguous `N' characters. `fastq-converter-v2.0.pl' generates unique ids for each of the read and inserts in the FASTA header. The read id is unique among all the files. These identifiers are used to compare the reads pre- and post-correction with corresponding identifiers. The order in which `fastq-converter-v2.0.pl' does the conversion is saved. The identifiers are assigned to the reads in this order from 0 to n-1, where n is the total number of reads. For example, for SRR001665 data-set the `fastq-converter-v2.0.pl' processes the fastq files in the following order.

SRR001665_2.fastq
SRR001665.fastq
SRR001665_1.fastq

Here, the first valid read in SRR001665_2.fastq file gets the id 0, and the last valid read in SRR001665_1.fastq has the id n-1.

Some of the error correction methods accept only FASTQ. In order to make sure the read ids are same for all the error correction methods, an identifier is given to each read provide. The identifiers are inserted in the FASTQ headers and range from 0 to n-1, where n is the total number of reads. To accomplish the assignment of identifiers, we use the script `merge-fastq.py'. The script merges all FASTQ in a given input list. The identifiers in the FASTQ are useful for comparison between pre- and post-corrected reads. 'merge-fastq.py' is run as follows:

$ merge-fastq.py --list=lst --outfile=all.fastq

where `lst' is a file containing the paths to fastq files, each delimited by a new line. For example, for SRR001665 data-set the `list' argument is the path to a file with the following contents:

SRR001665_2.fastq
SRR001665.fastq
SRR001665_1.fastq

For the comparison to make sense, the order should be same as the order in which FASTQ converter (`fastq-converter-v2.0.pl') did the conversion.

While pre-processing, we ignore the reads with invalid characters because some error correction programs can not work with invalid characters.

Error Correction output to TEF conversion

To get the TEF from corrected reads, different scripts are used for different error correction methods. All of these scripts exploit the identifiers inserted in FASTA/FASTQ headers during the pre-processing step (Section Pre-processing Data). Usage of these scripts for converting output to TEF are explained below.

coral-analy.pl

coral-analy.pl converts Coral-corrected FASTA file to TEF as below:

$ coral-analy.pl corrected.fa all.fa coral-output.er > coral_conv.log

In the above example, corrected.fa is the corrected FASTA file, all.fa is the uncorrected FASTA file and coral-output.er is the output in TEF.

quake-analy.py

Conversion program for both Quake and ECHO is quake-analy.py. It is run as below:

$ quake-analy.py -f all.fastq -c corrected.fastq -o echo-output.er -t echo-trim > missing.log

Here, `all.fastq' is the input file, `corrected.fastq' is the ECHO/Quake corrected fastq, `echo-output.er' is the output in TEF, and `echo-trim' is the list of reads with the trimmed area (which is ignored).

hitec-analy.pl

Output from HiTEC is converted to TEF as below.

$ hitec-analy.pl corrected.fa all.fa hitec-output.er

Again, `all.fa' is the uncorrected FASTA, `corrected.fa' is the corrected FASTA and `hitec-output.er' is output from HiTEC.

SAM to TEF Conversion

Alignments in SAM file are converted to TEF file using the script `sam-analysis.py'.

sam-analysis.py --file=/path/to/sam-file-input
              --outfile=/path/to/err-output
              --ambig=/path/to/ambig-output
              --unmapped=/path/to/unmapped-output
              --trim=/path/to/trim-file-output
              [--genomeFile=/path/to/genome-file]
              [--dry (for dry run no output generated)]

`outfile' option is a path of output file with write access. Ambiguous reads are written to the file given as the value for `ambig' option. Unmapped reads are written to the output file given as the value for `unmapped' option. Unmapped and ambigous file can be both same. trim-file-output positions trimmed (ranges allowed).

Here, genome file is optional. It is used if MD String is not available in the SAM file. If genome file is given, it will be loaded in memory completely. The script doesn't handle genomes with multiple chromosomes.

Comp2PCAlign

Comp2PCAlign measures the Gain and Sensitivity from the outputs generated in the previous two sub-sections. Usage is as below:

$ comp2pcalign [correction-rslt] [pre-correct-aln-rslt] [unmapped-pre-correct-aln] \
   [m-value] [fpfn-rslt] [optional trimmed-file]

It takes 6 arguments and they are given in the following order :

  1. Correction Result converted to TEF.
  2. Alignment SAM converted to TEF.
  3. File with list of unmapped reads.
  4. Edit distance used for alignment.
  5. Output file with write access to which the statistics are written to.
  6. [Optional] List of reads with trimmed regions.

(1) is generated from Error correction output as described in Section Error Correction output to TEF conversion. (2),(3),(4) and (6) are generated from the alignment as described in SAM to TEF Conversion. (3) is a concatenation of both unmapped and ambiguous reads.

Corrected 454/Ion Torrent Reads Analysis

The procedure to analyse 454/Ion Torrent Reads is given in the paper. `compute-stats.py' is the script implementing the procedure.
It is used as below:

compute-stats.py --aln=/path/to/pre-correction-alignment-sam-file
              --corrected=/path/to/corrected-reads-fa-file
              --outfile=/path/to/stats-output (write access reqd.)
              --records=number of reads
              [--genomeFile=/path/to/genome-file]
              [--band=value of k used for k-band alignment (default 5)]
            (OR)
compute-stats.py -a /path/to/pre-correction-alignment-sam-file
              -c /path/to/corrected-reads-fa-file
              -o /path/to/stats-output-file (write access reqd.)
              -r number of reads
              [-g /path/to/genome-file]
              [-b value of k used for k-band alignment (default 5)]

The script accepts only FASTA file. The script requires that the FASTA is pre-processed as given in Section Pre-processing Data, because it exploits the sorted identifiers to process SAM with FASTA in a single pass.

`band' option gives the value of band size used for k-band alignment. Here, genome file is optional. It is used if the MD String is not available. If genome file is given, it will be loaded in memory completely. The script doesn't handle genomes with multiple chromosomes.

`compute-stats.py' requires MPI, and mpi4py as it is uses MPI.

Download

Source code for the tools are available from here.

Reference

When using the toolkit, please cite:

X. Yang, S. Chockalingam and S. Aluru, “A Survey of Error Correction Methods for Next Generation Sequencing”, Briefings in Bioinformatics, 2012.

Contributors

Thanks to Giorgio Gonnella for fixing the bug of incorrect handling of soft padding

Downloads: 
AttachmentSize
tools.zip41.55 KB