It is important to point out that the employed simulation framework represents ideal conditions for assembly

Short reads have been combined with other sources of data to generate and improve de novo genome assemblies; examples include the rice pathogen Pseudomonas syringae, the forest pathogen Grosmannia clavigera, plant chloroplast genomes, and also Arabidopsis thaliana strains. Recently, individual human genome datasets were assembled into fragments by ABySS and SOAPdenovo yielding numerous small contigs covering in total up to 80% of the human genome. The first example of researchers having employed high throughput sequencing alone to assemble a large animal or plant genome was recently reported for the giant panda genome. However, it should be pointed out that the ‘true’ quality of the resulting assembly remains unclear, as it was estimated by employing comparisons to the dog genome, a limited amount of pre-existing mRNA annotations, and various repeat estimation techniques. A fundamental concern when performing de novo genome assembly stems from limited confidence in the assembled contigs since they represent only one possible way of mapping the sequence fragments to contiguous sequences. There have been efforts to computationally simulate certain aspects of the assembly process in order to gauge the performance of existing approaches. For example, benchmarking datasets and assembly evaluation for metagenomics sequencing data have been presented. Also, the original publications that describe a novel assembly algorithm typically include some validation and comparison with some of the existing methods. Some very recent studies compare short read assembly methods under various conditions and for various types of genomic input. Obviously, having a few long contigs is desirable; however, an equally important consideration is the correctness of the contigs. In this paper, we study de novo assembly through simulation. From several reference sequences, ranging from viral to plant, we generated simulated reads with lengths between 50 and 100 nts, these lengths being typical of the current short-read LDK378 in vivo generating platforms. We introduce and employ a protocol for evaluating a de novo assembly strategy for a genome for which a reference sequence does not exist. Our protocol calls for generating simulated sequencing reads from a carefully chosen related reference genome, assembling them de novo and finally aligning the assembled contigs to the reference and quantifying the erroneously and correctly assembled nucleotides. From the results, we can determine whether a sequencing and assembly strategy employed in the simulation would yield meaningful results on the related unsequenced genome. By injecting errors at varying rates into the reads, and by investigating different degrees of sequencing coverage, we obtain limits to the error that the assembler tolerates, and determine which coverage ranges are most useful. Finally, we examine the extent of improvement that results from the use of paired read information.