Simulator User Guide

1. Installing Dependencies

Varsimlab can be called from the command line using any python 3 version

installing ART Varsimlab uses art_illumina to generate short reads with realistic sequencing errors. The documentation is available Here

installing SINC VarSimLab uses SInC simulator to generate biologically realistic tumor genomic variations. The source files and instructions on compiling are available Here

installing Bedtools (optional) If you’d like to use VarSimlabs exome sequencing capabilities, Varsimlab uses Bedtools is required. bedtools documentation is available Here

installing BWA (optional) If you’d like to use VarSimLab to automatically align the generated reads back t o your sequence of interest, generating SAM files, download and compile BWA. Download is available Here and Here is the documentation for the alignment algorithm is available Here

2. Prepare The Reference Genome

There are two ways to easily run VarSimLab for a reference genome

  1. Download a pre-prepared folder to use from http://nabavilab.uconn.edu/datasets/varsimlab/ .
  2. Prepare your own reference (super easy).

To prepare a reference genome, follow these steps:

  1. Copy the reference genome in FASTA format (.FASTA | .FA ) to the folder. You can also download genome from UCSC Genome Browser https://genome.ucsc.edu/cgi-bin/hgGateway
  2. Copy target regions file in BED format (.BED) to the folder. You can also download the target files from UCSC Table Browser https://genome.ucsc.edu/cgi-bin/hgTables. You can also see more information about preparing the bed file at our github repository https://github.com/NabaviLab/VarSimLab

That’s it! Your reference is ready to generate reads from.

3. Running VarSimLab

Here are the available arguments VarSimLab accepts at the command line

required positional arguments:

filename name of output file genome genome to be processessed

VarSimLab also requires one of the following arguments:

-use_genome generate tumor and normal for entire provided sequence. used for whole genome sequence simualtion
-bed generate tumor and normal based on bed file containing exonic regions. used for whole exome sequence simulation
read generation parameters:

arguments to adjust read generation

-c C read depth of coverage
-s use single end reads (default paired)
-l L read length. default 100 bp
-m M maximum distance for two bed ranges to be merged into one range. If zero, merges only those ranges that directly overlap with each other
error parameters:

arguments to adjust tumor error generation

-cnv percent of total input to be incorporated into a CNV. Values from 0 to 100. 4 would signify 4 percent of input should be included in CNVs
-cnv_min_size minimum size of CNVs
-cnv_max_size CNV_max_size
-snp percent of total input to be turned into SNPs. Values from 0 to 100. A value of 5 indicates 5 percent of genome should be turned into SNPs
-indel percent of total input to be included in INDELS. values from 0 to 100, a value of 1 indicates 1 percent of the genome should be included in indels

4. Understanding Simulator Results

There are two folders inside the output_prefix folder.

  • Normal: it will contain .FASTQ file for reads that represent the control (or normal) sample. There will be two .FASTQ files if paired end reads were generated, and one if single end reads were generated.
  • Tumor: it will contain .FASTQ file for reads that represent the tumor sample. There will be two .FASTQ files per allele for paired end sequencing, or one per allele for unpaired. In addition, it will contain the benchmark data that tells you where SNPs, Indels and CNVs for each allele in each subclone generated.

5. Understanding Benchmarking files

If a bed file was supplied, two sets of positions are calculated, one relative to the genome, the other relative to the exome. The genome position is likely to be much greater than the exome position, since the exome is much smaller than the genome, and exons are usually surrounded by large noncoding stretches.