Pre-Cancer Genomics


CNAnorm is a Bioconductor package to estimate Copy Number Aberrations (CNA) in cancer samples.

It is described in the paper:

Gusnanto, A., Wood, H.M., Pawitan, Y., Rabbitts, P. and Berri, S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next generation sequence data. 2012. Bioinformatics, 28(1):40-47

CNAnorm performs ratio, GC content correction and normalization of data obtained using very low coverage (one read every 100-10,000 bp) high throughput sequencing. It performs a “discrete” normalization looking for the ploidy of the genome. It also provides tumour content if at least two ploidy states can be found.


Get the latest (recommended) version of CNAnorm and its documentation from Bioconductor. You might need a Fortran compiler and make to compile the latest versione (Linux/Unix users). If you install it using biocLite(“CNAnorm”) from within R, you will install the latest release version.

You can also obtain the perl script bam2windows.pl or the latest version from googlecode to convert sam/bam files to the text files required by CNAnorm. For documentation on usage, run the script without arguments

perl bam2windows.pl

For further information on both programs, please contact Stefano Berri

Related software

NGSoptwin is an R package designed to choose the optimal window size for CNAnorm. It is available here.

Additional data files

GC content

We provide gc1000Base.txt.gz, an example file for GC content (build GRCh37/hg19) to optionally use with bam2windows.pl. It provides average GC content every 1000 bp. The size of the window in the GC content file should be at least an order of magnitude smaller than the window used for CNAnorm to minimise boundary effects. If you require higher resolution, you can dowload the gc5Base tables from UCSD and/or make your own. The smaller the window size in the GC content file, the larger this will be, and the longer it will take to bam2windows.pl to process it.

LS041 bam files

We provide the bam files used to produce the dataset included in CNAnorm

LS041_tumour_500K.bam (28 MB)
LS041_control_500K.bam (27 MB)

They contain 500,000 reads randomly extracted from the following larger and unsorted files

LS041_tumour.bam (139 MB)
LS041_control.bam (130 MB)

To produce the text file used as input for CNAnorm, enter the following:

perl bam2windows.pl --readNum 50 --gc_file gc1000Base.txt.gz LS041_tumour_500K_sorted.bam LS041_control_500K_sorted.bam > LS041.tab

It will produce this file

You need samtools installed in a directory in your $PATH if your input files are bam format


