Advances in DNA sequencing technology in the past decade have had an enormous impact on the approach to the discovery of cancer genes. Vast international collaborations such as The Cancer Genome Atlas (TCGA), which include a number of laboratories in Vancouver, have been formed with a goal of generating comprehensive lists of genes that drive the progression of cancer. The approach is as straightforward as the expectations are intuitive: by sequencing ever more matched tumor-normal samples (to date, more than 5000 samples have been sequenced), driver mutations should begin to appear above the background “noise” – the random mutations that occur in the rest of the genome. These mutations should converge to a small list of verifiable drivers, which can be further analyzed for their therapeutic potential.
Instead, the opposite appears to be occurring. With each new sequencing study released, dozens more significant drivers are discovered, to the point that the list of “driver mutations” has burgeoned into the hundreds. The sheer number of potential therapeutic targets makes the task of validating them daunting, which consequently stunts opportunities for drug discovery – because when the choice is between one hundred genes each mutated at a frequency of 1% in cancer, how could any biotech or pharmaceutical company evaluate a new target for investment? Choosing a target for development is even more difficult to justify when biological mechanisms are difficult to explain, or even to fathom – such as the discovery, in a recent study of squamous cell lung carcinoma, that 101 out of 450 significantly mutated genes encoded olfactory receptors, and that the list is significantly enriched for extremely large proteins, such as the muscle protein titin.
A recent study from the Broad Institute at MIT in Boston sheds some light on these seemingly strange phenomena. Most sequencing studies assume a uniform mutation rate across the genome, and use mutation rates for each category of mutation from previously published data for the type of cancer in question. However, using a brief thought experiement, in which some genes have a high mutation rate and others a low one, Lawrence et al. show that this will quickly lead to overcounting the highly mutated genes and undercounting the less frequently mutated genes. Their re-analysis of the mutational heterogeneity across 27 forms of cancer (encompassing 3000 tumor-normal pairs) showed that mutation within a cancer type varies across up to four orders of magnitude, invalidating the assumption of uniform mutation rate within a tumor type. The most significant sources of heterogeneity, however, appeared to be expression level and replication timing: Lawrence et al. found that highly expressed genes were less frequently mutated (which they attributed to transcription-coupled repair), and genes that lie in regions of the genome replicated later in S phase were mutated more frequently (which was attributed to a decline in the nucleotide pool). Using their newly developed sequencing analysis method, a re-analysis of a lung cancer sequencing study saw the list of highly mutated genes shrink from 450 to 11, while still identifying some new biology. Such methods to eliminate noise in cancer sequencing efforts will go a long way to furthering anticancer drug development, and demonstrate that cancer-sequencing biologists and bioinformaticians work best when they work together!