A major obstacle (and opportunity) in the quest to understand the human genome is its variability from person to person. Individual genomes can differ greatly, from single-letter changes to complex structural differences over chunks of up to 1,000,000 base pairs of genetic code.
To analyze these complex variations, researchers at Harvard Medical School developed a novel set of molecular and statistical tools to scrutinize a particularly intriguing region of the genome associated with female fertility and neurological disease.
Their work, published online July 1 in Nature Genetics, gives scientists an unprecedented view of the variability of the human genome, with potential applications that range from predicting risk of disease to illuminating our evolutionary history.
Traditionally, studies of genome variation relied on comparisons of single base-pair differences between individual genomes. More recently, scientists identified much larger differences, known as copy number variations (CNVs), where regions of up to 1,000,000 base pairs could be either missing or present in extra copies. These regions potentially encode several genes, and duplications or deletions have been shown to be strong risk factors for diseases such as autism, schizophrenia and developmental delay.
However, scientists lacked the tools to analyze regions in the human genome where large-scale variation is more complex than the simple gain or loss of a segment. For example, one individual’s genome might have seven copies of a gene where another’s has five, each sandwiched by other combinations of duplicated or missing genes within a large area that differs as a whole. These wildly varying permutations make studying these areas of the genome extremely difficult—like trying to compare giant decks of shuffled cards.
“These complex regions are messy parts of our genomes, and we haven’t been able to get our heads around them until now,” said Steven McCarroll, assistant professor of genetics at Harvard Medical School.
So McCarroll and his team, including first author Linda Boettger, a graduate student in biological and biomedical sciences, developed a suite of techniques that drew on molecular biology, population genetics and statistics to reveal the physical structures underlying large-scale variations between individual genomes.
To measure the copy number of genome segments in large populations, the team refined a method of counting short chunks of DNA by cloning or amplifying them in nanoliter-sized droplets, a technique called droplet-based digital polymerase chain reaction, or ddPCR. They targeted specific genes for amplification with a fluorescent “reporter” molecule and subjected the regions of interest in each genome to 20,000 simultaneous ddPCR reactions. By analyzing the number of fluorescence-labeled droplets, they could determine the number of copies of a gene present in each genome, carefully calibrating these measurements against control genes. In parallel, the researchers developed an algorithm that compared hundreds of genomes, using whole-genome sequence data from each genome to inform the analysis of the others. With this algorithm, they were able to identify accurate copy-number variations that were independently verified by the droplet-based digital PCR technique.
To demonstrate the efficacy of their tools, the team examined a structurally complex region of chromosome 17, running 946 unrelated genomes from the 1000 Genomes Project through their algorithm and analyzing 120 parent–offspring trios from HapMap by ddPCR. The region, known as locus 17q21.31, contains genes associated with female fertility and risk of Parkinson’s.
Although 17q21.31 had long been thought to exist in two structural forms in the general population, McCarroll and his team found it in fact exists in nine distinct structural forms, with each differing in the gain, loss or rearrangement of more than a 100,000 base pairs of genetic code. All populations they looked at, including individuals of African, European and Chinese ancestry, carried at least four of the nine forms.
Interestingly, five of these variants were found primarily in genomes from West Eurasia, which includes Europe, India and the Mediterranean region. Four in particular were carried by over 40 percent of the European genomes analyzed, and were found to contain similar functional duplications of a portion of a gene known as KANSL1. McCarroll and his co-authors hypothesize that this gene, previously shown to regulate the timing of female fertility in fruit flies, may have a similar biological function in humans.
“Genetic variants can rise to high frequency by random chance, but observing this twice, in very similar KANSL1 duplications, supports the idea that they are evolutionarily beneficial,” said Boettger.
“It will take much more research, but KANSL1 immediately goes to the top of the list of genes that scientists would want to study in relation to the timing of fertility in humans,” McCarroll said. “This wasn’t necessarily about figuring out any one disease, but about understanding a general question about how genomes evolve over time, and how genomes vary in human populations. The work now makes it possible to relate such complex genome structures to clinical phenotypes, by providing the first set of molecular and statistical tools for doing so.”
His lab has already begun studying variations in other complex regions of the human genome.
This work was supported by a Smith Family Award for Excellence in Biomedical Research to S.A.M., by the National Human Genome Research Institute (U01HG005208) and by startup resources from the Harvard Medical School Department of Genetics.