MATTHEW HARE: Give an overview of some of the molecular issues of generating data in a nonmodel context and making some decisions about what kind of data do we want and what's the best way to collect it. There's a proliferation of options, so it's a little bewildering and can be overwhelming. With the amount of time I have, I'm not going to get into the details of any protocols, but I'm trying to give you an overview and comparison of some of the most common protocols, some of the more recent ones that might be helpful for certain circumstances.
I already sort of said what I mean by nonmodel. Many of you, like me, sort of that working on these questions associated with historical demography and phylogeny for a long time, working with tens or dozens of neutral markers, and we've now got the ability to collect data. As I said before, that gives us the capacity to start looking at functional and genetic variation. Whether we can make sense of it or not is another question.
And so part of it is putting it in the context of the theory that Andy talked about, a part of it is collecting data in a smart way that's going to give us power in those tests, part of it is going to be recognizing a gene when you've got a gene, the coding sequence, being able to annotate it, being able to understand the functional context. But this opens up some very exciting arenas for looking at adaptation, for looking at evolutionary constraints, and capacity measuring gene flow across landscapes, measuring gene flow heterogeneity across genomes, islands of divergence, and looking at genotype environment associations.
So this was old school. And so, I and a lot of you, have done a lot of single gene work where we're using Sanger sequencing to get a single sequence from an amplicon. And within that it's just the majority sequence basically. So it's hiding any [INAUDIBLE] incorporations. In pooled sequence from multiple individuals, it'd be hiding rare variants. And now with sequencing by synthesis, we have the ability to see all those variants, for better or worse. We're going to see all of those errors, whether they're from [INAUDIBLE] incorporations or they're from sequencing error. But we're also able to see rare variance, if this is a population pool that's being sequenced. So we're getting individual reads from individual templates being sequenced.
So I want to just take a short moment to talk about sequencing genomes. Many of you probably have come across the opinion at your sequencing centers or somewhere that if not today, tomorrow, or next week, sequencing cost will come down, and you can sequence whole genomes. Don't bother subsampling. Or why are you going to do de novo? Just sequence the whole genome and assemble it, and then you'll have something to map your reads to.
Fair enough. That is where we're going. We hopefully will get there sooner rather than later, but many of us need answers now. And so my talk is mostly about just moving forward with the subsampling and doing the best we can. You can do a whole lot more if you do have a reference genome. And then I'm going to go through some sort of comparison of the reduced representation methods that is subsampling of genomes.
So people will tell you, sequence your genome, but it depends on the genome whether it is going to be easy or hard. It depends upon the repetitive content in the genome, and many times it's very hard. It's hard to make good mate-pair libraries, large insert sizes with adapters on the end. And it's often hard, even if you have good libraries to assemble genomes that are highly repetitive. And you don't necessarily know with an nonmodel until you get into it.
For a lot of the questions that we're interested in the genome is full of DNA that's not going to be informative. It may, if there's a lot of linkage disequilibrium. It may just be redundant variation. So we don't necessarily want to sample it. And finally, a lot of what we're going be talking about here is optimizing our sampling, optimizing our sampling in the populations and in the genome. And part of that is trade-offs of, do we need a larger population sample, basically where we're going to put our sequencing coverage. And in many cases, we're going to want more samples to sample more populations, more habitats, and better population sample sizes, which is with given constraints going to mean subsampling the genome.
So that said, when possible, if your genome is amenable and you've got the resources, it's going to be really valuable for you to sequence and assemble the genome, even if it's not a perfect assembly. And it won't be the first time around probably. It can be, in all these ways, makes your bioinformatic work a whole lot easier and gives you the strength to do much more powerful tests, like extended haplotype tests or collect data in that framework. So enough said.
I'm going to go through several restriction enzyme-based reduced representation libraries and then briefly describe these very other ways of taking a targeted approach. And Andy already sort of motivated transcriptomes as a good first entry into a genome because, by homology, we can get annotations for it. And there'll be a lot of talk in this workshop about transcriptome data but not exclusively.
So there's not one answer to these questions. And I'm going to be giving opinions up here that probably aren't necessarily shared by the other instructors. So it depends on what you need and what your organism is like. Do you desire haplotype data to run some IM analyses or phylogenetic analyses where there is assumptions that you have no intergenic recombination. And you can't really tell that if you just have SNP data or SNPs find. Do you want to barcode individuals with multiplex identifier, MID, or is it sufficient to get just SNP frequencies from pool data. There's some efficiencies there, but that's an important trade-off. You get a lot more information when you've got individual data and individual genotypes. You can do a lot more with it.
When you're pooling together, when you're multiplexing multiple individuals, whether they're pooled or they're barcoded, it's really hard even with good quantification of DNA to get even coverage across them. So when you're figuring out, well, just how much coverage do I have for the amount of lanes I can buy, expect that you're going to get really dramatically uneven coverage and account for it in terms of your thinking through these methodological choices. And finally, I've found in my lab and I think in general, pilot libraries will be really helpful with a new species to just make sure you're making these choices, and you're getting the coverage right before you scale up and do all of your population samples. Almost goes without saying.
So CRoPS is a method described in 2007-- Alex, one of our instructors, Alex Berkley, has used this and published it. I have not. I've done some ALFPs. But the idea is you're either pulling initially or later after barcoding. And you're double digesting as in a classic ALFP approach, like EcoRI and [INAUDIBLE] are the classic pair of enzymes that are often used in ALFPs both sticky ends, so then you're putting on adapters. And these would be-- this is an ALFP figure-- but these would be adapters that would allow for sequencing and barcoding if you wish.
And then, in these figures, I'm going to put in yellow that point in the protocol where you're making decisions about how narrowly to sample in the genome. And, in general, these restriction-based methods are sampling quote "randomly" across the genome, that is, they're not targeted. They may be biased. They may be more GC or less GC, so they're not strictly random, but relative to a targeted approach. They're randomly scattered through the genome.
And in this approach, you're adding in your PCR using a primer. You're not just priming on the known sequence that's flanking, where you put on this adapter, but you're adding one or two or three nucleotides that are going to prime in these unknown regions. You don't know what the sequence is, but some subset of your fragments are going to have a C here that this G will be complementary to. Some subset will have an A there, and a T there. So you can very stringently sub sample from the initial pool of fragments that you've got. And that's a very flexible approach for narrowing down the fragment pool that you're going to sequence.
So, in the past, this has often been done creating small fragments that can be mostly sequenced with a 454, but there's no reason why it needs to be sequenced that way. It's amenable to single-end or paired-end Illumina short-read sequences. I see the pros here as it's very easy. You can start with relatively low amounts of DNA because you're using PCR to amplify it up. There's this flexible extent of complexity reduction. And you could potentially use that selective PCR to bias your fragments towards GC rich regions if you have a taxon that you suspect that's going to be more gene rich, for example. So you can use that in your favor to bias the sampling across the genome.
The cons are anytime we're using PCR, we've got the potential for misincorporations and for biases that are difficult to control. And this is going to be cropping up in every one of these restriction-based, reduced, representation methods. And that is that our goal is to look at polymorphisms within the fragment. But if there's polymorphisisms in the restriction sites themselves, and there always is to some extent, that's going to create allele dropout or no alleles where that fragment isn't produced in a particular chromosome because that restriction site isn't there. It's a polymorphic restrictions site.
So RADseq, the high throughput version as described by Baird in 2008 is a single digest ligating on multiplexed P1 adapters, so it's a P1, P1 construct, and then typically pooling at that stage. But then, you've got fragments. And the restriction enzyme that you choose to use is your primary determinant of how many fragments you're going to get. And this method is basically designed to give you a sequenceable construct for every one of the restriction enzyme sites in the genome for the restriction site that you choose.
The way that's done, given that this construct is typically too big to sequence, is to randomly shear. So now you've got P1 on one end, you've got some intermediate fragments from middle of a large fragment that doesn't have any adapters on it, and then you've got the other end with a single P1. At that stage, you size select. This is not a severe size selection but just to size select to that portion that is amenable to sequencing in the platform that you're using. Ligate on the P2 adapters, so now you've got constructs that are P1 and P2. So it's that subsample that you're going to sequence, and you can do either single-end or paired-end sequencing. So P2, P2 drops out, P1, P1 drops out, and anything, of course, without adapter's drops out.
So in this very nice review by Davey, this figure shows an example of a pair of chromosomes with restriction sites. Note that the gray arrow here shows a polymorphic restriction site, so that's one example of a polymorphic restriction site in this example. After cutting, then we've got these fragments with restricted site in red. And we've just got the shades of blue for the different samples. This restriction fragment did not get cut because of that polymorphism. And then with RADseq, what we're getting is a tag, a RAD tag, on the sequence associated with the restriction site, and we get a tag on both sides of that restriction site so sort of a duplicate tag because you're getting both flanks.
And because they're tightly linked, they're redundant information in many cases. We don't get this. This is an example of allele drop out. We're not getting that restriction site tagged because it was polymorphic and didn't exist in that sample. We're also not getting the tags associated with really long fragments. No, that's not true. Sorry. We get the tags there for RAD tag. And what these lines indicate is if you're doing paired-end sequencing and trying to assemble contigs, you can assemble some of the variation, some of the sequence in intermediate region of that fragment.
So comparing that to GBS, described by Eishire and developed here in the Bukler lab, again, the standard protocol is a single digest, ligating here on the barcodes with a P1 and on the other end, a common adaptor. So your P1 has all different barcodes. You're common. You only have to buy once, and you're using it all across for the different barcodes. And then you're skipping all these steps. You're pooling and then you're skipping all these steps and amplifying down here-- the P1 to the common and sequencing.
The complexity reduction is, again, based upon enzyme choice, but it's also by PCR, since that's going to be biased towards short amplicons, will get amplified more efficiently. And then the sequencing has efficiencies so that you can throw long fragments at it. You're just not going to get sequence out. And it doesn't apparently diminish the efficiency of the sequencing, that is, you get to just as many reads of your short fragments. So they haven't found that to be disruptive. So this is sort of, without having to take extra lab steps, this is also simplifying the complexity of the fragment pool in GBS that makes it very quick.
And here, the end result of that is now you're not sequencing tags associated with every restriction site in the genome. So the goals are kind of different here. With RAD tagging, you're trying to get a complete sampling of restriction sites for that particular restriction enzyme, whereas, GBS you're subsampling based upon fragment size. What I see as the pros here is, again, you're doing that complete sampling. There's a lot of steps in this protocol. You'll get this two-fold redundancy because you've got both flanks on each restriction site.
Davey just had a nice report showing that there's biases because the shearing, no matter what among the common methods for randomly sharing the DNA, there's ineffective shearing of small fragments relative to large fragments. And so you get biases in terms of how much coverage you have relative to fragment size. And you also get this allele dropout as you do with any restriction-based library construction. The GBS, it's a much easier lab protocol. There may be reductions in efficiency by allowing for the complexity reduction to be done through the idiosyncrasies of PCR and sequencing efficiencies rather than doing it discreetly and precisely with a gel size selection and again, allele dropout.
So a variation of the RAD tagging approach described by Peterson at Harvard is the double digest RAD. And to my eye, it should have been called DDGBS because it's basically GBS with double digestion and a gel size selection. And some of what they took pains to show is that if you do this gel base size selection very precisely with a Pippin Prep or in other ways, there are benefits that it appears to give you greater homogeneity of coverage across individuals and, therefore, less missing data for a given level of coverage. Unless you're just maxing out on coverage, then all the methods are similar.
So let me show you with the double digest. What they're suggesting is that you have a lot of flexibility and choice of enzymes. And so the purple line is two 4-cutter, and so you're getting primarily short fragments. And so these are simulations, and this would be an in silico description of fragment size distribution from the closest reference genome that you have to work with. In general, they show that that works pretty well, that these patterns of size frequency distributions for fragment sizes are generalizable across genomes with respect to what you need for this sampling in this method.
And so if you take a wide cut of this fragment size distribution, again the purple, and with almost 5 million reads, you would get this blue region in terms of your really good coverage loci. So you'd get 200,000 regions or tags with 7x in here, so at this edge of your gel excision, you're getting those fragments for some individuals, and you're not getting them for others. So that's where you're going to have more missing data, so they focus on their expectation for coverage being that middle part.
If you're using a 6 and 4-cutter, you're going to have a very different size distribution that you're sampling from in the gel. So here, you're designing this to get many fewer loci because you're trying to maybe spread your coverage out across many more population samples. And again, your actual gel excision is broader than where you're going to get even and high coverage.
Compare that to GBS, and so this is from the Elshire paper showing the in silico results for the maize reference as opposed to the fragment size distribution that they got out of a GBS protocol. What the sequencer ended up sequencing was largely, basically 100 to 200 base pair fragments. I mean, it sequenced all this too, but this was what's enriched relative to the size distribution that you see in the reference. So they're letting the PCR and the sequencer do the size selection.
So what the double digest RAD folks, Peterson, argue is that with precise gel isolation, you can get much even coverage among individuals. And so at low coverage, and so reads per individual, at low coverage, you start to see here, there are regions that are sampled for most of your sample size, so less missing data, at low coverage compared to a strict RADseq protocol with single digest. So this was not only an impact of the precise gel excision, but also the double digest.
So I see the pros here of that fine-tuning of the complexity reduction being pretty flexible, both by enzyme choice and the combination of two enzymes, as well as the size selection window, gives you a lot of flexibility at low coverage. Obviously, there's advantages to reducing missing data and getting greater uniformity across individuals. And this approach, the double digest RAD, is only sampling maybe 5% of all those restriction sites in the genome. You're no longer trying to get every one. So it can get flanking tags, but the likelihood of that is very low. So, in general, it's not doing as much redundant sampling at each restriction site.
Both these cons have to do with allele dropout. And so they're going to scale with heterozygosity, and they scale with the combined length of the two restriction sites. The longer the restriction sites, the more chance you have for a polymorphism within them in your population samples. The higher the heterozygosity, the more chance you have for a polymorphism in your restriction sites. Either way, you're going to get this allele dropout and null alleles. You also have the possibility of allele drop out if there's not a polymorphism within restriction site, but basically a novel restriction site within your fragment. So there's a couple of different ways that you can get these allele dropouts.
Recent development in 2012 by Wang et all is this 2b-RAD. It's using this type 2b restriction enzymes that recognize that sequence. It's a total of six base pairs, but it's a broken up a recognition sequence. And it cuts the same length fragment every time it sees that recognition site. So these data are short fragments. There are only 32, 33 base pairs, but they're all the same size, which is convenient in some ways. They found that de novo clustering was feasible. It's going to depend upon a lot of factors like how repetitive your genome is, how heterozygous your genome is.
So the pros is it's extremely rapid library prep without a lot of cleanup steps. They say four hours. You can easily adjust the density of your sampling with a selective amplification, just same as ALFPs. And you can get illumina sequenced for only 35 base pairs, and it cost less, at least at our facility. So you're going to be paying for only the sequence that you need. But the cons are you're not getting away from allele dropout.
So I'm going over-- what am I going to do here? This is what I asked Andy about. You can get paired-end data with any of these protocols in order to assemble around a RAD tag, the haplotype, or at least the sequence, flanking the tag. And so this can be really valuable if you've got a low polymorphism species, and the tag itself may not have as much variation as you need or if you want to try to infer haplotypes.
So really quickly, targeted methods are either going the old route of you've got primers for some nuclear genes, or you can develop them and just multiplexing those amplicons. That's becoming easier with the Miseq. And you can also get the paired-end 250 reads so that you can actually get a 500 base pair sequence. But this is pretty laborious to do it gene by gene and then assemble them. Nonetheless, if you can share runs, it may be the way to go for a small number of loci.
Sequence capture, a lot of expense upfront. If you're not going to continue to work on the system, it probably doesn't make sense. But this company in particular, I think, has much better cost than some of the other ones. And this study showed that it works OK in many cases to design baits from a species that's 10%, 12% divergent, so you can use what's out there to, in many cases, design your baits. The approach we're taking in my lab is to use a RAD tagging or GBS type of approach in a mapping family to get mapped markers and then get baits made for those map markers and proceed with those as being the most useful markers to genotype in the natural populations.
And finally, transcriptomes. I'll be brief here because it's going to come up a lot in the workshop. But obviously, there's a huge reduction in the genome complexity here because coding sequences are typically just a few percent of large genomes. You're targeting the functional fractions. So if that's what you're interested in, go for it. It's also the most easily annotated. You're not going to take most of your genomic RAD tags and find anything homologous to it in the databases. So if it's showing an interesting pattern according to selection test, OK, where do you go from there? It's still useful, but you won't necessarily be able to track down why it's showing that pattern in terms of using the genomic databases.
Because coding sequence for functional genes is not going to accumulate indels like noncoding would, you have the bioinformatic benefits of relatively few indels by sequencing cDNA and working with coding sequence. And you have a predictable codon structure to work with. There's also the potential to combine the population genetics you get from SNP calling with RNA-seq data with RNA-seq studies of measuring gene expression as a phenotype of interest. So there's a lot of potential there.
The cons in terms of now RNA-seq for calling SNPs, not for gene expression, you need to be more careful in terms of sampling in the field and getting those samples into RNA later or tries all. So you need to put more thought into that sampling. As Andy alluded to, you may only get good data across all your populations for the highly expressed genes, and the genes that are housekeeping genes and being expressed everywhere and under all the different conditions in which you sampled. And then there's a possibility that we still have a lot of work to evaluate how much this is going to be an issue.
But allele-specific expression could generate data in which an allele is present in this population, but it's not being expressed. That's interesting regulatory variation. But it's going to influence the data that you're working with in ways you're not necessarily going to be able to sort out. And alternative splice variants are going to be a lot of the variation in the data that comes into play in terms of assembling it and interpreting what that assembly means. But it also comes into play of well, what is the variation you want to score and try to understand across the populations.
So I've just said all of these questions before. It really depends what these different forks in the road. It depends on what your question is, what your organism is, what your resources are. And there's a number of different views, I think, about how to move forward represented among the instructors and probably among you. So I think it'd be fun to sort of flesh those out in a discussion period.
We've received your request
You will be notified by email when the transcript and captions are available. The process may take up to 5 business days. Please contact firstname.lastname@example.org if you have any questions about this request.
Matthew Hare, associate professor of natural resources, gives an overview of reduced representation strategies for genomic sampling, as part of a population genomics workshop at Cornell University, July 23-24, 2013.
Population genomics involves sampling, financial, and bioinformatics trade-offs, so proper experimental design requires understanding probability, sequencing technologies and evolutionary theory and how they relate to research trade-offs. The workshop, "Next Generation Population Genomics for Nonmodel Taxa," explored the strengths and weaknesses of different approaches to genome sequencing and bioinformatics when studying population genomics in nonmodel species.
Go to the
workshop website for information associated with these videos including lecture notes, descriptions of exercises, and computer code. This website is a site for on-going learning about methods for population genomics.