SPEAKER 1: This is a production of Cornell University.
ANDY CLARK: So it's an enormous pleasure to introduce the A. D. White professor speaker this evening, J. Craig Venter. It was fairly easy to get enough material to put together an introduction for Craig, having just read his book, A Life Decoded, a book that I encourage you all to study. In fact, the challenge, of course, is to winnow down what is the sort of finite amount of material that I can give you some sense of who is this remarkable person you're going to be hearing from this evening.
In many ways, he really is an exceptional human being. Started out as a surfer bum, and all his life, he's always been sort of a maverick. And I expect he's going to get--
I expect he's going to get a little mavericky on us today.
Anyway, in 1972 Craig got his bachelor's degree in biochemistry at UCSD. 1975-- that's three years-- he got his PhD at UCSD also in physiology. And then he went-- he got a faculty position at the State University of Buffalo where he was until 1984, getting intimately familiar with snow.
He then, from Buffalo, went to join the NIH, which is where he really started to get engaged with working at a genomic level. That's where he developed the method for really sequencing at a very high copy number or high level of, at a genomics scale, cDNA sequences, a method known as express sequence tags or ESTs.
And that became a method that he was pushing as being a means of understanding the entire transcribed genome of organisms. Got into a little bit of trouble for patenting some of these, but nevertheless that became a method that is widely used and is used to this day. In fact, if one ever sequences a complete genome sequence, having sequences of those express sequence tags is a remarkably effective means of identifying the transcribed gene set.
In 1992 he organized the company-- or the institute called The Institute for Genome Research or TIGR. And at TIGR we got the first glimpse of the sort of scale that his mind works at, the ability to imagine an institute where there was this combination of computer scientists and molecular biologists tackling really large-scale problems sort of in concert.
There aren't too many papers where decades later, you remember the day that you read that paper, but it was his 1995 paper from the TIGR group where they assembled the Haemophilus genome sequence entirely by this method, random shotgun sequencing, that I remember the day that that paper came out.
And it was remarkable, because it really showed the fantastic power of fusing methods of computational methods and molecular biology. That simply by sequencing random fragments from the genome, shoving it into a computer, and devising the appropriate algorithms to assemble them, we would have a remarkably more efficient means of assembling genomes.
And the rest has been history. A short while after that in 1999 there was a realization that, well, this probably could scale up. We might be able to do the entire human genome by random shotgun assembly. And in that year, 1999, they organized-- Craig organized the company Celera Genomics, and within a month or so, got $1 billion of market capitalization for that enterprise.
That group worked in collaboration with Applied Biosystems and Compaq computers. They assembled a computer that was larger than the World Bank computer. Banks and banks of these brand new capillary sequencers that used multiple lanes all at once, and started out that project.
The remarkable thing about Celera Genomics was Craig's realization that this isn't really your standard run of the mill molecular biology, it is an industrial process. And as an industrial process, who did he turn to help figure out the way to make it work but industrial engineers. So they modeled the entire system, from picking the original clones up through PCR amplifying to sequencing to dealing with those sequence fragments, as an industrial process, and understood the entire dynamics of where there might be bottlenecks in the flow of that information, where there might be problems with quality control.
They had full industrial-scale quality assurance and quality control mechanisms built into that system. It was really an incredible thing to see that. And pretty soon, people realized that this really was the future of the way genomics would be done by scaling things up in that sort of industrial fashion. And of course, as you know, it worked remarkably well.
The other thing that was really remarkable about Celera, which was different from any public effort for genome sequencing, was that from the very start, they did it with five individuals, so that from the very start they would get polymorphism across the entire genome.
I had the great good fortune of being invited as a consultant for Celera Genomics at that time, starting in February 1999, largely because of this realization that really there was a population genetics kind of problem right from the very beginning. And I remember the first simulations we did, trying to estimate how many single nucleotide polymorphisms we should see and so forth. So that was really a tremendous joy for me to be part of the whole enterprise.
Anyway, the draft sequence of the human genome was done in the year 2000, published in the year 2001. It was about the same time as the public effort. And that method of sequencing by random shotgun, which was widely assumed to totally fail at the beginning by many, many geneticists, of course, it's now the de facto standard for genome sequencing.
Craig was fired from Celera Genomics in 2002 and used that as an excuse to go on and do even greater things. Formed the J. Craig Venter Institute, which gave him the independence to pursue all kinds of interesting and novel problems, thinking about where genomics would take us, the study of metagenomics. What are all the organisms that are out in the sea? The microbial world, that we know we can only culture less than 1% of, what's out there? So he's using his fertile imagination to pursue problems of that sort.
That company also organized an effort to synthesize life, and that's largely the topic of tonight's lecture. And as a result of imagining what might we do if we had the ability to actually generate sequences that could then become living organisms, one of the prime applications, of course, is to think about the major problems in the world, one of which is, of course, the problems of energy. And so the opportunities for really fine-tuning efficient production of biofuels was then the focus of the company Synthetic Genomics, which was also formed by Craig.
Craig was put on the 2007 and the 2008 Time's list of the 100 most influential people in the world, and so we're especially delighted that he could come and be our A. D. White professor. He actually is taking the time with three different courses to sit and have discussions with students. This is all the more remarkable, and the comment that Craig made last night that in the past six years, he's never spent more than a few hours at any one spot. The roughly two days that he's spending at Cornell University is going to smash all the records of his time staying in one place. So that's an enormous honor for us. And please welcome J. Craig Venter.
CRAIG VENTER: Well, thank you for the extremely warm introduction. The only thing I really take exception to is, after the last election, I don't like being called a maverick anymore.
I think John McCain and Sarah Palin ruined that for mavericks everywhere.
So we have to come up with some new term. I don't know what it's going to be.
It is a pleasure to be back here. I think it was about 10 years ago was my last trip to Cornell and I don't recognize anything on the campus. I don't know if that's my age, or the campus changed a lot, or both. But it's exciting to be back here.
And I think I accepted this professorship five years ago and it's been a rough negotiation ever since then to find the right time to get me here. So I'm delighted. I wanted to wait till we really accomplished something to come and so it's nice to finally be here.
Andy gave you sort of an overview, in fact, of what I'm going to talk about, going through some of these things. But let me mention a little bit about the Venter Institute. So a couple of years ago, we merged my foundation and the various not for profits, including TIGR, into one institute. I, actually, for awhile, had five not for profit research organizations working, and they all had basically the same board. And they said, we thought you were smart person. And I said, yeah, of course, I am. They said, well, you're not smart, because you have five boards. So they merged everything together.
And this is the campus in Rockville, Maryland. We have five buildings there. And this is the building in La Jolla, California. We've taken over the building of the former Allergy and Immunology Institute near the UCSD campus.
But we're actually designing and going to be building a building right on the UCSD campus. We're trying to put some of our thoughts into action with the first zero carbon research building, and they gave us a nice spot overlooking the ocean. So that same building probably wouldn't work here in Ithaca, but the climate in La Jolla certainly very prime for doing this. And also the Synthetic Genomics company that's trying to do some of the environmental applications is also in La Jolla.
So maybe you got out of what Andy covered in the introduction that we like to look at big issues and big questions. It's either a big question or it's a small question. What is life? Is somebody that certainly everyone of you have asked yourselves at times. At various stages after alcohol or other substances, you might have come up with answers. I don't know.
We've tried to take a unique approach to this by seeing, can using reductionist biology, can we pare it down to its most basic components? A key element for this is, can you digitize biology? And then can we regenerate life out of that digital world became one of the corollaries.
So the first parts are pretty easy. Digitizing biology is what we've been doing for the last few decades on reading the genetic code, going from this analog world of biology, reading the genetic code into the computer.
And this started with some pretty simple concepts back in the work at NIH. This is an article that several of us wrote for the 10th anniversary of Nature Genetics, because we published our first papers in the human genome sequencing in the first issue of Nature Genetics. And it was concepts like randomness that have sort of permeated all the work that we're doing. Massively parallelism certainly helped both in terms of the computes.
And while everybody was wringing their hands when the early automated sequencers didn't do everything, I had a very profound concept. If we wanted to double our throughput, we just got a second machine. It really seemed to confound geneticists at the time. But we were able to scale up to-- when Andy was working with us at Celera, at one time, I think we had up to 500 machines. We had to get a lot of extra, because they failed at about 10% a day and so we had to have all these extra ones as spares.
But I think the concept of randomness is the most important going through things. After sequencing some regions of the human genome as a test, we found we couldn't interpret that sequence information. Finding genes algorithmically worked nicely in E. coli, it didn't work at all in human, whereas if we had a cDNA clone, it was very easy just to align them and know where the genes were.
So the EST method was an idea that I had of just randomly getting a whole lot of cDNAs very quickly for interpreting the human genome. And after a few years of doing this, we published this paper in Science in 1991, and it's something that did catch on with a little bit of a lag phase. 10 years after the first one, there was over 10 million ESTs in this new division of GenBank, dbEST.
Then it started doubling quite quickly. The most recent release that I looked at had over 50 million entries in it. Human sequences, I think, are the most abundant, with now over 8 million cDNA transcripts from EST projects. But the number of species this keeps getting applied to expands on a constant basis, because it's how you get the express component of the genome.
Now, it was just a problem of numbers that was limiting us. So we had thousands, then tens of thousands, then hundreds of thousands of sequences. Even at the time, people thought that there were maybe somewhere between 50 and 300,000 genes.
And so we wanted to see if we could reduce the duplicity with the sequencing of cDNAs. And so Granger Sutton, who just joined us as a post-doc at TIGR, I asked him to construct the algorithm for assembling these sequences. And it turned out to be such an incredible tool that we decided to apply it back to genomics to see if we could change how genomics was being done, and that resulted, as Andy said, in the first genome of a living organism in 1995 with Haemophilus.
The yeast genome took 1,000 scientists 10 years to sequence and ended up being the fourth genome sequence, and this sort of came out of the blue in just showing how we can reduce that 10 years. It took us four months to do this with a relatively small team. And as you know, these don't even qualify as graduate student projects now, because you can do a genome a day in a lot of different centers.
We couldn't get funding to do this from NIH. This grant was turned down with extreme prejudice. And I joke, because Francis Collins had Bob Strasbourg write the letter to us on how it was not going to be funded. Bob's now the deputy director of the Institute. But every time I introduce him, I remind him that he wrote that letter.
But it shows in terms of ideas of funding science. We took money out of the bank to fund this project, because we believed in it. And that roughly million and half dollars we spent has resulted in now over a half a billion dollars in additional grant funding, because that experiment worked. So sometimes if the experiment works, it goes in the right direction.
That year we actually sequenced another genome. We deliberately picked the organism that supposedly had the smallest genome, and that's Mycoplasma genitalium, of a independently self-replicating organism. There are smaller genomes, but they're all co-symbionts.
And the third genome that was done was the first archaea and this really changed our view of gene repertoires. This was done to try and help prove Carl Woese's idea of three branches of life was, in fact, real. But it was stunning to everybody how many genes in this genome were a complete unknown function, never been seen before, for this autotroph that lives happily at 85 degrees centigrade. And there were no obvious differences in the protein structures either that would enable them to work at these high temperatures.
So sequencing began to advance biology, and it was only five years after the Haemophilus genome that we took on the human genome. There was an important intermediate step of several hundred genomes, the most important being the Drosophila genome, which was, I think, the most fun genome that we ever did, because of the entire community came to Celera to annotate the Drosophila genome over a few-week period.
The first draft that we did was published at the same time as the first draft from the public effort. It turns out both of these were wrong and they were wrong for two different reasons. I talked to some of the classes today about this. The public version gave the wrong answer about variation in the human genome, because it was based on selected [? bat ?] clones. And once you select a clone, you have nothing else to compare that to, and so they totally underestimated sequence diversity.
The Celera was wrong for an unusual reason. In fact, had I listened to the bioinformaticians, we would have had the right answer seven years earlier. They were very worried about human variation and the extent of it. I wanted to measure variation, so I'm the one that came up the idea of sequencing a large number of genomes. It got pared down to five people. But the assembly team just wanted to do one genome, and I felt that would not have been very satisfactory.
But because of the consensus-making part of the algorithm, even though we had parts of five diploid genomes in, it subtracted out all the major variation, particularly all the indels. And so both groups came up with this notion that we vary from each other in one out of 1,000 letters of the genome.
And it was only as we continued to sequence, and a little over a year ago, published the first complete diploid genome, that the picture changed. This was expanding on my genome. And actually comparing my two parental chromosomes, they differed from each other by 0.5%. So it was a totally different picture.
I argue, right now, if we were going to compare any two of us, we'd probably differ by on the order of 1% to 3%. The bioinformaticians here can spend a lot of time calculating what the real number is based on how you view it.
But to put that number into context, we did some work at Celera with Svante Paabo on the chimpanzee genome. And what we did was sequence a few hundred thousand isolated clones from chimp, layered them on the genome assembly, and came up with a number that we varied from chimps by 1.27%. But now we know we were different from each other by 1% to 3%, and hopefully some of the evolutionary biologists in the room are hoping the chimpanzee number changed too. And now know we differ from chimps by about 5% to 6% when you take into account all the indels.
The other aspect of having both sets of chromosomes is to do haplotype phasing. And these are about half the genome in the original version. Half the pieces are over 200 kb of knowing they were from one parent or another. And as we add on sequences, we're pretty close, we think, to having all the chromosomes completely haplotype phased. That means I would be able to tell this particular set of sequences came from my mother and this set from my father. As we look at these differences, that can become very important in tracking exactly heritable lineages.
The other surprise with this, as shown on this slide, is about 75% of the base pairs that differ between any two of us are not in SNPs. We have a huge industry that measures SNPs and they thought that that accounted for basically all the variation. Because of these large insertions and deletions, we have a lot more base pairs in non-SNP variants.
I think the other surprise that came out of it was the number and the extent of genes in my genome that had major variations. About 44% had one or more heterozygotes variants. Now, some people argued it was just because I was strange, and there may be some truth to that, and it's not exactly comforting that Watson has as many.
I'm not sure if that proves the rule or not. But I think if you think with large populations, with all of us having that many unique protein structures, it changes certainly my view of biochemistry and my view of human variation, and the challenges associated with that, including pharmaceutics to address our various problems.
So the first two were-- was mine in 2007 and Watson's published in Nature in 2008. And I just had the privilege of sharing the Double Helix award with him, and I commented there, looking at genetic variation that phenotypically, we look somewhat alike, both being bald Caucasians. And it's been a pet peeve of mine about what I call race-based medicine where a lot of physicians think it's legitimate to practice different types of medicine based on the skin color of the patient in front of you.
And so we decided to look at comparing my genome to Jim's genome and found, in fact, that even though you would think we would be remarkably similar in things like drug metabolism, being bald Caucasians, then it turns out if the 454 sequences is accurate in Watson's genome, he has a very rare set of variants much more common in a Chinese population and has very slow drug metabolism. I have very fast drug metabolism. So if you gave us each the same dose, he would be overdosed and I would be underdosed. People have said that probably explains a lot.
It certainly explains some things. And so knowing your own drug metabolism can have a big impact.
And examples I gave in some of the classes today, my favorite is caffeine metabolism. Because we've all had these confusing reports, is caffeine good or bad for you? Well, it depends on your genome. I have two copies of the fast metabolizing allele and so I can have multiple doses of caffeine. I had at least three or four cups of coffee today, and if a Red Bull was available, I would've washed it down with that.
And if you separate out of the population the fast metabolizers, the rest of you cannot drink that much caffeine without having a greatly increased risk of heart attacks, whereas, in fact, for the fast metabolizers, it looks like caffeine could even be beneficial. I certainly find it that way.
So as we look at populations, what seemed to be very confusing when you just do a clinical trial or a test on a population, when you break out people by genotype, the world looks very different.
So we're trying to scale up now. While it's entertaining to have a few genomes, we need a large number with complete phenotypic information, and that's the next phase of what we're trying to scale up to do. And the challenge, I think, is going to be getting the phenotype information on all of us.
Once we do that, we will know what's genetic and what's environmental. And we've taken also a sequencing approach to trying to understand the environment. After we finished the first draft of the human genome and I was looking for a new job, as Andy said I was, I sequenced the human genome, I raised $1 billion, and they didn't need me anymore. So they fired me and I ended up being grateful for it.
But we tried to apply these same techniques to understanding the environment. And the first test was trying to do a shotgun sequencing of the Sargasso Sea. Now, the Sargasso Sea was chosen because it was supposed to be, according to marine ecologists and biologists, a very simple environment with very few organisms.
In fact, what we did is we just filtered some seawater onto different sized fraction filters, took the filters back to the lab, isolated all the DNA off there at once, and shotgun sequenced it. And we stopped sequencing on the first study after about 1.2 million new genes had been discovered and as many as 40,000 new species just in one barrel of water.
And to try and put that data in context, we thought maybe if we sampled elsewhere, we'd find organisms that were very rare in the first sample but more abundant elsewhere. And also I was looking for an excuse to sail my boat around the world. So I was fortunate to be able to combine the two things of doing an expedition and taking samples every 200 miles.
So this is the route that we took on the circumnavigation, starting in Halifax, just like with the Challenger expedition. And it's interesting that 200 miles is the same thing that the Challenger did in 1870 as the first oceanographic expedition, only they stopped every 200 miles and sent a dredge to the bottom of the sea to pull things up, because the notion was there was no life below a certain depth in the ocean, because obviously life could not survive in those conditions.
We went through the Panama Canal, spent three months on the Galapagos, sailed across the Pacific, then around Australia, and across the Indian Ocean. I was arrested twice for having the audacity to try and do science. We were arrested for the first time sailing into French Polynesia. We got in the middle of a dispute between the French government and the French Polynesia government about who owned the organisms in their water.
And little did I know that sailing across the Pacific, we would run into French genetic heritage in the middle the Pacific Ocean, but that's apparently where it originates.
And then we were arrested again in the middle of the Indian Ocean sailing too close to a-- and trying to stop to dive on an island that all the inhabitants had been kicked off so we could have a B-52 bomber base there. And this time, the British were not so kind, and they threatened to impound the boat and lock us up for a long time, again, because they were afraid that science might bring attention to that part of the region where they had kicked people off the island. So I argued that if Darwin had encountered all these same obstacles, we probably would not know as much about evolution, because everybody would have owned all the organisms in their waters.
We extended, last year, the expedition, covering the entire West Coast from Panama up to Glacier Bay, and sailed right up to melting glaciers to get maybe 10,000 year-old organisms. And we're about to start the third phase of the expedition, leaving San Diego this winter to go back through the Panama Canal, then across the Atlantic, and to cover the Baltic, the Mediterranean, and the Black Sea over the next few years.
So trust me, science can be fun. And it's also-- it's the best way to understand the history of the world is to see it from the deck of a sailboat, because that's how much of the history happened.
Well, we published the first third of the expedition findings in a special issue of PLOS Biology, and the first set from Bermuda doubled the number of proteins in the public databases. And then we described 6 million new genes in one single paper in the PLOS special edition that, again, more than doubled the genes known and available to science. And we're about to do that again, because now we have about 20 million genes in the database from continued expeditions across the Indian Ocean. We still have about a third of the samples left from the initial circumnavigation, and the entire West Coast to add into this, plus a few trips to Antarctica.
So we still don't know what the gene pool of the planet is. We did a set of experiments that took over a million CPU hours, trying to compare everything in our data set to everything that had been known to see if we were starting to see a saturation and whether there's a linear array of new gene families. It turns out it was completely linear with around 40,000 to 50,000 major gene families. Any new sample adds new gene families that have never been seen before in a pretty straightforward fashion.
It's a little bit different if you break things out by types of organisms. For example, if we look at mammalian genes, basically, the discovery of mammalian genes is saturated. We and our fellow mammals are a very small part of the gene pool. And sequencing more mammalian genomes probably won't discover many, if any, new genes. It'll give us new variants of those new combinations of those.
But if we look at our key bacteria and certainly viral genomes, we're still in a pretty linear phase of discovery. And even with all the new data, we see no hint of the saturating.
Some of the discoveries were pretty cool and a big surprise. Many of us knew about one or two bacterial rhodopsins and they were thought to be rare molecules. This is just what was discovered out of the initial sample in the Sargasso Sea. All the red lines were new rhodopsin-like molecules discovered by their similarity to our own visual pigments.
So we sort of were wondering what role these have. And we tried to assemble all of these, and developing new tools to align thousands of proteins has been a challenge. And some of the gene families now have as many as 50,000 members.
But with the rhodopsins, a single amino acid residue can determine the wavelength of light that these receptors interact with. Because each sequence has a GPS coordinate, we could ask unique kinds of questions. Do we see any unusual distribution of these? And the answer was yes.
And what seems like a very logical way now but was surprising at the time, out in the middle of the Sargasso Sea where it's a blue water, the receptors see primarily a blue light. You get into coastal waters where there's a lot of chlorophyll, they see primarily a green light. Fresh water, like the Panama Canal, almost entirely a green light due to all the algae.
And in analysis further, we found, in fact, there's just a single base can switch between the blue and the green light receptors. Apparently just the selectivity and survival is determined by just this one molecule in the membrane as to whether there's sufficient light to activate it. And the switch between blue and green has happened at least four times in recent evolution.
A study out of a Swedish group, in fact, proved that these receptors do drive the biology of these cells, and they're totally stimulated by light, and this is just like your own visual pigments.
In an effort to try and put some context around all the new discoveries, the Gordon and Betty Moore Foundation funded us to do what we call the Moore 160, organisms selected in culture in all these different unusual environments around the world. And this is important message as we go into human metagenomics in more detail now. Very few of these recruited many sequences. And we have a few dominant taxa with a huge tail of rare organisms. And so we're at the early stages of understanding just species diversification in the microbial world.
And the same is being applied to our own bodies. We have our 23,000 genes and 100 trillion cells. Most of us carry an additional 100 or 200 trillion cells in the form of microbes. We did this experiment right after the Sargasso Sea experiment, and it turns out we probably have a spare 10 million or so genes associated with each one of us.
So just because these organisms have been invisible and they haven't been cultured for the most part, we've been ignoring them in terms of trying to understand human biology, and I think these studies are going to be pretty interesting as we go forward.
One way to put them in context is try to look at metabolism. So a company in North Carolina, Metabolon, has been characterizing all the chemicals that our genes can put together, and they come up with about 2,400 different chemicals that our gene-- our enzyme set can synthesize.
But if you look in our bloodstreams, particularly after a meal, we find about 500 different chemicals. Only 60% of them are from our own metabolism. 30% of the chemicals circulating in your bloodstream are actually from what you ate, all the different species that you consumed.
But I think the most fascinating number is we have about 50 chemicals circulating in our bloodstream from bacteria metabolites. So the bacteria associated with this metabolize various compounds from our food, et cetera, and we have no idea what role they play in our own biology.
Many of you think if you have a drink, or you smoke something, or take a pill, that that's basically all that's circulating in your bloodstream. And so now we know we have this background of about 500 chemicals and we need to understand their role in human physiology.
Well, obviously, trying to get to a simpler system makes sense when you look at that complexity. And so we've been spending a lot of time trying to characterize this minimal genome of Mycoplasma genitalium and just asked simple questions. So this has around 500 genes. Haemophilus had 1,800 genes. And we just asked the question, if one needs 1,800 and another 500, and the others need more, is there a minimal gene set that we could define life with?
This is obviously a very naive question. But we started out trying to do it. How many of these genes are essential? What is really the smallest number of genes it takes to run a simple cellular machinery? And ultimately, we went back to the question, could we design and construct a genome of a minimal cell?
We did a lot of work with comparative genomics. We spent about a decade trying to knock out genes. And then I'll go into the construct of this simple genome.
So the first two genomes we did, several people did these kinds of comparative studies and came up with an interesting overlap. We're glad we didn't publish this paper, because this group, Koonin's team, said the gene pool on planet Earth was remarkably small and that we probably had most of the genes in these two organisms. That when we did the third organism, the archaea, it didn't overlap at all with this, so that wasn't going to take us very far.
There's now 13 mycoplasmas that we've sequenced. They have some core genes common to all of them, but that was not a minimal set that would take us anywhere. We could expand that minimal core to about 310 genes, looking at non-orthologous displacements genes of different structures but similar functions. But still looking at those, we didn't think that would truly define life.
We started doing a technique developed by Clive Hutchison at the Institute called whole genome transposon mutagensis where we can randomly insert transposons into genes in the genome. Because we have the complete sequence of the genome, we can just sequence off the transposons and know exactly where they are, and we generated this kind of map. I don't know if you can see these little black and red triangles. That's where transposons inserted in the genome without killing this organism.
So if they inserted in the middle of a gene and the cell lived, we called those nonessential genes. If you couldn't get any transposons in and the cell lived-- all the ones here without transposons, we considered those essential genes.
Now, the trouble is it depended on the environment, and these were very humbling experiments for molecular biologists that wanted to define life in terms of genes. We found what was in the environment determined which gene set would be there. A simple example is the cell grows off of both glucose and fructose. There's a gene for a [? self-service ?] transporter for each sugar. If you have both sugars in the environment and you knock out the glucose transporter, the cell lives. And so we'd score that as a nonessential gene.
But if you only have glucose in the environment and you do that same experiment, the cell dies. Under those conditions, you'd say it was an essential gene. So we learned there are two components to life. We need to have the genetic code and we have to have the environment, and it's hard to even define at this level.
When we look at the metabolic map that we generated for this system and you look at all the genes that can be knocked out one at a time, it's likely this would not result in a viable organism. So we decided the only way to do this-- because there weren't enough markers to do sequential knockouts of 100 genes-- would be to synthetically make a chromosome.
As soon as we went down this route, we had new questions. Would the chemistry permit even making large molecules? And even if we made them, could we boot them up or activate them?
So here's sort of what's been happening with gene synthesis. It moves remarkably slowly compared to reading the genetic code, but it is starting to change dramatically.
We tried experiments early on. We chose phiX174 as the target, because it was the first viral genome that Fred Sanger's team had sequenced. And Clive Hutchison was in Sanger's lab at the time and stuck a little vial of phiX in his pocket for historical purposes. So we had the original samples and we could resequence those to make these.
It turns out all of DNA synthesis is a degenerative process. If you're making PCR sequencing primers, it doesn't matter. But with an N minus 1 error rate, the longer the sequence, the more errors.
Obviously, we can't tolerate a lot of errors if you're trying to synthesize a genome, so over the years we developed all these new techniques to do error correction. And finally we were able to go from the genetic code of phiX in the computer-- we had resequenced it to verify it-- to two weeks later to have the 5,000 base pair piece of DNA made from the four bottles of chemicals.
And the exciting phase happened when we inserted that into E. coli. And E. coli recognized the synthetically-made DNA as though it was naturally occurring DNA. Started to make all the proteins. They self-assembled to form the virus. The virus was not very grateful. It started killing the E. coli. But that's how you can detect it with these clear spots on the plate.
This is actually from the structure of phiX. And we call this a case where the software builds its own hardware. All we did was chemically make the phiX genome. We put it into the cell and the cell started making the viral particles.
So if we take this concept to new levels, it's a pretty exciting concept. And we thought we could build a bacterial chromosome by making cassettes of viral-sized pieces, because we knew we could make those with 100% accuracy. And the challenge was, how would you build these pieces together, putting together, and getting enough DNA to try and activate?
And it turns out, again, design is really important. In 1995, the best standard for DNA sequencing was one error in 10,000 base pairs. So we went back and resequenced the mycoplasma genome to see how we did. In fact, we found 30 errors. A few of them probably would have been lethal if you were trying to construct the genetic code from that.
So we started with a new sequence again. And you have to design small pieces so they overlap to form larger pieces that overlap with each other to form larger pieces, and so on. We were trying to make initially the complete Mycoplasma genitalium genome, that then, if that worked, we could start leaving out pieces. So when you think about artifacts that could really mess you up, even one molecule contamination of the native chromosome could fool all of us into thinking we had succeeded.
So we decided to create a foolproof way of doing this. And we knew from the transposons, places we could insert DNA into the genome as watermarks. And you've probably seen some people who've complained that we didn't write poetry with this. We just autographed the genome. But we have a four-letter genetic code. Triplets, as you know, code for amino acids. There's 20 amino acids that all have single-letter designations. So you can write names, words, sentences in genetic code, and that's what we inserted in the genome with the names of the various people assembling this genome.
We started down this process of putting larger and larger segments together, and we were going along quite readily. We passed the world's record, which was 30 kb, just at our second round. Excuse me. But at each step, we grew these pieces up in E. coli so we could sequence them, and validate them, and have enough DNA for the next step.
When we got above 100 kb, the pieces didn't like to grow so well in E. coli, and so we started hunting around for new systems, and we came up with several. But the whole idea for building these was to use homologous recombination.
Now, early on we sequenced the Deinococcus radiodurans genome. This is the one that can take 3 million rads of radiation. As you can see at the top panel, its chromosome gets blown apart with a couple hundred double-stranded breaks, but 12 to 24 hours later, it reassembles its chromosome exactly as it was before and it can start replicating again. So this is basically a biological equivalent of our whole genome assembler that does this with sequenced fragments.
So we've been trying to duplicate this in the laboratory out of Deinococcus, but were having trouble doing so until we found, simply by growing the pieces in yeast and cloning them into yeast, yeast will automatically assemble the pieces of DNA for us. So in fact, not only did we grow them up in yeast, but yeast put them all together and we got the entire chromosome out as a single piece.
So this publication earlier this year is actually the largest human-made molecule of a defined structure. It's over 582,000 base pairs. It would take 147 pages to print this at a 10 font with no spacing. It's over 3 million molecular weight. And on sequencing that it's 100% accurate of what we designed.
So I think we've answered the first question, is you can now, with these techniques, make large molecules. We had to develop all kinds of new processes for these. We don't pipette them. We transform them around in gel blocks. We do all the enzymology in gel blocks, et cetera.
Here's actually a picture of it. It's large enough that it shows up quite readily. And it's a circular molecule.
So the big question is now, how do you boot up a synthetic chromosome? And this is something we've been working on in parallel. In the last year, we published, I think, one of the most important papers that we've ever published, and that's where we transformed one species into another by changing its operating system. So let me walk you through that a little bit. So we had two types of mycoplasmas, and Mycoplasma mycoides was our donor genome. We isolated the chromosome. We treated it with harsh enzymes to remove any proteins. We added a couple extra genes to it. And the notion was we would inject it into another species, capricolum.
Now, having the complete genome gives us unique bits of information. Ham Smith, my friend and colleague, discovered restriction enzymes and we knew they were going to be important in this process. We knew that the M. mycoides genome had several restriction enzymes that actually recognize capricolum. But we also found that the capricolum, which had several enzymes, none of which recognize the mycoides genome.
So we can insert this piece of DNA into the cell. And just like with the synthetic DNA with the virus, the cell machinery immediately started reading the genetic code, expressing all the proteins, including the restriction enzymes. The restriction enzymes immediately recognized the capricolum chromosome as foreign material and chewed it up, leaving us only with cells with the transplanted chromosome.
In a short while, we had bright blue cells that we knew contained our transplanted chromosome. And on examination, that's all they had. There was no capricolum DNA left.
The next part was even more interesting. Self-service antigens that existed in capricolum in the transplants totally disappeared, and antigens coded for in the mycoides genome appeared in all the transplants.
We did 2D gels and sequenced a large number of proteins. And we found that, basically, there was no remains whatsoever of capricolum proteins. Everything had switched over to what was driven by the new operating system.
So if you ever doubted the DNA was, in fact, the software of life, hopefully these experiments convince you. And this is a pretty stunning mechanism that just by changing that molecule, all of the characteristics of one species were gone and the new one appeared. So this is the ultimate in identity theft.
And you can understand the importance of restriction enzymes in preventing this, but we see lots of cases where it was not prevented. And for example, with cholera, it has two chromosomes very different from each other. So we think we discovered a step that's actually very important in evolution.
Let me close with, so why do this? There's lots of reasons other than understanding the minimal basic components of life and maybe understanding more processes of evolution. Our population is growing quite substantially. We're around a little over 6 and 1/2 billion people. Within 40 years or less, predictions are we'll be around 9 billion people.
To put that number in context, I was born in 1946. There's three people alive for everybody that existed the year that I was born. Soon there'll be four. Never before in human history have we gone through this type of expansion. And what we know is even with developing countries, as their economic systems grow, there's an exponential increase in their demand for energy, electricity, proportionate to the GDP.
Except for the dip for the election and for the slowdown in the economy, we are running out of oil. We'll, I'm sure, be back in triple-digit barrel oil prices relatively soon. And even if it was cheap, we're adding about 4.2 billion tons of carbon dioxide net to our atmosphere each year, and everybody's pretty aware of the predictions from that.
In addition, we can't feed the 6.5 billion people on our planet. Water's becoming even more scarce than food. As we made these dumb mistakes to switch for growing corn for food and feed stocks to try and convert it into fuel, we've more than doubled the cost of feed stocks just in less than a year.
We need new technologies. We need new sources of food, of clean water, and of energy. So we've been working on some of these problems, thinking that if we can use these new tools for this, and just as an idea, my view is we can look at these 20 million genes that our team has discovered as design components.
Much in the same way that the electronics industry grew out of the '40s and '50s from resistors, and capacitors, and transistors, and then integrated circuits, we have far more components. We're basically unlimited in our tool chest right now. This number, within five years, will probably be well over 100 million different gene components that we can assemble in different combinations.
We actually have software for designing organisms for specific purposes. Right now, we're doing this for a single cell organisms, but when you look at all the great agricultural work that's done at this university, it might be nice to actually be able to design a plant that grows with very little water, not needing rich soil, not needing oil-based fertilizers, that produce unique things that we want.
We can pick the types of metabolism. Do we want to metabolize sugar? Do we want to go from CO2 to methane? And we can control these so they won't live outside of the laboratory or the facility.
But that's not sufficient. The example I use, if we have a 10-gene metabolic pathway and you have 10 versions of each of those genes, that's 10 to the 10th combinatorial processes you'd have to combine to test them all. We have some gene families with 30 to 50,000 different members. If you're going to really optimize something, we need a combinatorial method to do this.
So we've developed a robot that can actually do the assembly of gene components using yeast with a homologous recombination to do this in yeast artificial chromosomes, and we're trying to get it scaled up to see if we could do thousands to millions of chromosomes a day. It's one of the few ways that I can see of starting to process these huge numbers that we have, and understanding life on our planet, yet alone trying to optimize processes.
If you consider ethanol a first generation fuel, we can make almost any molecule in the lab from bacteria starting with sugar as a starting point. But we decided that we could do a lot better than that, and we have been working on what I call a fourth generation fuel where we start with carbon dioxide directly as the feed stock.
We're not short of energy on this planet. Even today, there was a lot of sunlight, even here. We only need a tiny portion of the sunlight that reaches the Earth to cover all transportation. I've argued this would be a great use of Nevada.
There's been some disagreement. They say take Arizona instead.
So we're working on processes that can do this. Obviously, plants do this. They take CO2 out of the atmosphere and process it, but not very efficiently.
We have major projects on oil palm and jatropha. We finished the oil palm genome. We're looking at microbes associated with the very extensive root system on oil palm. We chose oil palm because, per hectare, it produces more oil than anything else. And jatropha has the promise of doing that because it's a very simple genome and it can be modified very rapidly, and selected even with different strains, and not much has been done with it. I don't know if you can actually-- if these numbers show up. If you look at oil palm, it was on the order of 600 gallons per acre.
Something that sort of strips off the scale theoretically is obviously microalgae. The trouble is, with microalgae, people have been viewing this as a farming problem of growing algae and then harvesting it for the lipids. We viewed it as a genetic engineering problem, and so we've been working on ideas for artificial cells starting with existing bacterial algae.
But we've had a major breakthrough, because by changing the genetic code, we've gotten the algae to manufacture the lipid and then just secrete it into the media. So we can get them to secrete different chain length lipid molecules. These can be isolated in a pure form from the media.
And so now these are mini production machines. It's not a farm crop. And so we're trying to develop systems right now where we can do continuous growth. We actually need concentrated CO2 for optimal growth, but there's no shortage of CO2 sources to do this in a concentrated way. They grow slower off of atmospheric CO2, and sunlight, and getting continuous production of these chemicals.
I got the initial ideas from these from the Methanocaldococcus jannaschii genome. That's a complete autotroph and just takes CO2 going to methane.
We're actually taking it further in terms of looking at hydrocarbon recovery in a program we have with BP. And this is where we combined all our tools of environmental genomics to look at things, and this was really an exciting study. We took a water sample from a mile deep in the Earth from some coal mines in Colorado, and we knew there was life there, but we didn't know to what extent and how much diversity there would be.
We were stunned by, on any microscopic slide, literally hundreds of unique microorganisms swimming around at high speed. So a much higher density of organisms than we find in the ocean even at these high temperatures.
And now in the lab, we have a large number of organisms that live on coal as a substrate and they metabolize the coal into a wide variety of chemical substrates that we can convert into chemicals or convert into methane as happens naturally in the ground with methanogens doing this biological conversion. So that's why there's all these coal bed methane wells developing around the country and other parts of the world. We're looking to see if we can enhance this by understanding the metabolic processes from all these organisms that have never been seen before.
This is actually exciting for evolutionary purposes, because this particular water sample had not been interacting with the surface for over 50 million years. We have another sample now that's 135 million years-old. And we look at the organisms in there, some of them have never-- we've never seen anything like them before.
But some of them look like the Shewanellas that we get out of the Baltic or Black Sea or other parts of the ocean that can metabolize heavy metals. So Shewanella can reduce uranium. It can live off of lead. It can live off of a variety of heavy metals. And here's stuff that's been isolated and separated for 50 to 100 million years and we find similar organisms there. So I think these are unique evolutionary tools as well.
This is actually-- I don't know if the bubble will show up there on the left. These are methane bubbles coming off of the coal just from having two unique organisms growing in a culture.
There's other applications in medicine, and Bert Vogelstein constantly comes up with new breakthroughs in cancer. The one that actually blew my mind, that if somebody asked me, I would have said probably wouldn't work, he's treating cancer patients by injecting them with anaerobic bacteria. Imagine trying to get that through the FDA. And he really had difficulty with this.
So he uses this Clostridium that, because it only likes an anaerobic environment, it goes to the center of these solid tubers that are anaerobic, it grows, expands, releases toxins, and kills the tumors. It's absolutely stunning. And so we're working with that. Came to see if we can actually design a microorganism to do this very specifically and also produce unique anti-tumor chemicals at the same time.
Also-- and this is truly the last thing-- is the first genomic-based vaccine is about to come out. It's in Phase 3 clinical trials right now with Novartis. Just to show you the time course of this, this is an experiment we started in 1996 with [INAUDIBLE] at Chiron, and we sequenced the Neisseria meningitidis genome. And based on observations of genetic elements that control phasing in the genome, we picked a number of antigens that we knew were stable and not subject to phase variation. We used informatic tools to find them common amongst a number of Neisseria species.
And finally, there's a new child vaccine that works against all major strains of Neisseria B meningitis, which is one of the major killers of young children and young adults at universities and military bases. And so it's hopeful this will come out soon, totally based on understanding the genetic code, understanding gene regulation.
But I thought, why not go further, because there's a number of diseases, like malaria, that have been intractable to vaccines-- certainly HIV-- but do it the same way that microorganisms do of actually designing a bacteria to produce every vaccine molecule that's used, but subject them to phase variations so they're constantly evolving and changing, keeping our immune system totally updated all the time. And theoretically, we could have a one-time event vaccination in our lifetimes that could constantly keep up and keep going.
So going from reading to writing the genetic code, I think the future is limited only by our imaginations, of going from this digital information and the computer, four bottles of chemicals, lighting this new software of biology that we think we can take in all these new directions. Thank you very much.
ANDY CLARK: We'll have a chance maybe to take a couple of questions. This one.
AUDIENCE: What domains of biology or science do you think or do you worry about [? being ?] outdated by your [INAUDIBLE] and are you justified?
CRAIG VENTER: So the question is, what domains of biology or science do I consider most likely to be outdated by these new techniques? I don't know. I've never thought about it in that way. I'd rather think of it in all the new areas that could be stimulated by new ideas.
Hopefully, we can overcome some of the limitations of understanding our own human genetics, and overcome the limitations for developing new therapies by all these wild ideas. And maybe in our spare time, save the planet from what we're doing with CO2.
AUDIENCE: So did you criticize the French Polynesians [INAUDIBLE]?
CRAIG VENTER: No, I criticized the French government. The French Polynesians were our friends.
CRAIG VENTER: So the question, is there a valid concern if people want to patent these genes? No, actually, it would have been much easier to have agreements with most countries if they were economic terms and the genes were being patented. We wanted to put them all on the public domain. And what the French argued is that it was a clever ploy on my part, because I could study them faster than the French could, so that I didn't need to patent them, and it was a way of keeping them from doing so. So no matter what you do, you get arguments.
So they're all in the public domain, and different countries dealt with it different ways. Some didn't want to let us sample at all and so we didn't. Others, like Australia, trying to compromise on all these new rules, asked us to label every sequence that came out of Australia as-- if somebody wants to do commercial development, they need to call somebody in Australia and ask their permission. It doesn't say who. And so these are all labeled in the database that they came from Australia.
Nobody knows what to do with all these. And the silly notions that started with the NIH trying to patent all the genes-- not me-- ended up with pretty clear decisions. The patenting genes at early stages is not a viable approach. All the companies that were doing that were trading for less than their cash value despite how many gene patents they had in their portfolio.
So when we scaled up orders of magnitude, there were some countries that discussed trying to patent all the genes in their waters, and they decided it would break their national budgets to do so. But some are still uncomfortable with the data even being in global public databases. So it's a complicated issue where global science still suffers from-- instead of individuals or individual companies, entire countries trying to tie up and block information.
The good thing about this country, other than a national park, anybody can sample anywhere and do anything with it.
AUDIENCE: Yes, what are your concerns-- [INAUDIBLE]. And two, what are your concerns [INAUDIBLE] having [INAUDIBLE]?
CRAIG VENTER: So it's an excellent question, one that we've dealt with from the beginning. So the question was, what are my concerns that we might build something that ends up being harmful instead of helpful? And I think it's a very critical question, because there's been a lot of ecological disasters around the world from scientists thinking they could solve problems by introducing a new species in Australia or someplace else to solve a problem with another species, and it hasn't worked out.
We asked those questions before we did the first experiment. We asked for a complete bioethical review. It was done at the University of Pennsylvania. They brought in all kinds of public groups, most major religions, and discussed about all these issues, and gave us the green light to proceed on the basis and for the reasons we were proceeding.
Everything changed again when we came up with the method for making phiX174 very quickly. That work was sponsored by the Department of Energy, so we had to notify them first. And there are some passages in the last part my book describing these early scenes with the head of Homeland Security and others really freaking out about this.
And I think it's one of the very few things the Bush administration did right in science. I still don't know why they did it. It'd be important to try and understand. But it went through this extensive review two or three times in the White House, whether they were going to classify our research or allow us to go ahead and publish it, and they came down on the side of open publication. And basically, they decided there's no way just containing one lab would block progress in science and that it was better to address things at other levels.
So one of the agreements going forward is they would start a new review group. It's the NSABB that reviews this type of dual use research. They formed it around the phiX work and what we were doing with synthetic bacteria. Now they don't even care about what we're doing, because it's so far below the scale and horizon that they deal with with biological threats.
I think we do need some new rules in terms of-- I was hoping for self-governance with the companies that make synthetic DNA. There's groups at Stanford now and at MIT that encourage biohacking and encouraging, be the first one on your block to make a virus or something. And I don't think that's good or healthy. I think we need to teach responsibility along with this.
So most reputable companies screen against A-list organisms so nobody could make something deliberately lethal on purpose. I think anything-- people [? look ?] and the Sloan Foundation funded us to do a extensive study, along with MIT, on looking at the risks versus benefits that got published just a while ago. This is constantly open to public discussion that's going on, I think, because we've been so open and public about our advances and the discussion around them.
Any new technology, you can never prevent a highly motivated crazy person from doing harm. It happens with any technologies. Fortunately, in modern science, these have been remarkably rare events. There's been about 10 million experiments of putting every species under the sun's genes into E. coli with no adverse consequences of that.
But monitoring's not bad. We're trying to establish the ethics. Nothing should be made in the lab that could survive outside a laboratory or a facility. Whether everybody follows our rules remains to be seen.
But these were the same concerns that existed at the start of the molecular biology revolution. And we were talking today about how Cambridge and other cities passed laws against doing molecular biology within the city limits. Imagine if Cornell University didn't allow molecular biology experiments.
So I think it's a matter of the new rules and the understanding of this developing. And I don't know any scientists that are really approaching things in a irresponsible way associated with trying to do harm. So I think we're trying to take reasonable approaches to those concerns that we have and that the public may have.
We've received your request
You will be notified by email when the transcript and captions are available. The process may take up to 5 business days. Please contact firstname.lastname@example.org if you have any questions about this request.
J. Craig Venter, noted geneticist and A.D. White Professor-at-Large, is the founder of Celera Genomics, whose team of researchers was one of two which first completed sequencing of the human genome. Beginning with his invention and development of EST sequencing to identify sequences of large numbers of expressed genes, Dr. Venter went on to determine the structural sequence of the hereditary materials of other mammalian, insect, bacterial, and viral genomes. This work increased our understanding of genetic diversity and enabled new strategies for improving the human condition.
The focus of his current research is on synthetic biology, and the intersection between genomics and environmental and energy policy.
Venter delivered a public lecture at Cornell on November 17, 2008 in Kennedy Hall's Call Auditorium.