The race to improve DNA sequencing - 360
Giulio Formenti
Published on February 28, 2023
Cost, availability and sample preservation are the main concerns but progress is rapid.
As we celebrate the 70th anniversary of the discovery of the DNA structure, it is astounding to look back at the progress made in our ability to read the ‘book of life’, the DNA sequence that provides the instructions to make all living beings.
The most recent and promising advancement is certainly represented by long-read DNA sequencing technologies. Long reads started to be developed around 2010, as it was progressively acknowledged that read length is an issue to reconstruct and study genomes.
Using long reads we can now basically reconstruct entire human genomes almost without errors and prior information, as demonstrated by the completed ‘telomere-to-telomere’ of the first human genome last year.
We can confidently make inferences in complex regions of the genome that were previously inaccessible to investigation, and as such often described as dark matter.
These regions can potentially harbour genes associated with human diseases that were previously described as having a genetic basis but such a genetic basis could not be found.
Owing to the specific characteristics of how long-read sequencing technologies read DNA, they also provide additional information that was not available with the previous short-read methods.
For instance, they allow us to immediately describe the ‘methylome‘ — how the DNA is modified in ways that are cell-specific. Methylation patterns are often caused by our lifestyle, and can also be associated with specific diseases.
Even after the DNA structure was elucidated, it took several years before methods to actually read nucleic acid sequences became available.
The first such attempt was published in 1965, and sequencing 76 nucleotides (DNA building blocks) required five people working three years with one gram of pure material isolated from 140kg of yeast.
More efficient approaches were clearly needed, and much of the progress in the field is owed to Frederick Sanger, a pioneer in reading the sequences of complex biological molecules.
Sanger invented the first method to read protein sequences in 1953, the same year the DNA structure was discovered. This was a landmark for DNA sequencing as well, as it established the general principle of ‘shotgun sequencing’, which is still what we use today to reconstruct the genome of any living being.
Since no technology can read an entire chromosome end-to-end in a single pass, in shotgun sequencing multiple copies of the same chromosome are first fragmented into smaller pieces at random positions, then — like in a puzzle — overlaps between fragments are used to reconstruct the original sequence.
This approach is much less laborious than earlier methods and therefore can scale to thousands, and nowadays to billions of sequences at once.
It does not come without disadvantages, particularly the length of the sequences that can be generated when scaling to millions or billions of reads is often in the order of only 150 nucleotides.
This is a significant limitation because, for instance, the human genome is over three billion nucleotides and approximately 50 percent of its sequence is repeated more than once.
When a sequence is repeated, the overlaps between the fragments generated by sequencing are not unique, limiting our ability to use this information to reconstruct the original sequence of chromosomes.
This results in many gaps and errors in the sequences generated, ultimately confounding all downstream analyses. This is where long-read sequencing technologies come into play.
In long-read sequencing, large DNA fragments, usually in the order of 10,000 or 20,000 nucleotides, often even longer than hundred thousands nucleotides and sometimes even over a million nucleotides, are read by sequencing machines at once.
Increasing read length by several orders of magnitude essentially resolves the repeat problem associated with ‘short-read sequencing‘ allowing entire chromosomes to be reconstructed with minimal computational effort.
While all the important advantages should make long-read the technology of choice for most genome projects, they are hampered by the higher cost and the relatively limited availability of long-read sequencing machines around the world.
However, sequencing machine companies are releasing new instruments with increased throughput and reduced cost every year, making this less of an issue as we transition to long-read sequencing.
One standing challenge is the sequencing errors that are still present in the sequencing reads, although the technologies are getting better.
A challenge that will be harder to overcome is the quality of the starting material. To generate long reads, the DNA material needs to be relatively intact to begin with.
DNA, while being one of the most long-living biological molecules, can still degrade if preservation conditions are not ideal. We need to rethink how we preserve and store biological samples.
DNA sequencing is probably the domain in the biological sciences that evolved the most in the last century. This is clearly due to our interest, as human beings, to understand the process of our making, our origins and our destiny.
It will only progress, making this century of discoveries at least as exciting as the past.
Giulio Formenti is Research Assistant Professor at the Rockefeller University in New York. He declares no conflict of interest.
Originally published under Creative Commons by 360info™.