A long-context language model for deciphering and generating bacteriophage genomes

Affiliations
  • 1Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China. shaobinlx@gmail.com.
  • 2Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 02138, USA. shaobinlx@gmail.com.
  • 3Independent researcher, 100 N Gushan Rd, Shanghai, 200135, China.

Published on:

Abstract

Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions.

Related Concept Videos

JoVE Research Video for Lytic Cycle of Bacteriophages 01:30

65.9K

Bacteriophages, also known as phages, are specialized viruses that infect bacteria. A key characteristic of phages is their distinctive “head-tail” morphology. A phage begins the infection process (i.e., lytic cycle) by attaching to the outside of a bacterial cell. Attachment is accomplished via proteins in the phage tail that bind to specific receptor proteins on the outer surface of the bacterium. The tail injects the phage’s DNA genome into the bacterial cytoplasm. In the…

JoVE Research Video for Lysogenic Cycle of Bacteriophages 00:43

57.3K

In contrast to the lytic cycle, phages infecting bacteria via the lysogenic cycle do not immediately kill their host cell. Instead, they combine their genome with the host genome, allowing the bacteria to replicate the phage DNA along with the bacterial genome. The incorporated copy of the phage genome is called the prophage. Some prophages can re-activate and enter the lytic cycle. This often occurs in response to a perturbation, such as DNA damage, but can also transpire in the absence of…

JoVE Research Video for Leaky Scanning 02:28

4.9K

During most eukaryotic translation processes, the small 40S ribosome subunit scans an mRNA from its 5' end until it encounters the first start AUG codon. The large 60S ribosomal subunit then joins the smaller one to initiate protein synthesis. The location of the translation initiation is largely determined by the nucleotides near the start codon as there may be multiple translation initiation sites present on the mRNA.  Marilyn Kozak discovered that the sequence RCCAUGG (where R…

JoVE Research Video for CRISPR and crRNAs 02:53

15.1K

Bacteria and archaea are susceptible to viral infections just like eukaryotes; therefore, they have developed a unique adaptive immune system to protect themselves. Clustered regularly interspaced short palindromic repeats and CRISPR-associated proteins (CRISPR-Cas) are present in more than 45% of known bacteria and 90% of known archaea.
The CRISPR-Cas system stores a copy of foreign DNA in the host genome and uses it to identify the foreign DNA upon reinfection. CRISPR-Cas has three different…

JoVE Research Video for Intracellular Movement of Viruses and Bacteria 01:10

2.4K

Intracellular bacteria and viruses often comprise a group of highly infectious pathogens that can cause several diseases. Bacterial pathogens include those belonging to the genus Rickettsia responsible for conditions such as rocky mountain spotted fever and the Mediterranean spotted fever; Chlamydia, a genus responsible for a sexually transmitted disease; Coxiella burnetii, an agent responsible for Q fever. Viral pathogens include vaccinia—a poxvirus, and herpes simplex virus—a…

JoVE Research Video for Genomic DNA in Prokaryotes 00:46

41.4K

The genome of most prokaryotic organisms consists of double-stranded DNA organized into one circular chromosome in a region of cytoplasm called the nucleoid. The chromosome is tightly wound, or supercoiled, for efficient storage. Prokaryotes also contain other circular pieces of DNA called plasmids. These plasmids are smaller than the chromosome and often carry genes that confer adaptive functions, such as antibiotic resistance.
Genomic Diversity in Bacteria
Although bacterial genomes are much…