The gastropod mollusk Aplysia is an important model for cellular and molecular neurobiological studies, particularly for investigations of molecular mechanisms of learning and memory. We developed an optimized assembly pipeline to generate an improved Aplysia nervous system transcriptome. This improved transcriptome enabled us to explore the evolution of cognitive capacity at the molecular level. Were there evolutionary expansions of neuronal genes between this relatively simple gastropod Aplysia (20,000 neurons) and Octopus (500 million neurons), the invertebrate with the most elaborate neuronal circuitry and greatest behavioral complexity? Are the tremendous advances in cognitive power in vertebrates explained by expansion of the synaptic proteome that resulted from multiple rounds of whole genome duplication in this clade? Overall, the complement of genes linked to neuronal function is similar between Octopus and Aplysia. As expected, a number of synaptic scaffold proteins have more isoforms in humans than in Aplysia or Octopus . However, several scaffold families present in mollusks and other protostomes are absent in vertebrates, including the Fifes, Lev10s, SOLs, and a NETO family. Thus, whereas vertebrates have more scaffold isoforms from select families, invertebrates have additional scaffold protein families not found in vertebrates. This analysis provides insights into the evolution of the synaptic proteome. Both synaptic proteins and synaptic plasticity evolved gradually, yet the last deuterostome-protostome common ancestor already possessed an elaborate suite of genes associated with synaptic function, and critical for synaptic plasticity.
Motivation Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term "error correction" to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available. Results We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated. Availability QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu. Contact gmarcais@umd.edu.
Abstract Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals 1 . These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Abstract Background The release of the first reference genome of walnut (Juglans regia L.) enabled many achievements in the characterization of walnut genetic and functional variation. However, it is highly fragmented, preventing the integration of genetic, transcriptomic, and proteomic information to fully elucidate walnut biological processes. Findings Here, we report the new chromosome-scale assembly of the walnut reference genome (Chandler v2.0) obtained by combining Oxford Nanopore long-read sequencing with chromosome conformation capture (Hi-C) technology. Relative to the previous reference genome, the new assembly features an 84.4-fold increase in N50 size, with the 16 chromosomal pseudomolecules assembled and representing 95% of its total length. Using full-length transcripts from single-molecule real-time sequencing, we predicted 37,554 gene models, with a mean gene length higher than the previous gene annotations. Most of the new protein-coding genes (90%) present both start and stop codons, which represents a significant improvement compared with Chandler v1.0 (only 48%). We then tested the potential impact of the new chromosome-level genome on different areas of walnut research. By studying the proteome changes occurring during male flower development, we observed that the virtual proteome obtained from Chandler v2.0 presents fewer artifacts than the previous reference genome, enabling the identification of a new potential pollen allergen in walnut. Also, the new chromosome-scale genome facilitates in-depth studies of intraspecies genetic diversity by revealing previously undetected autozygous regions in Chandler, likely resulting from inbreeding, and 195 genomic regions highly differentiated between Western and Eastern walnut cultivars. Conclusion Overall, Chandler v2.0 will serve as a valuable resource to better understand and explore walnut biology.
Abstract Water buffalo is a globally important species for agriculture and local economies. A de novo assembled, well-annotated reference sequence for the water buffalo is an important prerequisite for studying the biology of this species, and is necessary to manage genetic diversity and to use modern breeding and genomic selection techniques. However, no such genome assembly has been previously reported. There are 2 species of domestic water buffalo, the river (2n = 50) and the swamp (2n = 48) buffalo. Here we describe a draft quality reference sequence for the river buffalo created from Illumina GA and Roche 454 short read sequences using the MaSuRCA assembler. The assembled sequence is 2.83 Gb, consisting of 366 983 scaffolds with a scaffold N50 of 1.41 Mb and contig N50 of 21 398 bp. Annotation of the genome was supported by transcriptome data from 30 tissues and identified 21 711 predicted protein coding genes. Searches for complete mammalian BUSCO gene groups found 98.6% of curated single copy orthologs present among predicted genes, which suggests a high level of completeness of the genome. The annotated sequence is available from NCBI at accession GCA_000471725.1.
Abstract Adaptation requires genetic variation, but founder populations are generally genetically depleted. Here we sequence two populations of an inbred ant that diverge in phenotype to determine how variability is generated. Cardiocondyla obscurior has the smallest of the sequenced ant genomes and its structure suggests a fundamental role of transposable elements (TEs) in adaptive evolution. Accumulations of TEs (TE islands) comprising 7.18% of the genome evolve faster than other regions with regard to single-nucleotide variants, gene/exon duplications and deletions and gene homology. A non-random distribution of gene families, larvae/adult specific gene expression and signs of differential methylation in TE islands indicate intragenomic differences in regulation, evolutionary rates and coalescent effective population size. Our study reveals a tripartite interplay between TEs, life history and adaptation in an invasive species.
Ants are some of the most abundant and familiar animals on Earth, and they play vital roles in most terrestrial ecosystems. Although all ants are eusocial, and display a variety of complex and fascinating behaviors, few genomic resources exist for them. Here, we report the draft genome sequence of a particularly widespread and well-studied species, the invasive Argentine ant ( Linepithema humile ), which was accomplished using a combination of 454 (Roche) and Illumina sequencing and community-based funding rather than federal grant support. Manual annotation of >1,000 genes from a variety of different gene families and functional classes reveals unique features of the Argentine ant's biology, as well as similarities to Apis mellifera and Nasonia vitripennis . Distinctive features of the Argentine ant genome include remarkable expansions of gustatory (116 genes) and odorant receptors (367 genes), an abundance of cytochrome P450 genes (>110), lineage-specific expansions of yellow/major royal jelly proteins and desaturases, and complete CpG DNA methylation and RNAi toolkits. The Argentine ant genome contains fewer immune genes than Drosophila and Tribolium , which may reflect the prominent role played by behavioral and chemical suppression of pathogens. Analysis of the ratio of observed to expected CpG nucleotides for genes in the reproductive development and apoptosis pathways suggests higher levels of methylation than in the genome overall. The resources provided by this genome sequence will offer an abundance of tools for researchers seeking to illuminate the fascinating biology of this emerging model organism.
The American lobster, Homarus americanus, is integral to marine ecosystems and supports an important commercial fishery. This iconic species also serves as a valuable model for deciphering neural networks controlling rhythmic motor patterns and olfaction. Here, we report a high-quality draft assembly of the H. americanus genome with 25,284 predicted gene models. Analysis of the neural gene complement revealed extraordinary development of the chemosensory machinery, including a profound diversification of ligand-gated ion channels and secretory molecules. The discovery of a novel class of chimeric receptors coupling pattern recognition and neurotransmitter binding suggests a deep integration between the neural and immune systems. A robust repertoire of genes involved in innate immunity, genome stability, cell survival, chemical defense, and cuticle formation represents a diversity of defense mechanisms essential to thrive in the benthic marine environment. Together, these unique evolutionary adaptations contribute to the longevity and ecological success of this long-lived benthic predator.
Ensemble Kalman filteringwas developed as away to assimilate observed data to track the current state in a computational model. In this paper we showthat the ensemble approach makes possible an additional benefit: the timing of observations, whether they occur at the assimilation time or at some earlier or later time, can be effectively accounted for at low computational expense. In the case of linear dynamics, the technique is equivalent to instantaneously assimilating data as they are measured. The results of numerical tests of the technique on a simple model problem are shown.
In this paper, we introduce a new, local formulation of the ensemble Kalman filter approach for atmospheric data assimilation. Our scheme is based on the hypothesis that, when the Earth’s surface is divided up into local regions of moderate size, vectors of the forecast uncertainties in such regions tend to lie in a subspace of much lower dimension than that of the full atmospheric state vector of such a region. Ensemble Kalman filters, in general, take the analysis resulting from the data assimilation to lie in the same subspace as the expected forecast error. Under our hypothesis the dimension of the subspace corresponding to local regions is low. This is used in our scheme to allow operations only on relatively low-dimensional matrices. The data assimilation analysis is performed locally in a manner allowing massively parallel computation to be exploited. The local analyses are then used to construct global states for advancement to the next forecast time. One advantage, which may take on more importance as ever-increasing amounts of remotely-sensed satellite data become available, is the favorable scaling of the computational cost of our method with increasing data size, as compared to other methods that assimilate data sequentially. The method, its potential advantages, properties, and implementation requirements are illustrated by numerical experiments on the Lorenz-96 model. It is found that accurate analysis can be achieved at a cost which is very modest compared to that of a full global ensemble Kalman filter.