Each living cell contains DNA molecules in which all of its genetic information is stored. Part corresponds to coding elements for functional proteins: genes. The other party, wrongly called DNA junk, contains elements that will mainly intervene in the structure of DNA, in mechanisms of polymorphism, self-defense and regulation. The genome is the set of DNA molecules in a cell. These molecules can be of different natures: chromosome, plasmid and viral structure in the broad sense. Understanding and analysis of genomes, genomics, makes it possible to list all the biological functions of an organism and to carry out complex phylogenetic studies.
To access the genome, it must first be ensured that the isolated cells correspond to a single individual.
Poor selection from the start can generate the reconstruction of chimeric genetic elements. The DNA is then
isolated by extraction methods that will lyse the membranes (and nuclear in eukaryotes) and purify
the DNA molecules. A quality control of the extraction is carried out by a spectral approach to estimate the proportion
of DNA and co-purified proteins. This step does not detect any PCR inhibitors that might
to be present too.
The DNA is then fragmented into pieces of 500 bp because the NGS sequencing technologies have a size limit of reading the DNA molecules. They also have another limitation: to sequence a DNA fragment, it is necessary that the machine has a sufficient quantity to obtain a robust signal. An amplification step is therefore performed and will allow both isolation of fragments and attachment of adapters for sequencing. Then the sequencing itself starts and will produce a sequence data set with their quality score.
There are two types of information processing: the qualification of raw sequencer data and the valuation of
genetic information. The first steps are filtering phases that consist of the selection of good quality sequences
(filtering) and / or the selection of the good quality bases (trimming). Depending on the nature of the experience,
steps for detecting PCR chimeras can also be added.
The second step is the reconstruction of the DNA molecules from the fragments. There are two strategies: using a reference genome as a model (mapping) or searching for overlapping area between sequences (de novo). Both strategies can be used to improve the discovery of new unknown regions. Depending on the quality of the sequencing coverage, contigs (fragment assembly) are obtained and correspond totally or partially to the chromosome. If regions are missing, it is possible to estimate the size of the missing area to concatenate the contigs (scaffold).
The final step in qualifying the raw data is the annotation phase. It is often made from the genome of reference of the species, but a more comprehensive approach at Kingdom level is also possible. The annotation of Bacteria and Archaea remains easier because the definition of coding areas (ORF) is more constant. In eukaryotes, it is necessary take into account genetic contexts (eg Kozack) less constrained.
The phase of valorization of the genomic data is very vast and depends on the biological objective sought. We find:
Genomics and bioinformatics have several applications in the agri-food sector:
It is also possible to perform bioinformatic analyzes on all genomic data
available to perform in-silico screenings according to functions or pathogenicity capabilities.
It is important to note, however, that the presence of genetic elements or a plasmid does not necessarily imply of this ability under environmental conditions. Indeed, genetic and epigenetic regulation apply to genes. An over-coiled plasmid is for example not usable by the bacterium until it is relaxed via a topoisomerase.