Glossary - Genomics and Genetics

Allele – Classically an allele is an alternative form of a gene that is expressed in the phenotype. However, in NGS an allele is one form of a sequence variant that occurs in any position on any chromosome, or a sequence variant on any sequence read aligned to the genome. In some cases, the term allele is used interchangeably with the term genotype.

Annotation - The process of identifying the locations of genes and all other elements in the genome as well as their functions. This may include the likely functional impact of variants.

BAM file - BAM is a binary sequence file format that uses BZGF compression and indexing. BAM is the binary compressed version of the SAM (Sequence Alignment/Map) format, which contains information about each sequence read in an NGS data set. This includes its alignment position on a reference genome, variants in the read versus the reference genome, mapping quality, and the sequence quality.

Clinical Actionability  - The ability to use genomic data to change clinical management or therapy.

Coverage – The number of sequence reads covering a particular position in the genome or the average number of aligned reads that overlap all positions on the target genome.

Chromatin – the protein-DNA complex that makes up a chromosome. The state of chromatin – open or closed - usually determines whether a gene is expressed or not.

CRISPR/Cas9 – A technique commonly used for performing genome editing.

De novo sequencing - The sequencing of the genome of a new, previously unsequenced organism or DNA segment. This term is also used whenever a genome (or sequence data set) is assembled by methods of sequence overlap without the use of a known reference sequence. De novo sequencing might be used for a region of a known genome that has significant mutations and/or structural variation from the reference.

Dominant - Refers to the member of a pair of alleles that is expressed in the phenotype of the organism while the other allele is not, even though both alleles are present. It is the opposite of recessive.

Dominant negative mutation - A mutation that dominantly affects the phenotype by means of a defective protein or RNA molecule that interferes with the function of the normal gene product from the same cell.

Epigenetics – the study of how alterations to chromatin structure, rather than DNA sequence itself, can impact heritable characteristics. The sum total of chemical changes to DNA and its associated histone proteins is sometimes known as the epigenome.

Exome sequencing - Technique for enriching and sequencing most or all of the protein-coding gene segments (exons) in a genome.

FASTQ file - A text file format for NGS reads that contains both the DNA sequence and quality information about each base.

Genome - Totality of genetic information belonging to a cell or an organism.

Genome editing – a technology that allows particular mutations to be introduced into a genome with high specificity and efficiency.

Genotype - Genetic constitution of an invdidiual cell or organism.

Histone proteins – form the beads (nucleosomes) around which DNA is wrapped in the chromatin that makes up chromosomes.

Homolog - Refers to one of two or more genes that are similar in sequence as a result of derivation from a common ancestral gene.

Indels - Insertions or deletions in one DNA sequence with respect to another. Indels may be a product of errors in DNA sequencing, the result of alignment errors, or true mutations in one sequence with respect to another. In NGS, indels are detected in sequence reads after alignment to a reference genome.

Knockout mouse – a mouse completely lacking a particular gene that may model some aspect of a human phenotype.

Linkage -  refers to coinheritance of two genetic loci that lie near each other on the same chromosome. The closer together the two loci the greater the linkage and the lower recombination between them.

Loss-of-function variant - variant causing the reduction or complete loss of a gene product, thereby impairing its biochemical function. Most loss-of-function variants are often only predicted with no supporting experimental evidence.

Mendelian disease - a genetic disease determined by a single locus, exhibiting an inheritance pattern that follows the laws of Mendel.

Mouse model – a genetically altered mouse that models some aspect of a human disease phenotype. This might be because it has a DNA alteration in the same gene that causes a human phenotype, or a related gene.

Mutation - The changing of the structure of DNA, resulting in a variant form which may be transmitted to subsequent generations, caused by the alteration of single base units in DNA, or the deletion, insertion, or rearrangement of larger sections of genes or chromosomes.

Next generation sequencing (NGS) – any technology that allows the very rapid sequencing of a whole genome or related population of DNA molecules. Also known as deep sequencing.DNA bases are sequenced from many millions of DNA templates in a single reaction volume. The sequences of all templates are determined in parallel (massively parallel sequencing). Can also be used to examine gene expression (RNAseq) and chromatin structure (ChIPseq).

Paired-end sequencing - A technology that obtains sequence reads from both ends of a DNA fragment template. The use of paired-end sequencing can greatly improve overall sequencing quality by allowing contigs to be joined when they contain read pairs from a single template fragment.

Penetrance - the frequency, expressed as a fraction or a percentage, with which a genotype results in particular phenotype. If only a proportion of people carrying the genotype display the phenotype, the trait is said to show incomplete penetrance. If all carriers show the phenotype then the trait is said to have complete or full penetrance.

Phenotype – the collection of observable or measurable traits of an individual

Phred score – Widely used in NGS to measure sequence quality. Phred assigns a quality score to each base, which is equivalent to the probability of error for that base. The Phred score is the negative log (base 10) of the error probability and thererfore a base with an accuracy of 99% receives a Phred score of 20. Lower Phred scores signify poorer quality and hence potentially inaccurate data.

Polymorphism - A variant that appears in at least 1% of a population. This value is arbitrary and has been established in human genetics by convention.

Recessive refers to the member of a pair of alleles that fails to be expressed in the phenotype of the organism when the dominant allele is present. Also refers to the phenotype of the individual that has only the recessive allele.

Reference sequence - The formally recognized, official sequence of a known genome, gene, or artificial DNA construct. A reference sequence is usually stored in a public database and may be referred to by an accession number or other designation, such as human genome hg19.

Sanger sequencing: The method developed by Frederick Sanger in 1975 to determine the nucleotide sequence of cloned, purified DNA fragments based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. Widely used to validate potential candidate mutations identified by NGS.

Sequence alignment - An algorithmic approach to find the best matching of consecutive letters in one sequence (text symbols that represent the polymer subunits of DNA or protein sequences) with another. Generally sequence alignment methods balance gaps with mismatches, and the relative scoring of these two features can be adjusted by the user.

Sequence assembly - A computational process of finding overlaps of identical (or nearly identical) strings of letters among a set of sequence fragments and iteratively joining them together to form longer sequences.

Sequence read - When DNA sequence is obtained by any experimental method, including both Sanger and NGS, the data are obtained from individual template molecules as a string of nucleotide bases (G, A, T, C). This string of letters is called a sequence read.

Single nucleotide polymorphism (SNP) -  A polymorphic variation at a single position in a DNA sequence among individuals. If a SNP occurs within a gene, then the gene is described as having more than one allele.

Transgenic mouse – sometimes used to refer to any sort of mouse mutant; traditionally, it describes a mouse that has a piece of foreign or altered DNA inserted into its genome (a transgene) that expresses a novel gene product, often at high levels.

Variant Call Format (VCF) – A generic file format that was used by the 1000 Genomes project for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome.

Variant Calling - identifying the nucleotide or structural differences between a sequence of interest and the reference sequence

Variant Effect Predictor (VEP) – A widely used tool within ENSEMBL for the functional annotation of variants generated by NGS.

Variants - Differences at specific positions between two aligned sequences. Variants include single-nucleotide polymorphisms (SNPs), insertions and deletions, copy number variants, and structural rearrangements.