Sequencing interpretation with different database (note for each database)

In database (look for the answer through chatGPT - incredibly fast but requires validation)

https://brb.nci.nih.gov/seqtools/colexpanno.html


To predict the function; each bioinformatics tool used different datasets and different machine learning approaches to create the model for the prediction.


1000g -- 1000 genome database -- version 2015aug; All - ethnics, Africa, America, East -- East Asian, Europe, SAS -- South Asian subpopulation


ExAC -- Exome Aggregation Consortium; Africa, America, East Asin, Finland, Non-Finnish European (NFE), OTH (focuses on individuals of "Other" or diverse ancestries. This dataset contains exome sequencing data from over 15,000 individuals from populations such as African, East Asian, Latino, and Native American), South Asian


Exome Sequencing Project (ESP); aa (African American ancestry), ea (European American ancestry)


gnomAD - genome/exome; All, AFR, AMR, Ashkenazi Jewish (ASJ) ancestry, EAS, FIN, NFE, OTH, SAS


Kaviar (Known VARiants) database; AF (Allele Frequency), AC (Allele Count), AN (Allele Number) - In summary, "Kaviar_AF" is the frequency of a variant, "Kaviar_AC" is the count of a variant, and "Kaviar_AN" is the total number of alleles in the Kaviar database for the given population. These metrics are all useful for assessing the frequency and distribution of genetic variation across different populations and for prioritizing variants for further analysis.


HRC (The Haplotype Reference Consortium)


GenomeAsia 100K Project (GME) -- AF (African), NWA (Northwest Asian), NEA (Northeast Asian), AP (South Asian), Israel, SD (Southeast Asian), TP (Tibetan Plateau), CA (Central Asian)


NCI-60; The NCI-60 panel has been extensively characterized at the genetic and molecular level, including DNA sequencing, gene expression profiling, and proteomics.


Avsnp150 - a database of genetic variants and their annotations maintained by the National Center for Biotechnology Information (NCBI).


Cosmic70 -- database of somatic mutations in cancer maintained by the Catalog of Somatic Mutations in Cancer (COSMIC)


CLNALLELEID -- unique identifier for the specific variant allele in ClinVar. Each variant allele in ClinVar has a unique CLNALLELEID.

CLNDN -- contains the disease or condition associated with the variant in ClinVar

CLNDISDB -- This field contains the source of the disease or condition information in ClinVar.

CLNREVSTAT -- contains a review status assigned to the variant in ClinVar.

CLNSIG -- contains the clinical significance of the variant in ClinVar


InterVar_automated is based on the guidelines and standards established by the American College of Medical Genetics and Genomics (ACMG) for the interpretation of genetic variants.


PVS1 (pathogenic very strong), PS1 (pathogenic strong), PS2 (moderately to highly penetrant for a disease), PS3 (limited or no evidence of pathogenicity), and PS4 (observed at a highly conserved residue in a protein, and is predicted to have an impact on protein function based on computational or experimental evidence, but has limited or no other evidence supporting its pathogenicity) are categories of criteria used to assess the pathogenicity of genetic variants in the context of human disease. They were first proposed by the American College of Medical Genetics and Genomics (ACMG) in their 2015 guidelines for variant interpretation.


PM1 (pathogenic moderate), PM2 (a variant is observed at a significantly higher frequency in control populations than in cases with the disease of interest, unlikely to be pathogenic), PM3 (a variant is observed in trans with a pathogenic variant in the same gene in a compound heterozygous or digenic mode of inheritance), PM4 (a variant is observed in a gene that is not a known cause of disease, but a variant in another gene in the same biological pathway is a known cause of disease), PM5 (a variant is observed in a gene that has been consistently and repeatedly associated with a phenotype, but has conflicting evidence regarding the pathogenicity of the variant), and PM6 (a variant is located in a genomic region that is a common site of copy number variation (CNV), and is not predicted to cause loss of function or haploinsufficiency) are additional criteria used to assess the pathogenicity of genetic variants in the context of human disease. These criteria were also proposed by the American College of Medical Genetics and Genomics (ACMG) in their 2015 guidelines for variant interpretation.


PP1 (benign strong, a high frequency in control populations, and there is strong evidence that it does not cause disease), PP2 (a variant is a synonymous change (i.e. a change in the DNA sequence that does not alter the amino acid sequence of the protein) that has been well-established as a benign polymorphism), PP3 (a variant is observed in trans with a pathogenic variant in the same gene in a compound heterozygous or digenic mode of inheritance, but the variant alone is not expected to cause disease), PP4 (a variant is observed in a genomic region where benign variants are known to occur at high frequency), and PP5 (a variant is predicted to be benign based on computational evidence, such as a lack of conservation or a benign impact on protein function); assess the benign (non-pathogenic) nature of genetic variants in the context of human disease. These criteria were proposed by the American College of Medical Genetics and Genomics (ACMG) in their 2015 guidelines for variant interpretation, and provide a standardized framework for assessing the likelihood that a variant is benign


BA1 (benign supporting evidence) reported as benign in the literature, and the evidence is consistent with this classification, BS1 (stands for "benign very strong) a variant is at a position where a pathogenic variant is known to occur, but the variant has been observed in multiple healthy individuals, BS2 (a variant is in a gene where loss-of-function variants are known to be benign, and the variant is predicted to be loss-of-function), BS3 (a variant is in a genomic region where benign variants are known to occur, but there is limited or conflicting evidence for the specific variant), and BS (a variant is predicted to be benign based on computational evidence, such as a lack of conservation or a benign impact on protein function, but there is limited or conflicting experimental evidence); assess the benign (non-pathogenic) nature of genetic variants in the context of human disease. proposed by the American College of Medical Genetics and Genomics (ACMG) in their 2015 guidelines


BP1 (splice site pathogenic; a variant is located at the canonical splice site, and is predicted to affect splicing based on in vitro or in vivo functional studies), BP2 (a variant is located within the first 2 or last 8 bases of an intron, and is predicted to affect splicing based on in vitro or in vivo functional studies), BP3 (a variant is located more than 2 bases from the intron-exon boundary, but is predicted to affect splicing based on in vitro or in vivo functional studies), BP4 (a variant is located in a region that is conserved across species and is predicted to affect splicing based on computational evidence), BP5 (a variant is located in a region that is not conserved across species, but is predicted to affect splicing based on computational evidence and is located at a site where pathogenic variants are known to occur), BP6 (a variant is located in a region that is not conserved across species, but is predicted to affect splicing based on computational evidence and is located at a site where benign variants are known to occur), and BP7 (a variant is observed to create a de novo splice site, and the impact of the splice site on gene expression or protein function is predicted to be pathogenic); assess the likelihood that a variant affects splicing in the context of human disease.proposed by the American College of Medical Genetics and Genomics (ACMG) in their 2015 guidelines for variant interpretation, and provide a standardized framework for assessing the likelihood that a variant affects splicing.


SIFT (Sorting Intolerant From Tolerant); bioinformatics tool used to predict the potential impact of a variant on protein function. It is based on the premise that highly conserved protein sequences are more likely to be functionally important, and variants that alter highly conserved residues are more likely to have an impact on protein function.

SIFT_score; (scores ranging from 0 (most deleterious) to 1 (most tolerated))

SIFT_converted_rankscore; a normalized score that reflects the percentile rank of the raw SIFT score relative to other variants in the same protein. A higher SIFT converted rankscore indicates a higher predicted impact on protein function.

SIFT_pred (predict -- supervised learning); a categorical prediction of the impact of a variant on protein function, based on the SIFT score.Variants with SIFT scores below a certain threshold (typically 0.05 or 0.05) are classified as "deleterious" and are predicted to have an impact on protein function, while variants with higher SIFT scores are classified as "tolerated" and are predicted to be functionally neutral. D: Deleterious (sift<=0.05); T: tolerated (sift>0.05)

SIFT score, SIFT converted rankscore, and SIFT prediction are used to provide a quantitative and categorical assessment of the potential impact of a variant on protein function.


Polyphen2 (Polymorphism Phenotyping v2); used to predict the potential impact of a variant on protein function. It is based on the premise that amino acid substitutions that occur at highly conserved sites are more likely to be functionally important and potentially disease-causing.

HDIV (HumDiv - Human Diversity dataset) and HVAR (HumVar - Human Variation dataset); reflect different training sets and performance characteristics. The HDIV score is optimized for sensitivity and is used to classify variants as either "possibly damaging" or "probably damaging." The HVAR score is optimized for specificity and is used to classify variants as either "benign," "possibly damaging," or "probably damaging."

Polyphen2_HDIV_score; a raw score that reflects the predicted impact of a variant on protein function, with scores ranging from 0 (most benign) to 1 (most damaging).

Polyphen2_HDIV_rankscore; a normalized score that reflects the percentile rank of the raw Polyphen2_HDIV_score relative to other variants in the same protein.

Polyphen2_HDIV_pred;  a categorical prediction of the impact of a variant on protein function, based on the Polyphen2_HDIV_score. D: Probably damaging (>=0.957), P: possibly damaging (0.453<=pp2_hdiv<=0.956), B: benign (pp2_hdiv<=0.452)

Polyphen2_HVAR_score; a raw score that reflects the predicted impact of a variant on protein function, and the Polyphen2_HVAR_rankscore is a normalized score that reflects the percentile rank of the raw Polyphen2_HVAR_score relative to other variants in the same protein.

Polyphen2_HVAR_pred; a categorical prediction of the impact of a variant on protein function, based on the Polyphen2_HVAR_score.D: Probably damaging (>=0.957), P: possibly damaging (0.453<=pp2_hdiv<=0.956); B: benign (pp2_hdiv<=0.452)


LRT (Likelihood Ratio Test); predict the potential impact of a variant on protein function. The LRT algorithm is based on a likelihood ratio test that compares the likelihood of observing a variant under two different models: one in which the variant is predicted to be damaging, and one in which the variant is predicted to be benign.

LRT_score; a raw score that reflects the degree of support for the damaging model, with higher scores indicating a higher likelihood of a variant being damaging.

LRT_converted_rankscore; a normalized score that reflects the percentile rank of the raw LRT score relative to other variants in the same protein.

LRT_pred; a categorical prediction of the impact of a variant on protein function, based on the LRT score. Variants with LRT scores above a certain threshold (typically 0.05) are classified as "deleterious" and are predicted to have an impact on protein function, while variants with lower LRT scores are classified as "neutral" and are predicted to be functionally benign.D: Deleterious; N: Neutral; U: Unknown Lower scores are more deleterious

LRT score, LRT converted rankscore, and LRT prediction are used to provide a quantitative and categorical assessment of the potential impact of a variant on protein function


MutationTaster and MutationAssessor; used to predict the potential impact of a genetic variant on protein function and disease risk. 

MutationTaster; a probabilistic model to predict the impact of a variant on protein function based on various criteria, including conservation, protein structure, and splice site prediction.

MutationTaster_score; a raw score that reflects the degree of support for the deleterious model, with higher scores indicating a higher likelihood of a variant being damaging.

MutationTaster_converted_rankscore; a normalized score that reflects the percentile rank of the raw MutationTaster_score relative to other variants in the same protein.

MutationTaster_pred; a categorical prediction of the impact of a variant on protein function, based on the MutationTaster_score. A: (""disease_causing_automatic"");D: (""disease_causing""); N: (""polymorphism [probably harmless]"");P: (""polymorphism_automatic[known to be harmless]" -- higher values are more deleterious"

MutationAssessor; a combination of sequence and structure-based features to predict the functional impact of a variant.

MutationAssessor_score; a raw score that reflects the degree of support for the deleterious model, with higher scores indicating a higher likelihood of a variant being damaging.

MutationAssessor_score_rankscore; a normalized score that reflects the percentile rank of the raw MutationAssessor_score relative to other variants in the same protein.

MutationAssessor_pred; a categorical prediction of the impact of a variant on protein function, based on the MutationAssessor_score.H: high; M: medium; L: low; N: neutral. H/M means functional and L/N means non-functional higher values are more deleterious.


FATHMM (Functional Analysis through Hidden Markov Models); predict the functional impact of a genetic variant on protein function. It uses a combination of evolutionary and biochemical information to make predictions about the potential impact of a variant on protein function.

FATHMM_score; same principle as above. FATHMM_converted_rankscore; normalization values, FATHMM_pred; D: Deleterious; T: Tolerated; lower values are more deleterious


PROVEAN (Protein Variation Effect Analyzer); predict the functional impact of genetic variants on protein function. It compares the sequence of the wild-type protein and the mutated protein to predict whether the mutation is likely to be deleterious or neutral.

PROVEAN_score; same principle as above. PROVEAN_converted_rankscore; normalized values. PROVEAN_pred; D: Deleterious; N: Neutral -- higher values are more deleterious


VEST (Variant Effect Scoring Tool); predict the potential impact of genetic variants on protein function. It uses a random forest classifier trained on a variety of features related to sequence conservation, protein structure, and other genomic and functional annotations. 

VEST3_score; VEST3_rankscore


MetaSVM; predict the functional impact of genetic variants on protein function. It combines information from multiple sources, including sequence conservation, functional annotations, and protein structure, to make predictions about the potential impact of a variant on protein function.

MetaSVM_score; MetaSVM_rankscore; MetaSVM_pred -- D: Deleterious; T: Tolerated;

higher scores are more deleterious.


MetaLR; uses logistic regression to integrate nine independent variant deleteriousness scores and allele frequency information to predict the deleteriousness of missense variants. Variants are classified as 'tolerated' or 'damaging'; a score between 0 and 1 is also provided and variants with higher scores are more likely to be deleterious.

MetaLR_score, MetaLR_rankscore, MetaLR_pred -- D: Deleterious; T: Tolerated; higher scores are more deleterious


M-CAP (Mendelian Clinically Applicable Pathogenicity); predict the pathogenicity of missense variants in the human genome. It helps researchers and clinicians to identify genetic variants that may cause disease.

M.CAP_score, M.CAP_rankscore, M.CAP_pred -- categorical prediction of whether a missense variant is pathogenic or benign.


REVEL (Rare Exome Variant Ensemble Learner); predict the pathogenicity of missense variants in the human genome. It is an ensemble-based approach that combines multiple machine learning algorithms to estimate the likelihood that a particular variant is deleterious or disease-causing.

REVEL_score, REVEL_rankscore


MutPred -- predict the pathogenicity of missense variants in the human genome. It uses machine learning algorithms to estimate the likelihood that a particular amino acid substitution is disease-causing or deleterious based on various features derived from protein sequences and structures.

MutPred_score; MutPred_rankscore


CADD (Combined Annotation-Dependent Depletion); scoring the deleteriousness of genetic variants, including single nucleotide variants (SNVs) and short insertions and deletions (indels), in the human genome. It integrates multiple diverse annotations and conservation metrics to provide a single score that estimates the potential pathogenicity of a variant.

CADD_raw; CADD_raw_rankscore; CADD_phred (a transformation of the raw CADD score into a PHRED-like scale, which is commonly used in genomics to express variant quality. The CADD_phred score is more intuitive to interpret, as higher values indicate a higher confidence in the deleteriousness of the variant. A CADD_phred score of 10 indicates that a variant is in the top 10% of deleterious variants, while a score of 20 indicates it is in the top 1%.)


DANN (Deep Artificial Neural Network); predict the functional impact of non-coding and coding single nucleotide variants (SNVs) in the human genome. It uses a deep learning approach based on artificial neural networks to estimate the likelihood that a particular variant is deleterious or pathogenic.

DANN_score; DANN_rankscore


FATHMM-MKL (Functional Analysis Through Hidden Markov Models - Multiple Kernel Learning); predicting the functional impact of coding and non-coding single nucleotide variants (SNVs) in the human genome. It combines multiple diverse data sources and employs a multiple kernel learning approach to estimate the likelihood that a particular variant is deleterious or pathogenic.

fathmm.MKL_coding_score; fathmm.MKL_coding_rankscore; fathmm.MKL_coding_pred (D: Deleterious; T: Tolerated Score >= 0.5: D; Score < 0.5: T)


Eigen - predicting the functional impact of non-coding and coding single nucleotide variants (SNVs) in the human genome. The Eigen method integrates diverse genomic annotations, such as conservation metrics, functional genomic data, and annotations of known regulatory elements, to generate a single summary score for each variant.

Eigen_coding_or_noncoding - predicting the functional impact of non-coding and coding single nucleotide variants (SNVs). It integrates various genomic annotations and generates a single summary score for each variant.

Eigen.raw - raw Eigen score for a given variant, representing the predicted functional impact of that variant.

Eigen.PC.raw - another version of Eigen score that additionally accounts for the principal components of the input features.


GenoCanyon; analyzes and predicts the functional potential of genomic regions in the human genome. It is designed to identify functional elements in the genome by integrating diverse genomic data, such as sequence conservation across species, epigenetic markers, and transcription factor binding sites. GenoCanyon employs an unsupervised statistical learning method to assess the functional potential of each base pair in the human genome without relying on prior knowledge of known functional elements or annotations.

GenoCanyon_score; GenoCanyon_score_rankscore


integrated_fitCons; estimating the fitness consequences of functional genomic elements in the human genome. The goal of integrated_fitCons is to identify and prioritize genomic elements based on their potential functional impact, which can help researchers and clinicians focus on the most relevant regions when investigating gene regulation, functional genomics, or disease-causing variants. Integrate functional assays like ChIP-Seq with conservation measure of transcription factor binding sites.

integrated_fitCons_score; integrated_fitCons_score_rankscore; integrated_confidence_value (higher scores are more deleterious)


GERP++_RS (Genomic Evolutionary Rate Profiling) -- quantifies evolutionary constraint on genomic positions by comparing multiple sequence alignments. GERP++_RS is the GERP++ rejected substitution (RS) score, which estimates the number of substitutions "rejected" by the evolutionary process at each position.

GERP++_RS_rankscore; compares the GERP++_RS score of a particular genomic position to the scores of all other positions in the human genome.


phyloP100way_vertebrate (Phylogenetic p-values for 100 vertebrate species. See the dbNSFP information table for details) measuring conservation at individual positions in a multiple sequence alignment. The phyloP scores represent the degree of conservation at each position, with separate scores for vertebrates (100-way).

phyloP100way_vertebrate_rankscore


phyloP20way_mammalian (a phylogenetic hidden Markov model (phylo-HMM) Use 20 species) - higher scores are more deleterious.measuring conservation at individual positions in a multiple sequence alignment. The phyloP scores represent the degree of conservation at each position, with separate scores for mammals (20-way).

phyloP20way_mammalian_rankscore


phastCons100way_vertebrate; PhastCons score for 7 vertebrate species; identifying conserved elements in multiple sequence alignments. The phastCons scores represent the conservation score at each position, with separate scores for vertebrates (100-way).

phastCons100way_vertebrate_rankscore


phastCons20way_mammalian; identifying conserved elements in multiple sequence alignments. The phastCons scores represent the conservation score at each position, with separate scores for mammals (20-way). a phylogenetic hidden Markov model (phylo-HMM) Use 20 species, higher scores are more deleterious.

phastCons20way_mammalian_rankscore


SiPhy_29way_logOdds; SiPhy is a method that uses a probabilistic model to detect conserved elements in multiple sequence alignments.

SiPhy_29way_logOdds -- log odds score for 29 species, Probablistic framework, HMM Use 29 species, higher scores are more deleterious

SiPhy_29way_logOdds_rankscore


Interpro_domain; a database of protein families, domains, and functional sites. The InterPro domain is an annotation of a protein sequence, providing information about its function and structure.


GTEx_V6p_gene;(Genotype-Tissue Expression) -- a comprehensive public resource aimed at studying human gene expression and its relationship with genetic variation across a wide range of tissue types. The GTEx V6p refers to version 6, which is one of the data releases from the GTEx project. gene-level expression data generated from the GTEx V6p dataset. This data includes the expression levels of individual genes across multiple human tissues, allowing researchers to study tissue-specific gene expression patterns, assess the effects of genetic variants on gene expression, and explore the functional consequences of genetic variation.

GTEx_V6p_tissue -- tissue-specific expression data generated from the GTEx (Genotype-Tissue Expression) project, specifically from version 6 of the dataset (V6p). The GTEx project aims to study human gene expression and its relationship with genetic variation across a wide range of tissue types, providing valuable insights into the functional consequences of genetic variation.


Regsnp - analysis of regulatory single nucleotide polymorphisms (regSNPs) in the context of genetics and disease. RegSNPs are genetic variants that can impact gene regulation, such as gene expression or splicing, potentially leading to phenotypic changes or disease susceptibility.

Regsnp_fpr; false positive rate (FPR) associated with the prediction of regSNPs. In the context of bioinformatics and computational predictions, the FPR represents the proportion of negative instances incorrectly predicted as positive. In this case, it would represent the rate at which non-regulatory SNPs are mistakenly predicted to be regulatory.

Regsnp_disease; SNPs might alter gene regulation, leading to changes in gene expression or splicing patterns that result in disease phenotypes.

Regsnp_splicing_site; regSNPs that impact splicing sites in the genome. Splicing sites are essential for the proper processing of pre-mRNA molecules into mature mRNA. RegSNPs affecting splicing sites can disrupt the splicing process, leading to aberrant transcripts and potential disease phenotypes.


Genotype -- het -- heterologous (เจอการเปลี่ยนแปลงแค่ข้างเดียว)

Depth -- จำนวนเส้นซีเควนที่ได้มาตรงตำแหน่งนั้น ถ้ายิ่งมา แสดงว่าความน่าเชื่อถือมากกว่าตัว Depth ที่น้อยกว่า ทั้งนี้ขึ้นอยู่กับ ขนาดของตัวจีโนมด้วยว่าใหญ่ขนาดไหน เพราะจะเป็นตัวกำหนดว่าควรจะต้องได้ depth ขั้นต่ำเท่าไหร่


ถ้าพิจารณาเรื่องการ identify ว่าเป็น rare disease หรือเปล่า

  • จะดูจาก allele frequency -- ยกตัวอย่างจากฐานข้อมูลของ ExAC -- ตัวเลขที่ได้ จะเป็น allele frequency เช่น 0.001 หมายถึง ใน 1000 คน จะมี 1 คนที่มี phenotype แบบนี้ -- สำหรับ rare disease จะเลือก ที่ frequency < 0.05 หมายถึง ใน 100 คน จะมี 5 คนที่มีลักษณะ genotype แบบนี้ แต่ทั้งนี้ทั้งนั้นก็ต้องดูลักษณะโรคทางด้าน genetics ด้วย ค่า frequency ตรงนี้ก็จะมีการเปลี่ยนแปลงไปตามลักษณะของโรค เช่น ถ้าเป็นแบบ autosomal recessive frequency อาจจะต่ำไปกว่านี้ก็ได้

  • ถ้าเริ่มจากการพิจารณาลักษณะ Phenotype -- > ก็จะดูก่อนว่ามี genes อะไรบ้างที่เกี่ยวข้องกับโรคตัวนี้ โดยคัดกรองเอา genotype ที่มี frequency เยอะ ๆ ออกไป แล้วพิจารณาแค่ genotype ที่มี frequency น้อย ๆ และ genes ที่น่าจะเกี่ยวข้องกับโรคตัวนี้

  • จากนั้นก็เอาตัว genotype แบบ synonymous (การเปลี่ยนแปลงระดับ genotype ที่ทำให้ตัว amino acids ไม่มีการเปลี่ยนแปลง) ออก และเอาตัว genotype ที่เป็นแบบ deep intron ออก (ตน. ที่ห่างจาก donor and acceptor sites)

  • Focus ที่ non-synonymous (missense, stop gain, stop loss, frameshift)

Comments

Popular posts from this blog

Useful links (updated: 2024-05-05)

SUSA Thailand - Sustainable University? (update 2023-06-23)

Genome editing technology short note