GeneDistiller

This software is being further developed. You might occasionally discover new features that haven't yet made it into the manual. We apologise in advance if this causes confusion, but we want to make new features available as soon as possible.

Query mask

If you check the box display online help in the first line, pop-ups will display some help whenever the mouse pointer is drawn over an input element. Below, some examples can be found to illustrate use, capabilities and limitations of GeneDistiller.
The query interface consists of different sections which can be either displayed or hidden.

general settings

The general section allows to select the species (at present, only human genes can be queried, but mouse genes will be included soon) and the order in which the results are to be displayed (order / prioritise genes) in a drop-down menu. This can either be due to their physical position on the chromosome ( default) or to other features. There is also the possibility to let GeneDistiller perform a prioritisation on behalf of user-defined criteria. These options are indicated with a preceding 'prioritise' and focus the prioritisation on different information (e.g. phenotype data or tissue-specific expression). When prioritisation is selected, another segment will appear allowing to fine-tune the weights assigned to each parameter. This option can be used to raise or lower the influence of certain kinds of data in the prioritisation.

target genes

At first, possible candidate genes must be defined. This can be done in several ways:

interval-based

by selecting the chromosome and the physical position* of start and end of an interval
by selecting the first recombinant microsatellite or SNP markers outside an interval obtained by linkage analysis
by selecting neighbouring genes that delineate the interval

At least one delimiter of the interval must be specified in a text field. If only physical positions are used, the chromosomemust be selected in a drop-down menu. The types of delimiter can be mixed, e.g. an interval can begin with a genetic marker and end with a physical position.

* Physical positions below 400 are treated as Mb.

gene-based

by directly specifying genes to be treated as candidates [single gene(s)]

Target genes or gene IDs are specified either by their NCBI NCBI Entrez gene ID or by the HGNC HGNC gene symbol (in distinct input fields). It is also possible to search for known disease genes (with OMIM IDs or keywords) that are treated as candidate genes. Here, the different means of addressing target genes can be combined. However, the search within intervals and single genes cannot be combined. If you specify both, the region will be ignored.

genome-wide / mitochondriome
If whole genome analysis or mitochondrial genes à la MitoCarta analysis is selected, the whole genome or all genes encoding mitochondrial proteins, respectively, will be analysed. This option ignores pre-set intervals!

output
The output can be restricted to certain types of genes using the type drop-down menu, e.g. to protein-coding genes, non coding RNAs (ncRNA) etc.

number of target genes
The number of genes within an interval can be quickly accessed by clicking the count button. Here, no filters except for the limits directly above will be applied.

comparison with known genes or phenotypes

Target genes (defined in Target genes) can be compared with genes known to be causing similar disease phenotypes. These genes can be defined either by their HGNC symbol (synonyms cannot be used) or NCBI ID (both specified in the same input field, comma separated). Furthermore, OMIM IDs or parts of the title of an OMIM record can be entered; here all genes connected with the respective OMIM records will be added to the list of genes to be compared.

Related genes (or proteins) can be found either by similarity of expression patterns in all available tissues (co-expression, by Pearson correlation) and/or via protein-protein interactions. Both options can be activated by selecting the check boxes compare expression (which might take a few extra milliseconds or even seconds) or search for interactions in the first row of this menu. Additionally, the annotations (such as protein domains, KEGG pathways or GO IDs) of the genes used to compare the target genes with are studied. Similarities are highlighted and used for the prioritisation (if desired so).
By entering a cut-off value, only similarly expressed genes (with a score above the cut-off) will be listed. Expression similarity can vary between -1 and 1 (-1 indicates perfect negative correlation).

display options

Here, the users can choose which data shall be included in the output and how the output shall be formatted. They can also choose to shorten some phenotypic information to the first paragraph (OMIM) or the title (MGD phenotypes). Please note that the MGD phenotype information for human genes does not include any details.

Hyperlinks to the original data source are provided whenever possible. For each transcript, a further hyperlink to ExonPrimer is generated.

The settings made here do not affect the GeneDistiller prioritisation or the selections made by the user, i.e. user-defined conditions are regarded even when the corresponding data is not displayed.

phenotypes

This section allows choosing which phenotypic information shall be included.

Highlight keywords will search for the occurrence of word in the genes' OMIM reports, geneRIFs and their description. These terms will be highlighted in different colours in the output.
The checkbox below restricts the output to those genes with at least one of the words in their phenotypic description. Please note that this is a full text search which can require some time when a large number of genes is queried.
Highlight these Human Phenotype Ontology IDs (and their subclasses) will highlight the specified Human Phenotype Ontology IDs when they are connected with a gene. This includes any subordinate terms e.g. when the user selects the HPO ID of Abnormality of the heart (HP:0001627), also subclasses like Cardiomegaly will be highlighted. HPO IDs must be entered as numbers! The checkbox below restricts the output to those genes to which at least one of the phenotypes applies.
Highlight these MGD phenotypes will highlight the presence of one or more selected MGD phenotypes connected with a gene. The checkbox below restricts the output to those genes to which at least one of the phenotypes applies.
Highlight these GO IDs (and their subclasses) will highlight the selected GO terms when they were assigned to a gene. This includes any subordinate terms, e.g. when the user selects the GO ID of nucleic acid binding (GO:0003676), also subclasses of nucleic acid binding like DNA binding will be highlighted. GO IDs must be entered as numbers!
The checkbox below restricts the output to those genes to which at least one of the terms (or their subclasses) applies.
Highlight these Kegg pathways (specified by number) will highlight the defined Kegg pathways when they were assigned to a gene. Kegg pathways must be entered as numbers! The checkbox below restricts the output to those genes to which at least one of the pathways applies.

expression

In this segment, users can choose among different tissues which can be queried for expression. Tissues are hierarchically sorted. Checkboxes on the left allow to print tissue-specific expression in the output. The elements on the right side allow to define conditions either for exclusion of genes not fulfilling them (show only genes fulfilling the conditions) or to prioritise genes according to their fulfillment of these conditions (rank genes by their fulfillment of conditions). To include conditions, the operator (usually >) must be selected and a value specified. These values indicate the genes' expression in this tissue and are defined as [value] x median expression. Hence, to find a gene that is expressed above median the condition should simply be '>1'. Use higher values to screen for genes with an expected high expression. In case of exclusive queries (Connect expression conditions with), conditions can be combined either with AND or OR.
When more than one probe exists for a given gene (according to Affymetrix' annotations), the mean of the expression values of all these probes will be used.
Please note that although prioritisation according to tissue-specific expression will normally list the correct gene under the top 5 genes (provided the right tissues have been selected), the correct gene will not always make it to the first place.

cellular localisation

In this part of the query interface, the localisation (cellular, extracellular or organellar, respectively) of the gene products can be queried. The locations are presented in a hierarchical structure and can be selected by checking the boxes. If a gene product is located in one of the selected structures (or a substructure of them), its location will be highlighted in the output.

Above the localisations, a further checkbox allows to restrict output to those genes which fulfil the conditions, i.e. whose gene products are located in any of the structures selected.

prioritisation settings

The prioritisation settings allow fine-tuning the weight given to each parameter when a prioritisation approach is chosen. For instance, the impact of occurrences of search terms in OMIM reports can be increased by increasing the value assigned to OMIM text. When fields are set to zero, these parameters will not be used for the prioritisation.

This sections opens automatically when a prioritisation approach is selected under order / prioritise genes; values entered here are not considered when another sorting approach is chosen.

A note on synonyms

GeneDistiller stores synonyms for genes. However, when synonyms are used instead of gene symbols, GeneDistiller will display an warning message presenting the appropriate gene symbol(s) instead.
Synonyms cannot be used directly as this might cause ambiguities - the same synonym may name more than one gene now or in the future.

Examples

Positional candidates

First example:
Show all protein-coding genes in the interval defined by the microsatellite markers D15S1042 and to D15S659 ordered by their position. To create the interval set microsatellite (from) to D15S1042 and microsatellite (to) to D15S659. Now change type to protein coding to display only the protein coding genes. The sorting order is adjusted with the drop-down menu order / prioritise genes, position is the default value.
Show example #1.

Comparison with known phenotypes

Second example:
Leigh syndrome is a disease group with a number of different aetiologies. However, common to this disease group, genes are affected that have a function in the mitochondrium. As genes involved in the same disease, pathway or organelle (here the mitochondrial genes) are frequently co-regulated, the prioritisation is focussed on genes with a common mitochondrial expression pattern and mitochondrial organellar localization. Imagine, mutations in the LRPPRC gene had not yet been found as the cause of the French-Canadian type Leigh syndrome. A candidate region of 5.2 cM between the markers D2S2294 and D2S2291 would have been mapped that contains 15 genes. Therefore we choose the prioritisation setting (order / prioritise genes) prioritise with focus on possible pathways (interaction and expression similarity) and additionally increased the weight of the prioritisation settings Maestro score and Mitopred to 5.
Show example #2.
As a result we see the LRPPRC gene ranking top, mainly due to its expression correlation with other Leigh syndrome related genes (PDHA1, COX15, NDUFV1, PC, SURF1, NDUFS3, NDUFS4, NDUFS8, DLD, NDUFS7) and its mitochondrial localisation.

Third example:
Imagine, the TSC2 had not yet been found as a second gene to cause Tuberous sclerosis if mutated. A pedigree with several individuals affected with Tuberous sclerosis would have been mapped and a candidate region between D16S521 and D16S3124 delineated. This interval comprises 2.3 Mbp and 126 known genes. As the researcher already knows that TSC1 may cause Tuberous sclerosis if mutated, he or she may assume that the new candidate in the interval might interact with TSC1 or show the same expression. Therefore we choose the prioritisation setting (order / prioritise genes) prioritise with focus on possible pathways (interaction and expression similarity).
Show example #3.
As a result we see the TSC2 gene ranking top (score = 36.9, next gene in succession only 8.2) between the 126 genes in the region, mainly due of its protein-protein interaction with TSC1.

Fourth example:
In a homozygosity mapping for nephronophthisis, a target region between microsatellite D16S475 and SNP rs1529917 was found to be associated with the disease. These markers limit the interval to be analysed by GeneDistiller and we enter them at the correct places (microsatellite, from: D16S475, dbSNP ID, to: rs1529917) . We also hope to find a renal phenotype described for the disease causing gene in MGD. In the query interface, we therefore select renal/urinary system phenotype under highlight these MGD phenotypes and check show only genes to which at least one of these phenotypes was assigned to reduce the number of genes. We also expect the gene to be expressed in the kidney, so we open the expression tab and check the box left of kidney.
Assuming this is all the information we have in advance, the query mask would look like this:
Show example #4.
After clicking on submit, GeneDistiller will display 2 genes (out of 48 that are located in the interval); and we can see that while one of the genes (DNASE1) comes with detailed data, the information available for the other one (GLIS1) is scarce (there are, for instance, no interactions listed and no expression values available). However, we can read in the OMIM report that the latter gene is indeed responsible for nephronophthisis.
Note: We have not used OMIM terms, because the OMIM entry was created after this gene was identified.

Tissue-specific expression

Fifth example:
Suggests candidate genes on behalf of their tissue-specific expression in the brain or its substructures (ordered by: prioritise with focus on tissue-specific expression). Here, genes likely to be involved in GEFs+ (Generalized epilepsy with febrile seizures plus) are ranked. Note that no phenotypic criteria are given in the first example. In example 5b, more background knowledge is being applied.
Show example #5a. | Show example #5b.

Examples from the manuscript

In our manuscript, we describe two different strategies, selection and prioritisation, to determine the most likely gene involved in epilepsy from a 60 Mbp region on chromosome 2, a region containing more than 300 genes of all classes.

Selection:
Genes are filtered for those with a known murine nervous system phenotype and behaviour/neurological phenotype (select both values in the MGD phenotypes and limiting the query to the respective genes with the show only genes to which at least one of these phenotypes was assigned checkbox). A further condensation can be reached when known human phenotypes are considered: Enter the broad term brain into the field highlight these keywords and restricts the search to genes in whose descriptions this keyword appear (check show only genes with at least one of these words in their OMIM reports). Note that the more specific epilepsy is not used in example because we cannot be sure in advance that our candidate is already known to cause epilepsy in humans.
Since a gene responsible for epilepsy is likely to be expressed in brain, open the expression tab and select >1 (x median) for the expression in whole brain. Restriction to the genes with an expression of more than median can be reached whenshow only genes fulfilling the conditions is selected. Setting a filter for prefrontal cortex expression > 3 (x median) and connecting both expression filters with AND further shortens the list. Add the Gene Ontology ID for ion transport (GO:0006811) into the highlight these GO IDs fields and restrict the search to those carrying this GO ID or a subclass (show only genes to which at least one of these GO IDs applies checkbox).
Selection example
Now, only 2 genes, SCN1A and SCN3A remain in the list both of which are excellent candidates for an epilepsy phenotype.

Prioritisation:
If you change the order / prioritise genes drop-down to prioritise with focus on possible pathways, uncheck all the restricting checkboxes and change the expression setting to rank genes by their fulfillment of conditions, a prioritisation strategy will be applied. To search for similarities with genes known to be involved in epilepsy, enter the term epilepsy into the compare with these OMIM entries (MIM ID or keyword) field and check compare expression and search for interactions.
Prioritisation example
Again, SCN1A will be listed on top. Another gene, SCN2A, will appear as the second best candidate - it was not considered in the selection approach because no mice phenotypes have been described yet.

Output

GeneDistiller prints the desired data in HTML format. If figures, e.g. for expression data, are included, they will be produced as PNG and seamlessly integrated into the output.
Below the actual output, two hyperlinks are presented. The first one will restore the query mask with the settings made by the user, the latter will restore the actual output. Bookmark the second link if you only want to return to the output page (you might as well save the page), bookmark the first one if you might want to modify your settings add a later date.

The output includes hyperlinks to the original data on the providers' web pages.

Browsers

GeneDistiller has been developed on Mozilla Firefox 2. It has also been tested with Microsoft Internet Explorer. However, it should work with any web browser with JavaScript enabled.

Integrated data

NCBI Entrez Gene (gene information and geneRIFs)
Maglott D, Ostell J, Pruitt KD, Tatusova T.
Entrez Gene: gene-centered information at NCBI.
Nucleic Acids Res. 2007 Jan;35(Database issue):D26-31. Epub 2006 Dec 5.

Mitchell JA, Aronson AR, Mork JG, Folk LC, Humphrey SM, Ward JM.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
AMIA Annu Symp Proc. 2003;:460-4.
NCBI dbSNP (SNP positions)
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K.
dbSNP: the NCBI database of genetic variation.
Nucleic Acids Res. 2001 Jan 1;29(1):308-11.
NCBI UniSTS (microsatellite marker positions)
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E.
Database resources of the National Center for Biotechnology Information.
Nucleic Acids Res. 2007 Jan;35(Database issue):D5-12. Epub 2006 Dec 14.
ENSEMBL (genes, exons, transcripts)
Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Ouverdin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A, Birney E.
Ensembl 2007.
Nucleic Acids Res. 2007 Jan;35(Database issue):D610-7. Epub 2006 Dec 5.
Swiss-Prot (only protein IDs)
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M.
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Nucleic Acids Res. 2003 Jan 1;31(1):365-70.
STRING
Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C.
STRING 8--a global view on proteins and their functional interactions in 630 organisms.
Nucleic Acids Res. 2009 Jan;37(Database issue):D412-6. Epub 2008 Oct 21.
MGD phenotypes (mouse phenotypes linked to a gene, mapped to human genes)
Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE; Mouse Genome Database Group.
The mouse genome database (MGD): new features facilitating a model system.
Nucleic Acids Res. 2007 Jan;35(Database issue):D630-7. Epub 2006 Nov 29.
OMIM (phenotype information linked to a gene)
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA.
Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D514-7.
GeneAtlas (tissue-specific expression of a gene)
Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB.
A gene atlas of the mouse and human protein-encoding transcriptomes.
Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7. Epub 2004 Apr 9.
GO (gene ontology)
Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T, White R; Gene Ontology Consortium.
The Gene Ontology (GO) database and informatics resource.
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D258-61.
Human Phenotype Ontology
Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S.
The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease.
Am J Hum Genet. 2008 Nov;83(5):610-5. Epub 2008 Oct 23.
Maestro & Mitopred (prediction of mitochondrial genes)
Calvo S, Jain M, Xie X, Sheth SA, Chang B, Goldberger OA, Spinazzola A, Zeviani M, Carr SA, Mootha VK.
Systematic identification of human mitochondrial disease genes through integrative genomics.
Nat Genet. 2006 May;38(5):576-82. Epub 2006 Apr 2.

Guda C, Guda P, Fahy E, Subramaniam S.
MITOPRED: a web server for the prediction of mitochondrial proteins.
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W372-4.
InterPro
Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Nikolskaya AN, Orchard S, Orengo C, Petryszak R, Selengut JD, Sigrist CJ, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C.
New developments in the InterPro database.
Nucleic Acids Res. 2007 Jan;35(Database issue):D224-8.

All data is based on NCBI genome build #37 (GRCh37.p5).

Links to automatic prioritisation tools

Citing GeneDistiller

If you feel that GeneDistiller has helped you in your research, please cite the following publication:

Seelow D, Schwarz JM, Schuelke M.
GeneDistiller--distilling candidate genes from linkage intervals.
PLoS ONE. 2008;3(12):e3874. Epub 2008 Dec 5.

Contact

In case you discover bugs, have suggestions or questions, please write an e-mail to
Markus Schuelke (markus.schuelke AT charite.de) or to
Dominik Seelow (dominik.seelow AT charite.de).
We also appreciate hearing about your general experiences using GeneDistiller.