Getting genes from pathway databases
Gene Ontology
The following SQL query gets all gene names annotated to Gene Ontology terms in Humans.
SELECT DISTINCT T.acc, T.name, GPR.symbol, DBX.xref_dbname, DBX.xref_key FROM term T INNER JOIN graph_path GP ON term1_id = T.id INNER JOIN association A ON A.term_id = GP.term2_id AND A.is_not = 0 INNER JOIN gene_product GPR ON GPR.id = A.gene_product_id INNER JOIN species S ON S.id = GPR.species_id AND S.genus = 'Homo' AND S.species = 'sapiens' INNER JOIN dbxref DBX ON DBX.id = GPR.dbxref_id ;
For example, you can run this query in GOOSE or in locally downloaded copy of the Gene Ontology database.
Note: Many more example queries can be found here.
KEGG
The Kyoto Encyclopedia of Genes and Genomes has an FTP site. It seems to consist of text files and images in custom formats – I found it pretty hard to understand. However, hidden in there are files containing pathway ids and associated gene ids — for example, here’s is the one for humans. (I learned this from the GenomeNet support team, who also said
“Column 2 is the list of KEGG gene IDs and Column 3 is that of representative gene names. KEGG GENES is a collection of gene catalogs generated from publicly available resources, mostly NCBI RefSeq. Thus, gene names are got from NCBI RefSeq.”
And much indebted I am to them, too.
Reactome
The easiest way to get this information form Reactome seems to be the Reactome Biomart. (Personally, I’d much rather run a database query, because I’m of that sort of persuasion, but this does not seem possible in a simple way with the Reactome database. My guess – and it is only a guess – is that Reactome’s object-relational mapping is designed to be manipulated using an object-oriented code framework, and that transitive membership is not computable using the database directly.)
Cheers! Will have a look at some point and see if this works for me. However, I don’t understand SQL in the slightest, so perhaps not.
If you happen to be blogging from Newcastle next week, please give me a good review!