Saturday, June 21, 2008

Fun with Gene Expression (for Bio-Nerds Only)

Hop on over here and you'll be greeted with an awesome gene expression resource. It's a database of about 33,000 cDNA sequences that correspond to virtually every gene in the body. You could type something like "204285_s_at" (that's how these cDNA sequences are identified) in the search field. Alternatively, you could type something like "top2a" (which happens to be the symbol for topoisomerase 2A). Then you hit enter, and you see all sorts of expression data for that gene. You might see that it's heavily expressed in the pancreas, but not in the brain. You'll see links for papers that reference that gene. Etc.

Folks gawk at the achievement of sequencing entire genomes. Impressive, but personally I'm astonished with these DNA microarrays that tell you which genes, out of 20,000 or more, are being expressed in the tissue sample. Biochemists have been sequencing DNA for something like 50 years now, with various increments in efficiency along the way. But these microarrays required a whole new approach. Specifically, biochemists had to collaborate with experts in semiconductor lithography to create these "gene chips".

"How much would you pay for your own genome sequence?" is a question sometimes offered to various scientists. Currently, it would cost a few million dollars, I would think. But a fairly affluent individual can have a tissue sample analyzed via microarray right now, and the results would probably be quite a bit more useful (at this junction in history, at least).

It also amazes me that so many of these genes and their protein products have been analyzed in depth. Choose any of 20,000 genes. Which diseases are connected with its mutations? What sort of protein family/structure are we talking about? What is its function? How heavily is it expressed in any of 100 different tissues and cancers? Who has written papers on the subject? Nobody will win a Nobel prize here, because tens of thousands of researchers have contributed to this huge knowledge-base. And much of the data is freely available to anyone nerdy enough to probe it.

I'm one of those nerds. Having downloaded the aforementioned database, transformed the data, and squished it through my own statistical sausage machine, I now offer up the winners of the "Gene Expression Awards". Before continuing, I should make it clear that I'm taking the expression data at "face value", ignoring error bars, and the simple fact that heavy/light mRNA concentrations don't necessarily translate into heavy/light protein concentrations.

First, the award for Most Consistently Expressed Gene. Here, we're talking about a gene that is expressed across all sorts of tissues, not just a few. The winner is...Insulin-Like Growth Factor Binding Protein 6! You find it everywhere in the body. Runner-Up candidates include "proline arginine-rich end leucine-rich repeat protein", "phosphoinositide-3-kinase, class 3", and "glypican 4". Special mention should go to proteins like "gelsolin" and "growth arrest-specific 6", which are not only consistently expressed, but also heavily expressed.

How about the Least Consistently Expressed Gene? Here, we'll go with "myosin, heavy polypeptide 7, cardiac muscle, beta", found almost entirely in muscle tissue. Another good candidate would be "protamine 2", which is only expressed in testis tissue.

How about "Most Overexpressed in Cancer Cells"? Let's go with "phorbol-12-myristate-13-acetate-induced protein 1" as the champion. Runner ups include "neuromedin U", "topoisomerase (DNA) II alpha 170kDa", "phorbol-12-myristate-13-acetate-induced protein 1", "DNA replication complex GINS protein PSF2", "ribonucleotide reductase M2 polypeptide", "activator of S phase kinase", and more.

There are plenty of proteins that are essentially unexpressed in cancer cells. There's always a slim chance that these proteins must be actively suppressed in order for cancer cells to proliferate. Some examples: clusterin, fibronectin 1, v-fos FBJ murine osteosarcoma viral oncogene homolog, and insulin-like growth factor binding protein 7.

"Most Overexpressed in Adult Tissue" (i.e. least expressed in fetal tissue): "major histocompatibility complex, class II, DP alpha 1", followed by any number of other immunoglobulin-related proteins. "Prostaglandin D2 synthase 21kDa (brain)" should also figure in the list.

Conversely, there's "Most Underexpressed in Adult Tissue": "alpha-2-HS-glycoprotein", "glycophorin A ", "hemoglobin, gamma G", and more.

Some Trivia...

*genes that are generally heavily expressed tend not to be expressed in testis germ cells. Rather odd.

*want to know which genes are heavily expressed in the appendix? Look for genes that are also heavily expressed in the superior cervical ganglia. Why in hell should there be a relation between these two tissues? Heavy expression in the ovaries also correlates strongly with expression in the appendix...low expression in the ovaries correlates with low expression in the appendix.

*Heavy expression in the Spinal Cord correlates overwhelmingly with expression in the Olfactory Bulb. More reasonably, heavy expression in the Prefrontal Cortex and Hypothalamus also correlate strongly with with Olfactory Bulb expression.

*Various brain tissues cross-correlate very strongly. Strong expression in the whole brain is very well correlated with heavy expression in the amygdala, followed by the prefrontal cortex, occipital lobe, etc. The differences between these tissues are fairly subtle, apparently.

*Gene expression in the prostrate is strongly correlated to expression in the lung! Huh?

*Despite the proximity of the organs, expression in the prostrate is negatively correlated with expression in the testis.

*Heavy expression in the atrioventricular node (of the heart) corresponds to heavy expression in the skin!

*Heavy/low expression in the blood is negatively correlated with heavy/low expression in the brain. There are also inverse correlations between testis expression and whole blood expression.

*Low expression in the adrenal cortex is correlated with heavy expression in the Medulla Oblongata.

*If it's heavily expressed in smooth muscle, it's probably expressed in low quantities in skeletal muscle. This correlation isn't strong, but it's a bit surprising nevertheless.

*Proteins heavily expressed in adipocytes (fat cells) tend to be expressed in smooth muscle. Collagen and collagen binding proteins, for example, but also "melanoma associated gene" (ds2448), laminin, and more.

No comments: