Ooohhh look, a blog post. Not seen one of those in a while around these parts…

I’ve been rather busy since taking on my new job (which I guess I should now refer to just as ‘my job’). Hence the lack of posts for quite some time. Mostly I’ve been focussed on making sure the Bioinformatics Support Unit continues to run smoothly, but like anyone in a new job, I’ve also wanted to make my mark by changing the way thing operate a little bit. So this year we’ve run a proper training course for the first time, for instance. I’m also trying to make the unit more central to the way bioinformatics is done throughout the faculty, by establishing and running the Newcastle University Bioinformatics Special Interest Group. The aim of this group is to foster communication between bioinformatcians working at the University, and hopefully establish some sort of mutually supportive local geek community. The first meeting took place a couple of weeks ago, and I wrote it up for the Special Interest Group blog, but I thought I would reproduce that post here as well. The remainder of this post is taken from that site, with permission of the author (me).

In the first Bioinformatics Special Interest Group meeting, we heard a talk from Dr Andrew “Harry” Harrison entitled ‘On the causes of correlations seen between probe intensities in Affymetrix GeneChips’.

Harry started his talk with a brief overview of the Affymetrix microarray platform, including the important observation (as will become obvious later) that the distance between full length probes on the surface of a GeneChip is around 3nm. Full length probes are around 20nm long, so there is plenty of scope for adjacent probes to interact with one another. Also reviewed was the progress made in the summarisation of probe information from GeneChips into probeset observations per gene.

The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene

The Affymetrix developed MAS5.0 algorithm , which takes the Tukey bi-weighted mean of the difference in logs of PM and MM probes, was swiftly shown to be outperformed by methods developed in academia, once Affymetrix released data that could be used to develop other summarisation algorithms (in particular dChip , RMA and GCRMA  - which take into account systematic hybridisation patterns – i.e. the fact that some probes are “stickier” than others).

Finally for his introductory segment, Harry also mentioned the “curse of dimensionality” – the fact that high-throughput ‘omics experiments make 10s of 1,000s of measurements, and identifying small but significant differences that express what’s going on in the biology suffers from an enormous multiple-testing problem. Therefore, we want to be sure that those things we are measuring are truly indicative of the underlying biology.

For the main portion of his talk, Harry went on to detail a number of features of GeneChip data that mean the correlations we measure using this technology may not be due entirely to biology. This was split into four sections, each with their own conclusions.

Section 1

Different probesets mapping to the same gene may not always be up- and down-regulated together . The obvious explanation for this is that probes map to different exons, and alternative splicing means that differing probes may be differentially regulated, even if they map to the same gene. The follow-on suggestion from this is that while genes come in pieces, exons do not, and the exon can be considered the ‘atomic unit’ of transcription.

Conclusions: Exons need to be considered and classified separately. We should be careful of assumptions that contradict known biology.

Section 2

By investigating correlations across >6,000 GeneChips (HGU-133A, from experiments that are publicly available in the Gene Expression Omnibus), the causes of coherent signals across these experiments can be investigated. Colour map correlation plots can show at a glance the relationships between the probes in many probesets, and anomalous probesets can be easily targeted for investigation. One such probeset (209885_at) was one that looked like it was showing splicing in action (3 of 11 probes clearly did not correlate with the remainder of the probeset across the arrays in GEO), but on further investigation it was found that all the probes in the probeset mapped to the same exon. Another probeset (31846_at) that also mapped to the same exon showed a very similar pattern. By investigating the correlation of all of the probes in the 2 probesets, Harry clearly demonstrated that those 4 outlier probes correlated with one another, even though they did not correlate with any of the other probes.

The probes in the 2 probesets under investigation (centre panel, red bars) can clearly be seen to all be located in the final exon of the RHOD gene (top panel) on chromosome 11. In spite of the fact that the Affy annotation (bottom panel) has the probesets annotating the entire gene.

Further investigation showed that all of these 4 outlier probes contain long (4 or more) runs of guanine in their sequence, Harry showed that if you compare all probes with runs of guanine, you find more correlation than you would expect, and the more Gs, the better the correlation. A possible explanation for this was provided, with the suggestion being that the runs of Gs found in the probes could lead to G-quadruplexes being formed between adjacent probes on the GeneChip surface. This would mean that any RNA molecule with a run of Cs could hybridise to the remaining, free probes, and with a much greater affinity than at normal spots on the array, due to a much lower effective probe density in that spot (see for more details on the physics of this).

Conclusions: Probes containing runs of 4 or more guanines are correlated with one another, and therefore are not measuring the expression of the gene they represent. It is proposed that the signals of these probes should be ignored when analysing a GeneChip experiment.

Section 3

Probes that contain the sequence GCCTCCC are, just like probes containing runs of guanine, more correlated with one another than you might expect them to be (see picture below, taken from ). The proposed reason for this is that this sequence will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridizing to the GeneChip.

Pairwise correlations for probes containing GCCTCCC. From Briefings in Functional Genomics and Proteomics (2009) 8 (3): 199-212.

Conclusions: Probes containing the complementary sequence to the primer spacer are probably not measuring gene expression. As with GGGG probes, they should be ignored in analysis.

Section 4

In the final section of his talk, Harry focussed on physical reasons for correlations between probes, showing that many probes show a correlation purely because they are found adjacent to very bright probes . So their correlated measurements are almost entirely due to poor focussing on the instrument capturing the image of the array. It can be shown that sharply focussed arrays have big values right next to small values, whereas poorly focussed arrays will have smaller differences between adjacent spots, because the large values have some of their intensity falling into their small neighbours. Harry also showed that you can use this objective measure to show the “quality” of a particular array scanner, and how it changes over time (since the scanner ID is contained within the metadata in a CEL file).

Conclusions: There is evidence that many GeneChip images are blurred. This blurring can confound the measurement of biology that you are trying to take in your experiment.

The take home message from Harry’s engaging and thought-provoking talk is that the analysis of high-throughput experiments like those using Affymetrix GeneChips cannot happen in isolation. The things we can learn from considering the statistics and bio-physics (among other things) of these experiments can be invaluable in interpreting the data.

Further resources:

One of the questions after the talk asked how to generate custom CDFs for removing the problematic probes that Harry highlighted during his talk. The answer was to use a tool like Xspecies (NASC) for achieving this.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Bibliography

Tagged with:

My needs are small, I want an RSS feed of the stuff I want to share from Google Reader, so that other people can follow the things I share in Reader (if they want), and I can pipe that information elsewhere (I use dlvr.it to post selected RSS feeds into Twitter). Google doesn’t want to provide that anymore, so I’ll hack something together.

The ingredients:

1. These simple instructions for how to render an RSS feed from a MySQL backend.
3. My rudimentary PHP hackery skills

The code:
All source is available on BitBucket.

First, we need a database connection. The database is set up exactly as described in (1), above.

<?php
DEFINE('DB_USER', 'db_user');
DEFINE('DB_HOST', 'localhost');
DEFINE('DB_NAME', 'db_name');
// Make the connnection and then select the database.
$dbc = @mysql_connect(DB_HOST, DB_USER, DB_PASSWORD) OR die(mysql_error()); mysql_select_db(DB_NAME) OR die(mysql_error()); ?>  Now, when the page is visited, we want to render what is in the database as an RSS feed (again, this is a simple adaptation of the code in (1)): <?php class RSS { public function RSS() { require_once ('mysql_connect.php'); } public function GetFeed() { return$this->getDetails() . $this->getItems(); } private function dbConnect() { DEFINE('LINK', mysql_connect(DB_HOST, DB_USER, DB_PASSWORD)); } private function getDetails() { //header of the RSS feed$detailsTable = "webref_rss_details";
$this->dbConnect($detailsTable);
$query = "SELECT * FROM ".$detailsTable;
$result = mysql_db_query (DB_NAME,$query, LINK);
while($row = mysql_fetch_array($result)) {
//fairly minimal description of the feed
$details = '<?xml version="1.0" encoding="ISO-8859-1" ?> <rss version="2.0"> <channel> <title>'.$row['title'] .'</title>
<link>'. $row['link'] .'</link> <description>'.$row['description'] .'</description>
<language>'. $row['language'] .'</language> '; } return$details;
}

private function getItems() {
//return all the items foe the RSS feed
$itemsTable = "webref_rss_items";$this->dbConnect($itemsTable);$query = "SELECT * FROM ". $itemsTable;$result = mysql_db_query(DB_NAME, $query, LINK);$items = '';
while($row = mysql_fetch_array($result)) {
$items .= '<item> <title>'.$row["title"] .'</title>
<link>'. $row["link"] .'</link> <description><![CDATA['.$row["description"] .']]></description>
</item>';
}
//close the feed
$items .= '</channel> </rss>'; return$items;
}
}
?>


Finally, we need a method for adding new stuff for the feed. This code takes the GET variables passed to it by Google Reader, and stores them in the DB:

<?php
if ($_GET['url']) { //receive google reader 'send to' items, and store in mysqldb$url = $_GET['url'];$source = $_GET['source'];$title = $_GET['title'];$simple_check = $_GET['check']; //stops anyone adding new items to your feed unless they have the key if ($simple_check == 'uniquepasscodehere') {
$insert_statement = "INSERT INTO webref_rss_items(title, description, link) VALUES('$title', '$source', '$url')";
require_once('mysql_connect.php');
$result = mysql_query($insert_statement, $dbc); if ($result) {
echo "<p>Success!";
//would be nice to close the window automatically after a couple of seconds
}
else {
die('<p>Invalid query: ' . mysql_error());
}
}
}
else {
//render everything in the db as RSS
$rss = new RSS(); echo$rss->GetFeed();
}
?>


Now, I can set up the Send To: item in Google Reader:

Tagged with:

An announcement courtesy of Colin Gillespie, a lecturer in Maths & Stats here in Newcastle:

The School of Mathematics & Statistics at Newcastle University, are
again running some R courses. In January, 2012, we will run:

• January 16th: Introduction to R;
• January 17th: Programming with R;
• January 18th & 19th: Advanced graphics with R.

The courses aren’t aimed at teaching statistics, rather they aim to go through the fundemental concepts of R programming.

Further information is available at the course website.

It is hard not to get carried away in a room full of people who seem mostly to want the same things. You come away from a conference like Science Online thinking that the open science revolution is inevitable, and there is nothing anyone can do to stop it. Then you get back to your day job and talk of REF and impact factors and get bought back to earth with a bump.

Word Cloud of #solo11 Tweets (tagxedo.com)

The take home message of the conference this year seemed to be this: for open science to work, long term, reward mechanisms within the profession have to change in a comprehensive and profound way. Do I think this is possible? Of course. Do I think this is inevitable? Not by a long chalk. There are too many parties with a vested interest in things remaining the same, some of whom were represented here, despite all of the talk being about openness.

NPG certainly don’t seem that interested in opening things up too far, as the breakout session on APIs demonstrated. Nothing outside of their paywall was discussed, even more broadly applicable tools, like Connotea, seem to be quietly dropped in the background. The research councils are still more interested in “impact” (whatever that means) than genuinely original thinking.

But for all this pessimism, there are interesting things happening, and a mainstream breakthrough becomes more likely as the volume of those agitating for change grows. MaryAnn Martone‘s keynote was genuinely inspiring, a clear case for breaking down the garden walls. Michael Nielsen made a compelling case for wholesale revolution (however unlikely I think this sort of change may be). We showed that in an afternoon, you can set up a collaborative blog and populate it with interesting scientific content, using freely available tools. The interest we always encounter for the Knowledgeblog project enthuses me, and encourages me that something similar will make hay someday soon (even if we don’t manage to be the people who make the breakthrough).

It may be difficult for me to get to SoLo12, but I will try very hard to return, because I always leave with a smile on my face.

Tagged with:

This is a cross-post from the Blogging for Science Online London group blog. During the Saturday workshop at Science Online London 2011, a bunch of us wrote content relating to Spinal Muscular Atrophy. My post was a short summary of a small scale drug trial, which shows promising results.

This is a summary of a paper that shows that Salbutamol promotes SMN2 expression in vivo .

Patients with Spinal Muscular Atrophy (SMA) have no functioning copy of the gene SMN1. The SMN2 gene can theoretically function in its place, but a change in this gene means that only a small amount of functional protein is produced from the gene.

It is therefore suggested that any intervention that can increase the level of functional SMN2 transcript could well be effective as a treatment for SMA.

Salbutamol is a short acting beta-adrenergic agonist that is primarily used for treating asthma. A previous study has shown that Salbutamol is effective in raising SMN2 full length (SMN2-fl) levels in cultured SMA fibroblasts.

In this study, the researchers administered Salbutamol to 12 patients with SMA, and measured the levels of SMN2-fl 3 times (0, 3 and 6 months). The levels of SMN2-fl were significantly increased in all but 3 patients after 3 months (average increase of 48.9%), and in all patients after 6 months (average increase of 91.8%). They also showed that patients with more copies of the SMN2 gene (some patients had 3 copies, some had 4) showed a larger response to Salbutamol treatment. This increase in expression cannot be explained by normal fluctuations over time in these patients, since studies have shown that levels of SMN2-fl are usually stable over time . Clearly the big question now is whether this molecular response to the drug is reflected in a beneficial clinical response in the patient. This study does not address this question, but does propose that a full double-blind, placebo controlled trial should be carried out to ascertain whether or not this treatment is effective in treating the symptoms of SMA.

Tiziano, F., Lomastro, R., Pinto, A., Messina, S., D’Amico, A., Fiori, S., Angelozzi, C., Pane, M., Mercuri, E., Bertini, E., Neri, G., & Brahe, C. (2010). Salbutamol increases survival motor neuron (SMN) transcript levels in leucocytes of spinal muscular atrophy (SMA) patients: relevance for clinical trial design Journal of Medical Genetics, 47 (12), 856-858 DOI: 10.1136/jmg.2010.080366

Bibliography

Tagged with: