Carl's Selfie

Carl Sagan made a habit of humbling hubris. Although he was a master of narrative, pacing and awe, Sagan's salient talent was re-calibrating perspective in a cosmological context. Where most people saw a periodic table, Sagan saw star stuff. Where I would definitely see an apple pie, Sagan saw the origins of the universe. And when NASA flung Voyager 1 through our solar system, Sagan saw an opportunity to take the most ambitious selfie of all time.

3.7 billion miles from Earth, Sagan convinced NASA to turn Voyager 1's camera around so that it could take this picture:​

​Earth "suspended in a sunbeam"

​Earth "suspended in a sunbeam"

On first inspection it doesn't look like much. A pale blue dot perhaps. But that dot is Earth. And where you and I might see a few scruffy pixels, Sagan saw an opportunity to contextualize humanity. 

Sagan published his thoughts on this image in his book Pale Blue Dot: A Vision of the Human Future in SpaceDespite his deteriorating health, Sagan also managed to record an audiobook version before he died. One monologue from this reading became particularly famous and has since been widely used by amateur video editors in the YouTube age. The following video is a fantastic example (please watch it if you have time):

I first saw a version of this video in 2007. I was struck not just by its conclusions (most cosmology is humbling), but by the tightness of Sagan's prose and his penetrating delivery. Many variations of this video exist. Most are good, some are great. Whilst it's nice to see such widespread appreciation for Sagan's words in video, most versions are (understandably) very similar.

I was therefore delighted to find this cartoon version of the Pale Blue Dot narrative by Gavin Aung Than. It's refreshing to see the piece so gorgeously articulated in a different medium. And whilst most videos (like the one above) try to capture the grandeur of his words, Than's cartoon focuses on the Pale Blue Dot from the perspective of a young Carl Sagan. If the purpose of the original Pale Blue Dot image was to re-contextualize humanity, then this cartoon re-imagines Sagan as an adult with child-like fascination. He epitomised universal curiosity and it's this naive, juvenile perspective - the image of a child looking into space - that all scientists share.  

Sloshed Sequencers

​A delightfully cantankerous Mike Yaffe writing for Science Signaling:

"​We biomedical scientists are addicted to data, like alcoholics are addicted to cheap booze. As in the old joke about the drunk looking under the lamppost for his lost wallet, biomedical scientists tend to look under the sequencing lamppost where the “light is brightest”—that is, where the most data can be obtained as quickly as possible. Like data junkies, we continue to look to genome sequencing when the really clinically useful information may lie someplace else."

​Scientists are not just "addicted" to data. They literally need it.

It is impossible to act empirically without data — and obtaining good quality data is the core objective for every scientist. Ideas are cheap and in a scientific utopia we would have the resources to study them all. Unfortunately — tethered to reality as we are — a pragmatic filter must be applied to all testable theories. Feasible ideas (might) get funded, whilst theoretically sound, but technically challenging ones often remain untested.

DNA sequencing represents the former. It's a mature field where technical advances strive for increased speed, improved-fidelity and reduced cost — not changes in the data output. A sequenced genome is a sequenced genome. It can be expensive and time-consuming, but sequencing a genome is exceptionally feasible. In contrast consider protein sequencing: Existing technology is extremely expensive, frustratingly stochastic and years away from true proteome-wide coverage. However, whilst quantitative proteome sequencing may be more challenging than genomic sequencing, proteomic results have the potential to be far more valuable. So, although, as a person working in proteomics, I envy the pragmatic simplicity of DNA sequencing; I have less envy for the dreary results of large DNA sequencing projects.

​Yaffe shares my boredom with genomics and laments the continued funding of pragmatically feasible (although still very expensive) efforts to sequence the DNA of cancer genomes:

"So far, the results have been pretty disappointing. Various studies on common human tumors, many under the auspices of The Cancer Genome Atlas (TCGA), have demonstrated that essentially all, or nearly all, of the mutated genes and key pathways that are altered in cancer were already known."

​Of course that doesn't mean they shouldn't have looked. Knowing something is similar to how you think it is still has value. The continued funding of such projects is substantially less exciting and Yaffe makes a great case for why our "drunk scientist" should leave this "lamppost" and look for their keys in a more sensible place.

Impenetrable Drafts

Samuel Arbesman writing for Slate:

“A professor of mine once taught a class on a Tuesday, only to read a paper the next day that invalidated what he had taught. So he went into class on Thursday and told the class, “Remember what I told you on Tuesday? It’s wrong. And if that worries you, you need to get out of science."

Science is always in this draft form.”

Permanent incompletion is one of the unique features of science. No matter how accurate our model of the universe appears to be, further experimentation, hypothesis testing and repetition has always produced greater predictive accuracy. There is never a reason to think science is ‘finished’. There is always something else to do.

To draw contrast with an alternative philosophy, consider the following diagram:

science_vs_faith.jpg

For me, the salient point in this diagram is that science does not ‘End’. It’s an infinite loop — obsessed with obtaining a more accurate understanding of the universe. 

This perpetual cycle has an interesting side-effect: As methods for data collection and data processing have increased, the models we have developed to understand the universe have, unsurprisingly, become increasingly complicated. Thus, as the amount of data increases, the more we have to externalise the processing of this data to computers. Our models of the universe are no longer thought, they are computed.

If science continues its perpetual cycle of mass data collection and deeper modelling, we will eventually generate data so baffling that only computers will ‘understand’ it. Science will reach a point where the human brain can not comprehend the complexity of its conclusions. 

Arbesman cites evidence that this is already happening: 

“A computer program known as Eureqa that was designed to find patterns and meaning in large datasets not only has recapitulated fundamental laws of physics but has also found explanatory equations that no one really understands. And certain mathematical theorems have been proven by computers, and no one person actually understands the complete proofs, though we know that they are correct.”

I would argue this has actually been happening for quite a while. In the 1970‘s Richard Feynman famously popularised the impenetrability of quantum mechanics to the human mind. Yet, despite its incomprehensibility, quantum mechanics remains one of the most accurately predictive axioms in science. It may be deeply strange, but it provides an accurate, testable model of the universe. And that, I think is the real danger with increasing data complexity: Testability.

Science permits the presence of ideas that a human can not understand. What it does not permit are ideas that can not be tested.​

Totes Emosh

Scientists are often viewed as cold, insipid people. Platonic robots who appear overly connected with data and inhumanly disconnected from emotions.

This is perversely inaccurate. 

To be clear: Academic scientists are freelance-researchers who literally investigate the unknown. They operate with limited resources, minimal job security, finite time, poor pay (relative to their qualifications) and maximum uncertainty. They spend huge periods of time doing work that may not produce anything useful and should a technical error occur during an experiment, weeks of (expensive) hard work may have to be scrapped. To top it off, their success is judged by where (not if) they publish — something that is highly dependent on contemporary fashion, the recent performance of rivals and as a result, has a questionable association with ‘talent’.

Such an environment ensures scientists are very emotional people.

To address this psychological misconception, The Guardian has published several first-person reports describing the emotional experiences of scientists. They’re all interesting pieces, although I’m not convinced any truly capture the core driver of scientific emotion: Uncertainty. No profession provides unequivocal certainty — but as the purpose of science is to investigate the unknown, it always operates in uncertainty. When coupled with transient job stability, unpredictable results and an extremely competitive applicant:position employment ratio it’s surprising anyone would want to do it.

For all these lows, however, scientists can achieve comparable highs. One benefit of investigating the unknown is that scientists are assured a front row seat at the frontier of human discovery. Scientists see things before anyone else in human history — and when things work, they can, sincerely, change the world. Whilst uncertainty brings professional instability, when things work, there is really no feeling like it. For every Sheldon Cooper, there's a Doc Brown.

You can't say that about many nine-till-fives.​

Big Data

In an adaption from his new book “Antifragile” Nassim Taleb describes the problem with “big data” for Wired

“Big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal).”

So more data provides more signal. That’s the whole point of doing large experiments. However, Taleb is warning us that as data size increases, the rate at which noise increases is faster than the rate at which signal increases. Thus, bigger datasets contain a worse signal-to-noise ratio than smaller datasets. This means researchers can easily find completely bogus (but statistically significant) results in big datasets. Given the ubiquity of large experiments in modern science Taleb believes: "Researchers have brought cherry-picking to an industrial level.”

Taleb could be correct here — if it were not for one huge presumption: That researchers don’t follow up big experiments with smaller ones. 

Large biological experiments (e.g. anything ending in “omics”) produce massive datasets impenetrable to the isolated human mind. Unfortunately, no human can look at thousands of unprocessed data points and see the pattern. We can however, apply mathematical modelling to help explain data. 

These models are not the endpoint. They are not the answer. If they were, Taleb would have a point. These models are, in the very classical sense, hypotheses. Just as a human can look at several data points from a small experiment and derive a testable hypothesis, computational interpretation of big datasets can be used to produce a testable hypothesis from thousands of data points. 

And where there is a hypothesis, there is an experiment to test it.  

The needle may come “in an increasingly larger haystack." We just use a robot with a magnet to help sift it — then check the needle isn’t hay. It’s science.

Maybe Taleb’s cynicism comes from a career in the financial sector. It’s not a domain famous for reproducible, predictive or testable empiricism.

 

Cross-Domain Contrast

To test a hypothesis, scientific techniques must adequately distinguish between the metrics of a hypothesis (‘signal’) and information that is unrelated to the hypothesis (‘noise’). They must have contrast.

In biochemistry, where molecules are routinely defined at the atomic level, such contrast is commonplace. Unfortunately, the environment in which biochemistry operates — the cell — is less granular. Cells are chaotic snake-pits of interweaving biochemistry. This complexity can make studying cells a rather soft, fuzzy, low-contrast endeavor. 

Fortunately, for scientists (and biology in general), the more easily-defined world of biochemistry underpins cellular chaos. Despite their complex environment, cellular molecules interact specifically and for the most part, order prevails. As a result, researchers can use tools from the high-contrast domain of biochemistry to study the complex domain of cell biology. 

In his book “The Nature of Technology”, W. Brian Arthur suggests technical innovation occurs when a set of tools from one domain are used to solve a problem in another domain. Innovation is the “re-domaining” of a technology. As the ‘new’ tools must already exist in a domain external to the problem, innovation requires the cross-fertilization of different technical domains.  

Modern labs are littered with the products of such cross-domain innovation. For example, basic immunology knowledge is routinely exploited (through the medium of antibodies) to investigate the expression and location of proteins in cells (see western blots and immunofluorescence). Research into the thermodynamics of nucleic acid hydrogen bonding and thermostable DNA polymerases revolutionised the way genes are isolated (see PCR). Instruments initially designed for measuring the mass and charge of chemical ions (mass spectrometers) now empower the way we understand cell signaling (see phosphoproteomics). 

In all cases, technical knowledge from one domain is applied to solve a problem in another domain.

A fresh example comes from from Alice Ting’s group as reported last week in Science. First the authors engineered a new enzyme to specifically label proteins in the mitochondria (and not anywhere else in the cell). They then used this specific labelling to separate mitochondrial proteins from the noisy snake-pit of other cellular proteins. Once isolated, the authors could then measure the mitochondrial proteins with unparalleled accuracy using mass-spectrometry.

By cross-fertilizing the domain of recombinant enzyme engineering with contemporary proteomics, this technique can record detailed measurements of something surrounded by noise. 

High-contrast indeed.

1K Lexicon

The predominate barrier to understanding scientific methodology is not complex ideas, but complex language. 

People are smart. They understand elaborate things everyday. They are not, however, trans-linguist-gymnasts capable of computing and contextualising unfamiliar words. 

I am not a fan of ‘dumbing down’ methodology (i.e. removing detail). It’s dishonest and patronising. Non-experts are people without an expert knowledge — they are not stupid. 

Translating methodology into language non-experts can understand is a more productive form of communication. It respects an audience’s intellect and ensures they are exposed to the complete story. 

But how does one ensure a narrative contains non-expert language? 

An extreme example would be to try ‘The Up-Goer Five Text Editor’. Just as a word processor underlines misspelt words with a wavy red line, this text editor highlights words that differ from the most common 1,000 words. 

By this metric my 'About' blurb is 46% impenetrable.

And Repeat

Scientific conclusions should be reproducible. If A+B = X today, then all things being equal, A+B should = X tomorrow. But if A+B inconveniently = Y, then there was probably something wrong with the original idea that A+B = X and a new hypothesis may be required. As this happens frequently in science, researchers must repeat experiments to ensure initial results aren’t anecdotal flukes. 

At the individual level repeating experiments is not a problem. If an individual scientist can run one experiment, they can (time and money permitting) run another. It may be boring, but it is feasible. Experimental reproducibility is required by peer-reviewed journals — and although there is no general threshold of accepted reproducibility, most authors demonstrate the fidelity of their results through the medium of ‘statistical significance’. 

Thus at the individual level, experimental reproducibility is rewarded in the currency of research: Publications. So far, so good. 

The next tier of experimental reproducibility is the replication of results by other scientists. This is an unrewarding endeavour. In fact, due to the bias of selectively publishing ‘new’ work, there is an active discrimination against publishing repeated results in leading (i.e. career bumping) journals. This anti-replicant discrimination is so fierce that researchers often live in fear of competitors publishing ‘your findings’ before you do. There is no prize for being second. 

As publications drive careers, there is no mainstream incentive for researchers to spend their time and money replicating other people’s experiments. As a result, many conclusions, at the publication/community level are n=1

This highlights a bizarre paradox. On one hand, statistically significant reproducibility is the essence of individual science. Yet, at the community level, reproducing existing results is career suicide.

Tom Bartlett in Percolator reports on a charming challenge to this trend. In short, a group of researchers (lead by the Open Science Framework) have decided to replicate every study from three Psychology journals for a year. Their aim is to estimate the reproducibility of a sample of studies from the scientific literature

I can’t wait to see what they find.