Archive for the 'statistics' Category

Exploratory Versus Confirmatory Data Analysis?

Monday, February 15th, 2010

In 1977, John Tukey published a book called Exploratory Data Analysis. It introduced many new ways of analyzing data, all relatively simple. Most of the new ways involved plotting your data. A few involved transforming your data. Tukey’s broad point was that statisticians (taught by statistics professors) were missing a lot: Conventional statistics focussed too much on confirmatory data analysis (testing hypotheses) to the omission of exploratory data analysis — data analysis that might show you something new. Here are some tools to help you explore your data, Tukey was saying.

No question the new tools are useful. I have found great benefits from plotting and transforming my data. No question that conventional statistics textbooks place far too little emphasis on graphs and transformations. But I no longer agree with Tukey’s exploratory versus confirmatory distinction. The distinction that matters — at least to historians, if not to data analysts — is between low-status and high-status. A more accurate title of Tukey’s book would have been Low-Status Data Analysis. Exploratory data analysis already had a derogatory name: Descriptive data analysis. As in mere description. Graphs and transformations are low-status. They are low-status because graphs are common and transformations are easy. Anyone can make a graph or transform their data. I believe they were neglected for that reason. To show their high status, statistics professors focused their research and teaching on more difficult and esoteric stuff — like complicated regression. That the new stuff wasn’t terribly useful (compared to graphs and transformations) mattered little. Like all academics — like everyone — they cared enormously about showing high status. It was far more important to be impressive than to be useful. As Veblen showed, it might have helped that the new stuff wasn’t very useful. “Applied” science is lower status than “pure” science.

That most of what statistics professors have developed (and taught) is less useful than graphs and transformations strikes me as utterly clear. My explanation is that in statistics, just as in every other academic area I know about, desire to display status led to a lot of useless highly-visible work. (What Veblen called conspicuous waste.) Less visibly, it led to the best tools being neglected. Tukey saw the neglect –  underdevelopment and underteaching of graphs, for example — but perhaps misdiagnosed the cause. Here’s why Tukey’s exploratory versus confirmatory distinction was misleading: Because the tools that Tukey promoted for exploration also improve confirmation. They are neglected everywhere. For example:

1. Graphs improve confirmatory data analysis. If you do a t test (or compute a p value in any way) but don’t make an associated graph, there is room for improvement. A graph will show whether the assumptions of the computation are reasonable. Often they aren’t.

2. Transformations improve confirmatory data analysis. That a good transformation will make the assumptions of the test more reasonable many people know. What few people seem to know is that a good transformation will make the statistical test more sensitive. If a difference exists, the test will be more likely to detect it. This is like increasing your sample size at no extra cost.

3. Exploratory data analysis is sometimes thought of as going beyond the question you started with to find other structure in the data — to explore your data. (Tukey saw it this way.) But to answer the question you started with as well as possible you should find all the structure in the data. Suppose my question is whether X has an effect.  I should care whether Y and Z have an effect in order to (a) make my test of X more sensitive (by removing the effects of Y and Z) and (b) assess the generality of the effect of X (does it interact with Y or Z?).

Most statistics professors and their textbooks have neglected all uses of graphs and transformations, not just their exploratory uses. I used to think exploratory data analysis (and exploratory science more generally) needed different tools than confirmatory data analysis and confirmatory science. Now I don’t. A big simplification.

Exploration (generating new ideas) and confirmation (testing old ideas) are outputs of data analysis, not inputs. To explore your data and to test ideas you already have you should do exactly the same analysis. What’s good for one is good for the other.

Likewise, Freakonomics could have been titled Low-status Economics. That’s essentially what it was, the common theme. Levitt studied all sorts of things other economists thought were beneath them to study. That was Levitt’s real innovation — showing that these questions were neglected. Unsurprisingly, the general public, uninterested in the status of economists, found the work more interesting than high-status economics. I’m sensitive to this because my self-experimentation was extremely low-status. It was useful (low-status), cheap (low-status), small (low-status), and anyone could do it (extremely low status).

More Andrew Gelman comments. Robin Hanson comments.

Assorted Links

Tuesday, February 2nd, 2010

Thanks to Aaron Blaisdell, Tim Beneke, and Stephen Marsh.

How Much Should We Trust Clinical Trials?

Thursday, July 2nd, 2009

Suppose you ask several experts how to choose a good car. Their answers reveal they don’t know how to drive. What should you conclude? Suppose these experts build cars. Should we trust the cars they’ve built?

Gina Kolata writes that “experts agree that there are three basic principles that underlie the search for medical truth and the use of clinical trials to obtain it.” Kolata’s “three basic principles” reveal that her experts don’t understand experimentation.

Principle 1.  “It is important to compare like with like. The groups you are comparing must be the same except for one factor — the one you are studying. For example, you should compare beta carotene users with people who are exactly like the beta carotene users except that they don’t take the supplement.” An expert told her this. This — careful equation of two groups — is not how experiments are done. What is done is random assignment, which roughly (but not perfectly) equates the groups on pre-experimental characteristics. A more subtle point is that the X versus No X design is worse than a design that compares different dosages of X. The latter design makes it less likely that control subjects will get upset because they didn’t get X and makes the two groups more equal.

Principle 2. “The bigger the group studied, the more reliable the conclusions.” Again, this is not what happens. No one with statistical understanding judges the reliability of an effect by the size of the experiment; they judge it by the p value (which takes account of sample size). The more subtle point is that the smaller the sample size, the stronger the effect must be to get reliable results. Researchers try to conserve resources so they try to keep experiments as small as possible. Small experiments with reliable results are more impressive than large experiments with equally reliable results — because the effect must be stronger. This is basically the opposite of what Kolata says.

Principle 3. In the words of Kolata’s expert, it’s “Bayes theorem”. He means consider other evidence — evidence from other studies. This is not only banal, it is meaningless. It is unclear — at least from what Kolata writes — how to weigh the various sources of evidences (what if the other evidence and the clinical trials disagree?).

Kolata also quotes David Freedman, a Berkeley professor of statistics who knew the cost of everything and the value of nothing. Perhaps it starts in medical school. As I blogged, working scientists, who have a clue, don’t want to teach medical students how to do research.

If this is the level of understanding of the people who do clinical trials, how much should we trust them? Presumably Kolata’s experts were better than average — a scary thought.

Will Like vs. Might Love vs. Might Hate

Saturday, February 28th, 2009

What to watch? Entertainment Weekly has a feature called Critical Mass: Ratings of 7 critics are averaged. Those averages are the critical response that most interests me. Rotten Tomatoes also computes averages over critics. It uses a 0-100 scale. In recent months, my favorite movie was Gran Torino, which rated 80 at Rotten Tomatoes (quite good). Slumdog Millionaire, which I also liked, got a 94 (very high).

Is an average the best way to summarize several reviews? People vary a lot in their likes and dislikes — what if I’m looking for a movie I might like a lot? Then the maximum (best) review might be a better summary measure; if the maximum is high, it means that someone liked the movie a lot. A score of 94 means that almost every critic liked Slumdog Millionaire, but the more common score of 80 is ambiguous: Were most critics a bit lukewarm or was wild enthusiasm mixed with dislike? Given that we have an enormous choice of movies — especially on Rotten Tomatoes – I might want to find five movies that someone was wildly enthusiastic about and read their reviews. Movies that everyone likes (e.g., 94 rating) are rare.

Another possibility is that I’m going to the movies with several friends and I just want to make sure no one is going to hate the chosen movie. Then I’d probably want to see the minimum ratings, not the average ratings.

So: different questions, wildly different “averages”. I have never heard a statistician or textbook make this point except trivially (if you want the “middle” number choose the median, a textbook might say).  The possibility of “averages” wildly different from the mean or median is important because averaging is at the heart of how medical and other health treatments are evaluated. The standard evaluation method in this domain is to compare the mean of two groups — one treated, one untreated (or perhaps the two groups get two different treatments).

If there is time to administer only one treatment, then we probably do want the treatment most likely to help. But if there are many treatments available and there is time to administer more than one treatment — if the first one fails, try another, and so on — then it is not nearly so obvious that we want the treatment with the best mean score. Given big differences from person to person, we might want to know what treatments worked really well with someone. Conversely, if we are studying side effects, we might want to know which of two treatments was more likely to have extremely bad outcomes. We would certainly prefer a summary like the minimum (worst) to a summary like the median or mean.

Outside of emergency rooms, there is usually both a wide range of treatment choice and plenty of time to try more than one. For example, you want to lower your blood pressure. This is why medical experts who deride “anecdotal evidence” are like people trying to speak a language they don’t know — and don’t realize they don’t know. (Their cluelessness is enshrined in a saying: the plural of anecdote is not data.) In such situations, extreme outcomes, even if rare, become far more important than averages. You want to avoid the extremely bad (even if rare) outcomes, such as antidepressants that cause suicide. And if a small fraction of people respond extremely well to a treatment that leaves most people unchanged, you want to know that, too. Non-experts grasp this, I think. This is why they are legitimately interested in anecdotal evidence, which does a better job than means or medians of highlighting extremes. It is the medical experts, who have read the textbooks but fail to understand their limitations, whose understanding has considerable room for improvement.

viagra stopped working
Viagra Sale
cheap free free viagra viagra