a companion discussion area for blog.codinghorror.com

An Initiate of the Bayesian Conspiracy


I think it is easier to frame the question if you start with a set diagram that plots the populations with the various attributes properly.

Note that the initial 2 populations (with and without cancer) are mutually exclusive, but other sets (like who take the tests) overlap with both populations.

It becomes even better if you try drawing them with some degree of proportions like make the cancer set smaller, etc.


One thing you can do is simulate the situation described eg

     (defun random-woman ()
       (if ( (random 100.0) 1)
           ;; One percent of women have cancer
           (cons 'cancer
                 (if ( (random 100.0) 80)
                     ;; Of whom 80% test positive
           ;; of the others
           (cons 'well
                 (if ( (random 100.0) 9.6)
                     ;; 9.6% false positives

     (loop repeat 100
           for (status . result) = (random-woman)
           when (eql result 'positive)
           count (eql status 'well) into false-positive
           count (eql status 'cancer) into true-positive
           finally (format t "~With ~D true positives and ~D false positives ~$% of positives are true positives."
                           (* 100 (/ true-positive
                                     (+ false-positive true-positive)))))

With 2 true positives and 9 false positives 18.18% of positives are true positives.

This kind of Monte Carlo approach is hopeless if you want an accurate
answer, but it offers you something else instead.

Think of a typical primary care doctor. After he has ordered 100 mamograms and got the results of follow up investigations how do things appear. Running the simulation a few times

With 2 true positives and 5 false positives 28.57% of positives are true positives.

With 0 true positives and 12 false positives 0.00% of positives are true positives.

With 0 true positives and 11 false positives 0.00% of positives are true positives.

With 2 true positives and 10 false positives 16.67% of positives are true positives.

With 1 true positives and 5 false positives 16.67% of positives are true positives.

With 2 true positives and 18 false positives 10.00% of positives are true positives.

With 1 true positives and 10 false positives 9.09% of positives are true positives.

The results are all over the shop. So my guess is that doctors working in primary care have varied experiences with some seeing mamograms actually detecting cancers and others seeing only false positives.


I decided to use Excel, and it worked!

I mocked up a study with 1000 women being screened:

10 women actually have cancer (1% of 1000)

8 of the women who have cancer test positive (80% of 10)

95 women show false positives (9.6% of the 990 cancer-free women)

103 women test positive (8 + 95)

The 8 true positives are 7.76% of the 103 total positives. (8/103 = 0.0776)

For some reason I was uncomfortable solving this in a purely abstract manner and sticking to the problem description of sample sets of women really helped me reason through this.


I’m also a bit creeped out that only 46% of the doctors tested got this simplified version of the problem right:

100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammography. 950 out of 9,900 women without breast cancer will also get a positive mammography. If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammographies will actually have breast cancer?

Guess doctors don’t have to be math whizzes, do they?


i think we are missing the point here that 0% of men have breast cancer. based on the comments, only about 1 in 70+ people realize this :smiley:


The author references heavily the work of Daniel Kahneman and Amos Tversky, especially their book Judgement Under Uncertainty, which is an outstanding reference for those interested in how humans make (often fallible) decisions.

The most accessible introduction to their work, IMHO, was an article written in Discover Magazine in 1985 titled “Decisions, Decisions” (see footnote below). It would be well worth the time to print this article from the microfiche files in some academic library. The closest online reference of their work I’ve found is at:


Kevin McKean, “Decisions, Decisions.” Discover, June 1985 pp. 22-31


When I approach these kinds of problems I often find it conceptually easier to deal with actual numbers rather than percentages. It’s the same calculations in the end just easier for my brain to reason about.

So given 1000 women who go in for testing 10 (1%) will have breast cancer on average which means that 990 do not have breast cancer. Of the 10 who have cancer, 8(80%) will test positive. Of the 990 who are cancer free, 95.04(9.96%) will test positive. So in total, 103.04 women will test positive, but only 8 of those actually have cancer. 8/103.04 ~ 7.76%

Many probability problems are relatively simple when it comes to the actual calculations. Often times the hard part is finding the right calculation to do. You need to carefully look at exactly what each percentage is really saying. It’s easy to get stuck on the 80% or 9.96% figure without realizing they aren’t directly dividing the groups you’re trying to reason about.


Setting aside all the Bayesian chatter, I am wondering if the Hidden Markov Model is similar to what Apple does with their Mail application (see http://www.macdevcenter.com/pub/a/mac/2004/05/18/spam_pt2.html). The MacDevCenter article talks about using vector space – the combination of words found together cluster into volumes of highly-dimensional space. The HMM might do something similar (although I suspect CRM114 is order-sensitive, whereas Mail isn’t).

Still, even though Apple Mail is pretty good, I’d hardly call it better than 98%. I have to write special rules for image spam, and it doesn’t seem to look at tags.


Isn’t the correct answer: “the probability cannot be determined”?
People resort to the equations before they understand the question:

  1. 1% of women at age forty who participate in routine screening have breast cancer.
  • 1% of only age 40 women (stat with rate specific to age)
  1. 80% of women with breast cancer will get positive mammographies.
  • no age variable
  1. 9.6% of women without breast cancer will also get positive mammographies.
  • no age variable

A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

From the information present it seems you cannot calculate a percent chance of having breast cancer (by applying the 2nd two stats) since the 2nd two stats cannot be rationally related to the first stat. The 2nd two stats should not be allowed to skew the first stat because they are only applicable on their own. Therefore, the 2nd two stats don’t change the first stat and the most reasonable reading you could give her is a 1% probability that she will get breast cancer.


Thanks for everyone’s explanations.

@Darren: men do get breast cancer, but in much smaller percentages. I vaguely recall seeing a new special about it, with one of the comments being that there isn’t the support groups for men, as there are for women. I quick web search can find medical information about it. I can’t vouch for the support groups.



The key here isn’t Bayes amazing work. Then entire answer, and how close Bayes logic gets, is due to how accurately you can approximate the probabilities.

99% of the work goes into determining the probabilities and the confidence intervals. It’s an amazing formula, but being off by even .5% leads you to the wrong conclusion, possibly by a large factor ( 1:10 vs 1:7). After that, it’s just arithmetic.


(BTW, I didn’t even remember the formula so I didn’t even try)


The real problem is that both Bayes and Markov are theorems within mathematical statistics (what real statisticians call statistics; not at all what baseball junkies mean when they use the word). Normally, statistical inference lies on the bed rock of independence (whether the practioner realizes it or not; econometricians widely/wildly ignore the requirement). Bayes says phooey.

The main point being, to quote Allen Holub in a negative way, it doesn’t pay to know a little bit about SQL or mathematical statistics. Or relational databases. Or anything else that requires real thinking.


This is silly. The term “Bayesian” has not a damn thing to do with words, single words, sentences, etc. It’s a simple method dealing with prior, likelihood, and posterior probabilities and methods of determining parameters from data. Markovian simply means that the infinite past’s influence on the present is minimal. So, whatever relation to spam filtering this has needs to PICK NEW WORDS. Stop confusing terms, it just leads to confusion. I know cause I’ve been guily of it myself before.


The answer is NOT 7.7764%!

Read for content, people. The lady already has taken a mammogram, so that 1% deal doesn’t apply. She’s also got a positive result, so the question is: given a positive result, what’s the probability she has it? And the answer is 80%, because that’s exactly what it tells you in the second sentence.


I’m I wrong or what?

1st statement: 1% HAS CANCER, 99% WITHOUT CANCER.
2nd statement: 80% who HAS CANCER gets positive results. This leads to:
(a) 1% * 80% - has cancer with positive.
(b) 1% * 20% - has cancer without positive.
(3rd statement: 9.6% of women WITHOUT CANCER gets positive:
© 99% * 9.6% - no cancer, but positive
(d) 99% * 90.4% - no cancer, negative.

She got positive. That means that she is in (a) or in ©.
From all participants 10.304% (0.010.8 + 0.990.096) GETS POSITIVE results.
From all participants 8% HAS CANCER and GETS POSITIVE.

So 10304/8000 or 77.63% +/-0.005% who GOT POSITIVE result HAS CANCER.


Read for content, indeed. I misread the first sentence, thought it said 1% of women get tested. Whoops!


here’s my solution(I just finished a statistics course at Washintgon State University go Cougs!)

The problem statement tells us that the probability of a screened woman has cancer is 1%. P(cancer) = P© = 0.01

Conditional probability comes in here, and we apply the notation of P(A|B) = the probability that A is true given that B is true.

It tells us that the probability that a woman will get a positive result given that she has cancer is 80%. P(positive|cancer) = P(p|c) = 0.8

It tells us that the probability that a woman will get a positive result given that she does not have cancer is 9.6%. P(positive|NOT cancer) = P(p|NOT c)0.096

Out knowledge of conditional probability tells us that P(A|B) = P(A and B) / P(B). That is, the probability of A being true given that B is true is equal to the probability that A and B are true, divided by the probability that B is true. Drawing a Venn diagram can help to clarify this.

We can now solve for P(positive AND cancer) = P(p|c)/P© = 0.008
P(NOT cancer) = 1 - P(cancer) = 0.99
P(positive AND not cancer) = P(p|NOT c)/P(NOT c) = 0.09504
P(positive) = P(p AND c) + P(p AND NOT c) = 0.10304

Finally, P(c|p) = P(c AND p)/P§ = .0776

Given this data, the probability that any positive result corresponds to a woman with cancer is 7.76%


oops, I made a couple typos. these lines are correct:

We can now solve for P(positive AND cancer) = P(p|c)*P© = 0.008
P(positive AND not cancer) = P(p|NOT c)*P(NOT c) = 0.09504


Sounds great to me