Read about AsyncTask in a great article from Google Android…

Question

0

Editorial Team

Asked: May 15, 20262026-05-15T04:08:43+00:00 2026-05-15T04:08:43+00:00

I am building a website in python/django and want to predict wether a user

0

I am building a website in python/django and want to predict wether a user submission is valid or wether it is spam.

Users have an accept rate on their submissions, like this website has.

Users can moderate other users’ submissions; and these moderations are later metamoderated by an admin.

Given this:

the registered user A with an submission accept rate of 60% submits something.
user B moderates A’s post as a valid submission. However, user B is wrong 70% of the time.
user C moderates A’s post as spam. User C is usually right. If user C says something is spam/ no spam, this will be correct 80% of the time.

How can I predict the chance of A’s post being spam?

Edit: I made a python script simulating this scenario:

#!/usr/bin/env python

import random

def submit(p):
    """Return 'ham' with (p*100)% probability"""
    return 'ham' if random.random() < p else 'spam'

def moderate(p, ham_or_spam):
    """Moderate ham as ham and spam as spam with (p*100)% probability"""
    if ham_or_spam == 'spam':
        return 'spam' if random.random() < p else 'ham'
    if ham_or_spam == 'ham':
        return 'ham' if random.random() < p else 'spam'

NUMBER_OF_SUBMISSIONS = 100000 
USER_A_HAM_RATIO = 0.6 # Will submit 60% ham
USER_B_PRECISION = 0.3 # Will moderate a submission correctly 30% of the time
USER_C_PRECISION = 0.8 # Will moderate a submission correctly 80% of the time

user_a_submissions = [submit(USER_A_HAM_RATIO) \
                        for i in xrange(NUMBER_OF_SUBMISSIONS)]

print "User A has made %d submissions. %d of them are 'ham'." \
        % ( len(user_a_submissions), user_a_submissions.count('ham'))

user_b_moderations = [ moderate( USER_B_PRECISION, ham_or_spam) \
                        for ham_or_spam in user_a_submissions]

user_b_moderations_which_are_correct = \
    [i for i, j in zip(user_a_submissions, user_b_moderations) if i == j]

print "User B has correctly moderated %d submissions." % \
    len(user_b_moderations_which_are_correct)

user_c_moderations = [ moderate( USER_C_PRECISION, ham_or_spam) \
                        for ham_or_spam in user_a_submissions]

user_c_moderations_which_are_correct = \
    [i for i, j in zip(user_a_submissions, user_c_moderations) if i == j]

print "User C has correctly moderated %d submissions." % \
    len(user_c_moderations_which_are_correct)

i = 0
j = 0    
k = 0 
for a, b, c in zip(user_a_submissions, user_b_moderations, user_c_moderations):
    if b == 'spam' and c == 'ham':
        i += 1
        if a == 'spam':
            j += 1
        elif a == "ham":
            k += 1

print "'spam' was identified as 'spam' by user B and 'ham' by user C %d times." % j
print "'ham' was identified as 'spam' by user B and 'ham' by user C %d times." % k
print "If user B says it's spam and user C says it's ham, it will be spam \
        %.2f percent of the time, and ham %.2f percent of the time." % \
         ( float(j)/i*100, float(k)/i*100)

Running the script gives me this output:

User A has made 100000 submissions. 60194 of them are ‘ham’.
User B has correctly moderated 29864 submissions.
User C has correctly moderated 79990 submissions.
‘spam’ was identified as ‘spam’ by user B and ‘ham’ by user C 2346 times.
‘ham’ was identified as ‘spam’ by user B and ‘ham’ by user C 33634 times.
If user B says it’s spam and user C says it’s ham, it will be spam 6.52 percent of the time, and ham 93.48 percent of the time.

Is the probability here reasonable? Would this be the correct way to simulate the scenario?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T04:08:43+00:00

Bayes’ Theorem tells us:

$<code>P(A|B)=P(B|A)P(A)/P(B)</code>$

Let’s change the letters for the events A and B to X and Y resp. because you’re using A, B and C to stand for people, and it would make things confusing:

P(X|Y) = P(Y|X) P(X) / P(Y)

Edit: the following is slightly wrong because X should be this post _by A_ is spam, not just “this post is spam” (and thus Y should just be “B accepts A’s post, C rejects it”). I’m not redoing the math right here because the numbers are changed anyway — see other edit below for the right number and correct arithmetic.

You want X to mean “this post is spam”, Y to stand for the combination of circumstances A has posted it, B approved it, C rejected it (and let’s assume conditional independence of the circumstances in question).

We need P(X), the a priori probability that any post (no matter who makes it or approves it) is spam; P(Y), the a priori probability that a post would be made by A, approved by B, rejected by C (whether it’s spam or not); and P(Y | X), same as the latter but with the post being spam.

As you may be noticing, you haven’t really given us all the bits and pieces we need for the computation. You three points tell us: a given post by A is spam with a probability of 0.4 (that seems to be how the first point reads); B’s acceptance probabilty is 0.3 but we have no idea how that differs for spam and non-spam, except that there should be “little” difference (low accuracy); C’s is 0.8 and again we don’t know how that’s influenced by spam vs non-spam, except that there should be “large” difference (high accuracy).

So we need some more numbers! The fact that C has high accuracy while accepting 80% of the posts tells us that overall spam must be surprisingly low — if spam overall was the same 40% as for A, then C would have to accept half of it (even if he was perfect on always accepting non-spam) to get the overall 80% accept rate, and that would be hardly “high accuracy”. So say spam overall is just 20% and C only accepts 1/4 of it (and rejects 1/16 of non-spam), pretty good accuracy indeed and overall matching the numbers you’re giving.

Guessing for B, who accepts at 30% overall, and now “knowing” that spam overall is 20%, we could guess that B accepts 1/4 of the spam and only 5/16 of non-spam.

So: P(X)=0.2; P(Y)=0.3*0.2=0.06 (B’s overall acceptance times C’s rejection prob); P(Y|X)=0.4*0.25*0.75=0.075 (A’s prob of spamming time B’s prob of accepting spam time C’s prob of rejecting spam).

So P(X|Y)=0.075*0.2/0.06=0.25 — unless I’ve made some arithmetic error (quite possible, the point is mostly to show you how one can reason in such cases;-), the probability of this particular post being spam is 0.25 — a bit higher than the probability of any random post being spam, lower than the probability of a random post by A being spam.

But of course (even under the simplifying hypothesis of conditional independence all over hte place;=) this computation is highly sensitive to my guesses/hypotheses about the ratios of false positives vs false negatives for B and C, and the overall spam ratio. There are five numbers of this kind involved (overall spam prob, conditional prob for each of B and C for spam and non-spam) and you only give us two relevant (linear) constraints (unconditional prob of acceptance for B and C) and two vague “handwaving” statements (about low and high accuracy), so there’s plenty of degrees of freedom there.

If you can better estimate the five key numbers, the computation can be made more precise.

And, BTW, Python (and a fortiori Django) have absolutely nothing to do with the case — I recommend you remove those irrelevant tags to get a broader range of responses!

Edit: the user clarifies (in a comment — shd really edit his Q!):

When I said “B’s moderations’ accept
rate is a mere 30%” I mean that for
every ten times B moderates something
spam/no spam he makes the wrong
decision 7 times. So there is a 70%
chance he will tag something spam/no
spam when it is not. For user C, “His
moderations’ accept rate is 80%” means
that if C says something is spam or no
spam, he is right 80% of the time.
Overall chance of a registered user
spamming is 20%.

…and asks me to redo the math (I’m assuming false positives and negatives are equally likely for each of B and C). Note that B’s an excellent “contrarian indicator”, since he’s wrong 70% of the time!-).

Anyway: B’s overall acceptance rate of A’s posts must then be 0.6*0.3 (for when he accepts A’s nonspam) + 0.4*0.7 (for when he accepts A’s spam) = 0.18 + 0.28 = 0.46; C’s must be 0.8*0.4 + 0.2*0.6 = 0.32 + 0.12 = 0.44. So we have…:

P(X)=0.4 (I had it wrong at 0.2 before since I was ignoring the fact that A‘s probability of spamming is 0.4 — the overall prob of spam is not relevant, since we know that the post is A’s!); P(Y)=0.46*0.56=0.2576 (B’s overall acceptance rate for A times C’s rejection prob for A); P(Y|X)=0.7*0.8=0.56 (B’s prob of accepting spam time C’s prob of rejecting spam).

So P(X|Y)=0.56*0.4/0.2576=0.87 (rounding). IOW: while a priori the probability that a post of A’s is spam is 0.4, both B’s acceptance and C’s rejection heighten it, so this specific post of A’s has about 87% chance of being spam.

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions