Choosing Amazon Sellers with Laplace's Rule

Code used in this project: code.txt + snippets embedded in this article.

Laplace's rule of succession is an example of a mathematical method that can be described in a convoluted way. But can actually be reduced down to something really trivial. Thus demonstrating the importance of understanding the inner workings of data analytic methods.

I am going to illustrate this with a scenario taken from a 2020 Youtube video. According to that video, it originally comes from a 2011 blog post. Their scenario goes like this: suppose we have to choose between two Amazon sellers, for the exact same product, and exact same price.

Seller A Seller B
Number Of Reviews 100 20
Number Of Positive 85 18
Percent Positive 85% 90%

Seller B has a higher 90% of positive reviews, but this is from just 20 reviews. There is more information from seller A's 100 reviews. However, seller A has a lower 85% of positive reviews. Which seller should we choose?

The Bayesian Method

Both the Youtube video and the blog post propose that we can use the Bayesian method to resolve this problem. Their proposal first assumes that each seller has a probability \( p \) of giving us a good experience. This \( p \) is the value that we want to estimate.

To do so, classic statistics uses the mean \( \frac{y}{n} \) as an unbiased estimator of \( p \), where \( y \) is the number of positive reviews, while \( n \) is the total number of reviews. \( \frac{y}{n} \) is also the maximum likelihood estimator of \( p \). Formal details on these concepts, and their derivations, can be found in this set of lecture notes.

The Bayesian method follows this algorithm instead.

  1. Assume that \( p \) can be any real number between \( 0 \) and \( 1 \).
  2. Use Bayes' theorem to update this assumption.
  3. Use this updated assumption to calculate the probability of getting a good experience.

Laplace's Rule of Succession

The algorithm appears to be doing a lot. We are starting with no knowledge of \( p \) in the first step, and then updating our knowledge as data comes in. However, with a little bit of effort, we can show that this reduces to a simple formula \[ \frac{k+1}{n+2}. \] This formula is known as "Laplace's Rule Of Succession". It was an attempt at answering the question: if we see \( k \) successes in \( n \) independent trials, what is the probability that the next trial will be a success?

Despite starting with a sophisticated Bayesian framework, and requiring some mathematics to derive, it all boils down to a simple alteration of the mean number of success, \( \frac{k}{n} \). All we had to do was add 2 to the denominator, and 1 to the numerator!

Laplace's Rule vs Arithmetic Mean

For large \( k \) and \( n \), the difference between Laplace's rule \( \frac{k+1}{n+2} \) and the mean \( \frac{k}{n} \) is going to be tiny. Bayesian statistics is known to converge to the same result as traditional frequentist statistics. So, Laplace's rule is only relevant for small samples. We can visually compare the difference between Laplace's rule and the arithmetic mean with Python.

The Python code that produced the above can be found here. It looks like the difference falls off rapidly even for small increases in \( n \). This is really obvious when we plot the absolute difference between the two.

diff = []

for i in range(0,len(mean)):


Inertia of Laplace's Rule

By plotting the absolute distance from \( p = 0.5 \) for each estimator, we can see that Laplace's rule has some form of inertia, and tend to be closer to \( p = 0.5 \) than the arithmetic mean.

l_dist, m_dist = [], []

for i in range(0,len(mean)):

This makes sense since the formula for Laplace's rule, \( \frac{k+1}{n+2} \), is equivalent to adding two additional trials to the data, consisting of one success and one failure. Also, this is expected behavior for a Bayesian method: we start with a prior assumption that \( p = 0.5 \), and revise away from this assumption as data comes in.

In my humble opinion, Laplace's rule does not look like it offers any improvement at all, over the arithmetic mean. It is only relevant for small \( n \), and only provide a small arbitrary, possibly irrelevant, shift towards \( p = 0.5 \).