A Suite of Discrete Probability Classes

Environment: Any, C++

Abstract
Overview
Probability Basics
The Binomial Probability
Tackling Factorials and Large Multiplications
Monte Carlo Methods
The Geometric Probability
The Negative Binomial Probability
The HyperGeometric Probability
The Poisson Probability
Set Theory and Composite Events
Downloads

Abstract

This article covers five discrete probability distributions most common for use in computer simulations, gaming, artificial intelligence decision making, and environment modeling. The covered distributions are the Binomial, Geometric, NegativeBinomial, HyperGeometric, and Poisson probability distributions, along with an abstract probability class. The classes differ from other routine mathematical attempts at probability computation with a few implementation tricks that extend the range of allowable inputs. A comparison with Monte Carlo techniques to probability computation is given.

Overview

If you are a computer science or mathematics major, it is likely you will take a course on probability and statistics. Probability is a large area of study, but basic concepts can be learned by anyone. The following five probability classes are most likely the first you will encounter in studies and fall under the heading of ‘discrete’ probabilities, meaning that the events they model happen or they don’t. I will give a brief introduction to each probability and an example of when it might be used. Discrete probabilities use a lot of factorials or often require multiplying very large numbers with very small numbers. An example of some of the problems this causes and the work around is discussed, along with a very common alternative many programmers use to generate probabilities using random numbers.

Probability Basics

A probability is the chance of an event happening. A probability is always between 0.0 and 1.0, inclusive. A probability can ALWAYS be defined as “the number of ways the event can happen in an experiment” divided by “the total number of ways an experiment can turn out.” Rolling a die is the most common example, next to flipping a coin. The chance to roll any number is one divided by the number of sides on the die—usually 1/6 for a normal die. The chance to land heads on a flipped coin is 1/2. These are pretty obvious observations, but are the basis to more meaningful questions such as what is the chance to roll the same number three times in a row. This is not so obvious. The five probability distributions we are going to see all depend on the prior knowledge of a discrete event such as rolling a die, flipping a coin, hitting a ball, shooting a target, failure of a device, a lottery, or drilling for oil, to name a few. If we can know the likelihood of a discrete event, we can start to ask questions of how this event plays out under certain circumstances.

The Binomial Probability

The binomial probability models the probability of a number of successful trials, Y (also called a ‘random variable’), out of a total number of trials (N). Each trial has the EXACT same chance of success or failure. For example, when rolling a die, the chance to roll a one is 1/6. If we say that success, denoted by p, p=1/6, is the chance to not roll a one must be 1.0 – p, denoted by q. The symbol p means success and the symbol q means failure. Let’s say we have five dice or we are rolling the same die over five times and we want to know the probabilty that a one comes up exactly three out of the five times, three being the number of successful trials of rolling a one. The full equation for the Binomial Probability is:

P(Y) = N! / ((N-Y)! * Y!)  * pY * q (N-Y)

and for this example N = 5, Y = 3, p=1/6, and q=5/6. The probability is notated like so: P(Y=3). If we wanted to know the probability of at least three or three or less ones in five rolls we would say P(Y<=3). This is called a cumulative probability and is simply the addition of P(Y=0)+P(Y=1)+P(Y=2)+P(Y=3). The abstract class Probability contains a member variable called m_RVT, meaning ‘random variable type’, which is an enumeration as to the sign of P(Y ? y). The GetResult() method looks at this enumeration and calls the ComputeResult() function in the derived class for each P(Y=y) in the cumulative probability.

double Probability::GetResult()  throw (ProbabilityException)
{
  if(m_probability != UNDEFINED )
    return m_probability;
  try{
    m_probability = 0.0;
    int i = 0;
    switch(m_RVT)
    {
    case EQUAL:
      m_probability = ComputeResult();
      break;
    case LESSTHAN:
      for(SetRV(i); i < m_RV; SetRV(++i))
        m_probability += ComputeResult();
      break;
      case GREATERTHANOREQUAL:
      for(SetRV(i),m_probability = 1.0; i < m_RV; SetRV(++i))
        m_probability -= ComputeResult();
      break;
    case GREATERTHAN:
      for(SetRV(i),m_probability = 1.0; i <= m_RV; SetRV(++i))
        m_probability -= ComputeResult();
      break;
    case LESSTHANOREQUAL:
      for(SetRV(i); i <= m_RV; SetRV(++i))
        m_probability += ComputeResult();
      break;
    case NOTEQUAL:
      m_probability = 1.0 - ComputeResult();
      return m_probability;
    default:
      throw ProbabilityException("Probability error -
            Random Variable type is unset");
    }
  }
  catch(ProbabilityException pe)
  {
    m_probability = UNDEFINED;
    SetRV(m_RV);
    throw pe;
  }
  SetRV(m_RV);
  return m_probability;
}

The definition for the abstract class Probability is small and fairly straightforward:

class Probability
{
public:
  enum RandomVariableType { EQUAL, LESSTHAN, GREATERTHAN,
                            LESSTHANOREQUAL, GREATERTHANOREQUAL,
                            NOTEQUAL };

  /* Constructor Parameters:
    rv -   The Random Variable value.
    rvt -  Random Variable type, whether it is cumulative, and
           which way it is cumulative.
    prob - The probability of the event. This can be set
           beforehand if it is known and the GetResult function
           will return it.
  */
  Probability( int rv=0, RandomVariableType rvt=EQUAL,
               double prob=UNDEFINED)
               :m_RV(rv), m_RVT(rvt), m_probability(prob){};
  virtual ~Probability(){};

  class ProbabilityException
  {
  public:
    ProbabilityException(const char* whatString):What(whatString){}
    inline const char* what() const { return What; }
  protected:
    const char* What;
  };

  virtual double GetResult() throw (ProbabilityException) ;
  virtual double GetExpectedValue() const = 0;
  virtual double GetVariance()      const = 0;
protected:
  virtual double ComputeResult() = 0;
  virtual void   SetRV(int Y)    = 0;
  RandomVariableType m_RVT;
  double m_probability;
  int m_RV;
};

Each specialized probability class constructor will call this base constructor, like so:

BinomialProbability::BinomialProbability(int N, int Y, double p,
                     Probability::RandomVariableType rvt)
    :m_trials(N),
     m_successes(Y),
     m_chance_of_success(p),
    Probability(Y, rvt)
{
  assert(Y<=N && Y >=0);
  assert(p >=0.0 && p<=1.0);
}

Tackling Factorials and Large Multiplications

Any factorial over 12! promptly causes an overflow in an unsigned double. All of the probability classes involve multiplying a very large number by a very small number to get an end value between 0.0 and 1.0 inclusive. I use this knowledge to my advantage when computing the result so that overflow is not an issue. Each equation in the class’s ComputeResult() method is broken up into its individual factors. If the running result is over 1.0, we know that we want to divide by something or multiply by a fraction. If the running result is under 1.0, we want to multiply by factor. This behavior causes the result to float around the value of 1.0 as it is being computed. This also reduces loss of significant digits. Most of the probability classes have a combination that needs computing, which is basically a factorial divided by two other factorials. An interesting and very handy cancelation of digits occurs in a combination that reduces the number of multiplications dramatically. Many uses of factorials often require division by other factorials or multiplication by small numbers, so this general method of looking at the whole equation, optimizing it from what the range of the result will be is a good way to calculate factorials or other equations that normally have intermediate results of very large or small numbers.

double BinomialProbability::ComputeResult() throw
       (ProbabilityException)
{
  if(m_trials == 0)
    return 0.0;
  // initialize some variables
  double result = 1.0;
  int Y = m_successes;
  int N = m_trials;
  double P = m_chance_of_success;
  double Q = 1.0 - P;
  int range = 0, np =0, nq = 0, nnumer = 0, ndenom = 0;
    // validate
  assert( Y<=N && Y >=0);
  assert( P <= 1.0 && P >=0.0);
  // check optimizations
  if(Y == 0){
    return result = pow(Q,N);
  }
  if(Y == N){
    return result = pow(P,Y);
  }
  // reorder the factorials to account for cancellations
  // in numerator and denominator.
  if(Y < N-Y){
    range = Y;      // N-Y! cancels out
  }
  else{
    range = N-Y;    // Y! cancels out
  }
  np = Y;
  nq = N-Y;
  ndenom = range;
  nnumer = N;

  while(np > 0 || nq > 0 || ndenom > 0 || nnumer >(N-range)){
    // If the result is greater than one, we want to divide by
    // a denominator digit or multiply by percentage p or q.
    // If we are out of numerator digits, finish multiplying with
    // our powers of p or q or dividing by a denom digit.
    if(result >= 1.0 || nnumer ==(N-range)){
      if(ndenom > 0){
        //m_resut *= (1.0/ndenom);
        result /= ndenom;
        --ndenom;
      }
      else if(nq > 0){
        result *= Q;
        --nq;
      }
      else if(np > 0){
        result *= P;
        --np;
      }
      else {
        throw(ProbabilityException
             ("Binomial Probability computation error - check
               success percentage between 0 and 1"));
      }
    }
    // If the result is less than one, we want to multiply
    // by a numerator digit. If we are out of denominator digits,
    // powers of p or powers of q then multiply rest of result
    // by numerator digits.
    else if(result < 1.0 || np==0  ){
      if(nnumer >(N-range)){
        result *= nnumer;
        --nnumer;
      }
      else{
        throw(ProbabilityException
             ("Binomial Probability computation error -
               unknown error"));
      }
    }
    else{
      throw(ProbabilityException
           ("Binomial Probability computation error -
             possible value infinity or NaN"));
    }
  }
  return result;

}

Monte Carlo Methods

One of the most intuitive ways for figuring probability is by using a random number generator and then counting the number of hits or misses. Again, we will try to figure out the probability of rolling exactly three ones out of five trials. Consider the following code.

int Accuracy = 1000;
int N = 5;
int Y = 3;
int successess = 0;
for(int i =0; i < Accuracy; i++)
{
  int hits = 0;
  for(int j =0;j < N ; j++)
  {
    die = rand()%6 +1;    // give a random number from 1 to 6
    if(die == 1)
                          // check for a one rolled
      hits++;
  }
  if(hits == Y)
                          // check to see if 3 hits
    successes++;
}
double result = 1.0 * successes/Accuracy;

To get the percentage of chance of this experiment, you must run the rolling of five dice many times and then count how many times that three ones were rolled and divide by the number of times the experiment was done. The problem with this method is that there is no guarantee for consistency. If the random number generator is not as random as one would like, accuracy is moot. To increase accuracy with this method, you must run the experiment many times. Usually, to get a decent result, you would run the experiment at least 1000 times. To get accuracy to more decimal places, you have to increase the value of Accuracy. In complex situations, all this takes a lot of CPU cycles, proportionately more for the accuracy you need.

The Geometric Probability

This probability class models the number of the trial for which an event succeeds (or fails). Let’s assume that a weapon has a 2% chance of misfiring in competition. What is the chance the weapon will misfire on the 4th shot? Here, we only care about the 4th shot P(Y=4), but we could use the cumulative P(Y<=4) to see the chance of a misfire before the 5th shot. When we say P(Y=4), we are saying “what is the chance for a fire, fire, fire, then a misfire?” For our example, the arguments are simply:

  GeometricProbability gp(4, .02);
  // and for cumulative
  GeometricProbability gpc(4, .02, Probability::RandomVariableType
                                              ::LESSTHANOREQUAL);

The Negative Binomial Probability

This probabilty class is similar to the Geometric and models the number of the trial for which the Kth event succeeds (or fails). Like the last example, we could ask what is the chance that the weapon misfires three times by the 20th shot. This means that there will be two misfires within the first 19 shots and that the last shot is the third misfire.

    NegativeBinomialProbability np(20,3,.02);

The HyperGeometric Probability

This class models a more complex event choosing from two sets. It is also the probability that models modern day lotteries. Suppose we have a bag with three red chips and five black chips. We want to know the chance of picking one red chip out of three picks. The bag is dark and picking from the bag is assumed random. The HyperGeometric probability considers four arguments.

N—the population size, which is eight chips total, three red and five black
n—the sample size, we are making three picks (this assumes chips are not put back into the bag after picking!!)
r—the set of items we are interested in. We want red chips and there are three in the set.
y—the number of items from the sample set we are interested in. We only care about getting EXACTLY one. If we wanted to consider one or more or two or less or three or less, it would be a cumulative probability.

    HyperGeometricProbability hgp(8,3,3,1);

A more practical example might be finding the probability of an event in a legal case. 20 workers consisting of three minorities are chosen from to fill two types of positions, 16 regular labor positions and four heavy labor positions. The minorities claim in a lawsuit that all of them are chosen for heavy labor positions while the employer contends the assignments are random. To figure out what the chance that all minorities are chosen for the heavy labor job, we set our population size, N=20, the sample size is, n=4, for the number of heavy labor positions, the set of items we are interested in is r=3 for the number of minorities and we are interested in when y=3 or when all the minorities are chosen.

    HyperGeometricProbability hgp(20,4,3,3);

The Poisson Probability

The Poisson probability models events based around a known average. The average is based on events that can happen at any time and are considered instantaneous (meaning two events don’t happen at the EXACT same time). For example, the number of typing errors per page or the number of traffic accidents at a street intersection in a year have an average that is Poisson-“ish”. If we know that a certain intersection has an average of seven traffic accidents a year and we want to output the probability that there will be five or fewer accidents this year…

cout<< PoissonProbability(5,7,Probability
                              ::RandomVariableType
                              ::LESSTHANOREQUAL).GetResult();

Set Theory and Composite Events

One of the things not included with these probability classes is the set theory involved in mixing the probabilities to get your own relevant answers. To model more complex events, you need to understand the concept of Independence and Mutual Exclusiveness between sub events. Probability has addition and mulitplication rules for independency and mutual exclusiveness between events. These involve the concepts of Union and Intersection in Set theory. It is beyond the scope of this article to explain it in detail. The reader should learn about the Bayes theorem as well. There are numerous examples and explanations on the Web.

For example, the chance that a seven is rolled with two dice involves a bit more thought. To show modeling of composite events, we would want to multiply the probabilities of Random Variables, Y1 and Y2, where Y1 + Y2 = 7, because the each event of rolling a die is independent of the other. The notation being P(Y1 = a) intersect P(Y2 = 7-a) which is P(Y1=a) * P(Y2=7-a).

int Y1, Y2;
int N = 6;
int Roll = 7;
double p = 1.0/6.0;
double result = 0.0;
for(Y1=1; Y1<=6; Y1++)
{
  Y2 = Roll - Y1;
  BinomialProbability bp1( N, Y1, p);
  BinomialProbability bp2( N, Y2, p);
  // The answer is all the ways a seven can be rolled so we
  // accumulate each independent
  // probability of P(y1=a)*P(y2=7-a)
  result += bp1.GetResult() * bp2.GetResult()
}
cout << result;

Also included with each probability is the Expected Value or Mean of the distribution as well as the Variance. These can be retrieved with the GetExpectedValue() and GetVariance() functions.