By Julienne Walker
License: Public Domain
There is nothing wrong with rand() per se, though
it is notorious for being implemented poorly such that the low order bits tend to
cycle in uncomfortably short amounts. These days that problem has been largely corrected,
but despite this, rand() is still not recommended if one wants a reasonable sequence
of random numbers. For many applications, rand() will perform admirably when used
correctly, but with the current sad state of affairs, rand() is very rarely used
correctly.
The problem is that of distribution. A random number
generator is designed to distribute numbers evenly across a range such that the
probability of any newly generated number being any number in the range is 1/p.
This is called a “uniform” distribution.
Everything is great so far, but nine times out of ten, a programmer will want to
define the range. When rolling a six sided die, it just does not do for the random
number generator to give you 1845373. The typical solution for this is to use C
and C++'s remainder operator to force the range down to whatever is desired. For
example, to shrink the range into [0..N), one might do this:
1 int r = rand() % N;
To instead add a lower bound and set the range to
[M..N), one might do this:
1 int r = M + rand() % ( N - M );
Anyone who does this will be rewarded with a seemingly
random sequence and be thrilled that their clever solution worked. Unfotunately,
this does not work. The first solution only works when N evenly
divides into RAND_MAX. The second solution isn't any better. The
reason is because forcing the range in this way eliminates any chance of having a
uniform distribution. Now, this is okay if you care nothing about some numbers being
more probable than others, but to be correct, you must work with the distribution
instead of destroy it.
Most experienced C and C++ programmers will tell you
that using remainder with random numbers is bad, but not many of them will be able
to tell you why. This is similar to the goto issue, where beginners are taught to
blindly believe that any code with a goto statement is spaghetti code and any use
of goto is evil. The solution suggested to get around the remainder issue is usually
division by RAND_MAX. For a range of [0..N), you would do something
like this:
1 int r = rand() / ( RAND_MAX / N + 1 );
To get a more flexible range, you would do something
more like this:
1 int r = M + rand() / ( RAND_MAX / ( N - M ) + 1 );
Once again, this produces a seemingly random sequence.
The seasoned community is happy because you are not using the remainder operator
with rand, and you are happy because you got your random numbers and foiled the
evils of rand(). But wait! This solution is not a fix all for rand()'s problems.
In fact, the problem of distribution is still there. The trick of dividing by
RAND_MAX uses the high order bits of a random number produced by rand(),
thus alleviating the old problem of poor rand() implementations with non-random
low order bits. But as I said, these days that problem has been largely fixed by
library vendors, so the only thing you have managed to do is make the code more
complicated. It buys you nothing, and before you start thinking that doing the same
thing with floating-point will help, I assure you that it will not. Floating-point
can be used to solve the problem, but probably not the way you were thinking.
A real fix is to work with the distribution rather
than try to circumvent it. Simply call rand() until you get a number that fits within
your selected range:
1 int r;
2
3 do
4 r = rand();
5 while ( r < low || high < r );
Not only does this fix the distribution problem, it
also avoids the low order bit problem entirely. The down side to this solution is
that the performance is dictated by the range. A small range could potentially call
rand() many times. If you are worried about that possible performance hit, an alternative
is to create a uniform deviate from the random number. This floating-point value in the
range [0..1) can then be multiplied to fix the range. The uniform deviate would be used
like so for [0..N):
1 double uniform_deviate ( int seed )
2 {
3 return seed * ( 1.0 / ( RAND_MAX + 1.0 ) );
4 }
5
6 int r = uniform_deviate ( rand() ) * N;
For a more flexible range, the code is simple:
1 int r = M + uniform_deviate ( rand() ) * ( N - M );
Uniform deviates also may incur a performance penalty if
floating-point operations are significantly slower on the machine than integer operations.
It should be kept in mind that these distribution
problems are consistent with all random number generators that return a ranged integer,
not just rand(). So please do not read this article as me harping on rand(), rather
read it as me harping on the majority of programmers who use it incorrectly.
Seeding rand()
Probably the most vexing problem with how people use
rand() is how the generator is seeded. The usual solution is to get the system time
using the time() function. In theory, this is a great idea because not only does
it seed the random number generator with an ever changing value, but it is also
portable:
1 srand ( (unsigned int)time ( NULL ) );
Right? Well no, not really. The first issue is that
time_t is a restricted type, and may not be meaningfully converted
to unsigned int. That is not a terribly big issue as I have yet
to see a system where it would fail to work correctly, and I do not know anyone
who has either. However, a more noticeable problem is when a program is run many
times in quick succession. For example, take the following program:
1 #include <cstdlib>
2 #include <ctime>
3 #include <iostream>
4
5 using namespace std;
6
7 double uniform_deviate ( int seed )
8 {
9 return seed * ( 1.0 / ( RAND_MAX + 1.0 ) );
10 }
11
12 /* This is a test */
13 int main ( void )
14 {
15 // Another test
16 srand ( static_cast<unsigned int> ( time ( 0 ) ) );
17
18 for ( int i = 0; i < 10; i++ ) {
19 int r = uniform_deviate ( rand() ) * 10;
20 cout<< r <<' ';
21 }
22
23 cout<<'\n';
24 }
Run it several times, giving yourself a few seconds
between each run. Notice how the first number is always the same? This is the problem
with using the system time directly as a seed for rand(). The seed is different
each time the program is run, but if it is not sufficiently different, then the
first number generated by rand will be within a small cluster of values. This is
hardly random behavior, and it is because the system time grows in a predictable
direction at predictable intervals.
If the range has been shrunk, such as with the above
program, it may be some time before the seed will be different enough to change
the first random number. For example, on one of my systems, I would have to wait
about 17 minutes before the first number changed in the above program.
These two reasons make using the system time directly as a seed for rand() a flawed
solution, but there is a way to use the result of time() portably as a seed for
rand() while fixing this nasty little progression problem; just hash the bytes of
a time_t:
1 unsigned time_seed()
2 {
3 time_t now = time ( 0 );
4 unsigned char *p = (unsigned char *)&now;
5 unsigned seed = 0;
6 size_t i;
7
8 for ( i = 0; i < sizeof now; i++ )
9 seed = seed * ( UCHAR_MAX + 2U ) + p[i];
10
11 return seed;
12 }
13
14 srand ( time_seed() );
A hash will take advantage of the way the system time
changes while maintaining an unpredictable air about it. Even better, the C and
C++ standards guarantee that type punning is a portable operation for simple types,
and time_t is a simple type. So hashing the system time and seeding
rand() with it is a portable solution with desirable properties.
Of course, one might wonder why we should go to all of the trouble when time_t
is extremely likely to work and we can just throw away the first random number with
a dummy call to rand. That's certainly another solution, but it's more of a Band-Aid
than a true correction of the problem. The uncertainty of being able to use
time_t alone is a good enough reason to avoid seeding rand with one.
There is really nothing wrong with rand(), despite
what you may be told. The problem is with the people who use it, and how they use
it incorrectly and without having all of the facts. Another problem is that one
can use rand() incorrectly and still get seemingly random results, so the incorrect
use appears to work as expected.
Now for the big question: Does any of this matter to you?
Yes, of course it does. A keen understanding of how things work will give you insight
into writing good code. Will you encounter these problems in everyday programming?
Probably not. Pseudorandom numbers are typically expected to be sufficiently random
for the application but no more. As such, most people who use rand() will either not
see, or not care, about distribution issues. In fact, these issues are generally not
noticeable except on a very large scale.
An article on rand() isn't complete without the usual caveat that rand() isn't required
to be a strong random number generator. In fact, it's usually pretty weak. For little
things that don't need a good random number generator, rand() will do fine. For everything
else, a much stronger algorithm with more guarantees, such as the Mersenne Twister,
is recommended. In other words, for a dice rolling school project, rand() is just
peachy. But for a slot machine in Las Vegas, you need the very best you can get,
and rand() isn't it.