Thursday, November 27, 2008

That was random

One thing that computers are really bad at is coming up with random numbers. You ask the computer to do something specific and tell it exactly how, and it will do it exactly right every time. But ask it to come up with a random number, and it will ask "how?" Well, there is no "how" for that question, you just pick a number at random. A human can do it better, but even we're not that good at it. If you ask a million people to each pick a random number between 1 and 1000, the number of people who choose numbers less than 10 or greater than 990 should be around 2% but is likely to be much less. And who's going to pick 500? Or 666? That's not very random, right?

In recent years, some computers have been built with real random number generators; these use things like thermal noise generated by the processor itself to come up with truly random numbers. In the majority of cases, though, this isn't available, or at least it may not be so you have to assume it isn't. You are then forced to use a pseudo-random number generator (PRNG), which gives you a sequence of numbers that appear to be random, but are actually completely reproducible if you "seed" the generator with the same value each time. This is helpful for us programmers when trying to reproduce a problem, but in general you want your pseudo-random numbers to be closer to real randomness, so you need to seed the PRNG with a different value each time. Coming up with enough entropy in your PRNG seed can be a difficult problem.

Many programs choose the number of seconds since midnight January 1, 1970, since it's a fairly easy number to obtain in the C language (for historical reasons that I am not going to go into here, mainly because I have no idea what they are). However, if you have multiple programs starting at the same time, they can end up using the same seed and therefore the same sequence of pseudo-random numbers, which can be a serious security hole. So some programs go much further in picking a seed — combining the current time with the process ID or some other piece of data that changes frequently in an unpredictable way. I have heard of programs that ask the user to type a sentence, and calculate entropy by analyzing the typing pattern of the user — the average number of milliseconds between keystrokes and stuff like that.

The end result is that programs sometimes have to go to a lot of trouble to come up with a seed for the PRNG that has sufficient entropy. Essentially, you need to come up with a good pseudo-random number in order to generate pseudo-random numbers.

No comments: