Statistical Tests

Next: Transformation of Uniformly Distributed Up: Generation of Pseudo-Random Numbers Previous: Linear Congruential Generators Contents

Statistical Tests

In literature numerous statistical significance tests are discussed in order to investigate characteristics of random number generators; see e.g. G.S. Fishman (1996) Monte Carlo: Concepts, Algorithms and Applications, Springer, New York.

We only recall two such tests
- which are important for investigating characteristics of linear congruential generators (and other random number generators)
- and have already been mentioned in the courses ,,Statistik I'' and ,,Statistik II'', respectively.

Pearson's -goodness of fit test (see Section 5.2 of the lecture notes for ,,Statistik II'') is used to check
- if the generated pseudo-random numbers can be regarded as realizations of uniformly distributed random variables
- and if we may assume the independence of these random variables.

Another method for the generation of sequences of numbers having desirable characteristics
- is based on minimizing the Kolmogorov distance (see Section 1.5 of ,,Statistik I'' and Section 5.1 of ,,Statistics II'', respectively)
  
  $\displaystyle D_n(u_1,\ldots,u_n)=\sup\limits_{x\in(0,1]}\Bigl\vert\;\frac{1}{n}\; \char93 \bigl\{i:\, 1\le i\le n,\, 0<u_i\le x\bigr\}-x\Bigr\vert$
  between the empirical distribution function of the ,,sample'' $u_1,\ldots,u_n$ and the distribution function of the uniform distribution on for every natural number .
- In literature this procedure is referred to as Quasi-Monte-Carlo-Method; see e.g. H. Niederreiter (1992) Random Number Generation and Quasi-Monte-Carlo Methods, SIAM, Philadelphia.

$\chi ^2$ -goodness of fit test of uniform distribution
The following test is considered in order to check if the pseudo-random numbers $u_1,\ldots,u_n$
- can be regarded as realizations of independent sampling variables $U_1,\ldots,U_n$ that are uniformly distributed on the interval .
- The interval is divided in subintervals of equal length and
  - we consider the -dimensional (hypothetical) vector of parameters ${\mathbf{p}}_0=(1/r,\ldots,1/r)$ and
  - the test statistic $T_n:\mathbb{R}^n\to[0,\infty)$ where
    
    $\displaystyle T_n(u_1,\ldots,u_n)=\sum\limits _{j=1}^r\;\frac{(Z_j(u_1,\ldots,u_n)-n/r)^2}{n/r}\;,$
    and $Z_j(u_1,\ldots,u_n)=\char93 \{i:\, 1\le i\le n,\, j-1<ru_i\le j\}$ denotes the number of pseudo-random numbers $u_1,\ldots,u_n$ in the interval .
- If the sampling variables $U_1,\ldots,U_n$ are independent and uniformly distributed on the interval , by Theorem 5.5 from ,,Statistik II''
- the test statistic is asymptotically $\chi^2_{r-1}$ distributed.
- Thus, for sufficiently large the hypothesis $H_0:{\mathbf{p}}={\mathbf{p}}_0$ is rejected if
  
  $\displaystyle T_n(u_1,\ldots,u_n)>\chi^2_{r-1,1-\alpha}\,,$
  where $\chi^2_{r-1,1-\alpha}$ denotes the $(1-\alpha)$ -quantile of the $\chi ^2$ distribution with degrees of freedom.
- We will illustrate this test by the following numerical example. For , and we want to check if
  - the hypothesis that the sampling variables are uniformly distributed is conformable with a sample $(u_1,\ldots,u_{100\,000})$ of pseudo-random numbers. The sample has the following vector $(z_1,\ldots,z_{10})$ of class frequencies:
    
    $z_{10}$
    
    $9\,995$ $10\,045$ $10\,127$ $9\,816$ $10\,130$ $10\,040$ $9\,890$ $9\,858$ $10\,083$ $10\,016$
  - In this case we obtain $T_{100\,000}(u_1,\ldots,u_{100\,000})=10.99$ and hence
    
    $\displaystyle T_{100\,000}(u_1,\ldots,u_{100\,000})=10.99<\chi^2_{9,0.95}=16.92\,.$
  - Thus, the hypothesis of a uniform distribution on is not rejected.
Remarks
- As a generalization of the -goodness of fit test for checking the uniform distribution of some sample variables one can also check
  - if for a given natural number $d\ge 1$ (e.g. or ) the pseudo-random vectors $(u_1,\ldots,u_d),\ldots,(u_{(n-1)d+1},\ldots,u_{nd})$ can be regarded
  - as realizations of independent random vectors $(U_1,\ldots,U_d),\ldots,(U_{(n-1)d+1},\ldots,U_{nd})$ that are uniformly distributed on .
- For this purpose the unit cube is divided into smaller cubes of equal size,
  - which are of the form $((i_1-1)/r,i_1/r]\times\ldots\times((i_d-1)/r,i_d/r]$ .
  - Furthermore, we consider the -dimensional (hypothetical) vector ${\mathbf{p}}_0=(1/r^d,\ldots,1/r^d)$ of parameters and
  - the test statistic $T_n:\mathbb{R}^{nd}\to[0,\infty)$ where
    
    $\displaystyle T_n({\mathbf{u}}_1,\ldots,{\mathbf{u}}_n)=\sum\limits _{j=1}^{r^d}\;\frac{(Z_j({\mathbf{u}}_1,\ldots,{\mathbf{u}}_n)-n/r^d)^2}{n/r^d}\;,$
    ${\mathbf{u}}_i=(u_{(i-1)d+1},\ldots,u_{id})$ and $Z_j({\mathbf{u}}_1,\ldots,{\mathbf{u}}_n)=\char93 \{i:\, 1\le i\le n,\, {\mathbf{u}}_i\in B_j\}$ . Notice that $Z_j({\mathbf{u}}_1,\ldots,{\mathbf{u}}_n)$ denotes the number of pseudo-random vectors in .
Run Test
There are a number of other significance tests allowing to evaluate the quality of random number generators. In particular it can be verified
- if the generated pseudo-random numbers $u_1,\ldots,u_n$ can be regarded as realizations of independent random variables $U_1,\ldots,U_n$ having a certain distribution. In our case we consider the hypothesis of a uniform distribution on .
- The following run test checks in particular
  - if the independence assumption for the sampling variables $U_1,\ldots,U_n$ is reflected sufficiently well by the pseudo-random numbers $u_1,\ldots,u_n$ .
  - This is done by analyzing the lengths of monotonically increasing subsequences, also called runs, within the sequence $u_1,u_2,\ldots$ of pseudo-random numbers.
- For this purpose we define the random variables $V_1,V_2,\ldots$ by the recursion formula
  
  $\displaystyle V_{j+1}=\min\{i:\, i> V_j+1, U_i>U_{i+1}\}\,,\qquad\forall\, j=1,2,\ldots\,,$ (5)
  
  where $V_1=\min\{i:\, i\ge 1, U_i>U_{i+1}\}$ .
- The random variables $W_1,W_2,\ldots$ where
  
  $\displaystyle W_1=V_1$ and $\displaystyle \qquad W_{j+1}=V_{j+1}-(V_j+1)$ $\displaystyle \mbox{for $j=1,2,\ldots$}$ (6)
  
  are called the runs of the sequence $U_1,U_2,\ldots$ .
The significance test that will be constructed is based on the following property of the runs $W_1,W_2,\ldots$ .

Theorem 3.3 The random variables $W_1,W_2,\ldots$ introduced in $% latex2html id marker 35945 $ (\ref{run1})$$ are independent and identically distributed such that

$\displaystyle P(W_j=k)=\frac{k}{(k+1)!}\;,\qquad\forall\, k=1,2,\ldots\,,$ (7)

if the random variables $U_1,U_2,\ldots$ are independent and uniformly distributed on .
Proof
- Let be independent and uniformly distributed on .
  - Then for all $n\ge 1$ and for arbitrary natural numbers $k_1,\ldots,k_n\ge 1$ , we get that
    
    $\displaystyle { P(W_1=k_1,\ldots,W_n=k_n) = P(V_1=k_1,V_2-V_1-1=k_2,\ldots,V_n-V_{n-1}-1=k_n)}$
    
    $\displaystyle =$ $\displaystyle P\bigl(V_1=k_1,V_2=k_2+k_1+1,\ldots,V_n=k_n+\ldots+k_1+n-1\bigr)$
    
    $\displaystyle =$ $\displaystyle P\bigl(U_i\le U_{i+1},\,\forall\,i=1,\ldots,k_1-1,\,U_{k_1}>U_{k_1+1},$
    
    $\displaystyle U_i\le U_{i+1},\,\forall\,i=k_1+2,\ldots,k_1+1+k_2-1,\,U_{k_1+1+k_2}>U_{k_1+1+k_2+1},\ldots,$
    
    $\displaystyle U_i\le U_{i+1},\,\forall\,i=k_1+1+\ldots+k_{n-1}+2,\ldots,k_1+1+\ldots+k_{n-1}+1+k_n-1,\,$
    
    $\displaystyle \hspace{3cm} U_{k_1+1+\ldots+k_{n-1}+1+k_n}>U_{k_1+1+\ldots+k_{n-1}+1+k_n+1}\bigr)$
    
    $\displaystyle =$ $\displaystyle P\bigl(U_i\le U_{i+1},\,\forall\,i=1,\ldots,k_1-1,\,U_{k_1}>U_{k_1+1}\bigr)$
    
    $\displaystyle P\bigl(U_i\le U_{i+1},\,\forall\,i=k_1+2,\ldots,k_1+1+k_2-1,\,U_{k_1+1+k_2}>U_{k_1+1+k_2+1}\bigr)$
    
    $\displaystyle \ldots P\bigl(U_i\le U_{i+1},\,\forall\,i=k_1+1+\ldots+k_{n-1}+2,\ldots,k_1+1+\ldots+k_{n-1}+1+k_n-1,\,$
    
    $\displaystyle \hspace{3cm} U_{k_1+1+\ldots+k_{n-1}+1+k_n}>U_{k_1+1+\ldots+k_{n-1}+1+k_n+1}\bigr)$
    
    $\displaystyle =$ $\displaystyle P\bigl(U_i\le U_{i+1},\,\forall\,i=1,\ldots,k_1-1,\,U_{k_1}>U_{k_1+1}\bigr)$
    
    $\displaystyle \ldots P\bigl(U_i\le U_{i+1},\,\forall\,i=1,\ldots,k_n-1,\, U_{k_n}>U_{k_n+1}\bigr)\,.$
  - This implies that the runs $W_1,W_2,\ldots$ are independent and identically distributed.
- Furthermore, an induction argument shows that for arbitrary and
  
  $\displaystyle P(U_1\le\ldots\le U_k\le t)=\frac{t^k}{k!}\,.$ (8)
  - For , equation (8) obviously holds. By the formula of total probability we obtain
    
    $\displaystyle P(U_1\le\ldots\le U_{k+1}\le t)$ $\displaystyle =$ $\displaystyle \int_0^1 P(U_1\le\ldots\le U_k\le U_{k+1}\le t\mid U_{k+1}=x)\,P(U_{k+1}\in\, dx)$
    
    $\displaystyle =$ $\displaystyle \int_0^1 P(U_1\le\ldots\le U_k\le x\le t\mid U_{k+1}=x)\,P(U_{k+1}\in\, dx)$
    
    $\displaystyle =$ $\displaystyle \int_0^1 P(U_1\le\ldots\le U_k\le x\le t)\, dx\,,$
    
    where the last equality is a consequence of the independence and -uniform distribution of $U_1,U_2,\ldots$ .
  - Assume now that (8) is true for some $k\ge 1$ . Then
    
    $\displaystyle P(U_1\le\ldots\le U_{k+1}\le t)$ $\displaystyle =$ $\displaystyle \int_0^t P(U_1\le\ldots\le U_k\le x)\, dx$
    
    $\displaystyle =$ $\displaystyle \int_0^t\;\frac{x^k}{k!}\,dx=\frac{t^{k+1}}{(k+1)!}\;,$
    
    where the second but one equality uses the induction hypothesis.
  - This shows (8) for any $k\ge 1$ .
- Moreover, by (8) we can conclude that for any $k\in\mathbb{N}$
  
  $\displaystyle P(U_1\le\ldots\le U_k,\,U_k> U_{k+1})$ $\displaystyle =$ $\displaystyle \int_0^1 P(U_1\le\ldots\le U_k,\,U_k>x)\, dx$
  
  $\displaystyle =$ $\displaystyle \int_0^1 \Bigl(P(U_1\le\ldots\le U_k\le 1)- P(U_1\le\ldots\le U_k\le x)\Bigr)\, dx$
  
  $\displaystyle =$ $\displaystyle \int_0^1\Bigl(\frac{1}{k!}\;-\;\frac{x^k}{k!}\Bigr)\, dx$
  
  $\displaystyle =$ $\displaystyle \frac{1}{k!}\;-\;\frac{1}{(k+1)!}=\frac{k}{(k+1)!}\;.$
  
  $\Box$
Remarks
- Let us assume that sufficiently many pseudo-random numbers $u_1,u_2\ldots$ have been generated that are resulting in the runs $w_1,\ldots,w_n$ according to (5) and (6).
- We choose pairwise disjoint intervals on the positive real axis such that
  - the probabilities
    
    $\displaystyle p_{0,j}=\sum\limits _{k\in\mathbb{N}\cap(a_j,b_j]}\frac{k}{(k+1)!}\,,\qquad j=1,\ldots,r$
    are almost equal.
  - For these probabilities we consider the -dimensional (hypothetical) vector
    ${\mathbf{p}}_0=(p_{0,1},\ldots,p_{0,r-1})$ and
  - the test statistic $T_n:\mathbb{R}^n\to[0,\infty)$ where
    
    $\displaystyle T_n(w_1,\ldots,w_n)=\sum\limits _{j=1}^r\;\frac{(Y_j(w_1,\ldots,w_n)-np_{0,j})^2}{np_{0,j}}\;,$
    and $Y_j(w_1,\ldots,w_n)=\char93 \{i:\, 1\le i\le n,\, a_j<w_i\le b_j\}$ denotes the number of run lengths $w_1,\ldots,w_n$ belonging to class .
- According to Theorem 3.3 for large the hypothesis $H_0:{\mathbf{p}}={\mathbf{p}}_0$ will be rejected if $T(w_1,\ldots,w_n)>\chi^2_{r-1,1-\alpha}$ . Note that this requires the generation a sufficiently large number of pseudo-random numbers $u_1,u_2,\ldots$ .

Next: Transformation of Uniformly Distributed Up: Generation of Pseudo-Random Numbers Previous: Linear Congruential Generators Contents

Ursa Pantle 2006-07-20

									$z_{10}$
$9\,995$	$10\,045$	$10\,127$	$9\,816$	$10\,130$	$10\,040$	$9\,890$	$9\,858$	$10\,083$	$10\,016$

$\displaystyle { P(W_1=k_1,\ldots,W_n=k_n) = P(V_1=k_1,V_2-V_1-1=k_2,\ldots,V_n-V_{n-1}-1=k_n)}$
	$\displaystyle =$	$\displaystyle P\bigl(V_1=k_1,V_2=k_2+k_1+1,\ldots,V_n=k_n+\ldots+k_1+n-1\bigr)$
	$\displaystyle =$	$\displaystyle P\bigl(U_i\le U_{i+1},\,\forall\,i=1,\ldots,k_1-1,\,U_{k_1}>U_{k_1+1},$
		$\displaystyle U_i\le U_{i+1},\,\forall\,i=k_1+2,\ldots,k_1+1+k_2-1,\,U_{k_1+1+k_2}>U_{k_1+1+k_2+1},\ldots,$
		$\displaystyle U_i\le U_{i+1},\,\forall\,i=k_1+1+\ldots+k_{n-1}+2,\ldots,k_1+1+\ldots+k_{n-1}+1+k_n-1,\,$
		$\displaystyle \hspace{3cm} U_{k_1+1+\ldots+k_{n-1}+1+k_n}>U_{k_1+1+\ldots+k_{n-1}+1+k_n+1}\bigr)$
	$\displaystyle =$	$\displaystyle P\bigl(U_i\le U_{i+1},\,\forall\,i=1,\ldots,k_1-1,\,U_{k_1}>U_{k_1+1}\bigr)$
		$\displaystyle P\bigl(U_i\le U_{i+1},\,\forall\,i=k_1+2,\ldots,k_1+1+k_2-1,\,U_{k_1+1+k_2}>U_{k_1+1+k_2+1}\bigr)$
		$\displaystyle \ldots P\bigl(U_i\le U_{i+1},\,\forall\,i=k_1+1+\ldots+k_{n-1}+2,\ldots,k_1+1+\ldots+k_{n-1}+1+k_n-1,\,$
		$\displaystyle \hspace{3cm} U_{k_1+1+\ldots+k_{n-1}+1+k_n}>U_{k_1+1+\ldots+k_{n-1}+1+k_n+1}\bigr)$
	$\displaystyle =$	$\displaystyle P\bigl(U_i\le U_{i+1},\,\forall\,i=1,\ldots,k_1-1,\,U_{k_1}>U_{k_1+1}\bigr)$
		$\displaystyle \ldots P\bigl(U_i\le U_{i+1},\,\forall\,i=1,\ldots,k_n-1,\, U_{k_n}>U_{k_n+1}\bigr)\,.$

$\displaystyle P(U_1\le\ldots\le U_{k+1}\le t)$	$\displaystyle =$	$\displaystyle \int_0^1 P(U_1\le\ldots\le U_k\le U_{k+1}\le t\mid U_{k+1}=x)\,P(U_{k+1}\in\, dx)$
	$\displaystyle =$	$\displaystyle \int_0^1 P(U_1\le\ldots\le U_k\le x\le t\mid U_{k+1}=x)\,P(U_{k+1}\in\, dx)$
	$\displaystyle =$	$\displaystyle \int_0^1 P(U_1\le\ldots\le U_k\le x\le t)\, dx\,,$

$\displaystyle P(U_1\le\ldots\le U_{k+1}\le t)$	$\displaystyle =$	$\displaystyle \int_0^t P(U_1\le\ldots\le U_k\le x)\, dx$
	$\displaystyle =$	$\displaystyle \int_0^t\;\frac{x^k}{k!}\,dx=\frac{t^{k+1}}{(k+1)!}\;,$

$\displaystyle P(U_1\le\ldots\le U_k,\,U_k> U_{k+1})$	$\displaystyle =$	$\displaystyle \int_0^1 P(U_1\le\ldots\le U_k,\,U_k>x)\, dx$
	$\displaystyle =$	$\displaystyle \int_0^1 \Bigl(P(U_1\le\ldots\le U_k\le 1)- P(U_1\le\ldots\le U_k\le x)\Bigr)\, dx$
	$\displaystyle =$	$\displaystyle \int_0^1\Bigl(\frac{1}{k!}\;-\;\frac{x^k}{k!}\Bigr)\, dx$
	$\displaystyle =$	$\displaystyle \frac{1}{k!}\;-\;\frac{1}{(k+1)!}=\frac{k}{(k+1)!}\;.$