Simple sampling

Rather than talking about sampling very generally, I’ll focus on self-avoiding walks (because they are relatively simple to play with) and some similar objects like random walks. Now, since we cannot compute some of these quantities (and associated statistics etc) exactly in a reasonable amount of computer time, we can trade precision for speed, and try to compute them approximately. My own introduction to that idea came through something called the Rosenbluth method, which I’ll touch on, but to get to that, and better approaches, we should start with just some very simple sampling methods. Lets start with a 1d random walk. Part of the reason to introduce this is to also introduce some simple bits of python. Now some of you might have plenty of python experience, and some of you almost none — just Patrick’s introduction to visualisation etc yesterday. So I’ll start with some very basic things and we’ll work up from there. My apologies if this is very basic for some of you.

A simple random walk in 1d

Lets just build a random walk in 1d by starting at 0 and then flipping a coin at each step and recording our movements.

from random import randint

N=10
x=0;
walk=[x]

for n in range(0,N):
  if(randint(1,2)==1):
    x+=1
  else:
    x-=1
  walk.append(x)

print(walk)

This gives output (say)

 [0, -1, 0, 1, 0, 1, 0, 1, 2, 3, 4]

Which is just our 1d random walk at each time step.

A simple random walk in 2d

Let’s go up one dimension and look at a simple random walk on the plane:

from random import randint

N=10
x,y=0,0
walk=[(x,y)]

for n in range(0,N):
  c=randint(0,3) #0=R, 1=L, 2=U, 3=D
  if(c==0):
    x+=1
  elif(c==1):
    x-=1
  elif(c==2):
    y+=1
  elif(c==3):
    y-=1
  walk.append( (x,y) )

print(walk)

This gives output (say)

[(0, 0), (0, 1), (1, 1), (0, 1), (1, 1), (0, 1) (0, 2), (-1, 2), (-1, 1), (-1, 0), (-1, 1)]

Which is a list of the coordinates of our 2d random walk at each time step.

This method of producing a sample is $O(n)$ time — ie it is linear in the length of the walk.

Notice, we can clean up our “next vertex in direction c” code by storing $\Delta x, \Delta y$ for each direction in a list. We’ll do that in the next code blurb.

A simple random walk in 2d - now with pictures!

Lets pull in matplotlib and use it to plot things. Notice this is slightly complicated by the fact that we are producing a list of tuples [(x_1,y_1), (x_2,y_2),...], but the plot command wants the data arranged as [x_1,x_2,...], [y_1,y_2,...]. Thankfully there is a python command that does precisely that — zip.

from random import randint
import matplotlib.pyplot as plt

#0=R, 1=L, 2=U, 3=D
#encode delta x,y for each direction
xMoves=[1,-1,0,0]
yMoves=[0,0,1,-1]

N=10
x,y=0,0
walk=[(x,y)]

for n in range(0,N):
  c=randint(0,3)
  x += xMoves[c]
  y += yMoves[c]
  walk.append( (x,y) )

#Plot function wants [x-ords],[y-coords] - use zip to do this
print(walk)
xs,ys = zip( *walk ) #split our list of coords

# just add a little buffer around the edges of our plot
plt.axis([ min(xs)-1, max(xs)+1, min(ys)-1, max(ys)+1])
# now plot the data with blue blobs and a line.
plt.plot(xs,ys,'bo-')
plt.show()

This returns to us a figure like this

2d random walk

Notice that

Can clearly see that our random walk self-intersects.
There is a 1/4 chance that our next step will cause an immediate problem (self-intersection).
could improve this by requiring that our next step does not reverse the last one — a random walk with no immediate back-tracking.

A simple random walk in 2d — no immediate backtracking

To ensure no backtracking, we record the step we are not allowed to take next. So we create another list which encodes “if you stepped in direction blah, then you cannot step in direction blah”.

from random import randint
import matplotlib.pyplot as plt

N=20
x,y=0,0
walk=[(x,y)]

#0=R, 1=L, 2=U, 3=D
xMoves=[1,-1,0,0]
yMoves=[0,0,1,-1]
wayBack=[1,0,3,2] #reverse direction
dontGo=-1; #record step direction.

for n in range(0,N):
  c=randint(0,3)
  while(c==dontGo):
    c=randint(0,3) #keep picking until we get a different direction
  x+=xMoves[c]
  y+=yMoves[c]
  dontGo=wayBack[c]
  walk.append( (x,y) )

## Now build the plot
xs,ys = zip( *walk ) 
plt.axis([ min(xs)-1, max(xs)+1, min(ys)-1, max(ys)+1])
plt.plot(xs,ys,'bo-')
plt.show()

This returns to us a figure like this

2d random walk

It is better than before, but still lots of intersections.

A simple random walk in 2d — how many self-intersections

We can make small changes to our code in order to keep track of how many self-intersections our random walk is creating:

from random import randint
from collections import defaultdict
# defaultdict is a useful map container which handles missing keys easily

occupancy=defaultdict(int) 
# keep track of multiply occupied sites, "int" implies missing keys default to "0".

N=20
x,y=0,0
walk=[(x,y)]
occupancy[ (x,y) ]=1 # the first vertex occupied

xMoves=[1,-1,0,0]
yMoves=[0,0,1,-1]
wayBack=[1,0,3,2] #reverse direction
dontGo=-1; #record step direction.

for n in range(0,N):
  c=randint(0,3)
  while(c==dontGo):
    c=randint(0,3)

  x+=xMoves[c]
  y+=yMoves[c]
  dontGo=wayBack[c]
  walk.append( (x,y) )

  if( (x,y) in occupancy ):
    print( "ooo! an intersection at ", (x,y) )
  occupancy[(x,y)]+=1

print(walk)
print("Multiply occupied sites are:")
for p in occupancy:
  if( occupancy[p]>1 ):
    print(p, occupancy[p])

It returns to us something like

ooo! an intersection at  (-1, 0)
ooo! an intersection at  (0, 0)
[(0, 0), (0, -1), (-1, -1), (-1, 0), (-1, 1), (-2, 1), (-2, 0), (-1, 0), (0, 0), (1, 0), (1, 1), (2, 1), (3, 1), (4, 1), (4, 0), (3, 0), (3, -1), (4, -1), (4, -2), (4, -3), (4, -4)]
Multiply occupied sites are:
(0, 0) 2
(-1, 0) 2

You can see that we get plenty — indeed as the walk gets longer, the chance gets higher.

A simple random walk in 2d — how many SAWs?

“Higher” is not terribly precise. So let’s quantify that by looking at the survival time. We’ll run our simple non-backtracking random walk and stop as soon as we get a self-intersection (ie. as soon as the walk ceases to be a SAW). Run this many times and record see how many walks survive to time $n$ . It turns out that this decays exponentially (we’ll even prove this), so we’ll present our data using a semi-log plot.

from random import randint
import matplotlib.pyplot as plt

N=100 #walks of max length N
S=10000 #We'll run it S times

# the number of walks at time t
counts=[0 for i in range(0,N+1)]

xMoves=[1,-1,0,0]
yMoves=[0,0,1,-1]
wayBack=[1,0,3,2] 

def survivalTime():
  x,y=0,0
  dontGo=-1;
  occupied=set() #clean each time

  for n in range(0,N):
    occupied.add( (x,y) )
    counts[n]+=1

    c=randint(0,3)
    while(c==dontGo):
      c=randint(0,3)

    x+=xMoves[c]
    y+=yMoves[c]
    dontGo=wayBack[c]

    if( (x,y) in occupied ):
      break #stop as soon as self-intersection
  return

for s in range(0,S):
  survivalTime()

plt.bar(range(N+1), counts, color="blue")
plt.yscale('log')
plt.show()

This returns to us a figure like this Survival decay from which the exponential decay of the survival probability is very obvious.

Proving it

We can actually prove this exponential decay pretty easily. The number of non-backtracking random walks is $\begin{align*} w_n &= 4\cdot 3^{n-1} \end{align*}$ and from submultiplicative-ness of SAWs we know that $\begin{align*} c_{n+k} &\leq c_n \cdot c_k \end{align*}$ A little Feketing (Feketelemmarising?) then gives us $\begin{align*} \lim_{n \to \infty} c_n^{1/n} \leq c_k^{1/k} \end{align*}$ So provided we can show that $c_k^{1/k} \lt 3$ then we know that the limit $\begin{align*} \frac{c_n}{w_n} \to 0 \text{ exponentially}. \end{align*}$ To prove that we can just count SAWs up to some short length. The easiest way is to do a simple recursive backtracking walk. Note — this is definitely not the fastest way to do this, but it is a very quick and easy way to get some numbers.

N=10
lattice=set()
count=[0 for i in range(0,N+1)]

#depth first search through possible SAWs
def step(l,x,y):
  count[l]+=1
  if(l<N):
    lattice.add( (x,y) ) # occupy site
    # recursively step in each legal direction
    if( (x+1,y) not in lattice ):
      step(l+1,x+1,y)
    if( (x-1,y) not in lattice ):
      step(l+1,x-1,y)
    if( (x,y+1) not in lattice ):
      step(l+1,x,y+1)
    if( (x,y-1) not in lattice ):
      step(l+1,x,y-1)
    lattice.remove( (x,y) ) #clean up

step(0,0,0) #start with length 0 at (0,0)
for l in range(1,N+1):
  print(l, count[l], count[l]**(1.0/l) )

This returns to us a list of counts and also $c_k^{1/k}$ :

1 4 4.0
2 12 3.4641016151377544
3 36 3.3019272488946263
4 100 3.1622776601683795
5 284 3.0950214840037007
6 780 3.034001331989801
7 2172 2.997051875398709
8 5916 2.961443972633954
9 16268 2.937149267706373
10 44100 2.913693458576192

So — since $c_{7}^{1/7}\lt3$ (and getting smaller thereafter), we know that the probability of our simple random non-backtracking walk surviving to produce a SAW for us decays to zero exponentially quickly. This proves that simple sampling of these walks is never going to be much use.

A quick aside to estimate the growth rate

As an aside — we can also use our quick and dirty survival time histogram data to estimate the rate of decay. Just a semi-log plot and some regression.

from random import randint
import matplotlib.pyplot as plt
import numpy as np #lots of handy numerical stuff

N=100 #walks of max length N
S=10000 #We'll run it S times
counts=[0 for i in range(0,N+1)]

xMoves=[1,-1,0,0]
yMoves=[0,0,1,-1]
wayBack=[1,0,3,2] 

def survivalTime():
  x,y=0,0
  dontGo=-1;
  occupied=set() #clean each time

  for n in range(0,N):
    occupied.add( (x,y) )
    counts[n]+=1

    c=randint(0,3)
    while(c==dontGo):
      c=randint(0,3)

    x+=xMoves[c]
    y+=yMoves[c]
    dontGo=wayBack[c]

    if( (x,y) in occupied ):
      break #stop as soon as self-intersection
  return

for s in range(0,S):
  survivalTime()

#to do simple linear regression need lists of x_i, y_i
x=[]; y=[]
for l in range(10,55): 
  if( counts[l]>0 ): #log of 0 is bad
    x.append(l); y.append( np.log(counts[l]) )

fit = np.polyfit(x,y,1) #linear regression
fitFunction = np.poly1d(fit) #the regression line
print( fit, np.exp(fit[0])*3 ) #fit coeffs and mu est.

plt.plot(x,y,'bo', x,fitFunction(x),'r-')
plt.show()

This gives us an estimate of the decay rate of about $e^{-0.12} \approx 0.887$ which (after multiplying by 3) gives a rough estimate of the exponential growth of the number of SAWs as $2.66$ . Not too far off the actual $2.638\dots$ .

Survival decay regression

Not so simple sampling

We could — perhaps — make things better by altering the way we choose the next step. At present we just say “do go immedately backwards”, but we could also say “don’t step on an occupied site”. ie — rather than picking one of the 3 directions uniformly at random, we could pick uniformly from one of the directions the is vacant:

Where can I go next

We’d keep iterating this until we get to the desired length — ie we “grow” the walk.

If we denote the number of choices at the end of a walk as $a(\varphi)$ , then the probability of picking a particular a particular edge is just $1/a(\varphi)$ . So the probability of growing a particular walk (given by its edges) is $\begin{align*} \mathbb{P}(\varphi)^{-1} &= \prod_{k=0}^{n-1} a(\varphi_k) \end{align*}$ where we have used $\varphi_k$ to denote the first $k$ steps of the walk $\varphi$ .

A quick exploration of this formula shows that the probability of sampling a walk of length $n$ is no longer uniform. It really depends heavily on how tight the walk was as it was grown (which would make $a(\varphi_k)$ smaller or larger. Further\dots we still get “trapping” events where we cannot grow the walk any further

Nowhere to go

The persistence of such trapping events was proved by Madras (1988), and then their density estimated by Owczarek & Prellberg (2008) as being about 1%.

Since we know the probability with which we sample a walk, we can “undo” the bias that causes by reweighting a sample by the inverse of that probability. Hence we weight a sample by $\begin{align*} W(\varphi) &= \prod_{k=0}^{n-1} a(\varphi_k). \end{align*}$ So using this we can still do accurate (weighted) statistics. Curiously, the average weight is just $\begin{align*} \mathbb{E}( W ) &= \sum_\varphi \mathbb{P}(\varphi) \cdot W(\varphi) = \sum_\varphi 1 = c_n. \end{align*}$ This is the basis of Rosenbluth sampling — due first to Hammersley & Morton and then Rosenbluth & Rosenbluth in the mid 1950s. It is easy to code up something to explore this. Let’s do that now.

from random import choice
import matplotlib.pyplot as plt

S=10000
N=71

counts=[0 for i in range(0,N+1)]
sumWeight=[0 for i in range(0,N+1)]

correct=[1,4, 12, 36, 100, 284, 780, 2172, 5916, 16268, \
  44100, 120292, 324932, 881500, 2374444, 6416596, 17245332, \
    46466676, 124658732, 335116620, 897697164, 2408806028, \
    6444560484, 17266613812, 46146397316, 123481354908, \
    329712786220, 881317491628, 2351378582244, 6279396229332,\
    16741957935348, 44673816630956, 119034997913020, 317406598267076,\
    845279074648708, 2252534077759844, 5995740499124412, \
    15968852281708724, 42486750758210044, 113101676587853932,\
    300798249248474268, 800381032599158340, 2127870238872271828,\
    5659667057165209612, 15041631638016155884, 39992704986620915140,\
    106255762193816523332, 282417882500511560972, 750139547395987948108,\
    1993185460468062845836, 5292794668724837206644, 14059415980606050644844,\
    37325046962536847970116, 99121668912462180162908, 263090298246050489804708,\
    698501700277581954674604, 1853589151789474253830500, 4920146075313000860596140,\
    13053884641516572778155044, 34642792634590824499672196, 91895836025056214634047716,\
    243828023293849420839513468, 646684752476890688940276172, 1715538780705298093042635884,\
    4549252727304405545665901684, 12066271136346725726547810652, 31992427160420423715150496804,\
    84841788997462209800131419244, 224916973773967421352838735684, 596373847126147985434982575724,\
    1580784678250571882017480243636, 4190893020903935054619120005916]

def Rosenbluth():
  x,y=0,0
  occupied=set()
  weight=1.0
  for n in range(0,N+1):
    occupied.add( (x,y) )
    counts[n]+=1
    sumWeight[n]+=weight

    atm=[] # list of available sites
    if( (x+1,y) not in occupied ):
      atm.append((x+1,y))
    if( (x-1,y) not in occupied ):
      atm.append((x-1,y))
    if( (x,y+1) not in occupied ):
      atm.append((x,y+1))
    if( (x,y-1) not in occupied ):
      atm.append((x,y-1))

    weight *= len(atm)

    if(len(atm)>0):
      x,y=choice(atm)
    else:
      break

for s in range(S):
  Rosenbluth()

normWeight = list(map( lambda x:x/S,  sumWeight ))
plt.plot(range(N+1), normWeight, "bo", range(N+1), correct, 'r-')

plt.yscale('log')
plt.show()

You can still see the attrition caused by “trapping”

But you can also see that our estimates of the mean weight line up very nicely with the exact enumeration data (from Iwan Jensen):

A good fit

This is our first example of “approximate” enumeration.

Unfortunately the “trapping” becomes a very large issue — you can see that we are already having difficulty sampling walks of length $\lt100$ , let alone of length $1000+$ . Further, though it is less obvious, the weights of the walks have massive variance — our statistics can look completely fine and then one can be hit by just one walk of massive weight that throws everything off.

The methods of “pruning” and “enrichment” can help these problems — see the PERM algorithm by Grassberger (1997)(and its various refinements — eg Bachmann & Janke (2003), or Prellberg & Krawczek (2004)) — but we’ll work towards another method which is (arguably) easier to use and more general.

So for the moment we’ll back away from approximate enumeration and go back to thinking about sampling.