Simple sampling
Rather than talking about sampling very generally, I’ll focus on self-avoiding walks (because they are relatively simple to play with) and some similar objects like random walks. Now, since we cannot compute some of these quantities (and associated statistics etc) exactly in a reasonable amount of computer time, we can trade precision for speed, and try to compute them approximately. My own introduction to that idea came through something called the Rosenbluth method, which I’ll touch on, but to get to that, and better approaches, we should start with just some very simple sampling methods. Lets start with a 1d random walk. Part of the reason to introduce this is to also introduce some simple bits of python. Now some of you might have plenty of python experience, and some of you almost none — just Patrick’s introduction to visualisation etc yesterday. So I’ll start with some very basic things and we’ll work up from there. My apologies if this is very basic for some of you.
A simple random walk in 1d
Lets just build a random walk in 1d by starting at 0 and then flipping a coin at each step and recording our movements.
from random import randint N=10 x=0; walk=[x] for n in range(0,N): if(randint(1,2)==1): x+=1 else: x-=1 walk.append(x) print(walk)
This gives output (say)
[0, -1, 0, 1, 0, 1, 0, 1, 2, 3, 4]
Which is just our 1d random walk at each time step.
A simple random walk in 2d
Let’s go up one dimension and look at a simple random walk on the plane:
from random import randint N=10 x,y=0,0 walk=[(x,y)] for n in range(0,N): c=randint(0,3) #0=R, 1=L, 2=U, 3=D if(c==0): x+=1 elif(c==1): x-=1 elif(c==2): y+=1 elif(c==3): y-=1 walk.append( (x,y) ) print(walk)
This gives output (say)
[(0, 0), (0, 1), (1, 1), (0, 1), (1, 1), (0, 1) (0, 2), (-1, 2), (-1, 1), (-1, 0), (-1, 1)]
Which is a list of the coordinates of our 2d random walk at each time step.
This method of producing a sample is time — ie it is linear in the length of the walk.
Notice, we can clean up our “next vertex in direction c” code by storing for each direction in a list. We’ll do that in the next code blurb.
A simple random walk in 2d - now with pictures!
Lets pull in matplotlib and use it to plot things. Notice this is slightly complicated by the fact that we are
producing a list of tuples [(x_1,y_1), (x_2,y_2),...]
, but the plot command wants the data arranged as
[x_1,x_2,...], [y_1,y_2,...]
. Thankfully there is a python command that does precisely that — zip
.
from random import randint import matplotlib.pyplot as plt #0=R, 1=L, 2=U, 3=D #encode delta x,y for each direction xMoves=[1,-1,0,0] yMoves=[0,0,1,-1] N=10 x,y=0,0 walk=[(x,y)] for n in range(0,N): c=randint(0,3) x += xMoves[c] y += yMoves[c] walk.append( (x,y) ) #Plot function wants [x-ords],[y-coords] - use zip to do this print(walk) xs,ys = zip( *walk ) #split our list of coords # just add a little buffer around the edges of our plot plt.axis([ min(xs)-1, max(xs)+1, min(ys)-1, max(ys)+1]) # now plot the data with blue blobs and a line. plt.plot(xs,ys,'bo-') plt.show()
This returns to us a figure like this
Notice that
- Can clearly see that our random walk self-intersects.
- There is a 1/4 chance that our next step will cause an immediate problem (self-intersection).
- could improve this by requiring that our next step does not reverse the last one — a random walk with no immediate back-tracking.
A simple random walk in 2d — no immediate backtracking
To ensure no backtracking, we record the step we are not allowed to take next. So we create another list which encodes “if you stepped in direction blah, then you cannot step in direction blah”.
from random import randint import matplotlib.pyplot as plt N=20 x,y=0,0 walk=[(x,y)] #0=R, 1=L, 2=U, 3=D xMoves=[1,-1,0,0] yMoves=[0,0,1,-1] wayBack=[1,0,3,2] #reverse direction dontGo=-1; #record step direction. for n in range(0,N): c=randint(0,3) while(c==dontGo): c=randint(0,3) #keep picking until we get a different direction x+=xMoves[c] y+=yMoves[c] dontGo=wayBack[c] walk.append( (x,y) ) ## Now build the plot xs,ys = zip( *walk ) plt.axis([ min(xs)-1, max(xs)+1, min(ys)-1, max(ys)+1]) plt.plot(xs,ys,'bo-') plt.show()
This returns to us a figure like this
It is better than before, but still lots of intersections.
A simple random walk in 2d — how many self-intersections
We can make small changes to our code in order to keep track of how many self-intersections our random walk is creating:
from random import randint from collections import defaultdict # defaultdict is a useful map container which handles missing keys easily occupancy=defaultdict(int) # keep track of multiply occupied sites, "int" implies missing keys default to "0". N=20 x,y=0,0 walk=[(x,y)] occupancy[ (x,y) ]=1 # the first vertex occupied xMoves=[1,-1,0,0] yMoves=[0,0,1,-1] wayBack=[1,0,3,2] #reverse direction dontGo=-1; #record step direction. for n in range(0,N): c=randint(0,3) while(c==dontGo): c=randint(0,3) x+=xMoves[c] y+=yMoves[c] dontGo=wayBack[c] walk.append( (x,y) ) if( (x,y) in occupancy ): print( "ooo! an intersection at ", (x,y) ) occupancy[(x,y)]+=1 print(walk) print("Multiply occupied sites are:") for p in occupancy: if( occupancy[p]>1 ): print(p, occupancy[p])
It returns to us something like
ooo! an intersection at (-1, 0) ooo! an intersection at (0, 0) [(0, 0), (0, -1), (-1, -1), (-1, 0), (-1, 1), (-2, 1), (-2, 0), (-1, 0), (0, 0), (1, 0), (1, 1), (2, 1), (3, 1), (4, 1), (4, 0), (3, 0), (3, -1), (4, -1), (4, -2), (4, -3), (4, -4)] Multiply occupied sites are: (0, 0) 2 (-1, 0) 2
You can see that we get plenty — indeed as the walk gets longer, the chance gets higher.
A simple random walk in 2d — how many SAWs?
“Higher” is not terribly precise. So let’s quantify that by looking at the survival time. We’ll run our simple non-backtracking random walk and stop as soon as we get a self-intersection (ie. as soon as the walk ceases to be a SAW). Run this many times and record see how many walks survive to time . It turns out that this decays exponentially (we’ll even prove this), so we’ll present our data using a semi-log plot.
from random import randint import matplotlib.pyplot as plt N=100 #walks of max length N S=10000 #We'll run it S times # the number of walks at time t counts=[0 for i in range(0,N+1)] xMoves=[1,-1,0,0] yMoves=[0,0,1,-1] wayBack=[1,0,3,2] def survivalTime(): x,y=0,0 dontGo=-1; occupied=set() #clean each time for n in range(0,N): occupied.add( (x,y) ) counts[n]+=1 c=randint(0,3) while(c==dontGo): c=randint(0,3) x+=xMoves[c] y+=yMoves[c] dontGo=wayBack[c] if( (x,y) in occupied ): break #stop as soon as self-intersection return for s in range(0,S): survivalTime() plt.bar(range(N+1), counts, color="blue") plt.yscale('log') plt.show()
This returns to us a figure like this
from which the exponential decay of the survival probability is very obvious.
Proving it
We can actually prove this exponential decay pretty easily. The number of non-backtracking random walks is and from submultiplicative-ness of SAWs we know that A little Feketing (Feketelemmarising?) then gives us So provided we can show that then we know that the limit To prove that we can just count SAWs up to some short length. The easiest way is to do a simple recursive backtracking walk. Note — this is definitely not the fastest way to do this, but it is a very quick and easy way to get some numbers.
N=10 lattice=set() count=[0 for i in range(0,N+1)] #depth first search through possible SAWs def step(l,x,y): count[l]+=1 if(l<N): lattice.add( (x,y) ) # occupy site # recursively step in each legal direction if( (x+1,y) not in lattice ): step(l+1,x+1,y) if( (x-1,y) not in lattice ): step(l+1,x-1,y) if( (x,y+1) not in lattice ): step(l+1,x,y+1) if( (x,y-1) not in lattice ): step(l+1,x,y-1) lattice.remove( (x,y) ) #clean up step(0,0,0) #start with length 0 at (0,0) for l in range(1,N+1): print(l, count[l], count[l]**(1.0/l) )
This returns to us a list of counts and also :
1 4 4.0 2 12 3.4641016151377544 3 36 3.3019272488946263 4 100 3.1622776601683795 5 284 3.0950214840037007 6 780 3.034001331989801 7 2172 2.997051875398709 8 5916 2.961443972633954 9 16268 2.937149267706373 10 44100 2.913693458576192
So — since (and getting smaller thereafter), we know that the probability of our simple random non-backtracking walk surviving to produce a SAW for us decays to zero exponentially quickly. This proves that simple sampling of these walks is never going to be much use.
A quick aside to estimate the growth rate
As an aside — we can also use our quick and dirty survival time histogram data to estimate the rate of decay. Just a semi-log plot and some regression.
from random import randint import matplotlib.pyplot as plt import numpy as np #lots of handy numerical stuff N=100 #walks of max length N S=10000 #We'll run it S times counts=[0 for i in range(0,N+1)] xMoves=[1,-1,0,0] yMoves=[0,0,1,-1] wayBack=[1,0,3,2] def survivalTime(): x,y=0,0 dontGo=-1; occupied=set() #clean each time for n in range(0,N): occupied.add( (x,y) ) counts[n]+=1 c=randint(0,3) while(c==dontGo): c=randint(0,3) x+=xMoves[c] y+=yMoves[c] dontGo=wayBack[c] if( (x,y) in occupied ): break #stop as soon as self-intersection return for s in range(0,S): survivalTime() #to do simple linear regression need lists of x_i, y_i x=[]; y=[] for l in range(10,55): if( counts[l]>0 ): #log of 0 is bad x.append(l); y.append( np.log(counts[l]) ) fit = np.polyfit(x,y,1) #linear regression fitFunction = np.poly1d(fit) #the regression line print( fit, np.exp(fit[0])*3 ) #fit coeffs and mu est. plt.plot(x,y,'bo', x,fitFunction(x),'r-') plt.show()
This gives us an estimate of the decay rate of about which (after multiplying by 3) gives a rough estimate of the exponential growth of the number of SAWs as . Not too far off the actual .
Not so simple sampling
We could — perhaps — make things better by altering the way we choose the next step. At present we just say “do go immedately backwards”, but we could also say “don’t step on an occupied site”. ie — rather than picking one of the 3 directions uniformly at random, we could pick uniformly from one of the directions the is vacant:
We’d keep iterating this until we get to the desired length — ie we “grow” the walk.
If we denote the number of choices at the end of a walk as , then the probability of picking a particular a particular edge is just . So the probability of growing a particular walk (given by its edges) is where we have used to denote the first steps of the walk .
A quick exploration of this formula shows that the probability of sampling a walk of length is no longer uniform. It really depends heavily on how tight the walk was as it was grown (which would make smaller or larger. Further\dots we still get “trapping” events where we cannot grow the walk any further
The persistence of such trapping events was proved by Madras (1988), and then their density estimated by Owczarek & Prellberg (2008) as being about 1%.
Since we know the probability with which we sample a walk, we can “undo” the bias that causes by reweighting a sample by the inverse of that probability. Hence we weight a sample by So using this we can still do accurate (weighted) statistics. Curiously, the average weight is just This is the basis of Rosenbluth sampling — due first to Hammersley & Morton and then Rosenbluth & Rosenbluth in the mid 1950s. It is easy to code up something to explore this. Let’s do that now.
from random import choice import matplotlib.pyplot as plt S=10000 N=71 counts=[0 for i in range(0,N+1)] sumWeight=[0 for i in range(0,N+1)] correct=[1,4, 12, 36, 100, 284, 780, 2172, 5916, 16268, \ 44100, 120292, 324932, 881500, 2374444, 6416596, 17245332, \ 46466676, 124658732, 335116620, 897697164, 2408806028, \ 6444560484, 17266613812, 46146397316, 123481354908, \ 329712786220, 881317491628, 2351378582244, 6279396229332,\ 16741957935348, 44673816630956, 119034997913020, 317406598267076,\ 845279074648708, 2252534077759844, 5995740499124412, \ 15968852281708724, 42486750758210044, 113101676587853932,\ 300798249248474268, 800381032599158340, 2127870238872271828,\ 5659667057165209612, 15041631638016155884, 39992704986620915140,\ 106255762193816523332, 282417882500511560972, 750139547395987948108,\ 1993185460468062845836, 5292794668724837206644, 14059415980606050644844,\ 37325046962536847970116, 99121668912462180162908, 263090298246050489804708,\ 698501700277581954674604, 1853589151789474253830500, 4920146075313000860596140,\ 13053884641516572778155044, 34642792634590824499672196, 91895836025056214634047716,\ 243828023293849420839513468, 646684752476890688940276172, 1715538780705298093042635884,\ 4549252727304405545665901684, 12066271136346725726547810652, 31992427160420423715150496804,\ 84841788997462209800131419244, 224916973773967421352838735684, 596373847126147985434982575724,\ 1580784678250571882017480243636, 4190893020903935054619120005916] def Rosenbluth(): x,y=0,0 occupied=set() weight=1.0 for n in range(0,N+1): occupied.add( (x,y) ) counts[n]+=1 sumWeight[n]+=weight atm=[] # list of available sites if( (x+1,y) not in occupied ): atm.append((x+1,y)) if( (x-1,y) not in occupied ): atm.append((x-1,y)) if( (x,y+1) not in occupied ): atm.append((x,y+1)) if( (x,y-1) not in occupied ): atm.append((x,y-1)) weight *= len(atm) if(len(atm)>0): x,y=choice(atm) else: break for s in range(S): Rosenbluth() normWeight = list(map( lambda x:x/S, sumWeight )) plt.plot(range(N+1), normWeight, "bo", range(N+1), correct, 'r-') plt.yscale('log') plt.show()
You can still see the attrition caused by “trapping”
But you can also see that our estimates of the mean weight line up very nicely with the exact enumeration data (from Iwan Jensen):
This is our first example of “approximate” enumeration.
Unfortunately the “trapping” becomes a very large issue — you can see that we are already having difficulty sampling walks of length , let alone of length . Further, though it is less obvious, the weights of the walks have massive variance — our statistics can look completely fine and then one can be hit by just one walk of massive weight that throws everything off.
The methods of “pruning” and “enrichment” can help these problems — see the PERM algorithm by Grassberger (1997)(and its various refinements — eg Bachmann & Janke (2003), or Prellberg & Krawczek (2004)) — but we’ll work towards another method which is (arguably) easier to use and more general.
So for the moment we’ll back away from approximate enumeration and go back to thinking about sampling.