Counting with iterators

Ever wanted to do a loop in R over a million elements, but felt bad that for (i in 1e6) do.stuff(i) allocated an 8Mb vector of indices you didn’t actually need to store? That’s where iterators come in. Iterators are new to R (REvolution Computing just released the iterators package to CRAN last month), but will be familiar to programmers of languages like Java or Python. You can think of an iterator as something like a cursor or pointer to a predefined sequence of elements. Each time you access the iterator, it returns the current element being pointed to, and…

Ever wanted to do a loop in R over a million elements, but felt bad that

for (i in 1e6) do.stuff(i)

allocated an 8Mb vector of indices you didn't actually need to store? That's where iterators come in.

Iterators are new to R (REvolution Computing just released the iterators package to CRAN last month), but will be familiar to programmers of languages like Java or Python. You can think of an iterator as something like a cursor or pointer to a predefined sequence of elements. Each time you access the iterator, it returns the current element being pointed to, and advances to the next one.

This is probably easier to explain with an example. We can create an iterator for a sequence of integers 1 to 5 with the icount function:

> require(iterators)

Loading required package: iterators

> i <- icount(5)

The function nextElem returns the current value of the iterator, and advances it to the next. Iterators created with icount always start at 1:

> nextElem(i)

[1] 1

> nextElem(i)

[1] 2

> nextElem(i)

[1] 3

When an iterator runs out of values to return, it signals an error:

> nextElem(i)

[1] 4

> nextElem(i)

[1] 5

> nextElem(i)

Error: StopIteration

So, if we wanted to make a loop of a million iterations, all we need to do is make an iterator and then loop using the foreach function (from the foreach package):

> require(foreach)

Loading required package: foreach

> m <- icount(1e6)

> foreach (i = m) %do% { do.stuff(i) }

One nice thing about this construction is that m is a very small object: you don't need to waste a bunch of RAM on index values you only need one at a time. The other nice thing is that by replacing %do% with %dopar% you can run multiple iterations in parallel. Because the iterator m is shared amongst all the parallel instances, it guarantees that i takes each value between one and a million exactly once across all the iterations, even if they don't necessarily complete in sequence.

An iterator isn't constrained to simply return integers, either. You can set up an iterator on a matrix, so that each call to nextElem returns the next row (or column) as a vector. Or, you can set up an iterator on a MySQL or Oracle database, so that each call to nextElem returns the next record in the table. Iterators can even return infinite, irregular sequences — the sequence of all primes, for examples. You can see examples of all these kinds of iterators in my recent UseR! talk.

Link to original post