Note: This is not relevant for Linq to Sql or Entity framework, where Skip and Take are transalted into their SQL equivalents.
This week I had to write some code where some 500.000 objects in a collection had to chopped up in pieces of 100 objects. My first thought was to use Skip and Take in a construction like this:
int count = collection.Count();
for(int i=0; i<count; i+=100)
var x = collection.Skip(i).Take(i);
//do stuff with x
But it turns out this does not perform at all. I wanted to understand why, and I also wanted to understand how bad the problem is.
First I made a micro benchmark. Normally I don’t micro benchbark very much, as I often find the small differences you find doesn’t make a difference in the big picture. But then again: sometimes it does. In my case the collection is a List<T> so an alternative is to use GetRange(index, Count). I decided to benchmark this for collection sizes from 1000 item to 1000000 items. The Y-axis below shows number of milliseconds to iterate through the collection. The X-axis is number of items in the collection. “do stuff with x” in my test is a simple x.ToList() in order to simulate that all items in x is enumerated. The lower graph shows time elapsed when GetRange is used.
Beside a few bad measurement points we se that the GetRange graph is constant at 1-3 milliseconds, while the time to itereate through the complete collection rises with something like n2.
A collection size of 500.000 elements takes about 13 seconds to iterate throgh. Double the size and it takes more than 45 seconds.
I then did a second experiment where I kept the collection size constant, but changed how many items to Take() at a time. Note that the X-axis is logarithmic in this case.
It is clear that the more we Take() at a time, the fewer time we have to call Skip(), and the faster it goes.
Why is Skip() so slow and why is GetRange() for List<T> so fast?
Well looking at the MS reference code shows that everytime you invoke Skip() it will have to iterate you collection from the beginning in order to skip the number of elements you desire, which gives a loop within a loop (n2 behaviour).
GetRange() on the other hand can use the knowledge that a list is used, and do some highly optimized copying of the desired elements to new collection. It doesn’t have to iterate to get to the first item to copy since the collection is in one continuus piece of memory, so a simple calculation can find the memory location to start copying from.
Conclusion: For large collections, don’t use Skip and Take. Find another way to iterate through your collection and divide it.