Thursday 28 December 2023

A Tale of Two Contains Methods

Image generated by DALL-E 3 from a prompt by Bob Buzzard 


Eagle-eyed regular readers of this blog may have noticed that I dropped away during the build up to Christmas this year, and there was a good reason for this. I was taking part in Advent of Code, having been introduced to it by our Credera brethren. A challenge a day for 25 days - two a day in fact, as if/when you solve the first, it gets tweaked to make it harder. 

I opted to use Apex to take on the challenges, which ensured that I couldn't brute force any solutions, and that I typically had to switch to a batch/asynchronous approach when I finally ran out of CPU or heap. This didn't always help by the way - a few times I had to give up after encountering variants of "the batch class is too large" error. A few other times I had to accept solving one challenge of the two when I couldn't figure out how to even start on the second!  I ended up with 39/50 successes though, which didn't seem like a bad return for a constrained language. It was extremely enjoyable, but be aware that this can easily soak up all of your spare time and more, and may not make you popular at home!

A common theme of the challenge was walking a route through a grid and keeping track of the tiles that I'd encountered before, usually with some additional information around which path I was following, the direction I'd moved in and how many steps I'd taken in a direction. If it was net new then I needed to add it to a collection, but I also needed to order the collection, as I was looking for the shortest or longest route possible, so they needed to end up in a List.

A Tale of Two Contains Methods

It was the best of methods, it was the worst of methods

One difference between Apex and some other languages I've worked with in the past is the contains method on the List class - this handy helper returns true if the list contains the element passed in to it. This saves me from either iterating the list each time I consider adding an element, or maintaining a separate "lookup" collection - typically a Set that matches the list and I'd check if the element was in there first.

I used the List contains method in my first attempt on one of the challenges, and found that I had to quickly go to batch apex. In order to walk the path I was carrying out a breadth-first search, adding every possible option for each step to a queue as a complex object, but always processing the shortest option first. Once the queue got to around 3,000 elements (complex objects), I found that I could only process a few of them before breaching the 60,000 millisecond CPU limit, and I was looking at an extremely large set of batches that would likely take multiple hours to complete.  After a bit of digging it looked like the check/add to the queue of steps wasn't scaling overly well, so I switched back to maintaining a separate lookup Set and using the Set contains method to determine if I'd seen it before. Once I did this, the CPU use dropped to the point where I could complete the whole thing in 2-3 batches, which I did.

I was somewhat taken aback by this, as I'd assumed that the List contains method would be using an efficient mechanism under the hood and would perform well regardless of the size of the list/complexity of the object. This turns out not to be the case, but that's really my fault for assuming - there's nothing in the docs to suggest that it will be doing anything of the sort.

Now that Advent of Code has completed, I've had the time to run some numbers on the CPU consumption of each of these contains methods (hence the witty title, with apologies to Charles Dickens of course), and present the results.

The Methodology

I have defined a (not particularly) complex object so that there's a bit of work involved to determine if another object matches it:

public class ContainsObject 
    public String stringRep;
    public Integer intRep;
    public Long square;
    public DateTime timestamp;
    public ContainsObject(Integer idx)

I then add two thousand of these to a List, checking each one to see if I've seen it before. The CPU consumed is captured for every 100 elements and gives the following results:

   Count            CPU

   0 - 100           76
 400 - 500           97
 900 -1000          169
1400 - 1500         460
1900 - 2000         582

from this I can deduce that there isn't anything particularly efficient going on under the hood - as the size of the List increases, so does the time taken to check and add 100 elements. In the 1900-2000 range the average is over 5ms per check/insert, which is quite a chunk for a couple of statements.

Switching to the List and lookup Set approach, I create a Set of the complex objects to mirror the contents of the List, but without any ordering, that I can use for the check part. If the element isn't present in the Set, I add it to both the Set and the List.

Executing this for the same number of complex objects gives:

   Count            CPU

   0 - 100           4
 400 - 500           6
 900 -1000           5
1400 - 1500          4
1900 - 2000          7

This is much more the kind of result I want to see - the performance isn't really changing regardless of the size of the Set, and while the final hundred takes slightly longer than the first hundred, the average is 0.07ms, which leaves me plenty of CPU to play with in the rest of the transaction.

No Downside?

As always, there's no such thing as a free lunch - the fact that I have to maintain another collection for speedy lookup does incur a heap penalty. It is a pretty cheap lunch though, as I'm only holding references to the objects stored in the List rather than copies, so the 2,000 entries in the Set consume another 8kb of heap. This feels like a pretty decent trade off to me, but if your transaction has plenty of spare CPU and is butting up against the heap limit, you'll likely feel different.

Related Posts