Joys of cancelling a TBB task group · Aras' website
A Blender issue #152467<br>(“File Browser thumbnail cache broken with large amount of images”) reminded me<br>to write this up. This particular issue is a (documented) surprise that when you have a parallel_for in<br>TBB, some of the loop<br>iterations might not execute at all, if the task group gets cancelled.
Similar to C++ exceptions, the effect is “global” - something might throw an exception,<br>and now a completely unrelated part of your code needs to be aware of that possibility,<br>even if you don’t want to. Same with task group cancellation – you might write a parallel_for,<br>and assume that all the loop iterations will execute. That’s what all the code within Blender<br>does, I think :) But! Because some caller of your code way up above might do a task_group.cancel(),<br>now your code needs to be prepared to handle that possibility.
Anyway, all that reminded me of another Blender bug that I was involved with some months ago,<br>which is more curious.
The bug
The reported bug was #143662:
Blender crashes, while you have a file browser dialog open with “sufficient amount” of thumbnails,
But only if some thumbnails were freshly generated (i.e. have not been cached previously),
And only if you had rendered anything with the path-tracing Cycles renderer,
And only if you have “Persistent Data” Cycles option on,
The crash does not happen with Address Sanitizer being on.
So that’s… fun.
Part of the possible cause was my change that added more multi-threading to parts of image<br>processing code, some of which gets executed during thumbnail generation. Which means I had<br>to investigate.
What is wrong with this code?
Suppose you have a path-traced renderer (like Blender’s Cycles) that as part of<br>scene initialization does something like this:
parallel_for_each(all_scene_geometries, build_geometry_bvh); // 1.
and each Geometry object contains something like:
struct Geometry {<br>BVH bvh; // bounding volume hierarchy object backed by Embree library // 2.<br>// ...<br>};
So far so good. This builds bounding volume hierarchies for all geometries,<br>in parallel. The parallel_for_each is implemented by TBB library, and the BVH data is backed<br>by Embree library. Both well known, production & battle-tested libraries.
Now, in a completely unrelated part of the application, like in a file dialog UI code,<br>you have an on-demand file thumbnail generation code. Thumbnails are cached to disk,<br>but if some of them are not cached, they are rebuilt in the background and saved. While<br>scrolling the dialog with potentially many thumbnails, some of the queued requests<br>might get no longer needed if the visible portion changes drastically. The exact logic<br>is somewhat complex, but essentially it has a queue of “thumbnail generation tasks”,<br>and sometimes decides to cancel a whole group of pending tasks.
So there’s code like this somewhere:
if (some_condition) {<br>tbb_thumb_task_group->cancel(); // TBB cancel functionality // 3.<br>// ...
and somewhere within each thumbnail generation job, there is potential image<br>buffer colorspace conversion or scaling code that has like a:
parallel_for(all_thumbnail_pixels, process_the_pixel); // 4.
Does all of that code make sense? You would think so! And yet it crashes, from innards of oneTBB,<br>when doing the cancel() call. But only sometimes. And only when Address Sanitizer is off.
And only if all of the 1, 2, 3, and 4 points above are present:
Remove the parallel build of Cycles geometry BVH, i.e. build them sequentially? All good.
Switch the Cycles BVH to something not backed by Embree? All good.
Switch the Cycles to not persist scene data across frames/renders? All good.
Stop doing cancel on the thumbnail task group? All good.
Process thumbnail pixels sequentially instead of parallel? All good too.
🤯
The crash cause
Turns out, there’s nothing particularly wrong with the code above, it is “just” a surprising<br>implementation detail of TBB task group cancellation, that is not intuitive at all.
It maybe makes sense if you would think really hard about “so, how would I actually implement<br>task cancellation with nested parallelism?”, but most people do not think about this<br>question every day.
What happens is this:
Each parallel_for creates a “task group context” (TBB task_group_context)<br>as an on-stack / local variable. Our parallel_for_each(all_scene_geometries, build_geometry_bvh)<br>above has just created one.
Whenever something “uses” a task group context, TBB “binds” it (whatever that is).<br>As part of this “binding”, the context records the currently executing context<br>as its parent (task_group_context.cpp:118),<br>and adds itself into a per-thread list of live contexts (task_group_context.cpp:105).<br>Turns out, Embree library also uses TBB internally, so building a BVH for a geometry<br>does a...