C++17 structured bindings in the Clang Static Analyzer

Google Summer of Code 2022 @ LLVM Compiler Infrastructure

Dániel Domján Eötvös Loránd University, Hungary
Commits on GitHub | Revisions on Phabricator

Motivation

In modern C++ projects it's common to use structured bindings to make the code more readable and easier to maintain. They allow developers to bind specified names, also referred to as identifiers to underlying fields or elements of a given initializer. For example:

Color color_instance{176, 183, 187};
auto [r, g, b] = color_instance;
uint8_t red = r;

The structured binding declaration can be seen on the second line, where r, g and b are the specified names or identifiers, and color_instance is the initializer. r, g and b are bound to 176, 183 and 187 respectively.

On the next line the value of r is assigned to red, so the value of red is 176, however the Clang Static Analyzer was not aware of that prior to this project due to the lack of support for structured bindings. Since the analyzer didn't know what r is, in some cases it falsely reported a warning about assigning an uninitialized value, as it happened in #42387.

Incomplete support for structured bindings in the analyzer was a common source of false positives. The goal of this project was to introduce structured bindings to the analyzer, and reduce the amount of false positive reports while creating new opportunities to catch more bugs.

Approach

There are multiple different ways to declare a structured binding, though from an analysis perspective only 2 of them needs to be treated differently. Binding by value and binding by reference.

When a binding is created by value, the initializer is first copied, and the identifiers refer to the fields/elements of the copy. In the other case no copy gets created, the identifiers refer to the values inside the original initializer instead.

There are also 3 structured binding syntaxes defined by the standard. Their simplified versions look like this:

auto [identifier-list] = initializer; (1)
auto [identifier-list]{initializer};  (2)
auto [identifier-list](initializer);  (3)

In meaning there's a difference between (1) and (2)/(3) only if the initializer is an array and the binding is created by value. In that case if syntax (1) is used, when the initializer is copied, each element of the copy will be copy-initialized, while using syntax (2)/(3) will result in each element being direct-initialized. When it comes to static analysis, the same operations are performed in all 3 cases so there's no need to treat them separately.

Although there are also 3 different cases when structured bindings can be created, and each of them does behave differently. While handling binding by reference is relatively simple in all case, by value binding requires major changes in the analyzer for most of them.

Case 1: binding to data members

When a binding to data members is created by value, the object is first copied and the identifiers refer to the the non-static data members of the copy.

struct S {
    int x;
    double y;
};

void fn() {
    S s{1, 2.0};

    auto [a, b] = s;
}

In this snippet s is copied first and a and b refer to the data members of the copy, in the same order as they are declared. Identifier a corresponds to x and b corresponds to y respectively.

The identifiers themselves are not variables, just names referring to the members of the copy, so their evaluation happens when they are used somewhere. For example:

auto [a, b] = s;
int fst = a;

When the the statement int fst = a; is reached, the value of a is the value of the first field, x inside the copy of s, so all that's left to do is telling the analyzer to look that value up and assign it to fst.

Initially the analyzer was able to create a copy of the struct, so the only work to be done was implementing how to look the required value up.

The final step of this case was to handle when data members are bound by reference. In this case the initializer is not copied and the identifiers are references to the data members of the original initializer.

S s{1, 2.0};

auto &[x1, y1] = s;
auto &&[x2, y2] = s;

Inside the analyzer there's no need to differentiate between binding by & (lvalue reference) and && (rvalue reference). In both cases it's enough to look up the specific value from the original object.

Case 2: binding a tuple-like type

Binding a tuple-like type by value is similar to data members in terms of the tuple-like type is first copied. After that instead of treating each identifier as an alias to the corresponding member of the copy, a new variable is created for each of them, which has the same type and value as it's respective member.

std::pair<int, char> p{1, 'a'};
                    
auto [a, b] = p;

The snippet above can be viewed as if there are 2 variable declarations, int a = 1; and char b = 'a';.

The idea here is to replicate this behaviour, and after the object has been copied, create a new variable for each identifier, and initialize them with the proper values from the copy. The variable creation is driven by the Control Flow Graph, so the analyzer needed to be modified so that when it sees a structured binding declaration to a tuple-like type during CFG construction, it also creates a new variable declaration for each identifier in the CFG.

When a tuple-like type is bound by reference, the initializer is not copied, but the variables are still created for each identifier and in this case they are reference types.

std::pair<int, char> p{1, 'a'};
                    
auto &[a1, b1] = p;
auto &&[a2, b2] = p;

In both cases the new variables are lvalue reference types, so both a1 and a2 have the type of int & and b1 and b2 have the type of char &. Also these variables are references to the elements of p.

Case 3: binding an array

Binding an array by value works like binding data members by value. First the array is copied, and the identifiers refer to the values of the corresponding elements of the copy.

int arr[] = {1, 2, 3};

auto [x, y, z] = arr;

In the snippet above arr is copied and the identifiers (x, y, z) refer to the elements of the copy. The copy of the array is created by an ArrayInitLoopExpr, which is also responsible for creating a copy in an implicit copy/move constructor of a class with an array member and when a lambda captures an array by value. This means that handling the mentioned expression can provide more information to the analyzer about aggregate types and lambdas.

The idea is to copy the array with as little overhead as possible, so for POD arrays the desired approach is to create a LazyCompoundVal, which basically means that the analyzer associates the new array with a snapshot of the old one, instead of creating a memberwise copy.

While it is the desired approach for POD arrays, it is inaccurate for non-POD ones, because the copy constructor can have side effects. For example take a look at this snippet:

struct S {
    inline static int c = 0;
    int i;
                  
    S() : i(++c) {}
    S(const S &copy) { i = copy.i + 1; }
};

void fn() {
    S arr[2];

    auto [x, y] = arr;
}

The copy constructor of S has a side effect, which means it must be evaluated for every element to get an accurate result.

Initially the analyzer wasn't able to model non-POD arrays, so before it could start evaluating structured bindings to such arrays, it needed to be taught how to create and destroy them.

The idea is the same for both creating the array and creating a copy of it. The analyzer should call the constructor/copy constructor for each of the elements in order, and construct each element into the specific target memory region. Later when the lifetime of the array ends, the destructors also need to be invoked, but in a reverse order. The destructor of the last element is invoked first.

Binding by reference works the same way as in case of data members. The array is not copied and the identifiers are references to the elements in the original array.

Implementation

D127993 was the first patch applied to the analyzer and it put the initializer of the structured binding into the Control Flow Graph. Because the initializer was missing from the CFG, in some cases it had already been removed from the analyzer's internal state, before it even reached the structured binding declaration. As a result the analyzer had no information about what object were the identifiers being bound to.

D128064 introduced a "Small array binding policy" to the analyzer, which is similar to the already existing "Small struct binding policy". As a result if a LazyCompoundVal of a sufficiently small (smaller than 5 elements) array is about to be created, a by value copy is created instead. This allows the analyzer to produce more accurate warning messages.

D127643 is a simple patch allowing the analyzer to handle structured bindings to data members. It tells the analyzer where to look up a required values.

D126613 implemented ArrayInitLoopExpr and structured bindings to POD arrays.

D128716 fixed a bug when an identifier introduced to a reference type was treated as if it was a non-reference type. An example snippet that helps understanding the bug:
```
struct S { int &x; };

void fn() {
    int i = 1;
    S s{i};

    auto [a] = s;
}
```
Here the type of a is int &, however it was treated as int, which lead to false positives.

D127973 made it possible to evaluate the constructor of each element in a non-POD array of less than or equal to 4 elements. The analyzer allows the same statement to be repeated 4 times by default, hence the 4 element limit.

D129496 applied the ideas of the previous patch to ArrayInitLoopExpr, and made it possible to create structured bindings to small non-POD arrays as well as to evaluate implicit copy and move constructors of classes/structs with non-POD array fields properly.

D129967 has been extracted from the previous patch and introduced a new ConstructionContext for lambda captures, and introduced capturing non-POD arrays in lambdas by value.

D128837 introduced structured bindings to tuple-like types. It modified the CFG and the Liveness Analysis so that a new variable is created for each identifier, and it's lifetime is properly managed.

D131784 fixed an issue introduced by D129967 , which lead to a crash on copy elided initialized lambda captures. E.g.: [var = S()](){}; compiled with -std=c++17 or higher.

D131944 removed pattern matching for lambda capture initializers completely as they were hard to maintain and caused crashes.

D131840 introduced a solution for handling multidimensional non-POD arrays in ArrayInitLoopExpr, so the analyzer can properly model when such arrays are used in structured bindings, lambda captures and implicit copy/move constructors.

D130737 introduced non-POD array destruction to the analyzer, so now it's possible to properly reason about non-POD arrays when they're small enough. Before this patch the analyzer could reason about non-POD array construction, but still failed to evaluate their destruction.

D131501 introduced special handling for 0 length arrays, which are basically no-ops, so no machine code is generated for them. Before this patch they were treated by the analyzer as if they had 1 element, which could lead to false positives by invoking the constructor and destructor once.

D131187 added all the temporary data used to evaluate non-POD array construction and destruction to the Exploded Graph to make it easier to track which element is being constructed or destructed at a specific point.

D131262 increased the accuracy of the bug reports. With this patch the bug reporter can keep tracking a specific region even if it encounters a trivial copy/move construction. This patch also improved tracking values inside initializer lists a bit.

Additional patches

D132654 fixed a false positive warning in a clang-tidy checker that was reported when an array was used as the initializer of a structured binding or was captured in a lambda by value.

D131299 created a new analyzer checker that warns if the expression used to determine the number of elements inside new[] evaluates to a garbage value.

Results

Structured bindings are a C++17 addition, so testing the changes required projects written in C++17 or above. Finding sufficiently large projects with extensive structured binding usage was quite challenging, since the majority of large open-source C++ projects are written in C++14 or below.

Most of the testing was done on Acid and WebKit and their dependencies, including Bullet, Glslang and GLFW. On top that, the changes have also been evaluated on several C++14 or earlier open-source projects, including Qt, SQLite and LLVM itself.

Both the coverage and the accuracy of the analysis have been improved. On WebKit 81 false positives have been replaced by 22 true positives and Issue #42387 has been resolved too.

Future work

There are several improvements that could increase the accuracy and coverage of the analysis. Most of them are not structured binding specific, but are strongly connected to the implementations used to teach the analyzer about structured bindings.

Currently the analyzer can only reason about known values inside a LazyCompoundVal. It can't tell the difference between Unknown and Undefined values. This sometimes leads to false negatives, so an uninitialized assignment might not be detected. Modifying the analyzer so that it can decide what such values are could increase the accuracy of the analysis.

Detecting whether a constructor/destructor has side effects could also increase the accuracy of the analysis. For example the Dead Code Checker relies on such information, and because it's not provided with that, it can't detect when a non-POD variable is unused. Also if the analyzer knew if there are no side effects, invoking the constructor for every element in a non-POD array could also be avoided.

Analysis of non-POD variable-length arrays. Currently the analyzer only handles arrays if their element count is known at compile time.

Creating a new ConstructionContext for lambdas. Currently lambdas are constructed into a temporary memory region from which they are moved to their final region. A ConstructionContext would allow the analyzer to construct the lambdas directly into their final regions, so the temporary construction could be avoided.

Special thanks

I'd like to thank Artem Dergachev, Gábor Horváth, Kristóf Umann and Rashmi Mudduluru for mentoring the project, helping with testing the various patches on a couple of open-source projects and answering my questions even during the weekends.

I'd also like to thank Usama Hameed, Ziqing Luo and Malavika Samak for being part our weekly discussions, as well as Balázs Benics and Gábor Márton for code reviews and ideas about how the new features could be improved.