C++17 structured bindings in the Clang Static Analyzer

Google Summer of Code 2022 @ LLVM Compiler Infrastructure

Dániel Domján Eötvös Loránd University, Hungary
Commits on GitHub | Revisions on Phabricator

Motivation

In modern C++ projects it's common to use structured bindings to make the code more readable and easier to maintain. They allow developers to bind specified names, also referred to as identifiers to underlying fields or elements of a given initializer. For example:

Color color_instance{176, 183, 187};
auto [r, g, b] = color_instance;
uint8_t red = r;

The structured binding declaration can be seen on the second line, where r, g and b are the specified names or identifiers, and color_instance is the initializer. r, g and b are bound to 176, 183 and 187 respectively.

On the next line the value of r is assigned to red, so the value of red is 176, however the Clang Static Analyzer was not aware of that prior to this project due to the lack of support for structured bindings. Since the analyzer didn't know what r is, in some cases it falsely reported a warning about assigning an uninitialized value, as it happened in #42387.

Incomplete support for structured bindings in the analyzer was a common source of false positives. The goal of this project was to introduce structured bindings to the analyzer, and reduce the amount of false positive reports while creating new opportunities to catch more bugs.

Approach

There are multiple different ways to declare a structured binding, though from an analysis perspective only 2 of them needs to be treated differently. Binding by value and binding by reference.

When a binding is created by value, the initializer is first copied, and the identifiers refer to the fields/elements of the copy. In the other case no copy gets created, the identifiers refer to the values inside the original initializer instead.

There are also 3 structured binding syntaxes defined by the standard. Their simplified versions look like this:

auto [identifier-list] = initializer; (1)
auto [identifier-list]{initializer};  (2)
auto [identifier-list](initializer);  (3)

In meaning there's a difference between (1) and (2)/(3) only if the initializer is an array and the binding is created by value. In that case if syntax (1) is used, when the initializer is copied, each element of the copy will be copy-initialized, while using syntax (2)/(3) will result in each element being direct-initialized. When it comes to static analysis, the same operations are performed in all 3 cases so there's no need to treat them separately.

Although there are also 3 different cases when structured bindings can be created, and each of them does behave differently. While handling binding by reference is relatively simple in all case, by value binding requires major changes in the analyzer for most of them.

Case 1: binding to data members

When a binding to data members is created by value, the object is first copied and the identifiers refer to the the non-static data members of the copy.

struct S {
    int x;
    double y;
};

void fn() {
    S s{1, 2.0};

    auto [a, b] = s;
}

In this snippet s is copied first and a and b refer to the data members of the copy, in the same order as they are declared. Identifier a corresponds to x and b corresponds to y respectively.

The identifiers themselves are not variables, just names referring to the members of the copy, so their evaluation happens when they are used somewhere. For example:

auto [a, b] = s;
int fst = a;

When the the statement int fst = a; is reached, the value of a is the value of the first field, x inside the copy of s, so all that's left to do is telling the analyzer to look that value up and assign it to fst.

Initially the analyzer was able to create a copy of the struct, so the only work to be done was implementing how to look the required value up.

The final step of this case was to handle when data members are bound by reference. In this case the initializer is not copied and the identifiers are references to the data members of the original initializer.

S s{1, 2.0};

auto &[x1, y1] = s;
auto &&[x2, y2] = s;

Inside the analyzer there's no need to differentiate between binding by & (lvalue reference) and && (rvalue reference). In both cases it's enough to look up the specific value from the original object.

Case 2: binding a tuple-like type

Binding a tuple-like type by value is similar to data members in terms of the tuple-like type is first copied. After that instead of treating each identifier as an alias to the corresponding member of the copy, a new variable is created for each of them, which has the same type and value as it's respective member.

std::pair<int, char> p{1, 'a'};
                    
auto [a, b] = p;

The snippet above can be viewed as if there are 2 variable declarations, int a = 1; and char b = 'a';.

The idea here is to replicate this behaviour, and after the object has been copied, create a new variable for each identifier, and initialize them with the proper values from the copy. The variable creation is driven by the Control Flow Graph, so the analyzer needed to be modified so that when it sees a structured binding declaration to a tuple-like type during CFG construction, it also creates a new variable declaration for each identifier in the CFG.

When a tuple-like type is bound by reference, the initializer is not copied, but the variables are still created for each identifier and in this case they are reference types.

std::pair<int, char> p{1, 'a'};
                    
auto &[a1, b1] = p;
auto &&[a2, b2] = p;

In both cases the new variables are lvalue reference types, so both a1 and a2 have the type of int & and b1 and b2 have the type of char &. Also these variables are references to the elements of p.

Case 3: binding an array

Binding an array by value works like binding data members by value. First the array is copied, and the identifiers refer to the values of the corresponding elements of the copy.

int arr[] = {1, 2, 3};

auto [x, y, z] = arr;

In the snippet above arr is copied and the identifiers (x, y, z) refer to the elements of the copy. The copy of the array is created by an ArrayInitLoopExpr, which is also responsible for creating a copy in an implicit copy/move constructor of a class with an array member and when a lambda captures an array by value. This means that handling the mentioned expression can provide more information to the analyzer about aggregate types and lambdas.

The idea is to copy the array with as little overhead as possible, so for POD arrays the desired approach is to create a LazyCompoundVal, which basically means that the analyzer associates the new array with a snapshot of the old one, instead of creating a memberwise copy.

While it is the desired approach for POD arrays, it is inaccurate for non-POD ones, because the copy constructor can have side effects. For example take a look at this snippet:

struct S {
    inline static int c = 0;
    int i;
                  
    S() : i(++c) {}
    S(const S &copy) { i = copy.i + 1; }
};

void fn() {
    S arr[2];

    auto [x, y] = arr;
}

The copy constructor of S has a side effect, which means it must be evaluated for every element to get an accurate result.

Initially the analyzer wasn't able to model non-POD arrays, so before it could start evaluating structured bindings to such arrays, it needed to be taught how to create and destroy them.

The idea is the same for both creating the array and creating a copy of it. The analyzer should call the constructor/copy constructor for each of the elements in order, and construct each element into the specific target memory region. Later when the lifetime of the array ends, the destructors also need to be invoked, but in a reverse order. The destructor of the last element is invoked first.

Binding by reference works the same way as in case of data members. The array is not copied and the identifiers are references to the elements in the original array.

Implementation

Additional patches

Results

Structured bindings are a C++17 addition, so testing the changes required projects written in C++17 or above. Finding sufficiently large projects with extensive structured binding usage was quite challenging, since the majority of large open-source C++ projects are written in C++14 or below.

Most of the testing was done on Acid and WebKit and their dependencies, including Bullet, Glslang and GLFW. On top that, the changes have also been evaluated on several C++14 or earlier open-source projects, including Qt, SQLite and LLVM itself.

Both the coverage and the accuracy of the analysis have been improved. On WebKit 81 false positives have been replaced by 22 true positives and Issue #42387 has been resolved too.

Future work

There are several improvements that could increase the accuracy and coverage of the analysis. Most of them are not structured binding specific, but are strongly connected to the implementations used to teach the analyzer about structured bindings.

Special thanks

I'd like to thank Artem Dergachev, Gábor Horváth, Kristóf Umann and Rashmi Mudduluru for mentoring the project, helping with testing the various patches on a couple of open-source projects and answering my questions even during the weekends.

I'd also like to thank Usama Hameed, Ziqing Luo and Malavika Samak for being part our weekly discussions, as well as Balázs Benics and Gábor Márton for code reviews and ideas about how the new features could be improved.