Dániel Domján
Eötvös Loránd University, Hungary
Commits on GitHub
|
Revisions on Phabricator
In modern C++ projects it's common to use structured bindings to make the code more readable and easier to maintain. They allow developers to bind specified names, also referred to as identifiers to underlying fields or elements of a given initializer. For example:
Color color_instance{176, 183, 187};
auto [r, g, b] = color_instance;
uint8_t red = r;
The structured binding declaration can be seen on the second
line, where r
, g
and
b
are the specified names or identifiers, and
color_instance
is the initializer.
r
, g
and
b
are bound to 176
, 183
and
187
respectively.
On the next line the value of
r
is assigned to red
, so the value
of red
is 176
, however the Clang
Static Analyzer was not aware of that prior to this project
due to the lack of support for structured bindings. Since
the analyzer didn't know what r
is, in some
cases it falsely reported a warning about assigning an
uninitialized value, as it happened in
#42387.
Incomplete support for structured bindings in the analyzer was a common source of false positives. The goal of this project was to introduce structured bindings to the analyzer, and reduce the amount of false positive reports while creating new opportunities to catch more bugs.
There are multiple different ways to declare a structured binding, though from an analysis perspective only 2 of them needs to be treated differently. Binding by value and binding by reference.
When a binding is created by value, the initializer is first copied, and the identifiers refer to the fields/elements of the copy. In the other case no copy gets created, the identifiers refer to the values inside the original initializer instead.
There are also 3 structured binding syntaxes defined by the standard. Their simplified versions look like this:
auto [identifier-list] = initializer; (1)
auto [identifier-list]{initializer}; (2)
auto [identifier-list](initializer); (3)
In meaning there's a difference between (1) and (2)/(3) only if the initializer is an array and the binding is created by value. In that case if syntax (1) is used, when the initializer is copied, each element of the copy will be copy-initialized, while using syntax (2)/(3) will result in each element being direct-initialized. When it comes to static analysis, the same operations are performed in all 3 cases so there's no need to treat them separately.
Although there are also 3 different cases when structured bindings can be created, and each of them does behave differently. While handling binding by reference is relatively simple in all case, by value binding requires major changes in the analyzer for most of them.
When a binding to data members is created by value, the object is first copied and the identifiers refer to the the non-static data members of the copy.
struct S {
int x;
double y;
};
void fn() {
S s{1, 2.0};
auto [a, b] = s;
}
In this snippet s
is copied first and
a
and b
refer to the data members
of the copy, in the same order as they are declared.
Identifier a
corresponds to x
and
b
corresponds to y
respectively.
The identifiers themselves are not variables, just names referring to the members of the copy, so their evaluation happens when they are used somewhere. For example:
auto [a, b] = s;
int fst = a;
When the the statement
int fst = a;
is reached, the value of
a
is the value of the first field,
x
inside the copy of s
, so all
that's left to do is telling the analyzer to look that value
up and assign it to fst
.
Initially the analyzer was able to create a copy of the struct, so the only work to be done was implementing how to look the required value up.
The final step of this case was to handle when data members are bound by reference. In this case the initializer is not copied and the identifiers are references to the data members of the original initializer.
S s{1, 2.0};
auto &[x1, y1] = s;
auto &&[x2, y2] = s;
Inside the analyzer there's no need to differentiate between
binding by
&
(lvalue reference) and
&&
(rvalue reference). In both cases
it's enough to look up the specific value from the original
object.
Binding a tuple-like type by value is similar to data members in terms of the tuple-like type is first copied. After that instead of treating each identifier as an alias to the corresponding member of the copy, a new variable is created for each of them, which has the same type and value as it's respective member.
std::pair<int, char> p{1, 'a'};
auto [a, b] = p;
The snippet above can be viewed as if there are 2 variable
declarations,
int a = 1;
and char b = 'a';
.
The idea here is to replicate this behaviour, and after the object has been copied, create a new variable for each identifier, and initialize them with the proper values from the copy. The variable creation is driven by the Control Flow Graph, so the analyzer needed to be modified so that when it sees a structured binding declaration to a tuple-like type during CFG construction, it also creates a new variable declaration for each identifier in the CFG.
When a tuple-like type is bound by reference, the initializer is not copied, but the variables are still created for each identifier and in this case they are reference types.
std::pair<int, char> p{1, 'a'};
auto &[a1, b1] = p;
auto &&[a2, b2] = p;
In both cases the new variables are lvalue reference types,
so both a1
and a2
have the type of
int &
and b1
and
b2
have the type of char &
. Also
these variables are references to the elements of
p
.
Binding an array by value works like binding data members by value. First the array is copied, and the identifiers refer to the values of the corresponding elements of the copy.
int arr[] = {1, 2, 3};
auto [x, y, z] = arr;
In the snippet above arr
is copied and the
identifiers (x
, y
, z
)
refer to the elements of the copy. The copy of the array is
created by an
ArrayInitLoopExpr
, which is also responsible for creating a copy in an
implicit copy/move constructor of a class with an array
member and when a lambda captures an array by value. This
means that handling the mentioned expression can provide
more information to the analyzer about aggregate types and
lambdas.
The idea is to copy the array with as little overhead as
possible, so for POD arrays the desired approach is to
create a
LazyCompoundVal
, which basically means that the analyzer associates the
new array with a snapshot of the old one, instead of
creating a memberwise copy.
While it is the desired approach for POD arrays, it is inaccurate for non-POD ones, because the copy constructor can have side effects. For example take a look at this snippet:
struct S {
inline static int c = 0;
int i;
S() : i(++c) {}
S(const S ©) { i = copy.i + 1; }
};
void fn() {
S arr[2];
auto [x, y] = arr;
}
The copy constructor of S
has a side effect,
which means it must be evaluated for every element to get an
accurate result.
Initially the analyzer wasn't able to model non-POD arrays, so before it could start evaluating structured bindings to such arrays, it needed to be taught how to create and destroy them.
The idea is the same for both creating the array and creating a copy of it. The analyzer should call the constructor/copy constructor for each of the elements in order, and construct each element into the specific target memory region. Later when the lifetime of the array ends, the destructors also need to be invoked, but in a reverse order. The destructor of the last element is invoked first.
Binding by reference works the same way as in case of data members. The array is not copied and the identifiers are references to the elements in the original array.
LazyCompoundVal
of a sufficiently small
(smaller than 5 elements) array is about to be created,
a by value copy is created instead. This allows the
analyzer to produce more accurate warning messages.
ArrayInitLoopExpr
and
structured bindings to POD arrays.
struct S { int &x; };
void fn() {
int i = 1;
S s{i};
auto [a] = s;
}
Here the type of a
is int &
,
however it was treated as int
, which lead
to false positives.
ArrayInitLoopExpr
, and made it possible to
create structured bindings to small non-POD arrays as
well as to evaluate implicit copy and move constructors
of classes/structs with non-POD array fields properly.
ConstructionContext
for lambda captures, and introduced capturing non-POD
arrays in lambdas by value.
[var = S()](){};
compiled with
-std=c++17
or higher.
ArrayInitLoopExpr
, so the
analyzer can properly model when such arrays are used in
structured bindings, lambda captures and implicit
copy/move constructors.
new[]
evaluates
to a garbage value.
Structured bindings are a C++17 addition, so testing the changes required projects written in C++17 or above. Finding sufficiently large projects with extensive structured binding usage was quite challenging, since the majority of large open-source C++ projects are written in C++14 or below.
Most of the testing was done on Acid and WebKit and their dependencies, including Bullet, Glslang and GLFW. On top that, the changes have also been evaluated on several C++14 or earlier open-source projects, including Qt, SQLite and LLVM itself.
Both the coverage and the accuracy of the analysis have been improved. On WebKit 81 false positives have been replaced by 22 true positives and Issue #42387 has been resolved too.
There are several improvements that could increase the accuracy and coverage of the analysis. Most of them are not structured binding specific, but are strongly connected to the implementations used to teach the analyzer about structured bindings.
LazyCompoundVal
. It can't
tell the difference between Unknown
and
Undefined
values. This sometimes leads to
false negatives, so an uninitialized assignment might
not be detected. Modifying the analyzer so that it can
decide what such values are could increase the accuracy
of the analysis.
ConstructionContext
for
lambdas. Currently lambdas are constructed into a
temporary memory region from which they are moved to
their final region. A
ConstructionContext
would allow the
analyzer to construct the lambdas directly into their
final regions, so the temporary construction could be
avoided.
I'd like to thank
Artem Dergachev,
Gábor Horváth,
Kristóf Umann
and
Rashmi Mudduluru
for mentoring the project, helping with testing the various
patches on a couple of open-source projects and answering my
questions even during the weekends.
I'd also like to thank
Usama Hameed,
Ziqing Luo
and
Malavika Samak
for being part our weekly discussions, as well as
Balázs Benics
and
Gábor Márton
for code reviews and ideas about how the new features could
be improved.