Approximately every six months someone writes to us from the employees of Yandex company, asks about licensing of PVS-Studio, downloads the trial and disappears. It's normal, we got used to a slow processes of selling our analyzer to large companies. However, once I have an opportunity, it wouldn't be an extra thing to say hello to Yandex developers and remind about PVS-Studio tool.

Honestly, the article turned out to be random in many respects. We have already been offered to check ClickHouse, but somehow this idea was forgotten. The other day, surfing the Internet, I met again the mention of ClickHouse and got interested in the project. This time I decided not to postpone and check out this project.

ClickHouse

ClickHouse is a column database for OLAP (online analytical requests processing). ClickHouse was designed in Yandex to meet the challenges of Yandex.Metrica. ClickHouse allows you to perform analytical requests on updated data in real time. The linearly scalable system is able to work both with trillions of records and petabytes of data. In June of 2016 ClickHouse was posted in open-source under the Apache license 2.0.

Website: clickhouse.yandex.

Page in Wikipedia: ClickHouse.

Repository on GitHub.com site: yandex/ClickHouse.

Analysis of project using PVS-Studio

I checked the ClickHouse source code taken from repository of August 14, 2017. To test, I used the beta version of PVS-Studio v6.17. By the time we published the article, this version have already been released.

The following directories were excluded from the check:

ClickHouse/contrib

ClickHouse/libs

ClickHouse/build

various tests were also excluded, for example, ClickHouse/dbms/src/Common/tests

The size of the rest of the source code in C++ is 213 KLOC. At the same time, 7.9% of lines are comments. It turns out that the size of code itself that has been checked is not that great: about 196 KLOC.

As you can see, the ClickHouse project has a small size. Besides that, the quality of code is uniquely high and I won't be able to write a shocking article. In total the analyzer issued 130 warnings (General analysis, High and Medium warnings).

I'm not sure about the number of false positives. There are many warnings, that formally cannot be named as false, but at the same time there is no practical use in them. The easiest way to explain it, is to give an example.

int format_version; .... if (format_version < 1 || format_version > 4) throw Exception("Bad checksums format version: " + ....); if (format_version == 1) return false; if (format_version == 2) return read_v2(in); if (format_version == 3) return read_v3(in); if (format_version == 4) return read_v4(in); return false;

Analyzer draws attention to the fact that if the expression (format_version == 4) starts evaluating, then it will always be true. As you can see, there is a check above, that if a value format_version goes beyond [1..4], then an exception is thrown. The operator return false will never get executed.

Formally, the analyzer is right and it is not clear how to prove that it is a false positive. On the other hand, it is obvious that this code is correct and is simply written with a "safety margin".

In such cases, a programmer can suppress the analyzer warnings in various ways or rewrite the code. For example, you can write as follows:

switch(format_version) { case 1: return false; case 2: return read_v2(in); case 3: return read_v3(in); case 4: return read_v4(in); default: throw Exception("Bad checksums format version: " + ....); }

There are some warnings about that I just can't say if they point out an error or not. I'm not familiar with the project and have no idea how some code fragments have to be run. Let's consider such a case.

There is some scope with 3 functions:

namespace CurrentMemoryTracker { void alloc(Int64 size); void realloc(Int64 old_size, Int64 new_size); void free(Int64 size); }

The names of formal arguments of functions suggest that some sizes are passed into the functions. Some cases are suspicious for the analyzer. For instance, when the size of a pointer, but not the size of a structure, is passed to the alloc function.

using Large = HyperLogLogCounter<K, Hash, UInt32, DenominatorType>; Large * large = nullptr; .... CurrentMemoryTracker::alloc(sizeof(large));

The analyzer does not know if it is an error or not. I also do not know, but in my opinion, this code is suspicious.

Well, I will not write about such cases. If ClickHouse developers are interested, they can check the project themselves and explore the list of warnings in more detail. I'll review in the article only those code fragments that seemed to me the most interesting.

Interesting code fragments

1. CWE-476: NULL Pointer Dereference (3 errors)

bool executeForNullThenElse(....) { .... const ColumnUInt8 * cond_col = typeid_cast<const ColumnUInt8 *>(arg_cond.column.get()); .... if (cond_col) { .... } else if (cond_const_col) { .... } else throw Exception( "Illegal column " + cond_col->getName() + // <= " of first argument of function " + getName() + ". Must be ColumnUInt8 or ColumnConstUInt8.", ErrorCodes::ILLEGAL_COLUMN); .... }

PVS-Studio warning: V522 Dereferencing of the null pointer 'cond_col' might take place. FunctionsConditional.h 765

Here the situation is handled incorrectly when an error occurs. Instead of throwing an exception, null pointer dereference will occur.

To create an error message, the function call occurs: cond_col->getName(). You cannot do this, because the cond_col pointer will be null.

A similar error is found here: V522 Dereferencing of the null pointer 'cond_col' might take place. FunctionsConditional.h 1061

Let's consider another variant on the issue of using a null pointer:

void processHigherOrderFunction(....) { .... const DataTypeExpression * lambda_type = typeid_cast<const DataTypeExpression *>(types[i].get()); const DataTypes & lambda_argument_types = lambda_type->getArgumentTypes(); if (!lambda_type) throw Exception("Logical error: .....", ErrorCodes::LOGICAL_ERROR); .... }

PVS-Studio warning: V595 The 'lambda_type' pointer was utilized before it was verified against nullptr. Check lines: 359, 361. TypeAndConstantInference.cpp 359

In the beginning the lambda_type pointer is dereferenced, and only then it is checking. To fix the code, you need to move the pointer checking higher:

if (!lambda_type) throw Exception("Logical error: .....", ErrorCodes::LOGICAL_ERROR); const DataTypes & lambda_argument_types = lambda_type->getArgumentTypes();

2. CWE-665: Improper Initialization (1 errors)

struct TryResult { .... explicit TryResult(Entry entry_) : entry(std::move(entry)) // <= , is_usable(true) , is_up_to_date(true) { } .... Entry entry; .... }

V546 Member of a class is initialized by itself: 'entry(entry)'. PoolWithFailoverBase.h 74

Due to typos, entry member is initializing itself and as a result it actually remains uninitialized. To fix the code, you must add the underscore symbol correctly:

: entry(std::move(entry_))

3. CWE-672: Operation on a Resource after Expiration or Release (1 error)

using Strings = std::vector<std::string>; .... int mainEntryClickhousePerformanceTest(int argc, char ** argv) { .... Strings input_files; .... for (const String filename : input_files) // <= { FS::path file(filename); if (!FS::exists(file)) throw DB::Exception(....); if (FS::is_directory(file)) { input_files.erase( // <= std::remove(input_files.begin(), // <= input_files.end(), // <= filename) , // <= input_files.end() ); // <= getFilesFromDir(file, input_files, recursive); } else { if (file.extension().string() != ".xml") throw DB::Exception(....); } } .... }

PVS-Studio warning: V789 Iterators for the 'input_files' container, used in the range-based for loop, become invalid upon the call of the 'erase' function. PerformanceTest.cpp 1471

Input_files container is used in range-based for loop. At the same time, inside the loop, the container may vary due to the removal of some elements. If it is not very clear to a reader why you cannot do so, I suggest to read the description of the diagnostics V789.

4. CWE-563: Assignment to Variable without Use ('Unused Variable') (1 error)

struct StringRange { const char * first; const char * second; .... StringRange(TokenIterator token_begin, TokenIterator token_end) { if (token_begin == token_end) { first = token_begin->begin; // <= second = token_begin->begin; // <= } TokenIterator token_last = token_end; --token_last; first = token_begin->begin; // <= second = token_last->end; // <= } };

The analyzer issues two warnings:

V519 The 'first' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 26, 33. StringRange.h 33

V519 The 'second' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 27, 34. StringRange.h 34

When a certain condition in the beginning first and second variables are assigned to the token_begin->begin value. Further on, the value of these variables in any case is changing again. Most likely this code contains a logical error or something is missing. For example, the return operator may be forgotten:

if (token_begin == token_end) { first = token_begin->begin; second = token_begin->begin; return; }

5. CWE-570: Expression is Always False (2 errors)

DataTypePtr getReturnTypeImpl(const DataTypes & arguments) const override { .... if (!((.....)) || ((left_is_string || left_is_fixed_string) && (.....)) || (left_is_date && right_is_date) || (left_is_date && right_is_string) || (left_is_string && right_is_date) || (left_is_date_time && right_is_date_time) // 1 || (left_is_date_time && right_is_string) // 1 || (left_is_string && right_is_date_time) // 1 || (left_is_date_time && right_is_date_time) // 2 || (left_is_date_time && right_is_string) // 2 || (left_is_string && right_is_date_time) // 2 || (left_is_uuid && right_is_uuid) || (left_is_uuid && right_is_string) || (left_is_string && right_is_uuid) || (left_is_enum && right_is_enum && .....) || (left_is_enum && right_is_string) || (left_is_string && right_is_enum) || (left_tuple && right_tuple && .....) || (arguments[0]->equals(*arguments[1])))) throw Exception(....); .... }

In this condition three sub-expressions are repeated twice. PVS-Studio warnings:

V501 Instantiate FunctionComparison < EqualsOp, NameEquals >: There are identical sub-expressions '(left_is_date_time && right_is_date_time)' to the left and to the right of the '||' operator. FunctionsComparison.h 1057

V501 Instantiate FunctionComparison < EqualsOp, NameEquals >: There are identical sub-expressions '(left_is_date_time && right_is_string)' to the left and to the right of the '||' operator. FunctionsComparison.h 1057

V501 Instantiate FunctionComparison < EqualsOp, NameEquals >: There are identical sub-expressions '(left_is_string && right_is_date_time)' to the left and to the right of the '||' operator. FunctionsComparison.h 1057

There are two options. First, there is no error, the condition is simply superfluous and can be simplified. The second - there is an error here and some conditions are not checked. In any case, the authors should check this code fragment.

Let's look at another case where a condition is always false.

static void ipv6_scan(const char * src, unsigned char * dst) { .... uint16_t val{}; unsigned char * colonp = nullptr; while (const auto ch = *src++) { const auto num = unhex(ch); if (num != -1) { val <<= 4; val |= num; if (val > 0xffffu) // <= return clear_dst(); saw_xdigit = 1; continue; } .... }

PVS-Studio warning: V547 Expression 'val > 0xffffu' is always false. The value range of unsigned short type: [0, 65535]. FunctionsCoding.h 339

When parsing a string containing a IPv6 address, some invalid IPv6 addresses will be taken as correct. It is expected that numbers can be recorded between the separators in hexadecimal format, with a value of less than FFFF. If the number is greater, then the address must be considered incorrect. To identify this situation in code there is a test "if (val > 0xffffu)". But it does not work. Val variable is of uint16_t type, which means it cannot be greater than 0xFFFF. As a result, the function will "swallow" the incorrect address. As a regular part of the address, 4 last hexadecimal numbers before the separator, will be represented.

6. CWE-571. Expression is Always True (1 error)

static void formatIP(UInt32 ip, char *& out) { char * begin = out; for (auto i = 0; i < 3; ++i) *(out++) = 'x'; for (size_t offset = 8; offset <= 24; offset += 8) { if (offset > 0) // <= *(out++) = '.'; /// Get the next byte. UInt32 value = (ip >> offset) & static_cast<UInt32>(255); /// Faster than sprintf. if (value == 0) { *(out++) = '0'; } else { while (value > 0) { *(out++) = '0' + value % 10; value /= 10; } } } /// And reverse. std::reverse(begin, out); *(out++) = '\0'; }

PVS-Studio warning: V547 Expression 'offset > 0' is always true. FunctionsCoding.h 649

"offset > 0" condition is always executed, therefore the point is always added. It seems to me there is no error and a check is just superfluous. Although, of course, I'm not sure. If it wasn't an error, a check should be deleted, so that it wouldn't confuse other programmers and static code analyzers.

Conclusion

Perhaps, project developers will also be able to find a number of errors, looking through the analyzer warnings, which were reflected in the article. I'd like to finish a storytelling especially since I had enough of material to "give regards".

In general, I would like to note the high quality of the code of ClickHouse project developers. However, even highly skilled developers are not immune to having errors and this article proves it again. PVS-Studio static code analyzer will help to prevent many errors. The greatest effect from static analysis developers get when writing new code. It makes no sense to spend time debugging errors that can be detected by the analyzer immediately after checking new code.

I invite you all to download and try PVS-Studio.