About the project

Hadoop Common

MapReduce

Hadoop Distributed File System

YARN

About the check

Using the maven plugin;

Using the gradle plugin;

Using the gradle IntellJ IDEA;

Directly using the analyzer.

Production code

Hadoop Common

public class MiniKdc { .... private static final Set<String> PROPERTIES = new HashSet<String>(); .... static { PROPERTIES.add(ORG_NAME); PROPERTIES.add(ORG_DOMAIN); PROPERTIES.add(KDC_BIND_ADDRESS); PROPERTIES.add(KDC_BIND_ADDRESS); // <= PROPERTIES.add(KDC_PORT); PROPERTIES.add(INSTANCE); .... } .... }

MapReduce

public synchronized void setup(JobConf conf, JobID jobId) throws IOException { .... // Update the configuration object with localized data. if (!localArchives.isEmpty()) { conf.set(MRJobConfig.CACHE_LOCALARCHIVES, StringUtils .arrayToString(localArchives.toArray(new String[localArchives // <= .size()]))); } if (!localFiles.isEmpty()) { conf.set(MRJobConfig.CACHE_LOCALFILES, StringUtils .arrayToString(localFiles.toArray(new String[localArchives // <= .size()]))); } .... }

Suppose, we have localArchives (size = 4) and localFiles (size = 2);

(size = 4) and (size = 2); When creating the array localFiles.toArray(new String[localArchives.size()]) , the last 2 elements will be null ([«pathToFile1», «pathToFile2», null, null]);

, the last 2 elements will be ([«pathToFile1», «pathToFile2», null, null]); Then org.apache.hadoop.util.StringUtils.arrayToString will return the string representation of our array, in which the last files names will be presented as «null» («pathToFile1, pathToFile2, null, null» ) ;

will return the string representation of our array, in which the last files names will be presented as «null» («pathToFile1, pathToFile2, null, null» ; All this will be passed further and God only knows what kind of checks there are for such cases =).

boolean isHierarchySameAs(Queue newState) { .... if (children == null || children.size() == 0) { .... } else if(children.size() > 0) { .... } .... }

HDFS

RollingWindow(int windowLenMs, int numBuckets) { buckets = new Bucket[numBuckets]; for (int i = 0; i < numBuckets; i++) { buckets[i] = new Bucket(); } this.windowLenMs = windowLenMs; this.bucketSize = windowLenMs / numBuckets; if (this.bucketSize % bucketSize != 0) { // <= throw new IllegalArgumentException( "The bucket size in the rolling window is not integer: windowLenMs= " + windowLenMs + " numBuckets= " + numBuckets); } }

YARN

public static ApplicationReport convertToApplicationReport(TimelineEntity entity) { .... if (metrics != null) { long vcoreSeconds = 0; long memorySeconds = 0; long preemptedVcoreSeconds = 0; long preemptedMemorySeconds = 0; for (TimelineMetric metric : metrics) { switch (metric.getId()) { case ApplicationMetricsConstants.APP_CPU_METRICS: vcoreSeconds = getAverageValue(metric.getValues().values()); break; case ApplicationMetricsConstants.APP_MEM_METRICS: memorySeconds = ....; break; case ApplicationMetricsConstants.APP_MEM_PREEMPT_METRICS: preemptedVcoreSeconds = ....; // <= break; case ApplicationMetricsConstants.APP_CPU_PREEMPT_METRICS: preemptedVcoreSeconds = ....; // <= break; default: // Should not happen.. break; } } .... } .... }

@Override public synchronized void synchronizePlan(Plan plan, boolean shouldReplan) { .... try { setQueueEntitlement(planQueueName, ....); } catch (YarnException e) { LOG.warn("Exception while trying to size reservation for plan: {}", currResId, planQueueName, e); } .... }

Test code

Hadoop Common

public void testMultiple(int order) throws Exception { .... currentSecretA = secretProviderA.getCurrentSecret(); allSecretsA = secretProviderA.getAllSecrets(); Assert.assertArrayEquals(secretA2, currentSecretA); Assert.assertEquals(2, allSecretsA.length); // <= Assert.assertArrayEquals(secretA2, allSecretsA[0]); Assert.assertArrayEquals(secretA1, allSecretsA[1]); currentSecretB = secretProviderB.getCurrentSecret(); allSecretsB = secretProviderB.getAllSecrets(); Assert.assertArrayEquals(secretA2, currentSecretB); Assert.assertEquals(2, allSecretsA.length); // <= Assert.assertArrayEquals(secretA2, allSecretsB[0]); Assert.assertArrayEquals(secretA1, allSecretsB[1]); .... }

private int readPrepWithUnknownLength(Scanner scanner, int start, int n) throws IOException { for (int i = start; i < start; i++) { String key = String.format(localFormatter, i); byte[] read = readKey(scanner); assertTrue("keys not equal", Arrays.equals(key.getBytes(), read)); try { read = readValue(scanner); assertTrue(false); } catch (IOException ie) { // should have thrown exception } String value = "value" + key; read = readLongValue(scanner, value.getBytes().length); assertTrue("values nto equal", Arrays.equals(read, value.getBytes())); scanner.advance(); } return (start + n); }

MapReduce

GenerateOutput writeSegment(long byteAm, OutputStream out) throws IOException { long headerLen = getHeaderLength(); if (byteAm < headerLen) { // not enough bytes to write even the header return new GenerateOutput(0, 0); } // adjust for header length byteAm -= headerLen; if (byteAm < 0) { // <= byteAm = 0; } .... }

HDFS

public void testWebHdfsErasureCodingFiles() throws Exception { .... final Path normalDir = new Path("/dir"); dfs.mkdirs(normalDir); final Path normalFile = new Path(normalDir, "file.log"); .... // logic block #1 FileStatus expectedNormalDirStatus = dfs.getFileStatus(normalDir); FileStatus actualNormalDirStatus = webHdfs.getFileStatus(normalDir); // <= Assert.assertEquals(expectedNormalDirStatus.isErasureCoded(), actualNormalDirStatus.isErasureCoded()); ContractTestUtils.assertNotErasureCoded(dfs, normalDir); assertTrue(normalDir + " should have erasure coding unset in " + ....); // logic block #2 FileStatus expectedNormalFileStatus = dfs.getFileStatus(normalFile); FileStatus actualNormalFileStatus = webHdfs.getFileStatus(normalDir); // <= Assert.assertEquals(expectedNormalFileStatus.isErasureCoded(), actualNormalFileStatus.isErasureCoded()); ContractTestUtils.assertNotErasureCoded(dfs, normalFile); assertTrue( normalFile + " should have erasure coding unset in " + ....); }

private void verifyNodesAndCorruptBlocks( final int numDn, final int numLiveDn, final int numCorruptBlocks, final int numCorruptECBlockGroups, final DFSClient client, final Long highestPriorityLowRedundancyReplicatedBlocks, final Long highestPriorityLowRedundancyECBlocks) throws IOException { /* init vars */ .... final String expectedCorruptedECBlockGroupsStr = String.format( "Block groups with corrupt internal blocks: %d", numCorruptECBlockGroups); final String highestPriorityLowRedundancyReplicatedBlocksStr = String.format( "\tLow redundancy blocks with highest priority " + "to recover: %d", highestPriorityLowRedundancyReplicatedBlocks); final String highestPriorityLowRedundancyECBlocksStr = String.format( "\tLow redundancy blocks with highest priority " + "to recover: %d", highestPriorityLowRedundancyReplicatedBlocks); .... }

private void verifyFileContent(...., SlowWriter[] slowwriters) throws IOException { LOG.info("Verify the file"); for (int i = 0; i < slowwriters.length; i++) { LOG.info(slowwriters[i].filepath + ....); FSDataInputStream in = null; try { in = fs.open(slowwriters[i].filepath); for (int j = 0, x;; j++) { x = in.read(); if ((x) != -1) { Assert.assertEquals(j, x); } else { return; } } } finally { IOUtils.closeStream(in); } } }

if we read something relevant, we compare the read byte with the current value of the j counter though assertEquals (if the check is not successful, we fail out of the test);

counter though (if the check is not successful, we fail out of the test); if the file is checked successfully and we get to the end of the file (we read -1), the method exits.

YARN

@Test public void testCreationOfNodeLabelsProviderService() throws InterruptedException { try { .... } catch (Exception e) { Assert.fail("Exception caught"); e.printStackTrace(); } }

V6019 Unreachable code detected. It is possible that an error is present. TestResourceTrackerService.java(928)

V6019 Unreachable code detected. It is possible that an error is present. TestResourceTrackerService.java(737)

V6019 Unreachable code detected. It is possible that an error is present. TestResourceTrackerService.java(685)

....

@Test public void testDirectoryCleanupOnNewlyCreatedStateStore() throws IOException, URISyntaxException { .... // verify directory creation for (Path p : localDirs) { p = new Path((new URI(p.toString())).getPath()); // logic block #1 Path usercache = new Path(p, ContainerLocalizer.USERCACHE); verify(spylfs).rename(eq(usercache), any(Path.class), any()); // <= verify(spylfs).mkdir(eq(usercache), ....); // logic block #2 Path publicCache = new Path(p, ContainerLocalizer.FILECACHE); verify(spylfs).rename(eq(usercache), any(Path.class), any()); // <= verify(spylfs).mkdir(eq(publicCache), ....); .... } .... }

Conclusion

In order to get high quality production code, it's not enough just to ensure maximum coverage with tests. No doubts, great results require the main project code and tests to work efficiently together. Therefore, tests have to be paid as much attention as the main code. A decent test is a key success factor, as it will catch regression in production. Let's take a look at PVS-Studio static analyzer warnings to see the importance of the fact that errors in tests are no worse than the ones in production. Today's focus: Apache Hadoop.Those who were formerly interested in Big Data have probably heard or even worked with the Apache Hadoop project. In a nutshell, Hadoop is a framework that can be used as a basis for building Big Data systems and working with them.Hadoop consists of four main modules, each of them performs a specific task required for a big data analytics system:Anyway, there are plenty of materials about it on the Internet.As shown in the documentation, PVS-Studio can be integrated in the project in a variety of ways:Hadoop is based on the maven build system, therefore, there were no obstacles with the check.After I integrated the script from the documentation and edited one of the pom.xml files (there were modules in dependencies that were not available), the analysis started!After the analysis was completed, I chose the most interesting warnings and noticed that I had the same number of warnings in production code and in tests. Normally, I don't consider analyzer warnings, given for tests. But when I divided them, I couldn't leave 'tests' warnings unattended. «Why not take a look at them?»- I thought, because bugs in tests might also have adverse consequences. They can lead to incorrect or partial testing, or even to mishmash (they exist just to check the box, that they are always green).After I selected the most intriguing warnings, I divided them by the following groups: production, test and the four main Hadoop modules. And now I'm glad to offer for your attention the review of analyzer warnings. V6033 An item with the same key 'KDC_BIND_ADDRESS' has already been added. MiniKdc.java(163), MiniKdc.java(162)The value added twice inis a very common defect when checking projects. The second addition will be actually ignored. I hope this duplication is just a needless tragedy. What if another value was meant to be added? V6072 Two similar code fragments were found. Perhaps, this is a typo and 'localFiles' variable should be used instead of 'localArchives'. LocalDistributedCacheManager.java(183), LocalDistributedCacheManager.java(178), LocalDistributedCacheManager.java(176), LocalDistributedCacheManager.java(181)The V6072 diagnostic sometimes yields some interesting findings. The purpose of this diagnostic is to detect the same type code fragments resulted from copy-paste and replacement of one-two variables. In this case, some variables have even been left «underchanged».The code above demonstrates this. In the first block thevariable is used, in the second similar fragment —. If you study this code with due diligence, rather then quickly run through it, as it often happens while code reviewing, you'll notice the fragment, where the author forgot to replace thevariable.This gaffe can lead to the following scenario: V6007 Expression 'children.size() > 0' is always true. Queue.java(347)Due to the fact that the number of elements is checked for 0 separately, the further checkwill always be true. V6001 There are identical sub-expressions 'this.bucketSize' to the left and to the right of the '%' operator. RollingWindow.java(79)Here the defect is that the variable is divided by itself. As a result, the check for multiplicity will always be successful and in case of getting incorrect inputs (), the exception will never be thrown. V6067 Two or more case-branches perform the same actions. TimelineEntityV2Converter.java(386), TimelineEntityV2Converter.java(389)Same code fragments in twobranches. It's just all over the place! In the prevailing number of cases, this is not a real error, but just a reason to think aboutrefactoring of. But not for the case at hand. Repeated code fragments set the value of the variable. If you look closely at the names of all variables and constants, you will probably conclude that in case ifthe value must be set for thevariable, not for. In this regard, after the 'switch' operator,will always remain 0, while the value ofmight be incorrect. V6046 Incorrect format. A different number of format items is expected. Arguments not used: 2. AbstractSchedulerPlanFollower.java(186)Thevariable isn't used when logging. In this case, either too much is copied or the format string isn't finished. But I still blame the good old copy-paste, which in some cases is just great to shoot yourself in the foot. V6072 Two similar code fragments were found. Perhaps, this is a typo and 'allSecretsB' variable should be used instead of 'allSecretsA'. TestZKSignerSecretProvider.java(316), TestZKSignerSecretProvider.java(309), TestZKSignerSecretProvider.java(306), TestZKSignerSecretProvider.java(313)And again the V6072. Look closely at variablesand V6043 Consider inspecting the 'for' operator. Initial and final values of the iterator are the same. TestTFile.java(235)A test that's always green? =). A part of the loop, which is a part of the test itself, will never get executed. This is due to the fact that the initial and final counter values are equal in thestatement. As a result, the conditionwill immediately become false, leading to such behavior. I ran through the test file and leapt to the conclusion that actuallyhad to be written in the loop condition. V6007 Expression 'byteAm < 0' is always false. DataWriter.java(322)The conditionis always false. To figure it out, let's give the code above another glance. If the test execution reaches the operation, it means thatFrom here, after subtraction, thevaluewill never be negative. That's what we had to prove. V6072 Two similar code fragments were found. Perhaps, this is a typo and 'normalFile' variable should be used instead of 'normalDir'. TestWebHDFS.java(625), TestWebHDFS.java(615), TestWebHDFS.java(614), TestWebHDFS.java(624)Believe it or not, it's V6072 again! Just follow the variablesand V6027 Variables are initialized through the call to the same function. It's probably an error or un-optimized code. TestDFSAdmin.java(883), TestDFSAdmin.java(879)In this fragment, variablesandare initialized by the same values. Often it should be so, but this is not the case. Variables' names are long and similar, so it's not surprising that copy-pasted fragment wasn't changed in any way. To fix it, when initializing thevariable, the author has to use the input parameter. In addition, most likely, they still need to correct the format line. V6019 Unreachable code detected. It is possible that an error is present. TestReplaceDatanodeFailureReplication.java(222)The analyzer complains that thecounter in the loop can't be changed. Which means that in theloop no more than one iteration will execute. Let's find out why. So, in the first iteration, we link the thread with the file that corresponds tofor further reading. Next, we read the file contents via the loopTherefore, whatever happens during the check of, it won't get to checking subsequent elements. Most likely,has to be used instead of V6019 Unreachable code detected. It is possible that an error is present. TestNodeManager.java(176)In this situation, themethod will interrupt the test and stacktrace won't be printed in case of an exception. If the message about the caught exception is enough here, it's better to remove printing of stacktrace to avoid confusion. If printing is necessary, you just need to swap them.Many similar fragments have been found: V6072 Two similar code fragments were found. Perhaps, this is a typo and 'publicCache' variable should be used instead of 'usercache'. TestResourceLocalizationService.java(315), TestResourceLocalizationService.java(309), TestResourceLocalizationService.java(307), TestResourceLocalizationService.java(313)And finally, V6072 again =). Variables to view the suspicious fragment:andHundreds of thousands of code lines are written in development. Production code is usually kept clean from bugs, defects and flaws (developers test their code, the code is reviewed and so on). Tests are definitely inferior in this regard. Defects in tests can easily hide behind a «green tick». As you've probably got from today's warnings review, a green test is not always a successful check.This time, when checking the Apache Hadoop code base, static analysis turned out to be highly needed both in production code and tests that also play an important role in development.So if you care about your code and tests quality, I suggest that you set sights on static analysis. Let PVS-Studio be the first contender in this undertaking.