WAP 2.0 is a source code static analysis and data mining tool to detect and correct input validation vulnerabilities in web applications written in PHP (version 4.0 or higher) and with a low rate of false positives. WAP detects and corrects the following vulnerabilities:

SQL Injection (SQLI)

Cross-site scripting (XSS)

Remote File Inclusion (RFI)

Local File Inclusion (LFI)

Directory Traversal or Path Traversal (DT/PT)

Source Code Disclosure (SCD)

OS Command Injection (OSCI)

PHP Code Injection

This tool semantically analyses the source code. More precisely, it does taint analysis (data-flow analysis) to detect the input validation vulnerabilities. The aim of the taint analysis is to track malicious inputs inserted by entry points ($_GET, $_POST arrays) and to verify if they reaches some sensitive sink (PHP functions that can be exploited by malicious input). After the detection, the tool uses data mining to confirm if the vulnerabilities are real or false positives. At the end, the real vulnerabilities are corrected with the insertion of the fixes (small pieces of code) in the source code. WAP is written in Java language and is constituted by three modules:

Code Analyzer : composed by tree generator and taint analyser. The tool has integrated a lexer and a parser generated by ANTLR, and based in a grammar and a tree grammar written to PHP language. The tree generator uses the lexer and the parser to build the AST (Abstract Sintatic Tree) to each PHP file. The taint analyzer performs the taint analysis navigating through the AST to detect potentials vulnerabilities.

: composed by tree generator and taint analyser. The tool has integrated a lexer and a parser generated by ANTLR, and based in a grammar and a tree grammar written to PHP language. The tree generator uses the lexer and the parser to build the AST (Abstract Sintatic Tree) to each PHP file. The taint analyzer performs the taint analysis navigating through the AST to detect potentials vulnerabilities.

False Positives Predictor : composed by a supervised trained data set with instances classified as being vulnerabilities and false positives and by the Logistic Regression machine learning algorithm. For each potential vulnerability detected by code analyser, this module collects the presence of the attributes that define a false positive. Then, the Logistic Regression algorithm receives them and classifies the instance as being a false positive or not (real vulnerability).

: composed by a supervised trained data set with instances classified as being vulnerabilities and false positives and by the Logistic Regression machine learning algorithm. For each potential vulnerability detected by code analyser, this module collects the presence of the attributes that define a false positive. Then, the Logistic Regression algorithm receives them and classifies the instance as being a false positive or not (real vulnerability).

Code Corrector: Each real vulnerability is removed by correction of its source code. This module for the type of vulnerability selects the fix that removes the vulnerability and signalizes the places in the source code where the fix will be inserted. Then, the code is corrected with the insertion of the fixes and new files are created.



