Apache Eagle, an open-source solution for identifying security and performance issues on big data platforms, graduates to Apache top level project on January 10, 2017.

Firstly open-sourced by eBay on October 2015, Eagle was created to instantly detect access to sensitive data or malicious activities and, to take actions in a timely fashion. In addition to data activity monitoring, Eagle can also provide node anomaly detection, cluster and job performance analytics.

Job performance is analyzed by crunching YARN application logs and by taking snapshots of all running jobs in YARN. Eagle can detect single job trends, data skew problems, failure reasons and assess overall cluster performance taking into account all the jobs running. Eagle calculates task failure ratios for each node to detect the nodes behaving differently from others and requiring attention. For cluster performance, Eagle accounts the resources used by each YARN job and correlates it with transversal services’ metrics (e.g. HDFS namenode’s) to help identify overall cluster slowness causes.

Apache Eagle relies on Apache Storm for stream processing of data activity and operational logs and can perform policy-based detection and alerting. It provides multiple APIs: streaming API as an abstraction on top of Storm API and a policy engine provider API, exposing WSO2's open-source Siddhi CEP engine as first class citizen. Siddhi CEP engine supports hot deployment of alerting rules and alerts can be defined with attribute filtering and window-based rules (e.g. more than three accesses in a 10 minute interval).

A machine learning based policy provider is also included. It learns from past user behaviour to classify a data access to be anomalous or not. The machine learning policy provider evaluates models trained offline in Apache Spark framework. Eagle ships with two machine learning methods to calculate a user profile: a density estimation that computes a Gaussian probability density for each user / activity and a threshold, and a eigen-value decomposition that captures behavioural patterns by reducing the dimensionality of user and activity features.

Data integration is achieved with Apache Kafka via logstash forwarder agent or via log4j kafka appender. Log entries from multiple Hadoop daemons (e.g. namenode, datanode, etc.) are fed into Kafka and consumed by the Storm topology. Eagle supports classification of data assets into multiples sensitivity types.

Eagle supports Apache HBase for alert persistence as well as a relational database. Alerts can be notified via e-mail, Kafka or stored in a Eagle supported storage. It is also possible to develop your own alert notification plugin.