Keeping you updated with latest technology trends, Join DataFlair on Telegram

Objective

This is a comprehensive guide about various Spark Hadoop Cloudera certifications. In this Cloudera certification tutorial we will discuss all the aspects like different certifications offered by Cloudera, the pattern of Cloudera certification exam / test, number of questions passing score, time limits, required skills and weightage of each and every topic. We will discuss about all the certifications offered by Cloudera like: “CCA Spark and Hadoop Developer Exam (CCA175)”, “Cloudera Certified Administrator for Apache Hadoop (CCAH)”, “CCP Data Scientist”, “CCP Data Engineer”.

1. CCA Spark and Hadoop Developer Exam (CCA175)

In CCA Spark and Hadoop Developer certification, you need to write code in Scala and Python and run it on the cluster to prove your skills. This exam can be taken from any computer at any time globally.

CCA175 is a hands-on, practical exam using Cloudera technologies. The users are given their own CDH5 (currently 5.3.2) cluster that is pre-loaded with Spark, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many other software that are needed by the users.

a. CCA Spark and Hadoop Developer Certification Exam (CCA175) Details:

Number of Questions: 10–12 performance-based (hands-on) tasks on CDH5 cluster

Time Limit: 120 minutes

Passing Score: 70%

Language: English, Japanese (forthcoming)

CCA Spark and Hadoop Developer certification Cost: USD $295

b. CCA175 Exam Question Format

In each CCA question, you would be required to solve a particular scenario. In some cases, a tool such as Impala or Hive may be used. In other cases, coding is required. In Spark problem, a template (in Scala or Python) is often provided that contains a skeleton of the solution, asking the candidate to fill in the missing lines with functional code.

c. Prerequisites

There are no prerequisites required to take any Cloudera certification exam.

d. Exam selection and related topics

I. Required Skills

Data Ingest: These are the skills required to transfer data between external systems and your cluster. It includes:

Using Sqoop to import data from a MySQL database into HDFS and Change the delimiter and file format of data

Using Sqoop to Export data to a MySQL database

Ingest real-time and near-real-time (NRT) streaming data into HDFS using Flume

Using Hadoop File System (FS) commands to load data into and out of HDFS

II. Transform, Stage, Store:

It converts a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS. This includes writing Spark applications in Scala / Python for the below tasks:

Load data from HDFS and store results back to HDFS

Join disparate datasets together

Calculate aggregate statistics (e.g., average or sum)

Filter data into a smaller dataset

Write a query that produces ranked or sorted data

III. Data Analysis

Data Definition Language (DDL) to create tables in the Hive metastore, use by Hive and Impala.

Read and/or create a table in the Hive metastore in a given schema

Avro schema extraction from a set of data-files

Hive metastore table creation using the Avro file format and an external schema file

Improve query performance by creating partitioned tables in the Hive meta-store

Evolve an Avro schema by changing JSON files

2. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Cloudera Certified Administrator for Apache Hadoop (CCAH) certification shows your technical knowledge, skills, and ability to configure, deploy, monitor, manage, maintain, and secure an Apache Hadoop cluster.

a. Cloudera Certified Administrator for Apache Hadoop (CCA-500) details

Number of Questions: 60 questions

Time Limit: 90 minutes

Passing Score: 70%

Language: English, Japanese

Cloudera Certified Administrator for Apache Hadoop (CCAH) certification Price: USD $295

b. Exam sections and related topics

I. HDFS (17%)

HDFS Features & design principle and function of HDFS daemons

Describe the operations of the Apache Hadoop cluster in data storage and in data processing

Features of current computing systems that motivated system like Apache Hadoop and commands to handle files in the HDFS

Given a scenario, identify appropriate use cases for HDFS Federation

Identify components and daemon of an HDFS HA-Quorum cluster

HDFS security (Kerberos) and file read-write paths

Determine the best data serialization choice for a given scenario

Internals of HDFS read operations and HDFS write operations

II. YARN (17%)

Understand how to deploy core ecosystem components along with Spark, Impala, and Hive

Understand Yarn, MapReduce v2 (MRv2 / YARN) deployments

Understand basic design strategy for YARN and how resource allocations is handled by it

Understand Resource Manager and Node Manager

Identify the workflow of job running on YARN

Determine which files you must change and how in order to migrate a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) running on YARN

III. Hadoop Cluster Planning (16%)

Principal points to consider while choosing the hardware and operating systems to host an Apache Hadoop cluster

Understand kernel tuning and disk swapping

Identify a hardware configuration and ecosystem components your cluster needs for the given scenario

Cluster sizing: identify the specifics for the workload, including CPU, memory, storage, disk I/O for a given case

Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster

Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario

IV. Hadoop Cluster Installation and Administration (25%)

Understand how to install and configure Hadoop cluster

Identify how the cluster will handle disk and machine failures for given case

Analyze a logging configuration and logging configuration file format

Understand the basics of Hadoop metrics and cluster health monitoring

Install ecosystem components in CDH 5 like Impala, Flume, Oozie, Hue, Cloudera Manager, Sqoop, Hive, and Pig etc.

Identify the function and purpose of available tools for managing the Apache Hadoop file system

V. Resource Management (10%)

Understand the overall design goals of each of Hadoop schedulers and resource manager

Given a scenario, determine how the Fair/FIFO/Capacity Scheduler allocates cluster resources under YARN

VI. Monitoring and Logging (15%)

Understand the functions and features of Hadoop’s metric collection abilities

Analyze the NameNode and Yarn Web UIs

Understand how to monitor cluster daemons

Identify and monitor CPU usage on master nodes

Describe how to monitor swap and memory allocation on all nodes

Interpret a log file and Identify how to manage Hadoop’s log files

3. CCP Data Scientist

“Cloudera Certified Professional Data Scientist” is able to perform descriptive and inferential statistics, apply advanced analytical techniques and build machine learning models using standard tools. Candidates need to prove their abilities on a live cluster with large datasets in a variety of formats. It needs clearing 3 CCP Data Scientist exams (DS700, DS701, and DS702) in any order. All three exams must be passed within 365 days of each other.

a. Common Skills (all exams)

Extract relevant features from a large dataset containing bad records, partial records, errors, or other forms of “noise”

Extract features from a data in multiple formats like JSON, XML, raw text logs, industry-specific encodings, and graph link data

b. Descriptive and Inferential Statistics on Big Data (DS700)

Determining confidence for a hypothesis using statistical tests

Calculate common summary statistics, such as mean, variance, and counts

Fit a distribution to a dataset and use it to predict event likelihoods

Perform complex statistical calculations on a large dataset

c. Advanced Analytical Techniques on Big Data (DS701)

Build a model that contains relevant features from a large dataset

Define relevant data groupings and assign data records from a large dataset into a defined set of data groupings

Evaluate goodness of fit for a given set of data groupings and a dataset

Apply advanced analytical techniques, such as network graph analysis or outlier detection

d. Machine Learning at Scale (DS702)

Build a model with relevant features from a large dataset and select a classification algorithm for it

Predict labels for an unlabeled dataset using a labeled dataset for reference

Tune algorithm meta parameters to maximize algorithm performance

Determine the success of a given algorithm for the given dataset using validation techniques

e. What technologies/languages do you need to know?

You’ll be provided with a cluster with Hadoop technologies on a cluster, plus standard tools like Python and R. Among these standard technologies, it’s your choice what to use to solve the problem.

4. CCP Data Engineer

“Cloudera Certified Data Engineer” is able to perform core competencies required to ingest, transform, store, and analyze data in Cloudera’s CDH environment.

a. What do you need to know?

I. Data Ingestion

These are the skills to transfer data between external systems and your cluster. It includes:

Import and export data between an external RDBMS and your cluster, including specific subsets, changing the delimiter and file format of imported data during ingest, and altering the data access pattern or privileges.

Ingest real-time and near-real time (NRT) streaming data into HDFS, including distribution to multiple data sources and converting data on ingest from one format to another.

Load data into and out of HDFS using the Hadoop File System HDFS commands.

II. Transform, Stage, Store

It means converting a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/HCatalog. It includes:

Convert data from one file format to another and write it with compression

Convert data from one set of values to another (e.g., Lat/Long to Postal Address using an external library)

Purge bad records from a data set, e.g., null values

De-duplication and merge data

De-normalize data from multiple disparate data sets

Evolve an Avro or Parquet schema

Partition an existing data set according to one or more partition keys

Tune data for optimal query performance

III. Data Analysis

It includes operations like Filter, sort, join, aggregate, and/or transform one or more data sets in a given format stored in HDFS to produce a specified result. The queries will include complex data types (e.g., array, map, struct), the implementation of external libraries, partitioned data, compressed data, and requires the use of metadata from Hive/HCatalog.

Write a query to aggregate multiple rows of data and to filter data

Write a query that produces ranked or sorted data

Write a query that joins multiple data sets

Read and/or create a Hive or an HCatalog table from existing data in HDFS

IV. Workflow

It includes the ability to create and execute various jobs and actions that move data towards greater value and use in a system. It includes:

Create and execute a linear workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom actions, etc.

Create and execute a branching workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom action, etc.

Orchestrate a workflow to execute regularly at predefined times, including workflows that have data dependencies

b. What should you expect?

You are given five to eight customer problems each with a unique, large data set, a CDH cluster, and four hours. For each problem, you must implement a technical solution that meets all the requirements using any tool or combination of tools on the cluster (see list below) — you get to pick the tool(s) that are right for the job.