Today, we’re extremely happy to announce Amazon SageMaker Debugger, a new capability of Amazon SageMaker that automatically identifies complex issues developing in machine learning (ML) training jobs.

Building and training ML models is a mix of science and craft (some would even say witchcraft). From collecting and preparing data sets to experimenting with different algorithms to figuring out optimal training parameters (the dreaded hyperparameters), ML practitioners need to clear quite a few hurdles to deliver high-performance models. This is the very reason why be built Amazon SageMaker : a modular, fully managed service that simplifies and speeds up ML workflows.

As I keep finding out, ML seems to be one of Mr. Murphy’s favorite hangouts, and everything that may possibly go wrong often does! In particular, many obscure issues can happen during the training process, preventing your model from correctly extracting and learning patterns present in your data set. I’m not talking about software bugs in ML libraries (although they do happen too): most failed training jobs are caused by an inappropriate initialization of parameters, a poor combination of hyperparameters, a design issue in your own code, etc.

To make things worse, these issues are rarely visible immediately: they grow over time, slowly but surely ruining your training process, and yielding low accuracy models. Let’s face it, even if you’re a bonafide expert, it’s devilishly difficult and time-consuming to identify them and hunt them down, which is why we built Amazon SageMaker Debugger.

Let me tell you more.

Introducing Amazon SageMaker Debugger

In your existing training code for TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost, you can use the new SageMaker Debugger SDK to save internal model state at periodic intervals; as you can guess, it will be stored in Amazon Simple Storage Service (S3).

This state is composed of:

The parameters being learned by the model, e.g. weights and biases for neural networks,

The changes applied to these parameters by the optimizer, aka gradients,

The optimization parameters themselves,

Scalar values, e.g. accuracies and losses,

The output of each layer,

Etc.

Each specific set of values – say, the sequence of gradients flowing over time through a specific neural network layer – is saved independently, and referred to as a tensor. Tensors are organized in collections (weights, gradients, etc.), and you can decide which ones you want to save during training. Then, using the SageMaker SDK and its estimators, you configure your training job as usual, passing additional parameters defining the rules you want SageMaker Debugger to apply.

A rule is a piece of Python code that analyses tensors for the model in training, looking for specific unwanted conditions. Pre-defined rules are available for common problems such as exploding/vanishing tensors (parameters reaching NaN or zero values), exploding/vanishing gradients, loss not changing, and more. Of course, you can also write your own rules.

Once the SageMaker estimator is configured, you can launch the training job. Immediately, it fires up a debug job for each rule that you configured, and they start inspecting available tensors. If a debug job detects a problem, it stops and logs additional information. A CloudWatch Events event is also sent, should you want to trigger additional automated steps.

So now you know that your deep learning job suffers from say, vanishing gradients. With a little brainstorming and experience, you’ll know where to look: maybe the neural network is too deep? Maybe your learning rate is too small? As the internal state has been saved to S3, you can now use the SageMaker Debugger SDK to explore the evolution of tensors over time, confirm your hypothesis and fix the root cause.

Let’s see SageMaker Debugger in action with a quick demo.

Debugging Machine Learning Models with Amazon SageMaker Debugger

At the core of SageMaker Debugger is the ability to capture tensors during training. This requires a little bit of instrumentation in your training code, in order to select the tensor collections you want to save, the frequency at which you want to save them, and whether you want to save the values themselves or a reduction (mean, average, etc.).

For this purpose, the SageMaker Debugger SDK provides simple APIs for each framework that it supports. Let me show you how this works with a simple TensorFlow script, trying to fit a 2-dimension linear regression model. Of course, you’ll find more examples in this Github repository.

Let’s take a look at the initial code:

import argparse import numpy as np import tensorflow as tf import random parser = argparse.ArgumentParser() parser.add_argument('--model_dir', type=str, help="S3 path for the model") parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001) parser.add_argument('--steps', type=int, help="Number of steps to run", default=100) parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0) args = parser.parse_args() with tf.name_scope('initialize'): # 2-dimensional input sample x = tf.placeholder(shape=(None, 2), dtype=tf.float32) # Initial weights: [10, 10] w = tf.Variable(initial_value=[[10.], [10.]], name='weight1') # True weights, i.e. the ones we're trying to learn w0 = [[1], [1.]] with tf.name_scope('multiply'): # Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w) with tf.name_scope('loss'): # Compute loss loss = tf.reduce_mean((y_hat - y) ** 2, name="loss") optimizer = tf.train.AdamOptimizer(args.lr) optimizer_op = optimizer.minimize(loss) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for i in range(args.steps): x_ = np.random.random((10, 2)) * args.scale _loss, opt = sess.run([loss, optimizer_op], {x: x_}) print (f'Step={i}, Loss={_loss}')

Let’s train this script using the TensorFlow Estimator . I’m using SageMaker local mode, which is a great way to quickly iterate on experimental code.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000} estimator = TensorFlow( role=sagemaker.get_execution_role(), base_job_name='debugger-simple-demo', train_instance_count=1, train_instance_type='local', entry_point='script-v1.py', framework_version='1.13.1', py_version='py3', script_mode=True, hyperparameters=bad_hyperparameters)

Looking at the training log, things did not go well.

Step=0, Loss=7.883463958023267e+23

algo-1-hrvqg_1 | Step=1, Loss=9.502028841062608e+23

algo-1-hrvqg_1 | Step=2, Loss=nan

algo-1-hrvqg_1 | Step=3, Loss=nan

algo-1-hrvqg_1 | Step=4, Loss=nan

algo-1-hrvqg_1 | Step=5, Loss=nan

algo-1-hrvqg_1 | Step=6, Loss=nan

algo-1-hrvqg_1 | Step=7, Loss=nan

algo-1-hrvqg_1 | Step=8, Loss=nan

algo-1-hrvqg_1 | Step=9, Loss=nan

Loss does not decrease at all, and even goes to infinity… This looks like an exploding tensor problem, which is one of the built-in rules defined in SageMaker Debugger. Let’s get to work.

Using the Amazon SageMaker Debugger SDK

In order to capture tensors, I need to instrument the training script with:

A SaveConfig object specifying the frequency at which tensors should be saved,

object specifying the frequency at which tensors should be saved, A SessionHook object attached to the TensorFlow session, putting everything together and saving required tensors during training,

object attached to the TensorFlow session, putting everything together and saving required tensors during training, An (optional) ReductionConfig object, listing tensor reductions that should be saved instead of full tensors,

object, listing tensor reductions that should be saved instead of full tensors, An (optional) optimizer wrapper to capture gradients.

Here’s the updated code, with extra command line arguments for SageMaker Debugger parameters.

import argparse import numpy as np import tensorflow as tf import random import smdebug.tensorflow as smd parser = argparse.ArgumentParser() parser.add_argument('--model_dir', type=str, help="S3 path for the model") parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001 ) parser.add_argument('--steps', type=int, help="Number of steps to run", default=100 ) parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0 ) parser.add_argument('--debug_path', type=str, default='/opt/ml/output/tensors') parser.add_argument('--debug_frequency', type=int, help="How often to save tensor data", default=10) feature_parser = parser.add_mutually_exclusive_group(required=False) feature_parser.add_argument('--reductions', dest='reductions', action='store_true', help="save reductions of tensors instead of saving full tensors") feature_parser.add_argument('--no_reductions', dest='reductions', action='store_false', help="save full tensors") args = parser.parse_args() args = parser.parse_args() reduc = smd.ReductionConfig(reductions=['mean'], abs_reductions=['max'], norms=['l1']) if args.reductions else None hook = smd.SessionHook(out_dir=args.debug_path, include_collections=['weights', 'gradients', 'losses'], save_config=smd.SaveConfig(save_interval=args.debug_frequency), reduction_config=reduc) with tf.name_scope('initialize'): # 2-dimensional input sample x = tf.placeholder(shape=(None, 2), dtype=tf.float32) # Initial weights: [10, 10] w = tf.Variable(initial_value=[[10.], [10.]], name='weight1') # True weights, i.e. the ones we're trying to learn w0 = [[1], [1.]] with tf.name_scope('multiply'): # Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w) with tf.name_scope('loss'): # Compute loss loss = tf.reduce_mean((y_hat - y) ** 2, name="loss") hook.add_to_collection('losses', loss) optimizer = tf.train.AdamOptimizer(args.lr) optimizer = hook.wrap_optimizer(optimizer) optimizer_op = optimizer.minimize(loss) hook.set_mode(smd.modes.TRAIN) with tf.train.MonitoredSession(hooks=[hook]) as sess: for i in range(args.steps): x_ = np.random.random((10, 2)) * args.scale _loss, opt = sess.run([loss, optimizer_op], {x: x_}) print (f'Step={i}, Loss={_loss}')

I also need to modify the TensorFlow Estimator , to use the SageMaker Debugger-enabled training container and to pass additional parameters.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1} from sagemaker.debugger import Rule, rule_configs estimator = TensorFlow( role=sagemaker.get_execution_role(), base_job_name='debugger-simple-demo', train_instance_count=1, train_instance_type='ml.c5.2xlarge', image_name=cpu_docker_image_name, entry_point='script-v2.py', framework_version='1.15', py_version='py3', script_mode=True, hyperparameters=bad_hyperparameters, rules = [Rule.sagemaker(rule_configs.exploding_tensor())] ) estimator.fit() 2019-11-27 10:42:02 Starting - Starting the training job... 2019-11-27 10:42:25 Starting - Launching requested ML instances ********* Debugger Rule Status ********* * * ExplodingTensor: InProgress * ****************************************

Two jobs are running: the actual training job, and a debug job checking for the rule defined in the Estimator . Quickly, the debug job fails!

Describing the training job, I can get more information on what happened.

description = client.describe_training_job(TrainingJobName=job_name) print(description['DebugRuleEvaluationStatuses'][0]['RuleConfigurationName']) print(description['DebugRuleEvaluationStatuses'][0]['RuleEvaluationStatus']) ExplodingTensor IssuesFound

Let’s take a look at the saved tensors.

Exploring Tensors

I can easily grab the tensors saved in S3 during the training process.



s3_output_path = description["DebugConfig"]["DebugHookConfig"]["S3OutputPath"] trial = create_trial(s3_output_path)

Let’s list available tensors.

trial.tensors()

['loss/loss:0',

'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0',

'initialize/weight1:0']

All values are numpy arrays, and I can easily iterate over them.

tensor = 'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0' for s in list(trial.tensor(tensor).steps()): print("Value: ", trial.tensor(tensor).step(s).value) Value: [[1.1508383e+23] [1.0809098e+23]] Value: [[1.0278440e+23] [1.1347468e+23]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]]

As tensor names include the TensorFlow scope defined in the training code, I can easily see that something is wrong with my matrix multiplication.

# Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w)

Digging a little deeper, the x input is modified by a scaling parameter, which I set to 100000000000 in the Estimator. The learning rate doesn’t look sane either. Bingo!

x_ = np.random.random((10, 2)) * args.scale bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1}

As you probably knew all along, setting these hyperparameters to more reasonable values will fix the training issue.

Now Available!

We believe Amazon SageMaker Debugger will help you find and solve training issues quicker, so it’s now your turn to go bug hunting.

Amazon SageMaker Debugger is available today in all commercial regions where Amazon SageMaker is available. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.