Machine Learning

The exam for me (questions are random so I don’t know about other experience) was really heavy on Machine Learning topics. There was like 15 question about this. Mostly related to:

Overfit models

How to deal with high RMSE, for example: make your model more complex and robust.

Neurons, features, epoch, labels.

L1 and L2 regularization

Dialogflow

Cloud AutoML to label some logos within an image.

Cloud Vision API, Speech to Text, etc.

Tensorflow models in C++

Cloud TPU and GPU

BigQuery

The exam was heavy on BigQuery questions too. Questions related to Updates on Bigquery, IAM roles, Slots, Storage, etc.

Update DML: How to use Update DML in BigQuery, how you can handle quota error exceded in your project. How many simultaneous updates can you run on a daily basis? Best ways to update a table (for example, if you have a partitioned table, etc.).

Authorized View: I recommend to study all about authorized views, how you can share queries to your data science team without the necessity of query the entire columns of a table.

Allocated Slots and Available Slots: What can you do if you don’t have more slots available and you don’t want to create a new project in your organization?

Storage: Questions related to the best way to store raw data, for example between BigQuery or Storage. This will depend on the context of the question and if price or performance is the priority.

BigQuery Data Transfer Service and the connection available with BI tools.

Partition and clustering

HASH, Merge, Data manipulation. 2 questions related to this.

BigTable

I have no professional experience working with BigTable so the knowledge that I have was mostly theoretical.

There were 2 questions related to Row key performance and how you can update your cluster if the performance is not optimal due to high reads or write.

How you can scale your cluster and synchronize the data.

Single-cluster routing and multi-cluster routing.

Key Visualizer Metrics

Spanner

Knowing about default indexes and secondary indexes is a must and when to choose Spanner over Datastore, Bigtable or CloudSQL.

Regional configuration and replicas.

Monitoring CPU

CloudSQL

I remember just one question about CloudSQL. I recommend to know how to export data from CloudSQL to BigQuery, on-premises databases to CloudSQL, Cloud SQL HA and read replicas.

Datastore

Again, the key here is when to choose DataStore over other databases like CloudSQL, BigTable, BigQuery, etc.

How you can export data from DataStore to BigQuery.

Replicas between other projects.

Multiple indexes and syntax to create composite indexes are going to be really helpful.

Dataflow

This topic was really technical and if you don’t have experience working with Dataflow it may be a little bit tricky.

How to discard erroneous data and for example sent it to Pub/Sub or Cloud Storage

Transform, DoFn, Sideinputs, Sideoutputs.

IAM Roles for Developers and how to secure the data.

Windows, all kinds of them. There was like 3 questions about Sliding time windows, Session time windows, the best way to deal with late data.

Bounded and unbounded data.

How to connect with Pub/Sub, BigQuery, BigTable, etc.

Pub/Sub

I recommend to understand pretty good the differents between push and pull and what you need to implement a push solution.

This service is glue with other Cloud components so there was some question related to Pub/Sub / Dataflow / BigQuery implementation.

Streaming and how to implement this solution with Dataflow

Globally Unique Identifier (GUID)

Handle subscriber code errors

How to connect Kafka to Pub/Sub

How to know when your topic is currently not working well. This is mostly related to Stackdriver Monitoring.

Dataproc

This was pretty heavy on on-prem Hadoop implementations and how to migrate to GCP.

Migrate jobs to the cloud.

Which role needs the service account to work properly with Dataproc (Dataproc Worker).

SOCKS and YARN for web Interface.

Custom images.

Use Storage instead of HDFS.

Always remember that Google recommends one cluster for one task. If you need analytics and transactional solutions with Dataproc, it’s better to create two clusters for that kind of implementation.

IAM Roles

It’s going to be useful to know the most important roles for every Service

Different between jobUser role and User for BigQuery.

Dataproc Worker, Dataflow Developer.

Billing Administrator Role Account.

Difference between Writer and Reader role for BigQuery.

Which roles can you administrate for Pub/Sub service?

Aggregated logs for multiple projects.

Dealing with roles cross projects, what are the best practice for that, create a group of users for those projects? hierarchy? Service Accounts for Cloud Storage and BigQuery?

Cloud Storage

Here, it’s a must to know the differences between every class in Storage. The exam questions were really tricky between cold line and nearline implementations.

Most questions were related to how to secure raw data for audit.

Data Transfer vs Storage transfer service.

How you can stay in sync with on-prem storage if the on-prem storage doesn't allow any other IP from outside?

Composer

This cloud component is quite easy to figure out. Remember that the Cloud Composer environment runs Airflow and Airflow itself is an Orchestrator tool. So when you want to integrate some Dataflow jobs with Dataproc jobs and there’s a dependency on each other. Always the best solution is going to be Cloud Composer.

DataStudio

Study the difference between Viewer credentials and Owner credentials if you want to share some dashboards.

Default caching and prefetch caching.

How to connect BigQuery with DataStudio and other services like Youtube.

Dataprep

There was some question related to Dataprep, for example, if you want a quite easy implementation to deal with outliers what’s the best tool for that? transform recipes, and finally, how to schedule a Dataprep implementation, do you need Cloud Scheduler for that or you can do it directly from the Dataprep UI?