I’ve been working on building data infrastructure in Coursera for about 3.5 years. This week, I had an opportunity to speak at Data Engineering in EdTect event at Udemy about our data infrastructure. To better suit readers, this article is an adapted based on my notes of the talk, in which I shared a few lessons of building a real world data infrastructure from scratch.

In Coursera, data plays a big role in our daily work. From overall aggregated numbers, e.g., 27M learners on the platform, to specific zone-in metrics, e.g., 45% of our learners are from emerging markets. All these numbers help us to make data-driven decisions in everyday. Empowering people across the company to have an easy access to our data is always the first priority of the data infrastructure team, and it is definitely not an easy task.

Challenge 1: Data is everywhere

Like every other internet company, data is everywhere is Coursera. When I joined Coursera, we started rebuilding our web application and building our mobile applications. Data were tracked in multiple channels through a a few inconsistent ways: some data were tracked in an unstructured format and directly sent to our eventing system, some data were tracked in our MySQL or Cassandra databases, and some data were only in third party tools like SurveyMonkey. I even heard the story of manually logging into each database to calculate the daily activity learner metric. Luckily I didn’t need to do any of that as we just started of building our enterprise data warehouse (EDW). Though there were merely a few dozens of tables in our EDW system, it is still a solid start.

We picked Redshift as our EDW system. Besides of the standard SQL interface that every data scientist understands, Redshift is fast, and more importantly, reliable. We only had three engineers at that time; by only a few clicks, we can operate reboot, resize, and other actions on Redshift through its console. Unlike Hadoop or Spark at that time (I heard that both Hadoop and Spark are getting much better now and we will look into them again when Redshift is not sufficient for us), we rarely need to debug any issue (e.g., OOM), Redshift can reliably execute most of our queries without any memory or performance optimizations on the query.

We tried Hadoop and Spark at that time, comparing with Redshift, we saw a huge amount of operational cost which we couldn’t afford as a team at the time. Thanks to Redshift, we now focus on building tools which have direct impact to our business instead of spending time on operational tasks. Redshift has served us well in the past four years, and we haven’t looked back yet.

Solution: build an EDW system to keep all your data in one place (latency is OK for most of cases)

Challenge 2: Data requests are from everywhere.

Once we started moving data into Redshift, data requests flooded into us from everywhere: engineers, data scientists, marketing, customer support, sales, external users like universities and enterprise content providers and customers. Everyone in the company wants to understand our data in a more quantitative way, and we saw a variety of requests across a huge spectrum of different domains.

In order to meet the demands while the team is small, our solution is very simple, we build an internal query page, which gives people the ability of writing SQL queries, simple charting functions and basic sharing functions by accessing a web page. This helps address a few issues:

Access centralization. Before this tool, people use all sorts of tools to access Redshift, which gave us a hard time as any misuse of a tool could potentially bring down EDW. For example, some tool doesn’t implicitly release the locks on the tables in Redshift until the disconnection, and if people forget to disconnect (which they often do), our ETL system will break because it cannot write any new data into the tables as they are locked by the connection. On the other side, because people always access EDW through this tool, we can easily monitor and operate on people’s queries, e.g., if EDW is hot and overwhelmed, we can limit people’s access to this tool to throttle jobs sent to EDW until it cools down. Democratization. Since this tool is build on web, every query and execution result is saved in the tool as well. People share any query or result by copy-and-paste of a URL. This allows anyone in the comopany to conduct a simple ad hoc analysis through this tool and share the result with other people. Non-data-scientist role especially loves it. As long as they can access internet, they can go to the querypage and write queries to get answers they want. This self-serve query system helps reduce the daily load of data scientists, allowing them to focus more on deep dive analysis and less on daily data inquiry support.

Solution: focus on building a self-serve data access by providing a centralized access point; avoid the situation that your customers choose tools to access EDW because it is hard to debug and manage.

Challenge 3: Everyone hates ETLs, everyone needs ETLs.

As we were expanding our product lines and business, the demand of writing ETLs became higher and higher.