Open data will be the next leap forward

Every revolution starts as a grassroots movement. Driven by the democratization of data science skill sets, a proliferation of data, and trivial access to on-demand compute and storage through flexible cloud compute availability, the promises of the AI age are just barely out of reach. An increasingly simplified workflow in the age of automated machine learning means extracting predictive and prescriptive insights from vast volumes of information has never been easier- as long as you have the assets.

However, for small businesses and civilians, the process of accessing, collecting, cleaning, and analyzing data is not trivial. While enterprise firms can leverage platforms such as Amazon Mechanical Turk to get a handle on the dirty data cleaning and labelling work required by large data science projects, individual developers often find themselves at the mercy of the API purchase screen if they hope to get access to aggregates of data monetized by enterprise aggregators.

AI in the modern workplace

Since the first breakthroughs in data-driven statistics in the 1990s, analytics faculty has become indispensable as a skill set, commanding a premium from employers and students alike. Simply adding “Data” to a job title seems to qualify new hires for low six figure salaries, and “Master in Data Science” has become one of the more coveted degrees coming out of business schools, preparing graduates for an exciting career in machine learning, analytics, and other industry applications.

Less conventional roads to the data science career path are becoming more common. Class Central now boasts over 20 basic “introductory data science” style courses, each containing approximately 20 hours of content. This doesn’t include specializations from institutions like John Hopkins university, or coverage of tangential skill sets like data visualization or engineering.

Google-owned Kaggle now boasts an impressive 3 million data scientists competing to solve prediction and regression problems, and Grand Masters (the highest rank awarded over a series of tournaments) are often inundated with requests from recruiters, for both internal and external company positions. The community hosts its own course work as well as a jobs board, where members of Kaggle can be recruited for industry applications ranging from tech to healthcare and finance.

Even Amazon has jumped on to the re-education band wagon, pledging $700 million in a July press release to upskill 100,000 employees with technical skill sets by 2025. In addition to covering university-level programs and upskilling efforts for software developers through the launch of programs like MLU, the initiative funds Associate2Tech and Growing Career Choice, programs directed specifically at developing Amazon FC employees by funding tuition or offering IT apprenticeships. The commitment to employee upskilling was driven by Amazon’s review of its own hiring data, which revealed staggering growth of 832% and 505% respectively for “data mapping specialist” and “data scientist” positions over the previous 5 years.

Missing pieces of the data puzzle

The most remarkable feature of this new workforce is that unlike previous requirements for highly skilled workers, a university degree is not a necessary prerequisite for application. Increasingly, employers are interested in seeing a history of production via portfolios, community involvement, and open source projects- the best evidence of ability is a demonstrated history of performance. Data scientists and developers that participate broadly in open source work are more likely to have good teamwork and communication skills, which help employees thrive in a collaborative workplace and form the core foundation of any strong engineering organization.

This flexibility on the part of employers is likely driven at least in part by the acute shortage of big data related skill sets. A 2018 Linked In survey found a shortage of nearly 151,000 data analytics specialists in the US, with acute shortages in both eastern and western population centres. Like any other specialty, data science is not a skillset that can be acquired by theory alone- practitioners require a marriage of theory and deep hands-on experience, and course work won’t solve for the latter, whether tuition costs students $40 or $40,000.

Automation and a wider breadth of access to data may hold the key to the problem. “Making data science products easier for citizen data scientists to use will increase vendors’ reach across the enterprise as well as help overcome the skills gap,” said Alexander Linden, research vice president at Gartner. “The key to simplicity is the automation of tasks that are repetitive, manual intensive and don’t require deep data science expertise.”

The “citizen data scientist” — a person who “creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics” seems to hold the key to scaling the data science problem. The questions faced by many business lines today are less focused on the latest innovations in federated or transfer learning, or the intricacies of creating fast GNNs. Rather, for a majority of applications, focus on simpler use cases allows a business to extract maximum value from its own data, by supporting decision makers and line staff in day to day tasks. AI is moving towards a commodity model of consumption, where the role of applications is found in support of domain experts in their day to day.

Open data for a more human future

A 2017 prediction by Gartner- forecasting automation in a significant proportion of machine learning tasks- has in many ways come true. Innovative firms like DataRobot and H2O.ai are leading the charge by bringing scalable machine learning to varying industry lines. The promise of these firms is to democratize the AI skill set by allowing employees, across varying organization functions, to leverage the advantages of predictive and prescriptive modelling, allowing data scientists to focus on the truly challenging technical problems which only they can solve.

Of course, solutions like these do little for ordinary business owners not working as cogs in large organizations, due to the lack of availability in not only skill set, but also data. Data, collected by effectively free services around what people search, share, and buy, builds 25B technological empires in Silicon Valley. Industrial giants like Siemens and GE increasingly market themselves as data companies, and financial services firms are working towards platform approaches by productizing data assets in the form of APIs.

A grassroots movement towards open data- a view of data as necessary public infrastructure- has been gaining ground in policy as well as practice. Public good projects like W3C, cities, and government organizations have long abided by a standard of making internal data available via portals in order to make insights that can be mined accessible to the public. The movement is largely reminiscent of the open source revolution that influenced the way software is created today by introducing the concept of a common code base, maintained by the general public on a collaborative basis. By open sourcing this infrastructure, software development solved for the duplication of labour taking place among developers working on commonly used and re-developed components. As a result of the available shared infrastructure, the nascent industry ballooned in size, permeating every part of modern society.

Open data promises to do the same, not only by enabling access to data that is commonly aggregated by those seeking insight, but by unifying the discipline under a common set of interoperable standards, allowing researchers to focus on the problems that remain to be solved.

Firm.ai, a project focusing on the aggregation of open source AI-BI applications, is just one instance of progress towards this goal. The project, operated by Derek Snow out of Auckland, NZ., has amassed over 500 entries to date, and has made significant headway on cataloging the huge variety in data collection, aggregation, analysis and delivery solutions (free and paid) available to consumers to date. Firm.ai envisions a future of open data access and a facilitation of small-medium enterprise automation.

The Open Data Science approach of Firm.ai may be revolutionary in the U.S., where organizations like Google, Refinitiv, and Bloomberg have been competing over the title of “creator of the data standard”, but China and Korea have both made significant headway into reorganizing society to support AI mass flourishing. Government endorsement of open data programs (via sponsoring data producers to make data openly available), standardization of data structures across business and industry lines for better interoperability, and a commons approach to the tools of the data science trade are quickly positioning China as the leader of the Industry 4.0 pack, with innovation hubs springing up along the Pacific coast from Shenzhen to Shenyang.

In order to keep its lead, the U.S. will need to reconsider its approach to data access and ownership. The promise of the fourth industrial revolution is a high stakes bet on allowing technology to augment the human experience by supporting knowledge workers, executives, and even common social functions by allowing them to do what they are best att: innovating, creating, and discovering. By automating the banal, and accounting for the trivial, people can be allowed to focus on the orthogonal thinking capacities that built the varied social landscape we enjoy today. Whether this promise is kept- or whether the spoils will be funnelled up, creating an underclass beneath the API- remains to be seen, but one thing is certain: like any significant shift undergone by society at large, if this one is to happen it will be bottom up.