This article was originally published on PolicyHub (July 3, 2017).
The rapid development of information and communication technologies (ICT) is significantly changing our data landscape, and influences everything – from our daily lives, to business, science and public governance.
Data explosion is the term that describes the contemporary state of data production. New (and usually networked) ICT devices (so-called new media, internet of things) are just some of the factors that have contributed to the growing volume and variety of available data. Almost every person, company, organization or institution produces data on a daily basis, using computers, smart phones, smart TVs, self-driving cars, different equipment etc. Some estimations suggest a 4,300% increase in annual data generation by 2020, meaning that data production will be 44 times greater in 2020 than it was in 2009.
In that sense, the new technologies that have emerged have created the possibility – and also the need – for more sophisticated manipulation and analysis of data. However, coping with data has become increasingly challenging. The new data reality brings many challenges for traditional approaches to empirical research and data analysis, making it clear that the ‘new reality’ cannot be met without new, technology-driven, techniques. On the other hand, awareness of new technological capacities and opportunities is creating growing demand for more sophisticated forms of data usage such as real-time analytics, automated data processing and decision-making through machine learning and the like.
This gap between the new opportunities that have emerged from contemporary data and the technology landscape, and the old analytical techniques, has recently been filled with so-called data science.
What is Data Science?
Although the term data science is in widespread use, there is no full consensus on its definition, nor clarity in its meaning. Data science is often associated exclusively with big data, i.e. it is incorrectly perceived as an interdisciplinary professional field exclusively oriented towards manipulating and analysing big data. However, even a small amount of data can be in the focus of data science, so the definition of the field cannot be limited to big data only.
Data science is a relatively young and emerging field that can, in the simplest way, be defined as “the science of extracting knowledge from data”[i] or an interdisciplinary field oriented toward making useful and applicable insights from an amount of data, using scientific methods as well as advanced information technology and techniques that can support highly sophisticated data processing and analysis. In a more technical way, data science can be defined by the OSEMN model, i.e. “according to the following five steps: (1) obtaining data, (2) scrubbing data, (3) exploring data, (4) modelling data, and (5) interpreting data.”[ii]
Therefore, data science is often understood as the intersection of advanced statistics, computer science and particular domain expertise (see Picture 1).
Picture 1: Data science in the Venn diagram
Source: Management Circle[iii]
Some authors, such as Ben Lorica and Michael Li, claim that there is a distinction between data science for humans and data science for machines. In the first case, the focus is placed on getting insights from complex data sets by applying technological tools and solutions and the “ultimate decision maker and consumer of the analysis here is another human”. In the second case, the focus is put on finding machine-based, automated, solutions for data processing, modelling and decision-making with machine learning and powerful algorithms. In this case, “the ultimate decision maker and consumer of the analysis is a computer”. However, in both cases machine tools are used to make data meaningful and/or applicable.
Considering that data science is a relatively new field but, at the same time, more and more present and engaged in the business sector and academia, a lot of misunderstanding and confusion about the differences and the distinction between data science, on the one hand, and empirical research and traditional data analysis, on the other, has occurred. The most usual question is, what is the difference between data science and statistics?
First of all, data science is broader and, as explained earlier, an interdisciplinary field that “borrowed from statistics, machine learning and database management to create a whole new set of tools for those working with data”. Therefore, the essential difference is that “data science emphasizes the data problems of the 21st century, like accessing information from large databases, writing code to manipulate data, and visualizing data”. In other words, data science seeks technology-supported ways to process and manipulate complex data – especially ‘big data’ – but also to accelerate, simplify and even automate these processes. Unlike a traditional researcher, statistician or data analyst, a data scientist will apply knowledge of methodology, math and analytical skills by using programming languages (e.g. Python, R, SQL) and algorithms as well as other computer solutions (e.g. software for visualization, such as Tableau) to make complex data insightful and applicable for solving issues. Considering that data science requires a wide spectrum of knowledge, skills and, very often, specific domain expertise to tackle specific problems, it is usual for a team consisting of professionals with different skills and backgrounds to be formed for executing data science tasks.
For the sake of illustration and better understanding, some key differences between the market research approach and the data science approach, where end goals are similar, are outlined in Table 1.
Table 1: Differences between traditional market research and data science
Source: Chris Martin[iv]
Although the field is still in the phase of articulation, data science expertise is in high demand and IBM predicts that demand for data scientists will soar 28% by 2020. Harvard Business Review described the data scientist as the ‘sexiest’ job of the 21st century. According to Glassdoor’s job ranking – which takes into consideration the number of job openings, the job satisfaction rating, and the median annual base salary – the data scientist profession is at the very top of the scale in the United States in 2017, becoming the best job for the second year in a row in the country. However, there is a shortage of practitioners of this profession. For example, according to McKinsey Global Institute’s estimation, “The United States will face a shortage of up to 190,000 data scientists with advanced training in statistics and machine learning” by 2018.
Data Science in Policy Making
In recent decades, there has been a growing awareness that data “can reduce uncertainty about the best course of action” in policy design, i.e. that it can inform a better policy making process and lead to more adequate, more efficient and more effective public policies. Therefore, policymakers and policy advocates often tend to provide data-based arguments for particular policy solutions, usually gained through sound empirical research or analysis on that topic.
In that sense, datafication – described in the words of Daniel Diermeier, former dean at the Chicago Harris School of Public Policy, as “the ability to transform nontraditional information sources such as text, images, and transactional records into data” – has created the opportunity for policy makers to have deeper, data-driven, insights into respective issues and “allowed quantitative analysis to penetrate the policy process more deeply than ever before”. And this technological reality has generated the opportunity or/and necessity for a more complex, more sophisticated and technology-driven approach to transforming data into policy action that goes beyond traditional empirical research – data science.
Another important trend – political, rather than technical – that allows data science to penetrate into the public policy sphere is the opening of government data. Namely, the growing demand for more transparent, accountable and responsive government – which is coming from citizens as well as from initiatives such as the Open Government Partnership – is resulting in more and more governments deciding to open up and make their data accessible. Illustratively, until the former president of the United States of America, Barak Obama, “launched his ‘Digital Government’ directive in 2012, data science played a minor role in constructing governmental policies” considering that data was “relatively inaccessible for both governmental staff and the public”. However, the US government has opened the door to its ‘big data’ – with 194,263 datasets at the moment of writing – by launching data.gov and, through that, allowing stakeholders to approach, analyse or use the government’s data for various purposes. Once the data is open and available, opportunities to apply them in different spheres, from business to policy monitoring and analysis has become endless.
However, there is a growing awareness that data science can provide “new sources of evidence for policy-making”. In that sense, possibilities and opportunities for the application of data science solutions in public policy are increasingly gaining interest. For example, Data for Policy, an independent initiative launched in 2015, seeks to debate the “theory and applications of Data Science as relevant to governments and policy research”, by gathering “academic institutions, government departments, international agencies, non-profit institutions, and businesses”. A central part of this initiative is an international annual conference which covers topics relevant to this domain.[v]
Using technological solutions and data science for providing policy measures is becoming especially popular among city governments in the US. In that sense, American cities “have started to use the ever-increasing amounts of data they collect to improve planning, offer better services and engage citizens”.
Thus, for example, the city government of San Francisco has employed a form of the data science approach to reduce frequent traffic collisions in the city. Namely, due to frequent traffic accidents, resulting in a number of deaths every year, as well as generally low safety in this regard, the Department of Public Health and the Department of Transportation were assigned by the government to develop an adequate policy to address this issue. They decided to provide possible solutions in a data-driven and technologically sophisticated way. The first step was to establish a mechanism for continuous mapping and visualization of traffic-related incidents across the city through the TransBase online software platform. Second, based on the gathered data, the Vision Zero High Injury Network was developed to identify where the main problems occurred and to provide insights into what kind of policy actions the government should undertake. They found out that “just 12 percent of intersections result in 70% of major injuries”. Finally, insights gained through this process were transformed into policy solutions, introducing ‘protected intersections’, underway intersections and protected bike lanes as some of the measures.[vi]
Picture 2: Screenshot of Vision Zero High Injury Network
Source: Abhi Nemani[vii]
More and more cities across the globe are being equipped with different technological solutions (e.g. phone apps that connect citizens with public services, various types of sensors and similar), provided by governments and other stakeholders, which improve citizens’ well-being and the general functioning of cities. However, very often, government officials and decision-makers are not aware of the full potential of the data generated through these solutions, i.e. they lack the knowledge, skills, technical expertise and/or technological infrastructure to gain useful insights from these sources. In other words, despite the ‘technologization’ of the cities’ governance, the link that is missing in order to extract relevant policy insights from the amount of collected data and maximise the effectiveness of the implemented ICT solutions is data science. Furthermore, “cities need to play a more active role as brokers of urban data”, by becoming “the guardians of the local data ecosystem” and encouraging citizens and relevant actors to share data by protecting privacy and ensuring the transparency of algorithms.
The importance and potential of data science in governance and policy making is also becoming recognized by academic institutions. In that sense, university degree programmes in this field are being established in order to provide students and the community with knowledge and skills that are in step with the technological developments. The University of Chicago is among the pioneers in this domain, establishing education and research programmes in the field. Thus, for example, the Master’s Degree programme in Computational Analysis and Public Policy at the university combines “a traditional public policy curriculum with computer science training – including topics such as programming, databases, and machine learning”, providing students “with the hard and soft skills needed to fill the talent gap in public-sector data analytics.” Within the same university, the Center for Data Science and Public Policy was established with a mission to support the education of professionals who are able to apply data science knowledge to a policy context, to facilitate research in this domain and to develop “open-source tools that provide non-profits and government organizations a starting point to use data science.”
Data Science and Non-government Actors in Public Policy
Although still in its infancy, it is certain that such technology-influenced trends will shape the future of public governance and policy making to a great extent. Governments are becoming increasingly aware of and adaptable to these trends and data science is starting to play an important role in public policy.
Additionally, there is a growing awareness that such trends should be taken up by civic society in order to ensure more democratic governance, accountability and influence on policy making by non-institutional actors. Accordingly, initiatives such as Civic Analytics Network and Data Science for Social Good promote the application of data science in solving real-world, community problems, providing the necessary tools (or tech infrastructure) as well as educating professionals and empower civic society to participate in policy making by using technologies and data science solutions. In other words, such initiatives are trying to help the community to overcome the technological and knowledge barriers to participation in policy making in the contemporary data environment.
The new data landscape brings new opportunities not only to governments but also to other stakeholders involved in policy processes. While governments are already making the first steps towards incorporating data science into governance, non-government actors will have to take this trend more seriously in the upcoming period and adapt their work and capacities so that they can respond to policy issues through analysis and by putting complex data sets into action. In this regard, every actor will have to give more attention to the strengthening of data science capacity, from civic movements to think tanks. This ultimately means that data scientists will play an increasingly important role, along with researchers and analysts, in the analysis of public policies and the formulation of policy demands.
[v] The following topics were covered in the past two years: “Policy-making in the Big Data Era: Opportunities and Challenges” (2015) and “Frontiers of Data Science for Government: Ideas, Practices, and Projections” (2016). The third annual conference, name “Government by Algorithm”, will be held in September 2017.
[vi] Abhi Nemani, “Data-Driven Policy”: San Francisco just showed us how it should work in Medium (28.8.2016)
[vii] Abhi Nemani, “Data-Driven Policy”: San Francisco just showed us how it should work in Medium (28.8.2016)