Guest Column: A Reader�s Guide to Policy Analysis with Big Data
By Paul and Kim Thomas
This is a short note about a BIG topic. The term "Big Data" has been used to describe the burgeoning practice of using massive amounts of data from various sources--including unstructured data collected for other purposes--to do statistical analyses. The readers of this note are, we presume, policy advisors and policy makers who are now encountering internal or published studies based on Big Data analysis.
Here are eight issues that should be considered when evaluating Big Data studies:
- Bigger is not always better. Flawed analyses based on a variety of problems besides small samples are common. Examples of flaws include confirmation bias, overfitting, and omitted variables. Having more data will not magically fix these problems.
- Big Data studies often join several large data sets and therefore have an embarrassment of riches in terms of possible explanatory variables for purposes of analyses and forecasts. This could lead to over-fitting in which the analyst finds associations in the data that are spurious, that is, ones that look good statistically but have happened only by chance and don�t reflect any real causal connection. This can confound analyses and forecasts.
- Advocates of Big Data methods sometimes argue that preconceived theories have no part in data analysis (or even in the scientific method!), and the data should be allowed to "speak for themselves." We would say that data may speak, but to understand them you need a skilled interpreter, someone who knows how to separate the whisper of truth about variable relationships from the cacophony of lies due to spurious correlations. In ideal research, theory and data must continually inform each other.
- Big Data studies often collate several data sets, each with its own set of measured variables. There are a multitude of possible combinations of variables and branch points to try in models and it is tempting to use machine learning techniques to try all the combinations. These techniques will discover many spurious relationships. To make matters worse, some commonly used statistical measures will no longer be valid. For example, if our machine tries 1,000 models, about 50 will pass normal 5 percent significance statistical tests just by chance. But, there are ways to correct the statistics and reject most or all of the spurious 50. The policy analyst should require that these corrections be used.
- Big Data projects analyze huge amounts of data using esoteric analytic techniques. The reader can be overwhelmed by all the data and all the manipulations and, in fact, the data scientist can easily lose track of the steps taken. To improve the quality of the work and allow critical judgement, certain questions need answers. Where did the data come from? What techniques were used to wrangle them into usable form, and how were common problems with data manipulation avoided? How were the models chosen and tested? Fortunately, data scientists have good tools available for documenting each step of Big Data procedures. Users of their studies should insist that the documentation be made available along with the results.
- Big Data studies can have a surprisingly small focus. The datasets called "micro-data" have many observations of many variables pertaining to individual persons, families, companies, etc. One or more sets of micro-data can sometimes be combined if a common key can be found to ensure that data for an individual in one set is combined with data for the same individual in another. For instance, a study of migration of families from state to state to see how the amenities available in each locale shape educational attainment would ideally track individual students across school lines. So-called "administrative data" such as names and home addresses, perhaps assembled from change of address data collected by the U.S. Postal Service, might be used to recognize that student "ABC" in school "I" is the same person as student "DEF" in school "J". Some of the most useful Big Data studies from a policy perspective have made use of administrative data to assemble such detailed micro-data. Therefore, policy makers should encourage such studies.
- At the same time, policy makers must be sensitive to possible infringements of privacy associated with micro-data. Consider the school study mentioned above. There might be individuals in the dataset who are dealing with substance abuse issues. They might appear, for instance, as students whose grades falter, who then take leaves from school and show up later in the data, possibly at other schools. Suppose we know which schools were involved and the years during which these movements occurred. How hard would it be for a third party to deduce the affected students� names and addresses? Measures have to be taken to safeguard the anonymity of the data subjects, even though this may mean less transparency in the overall process.
- Good Big Data studies certainly exist. But Big Data is already big business, so we should beware of marketing hype. Claims for Big Data today often resemble the highly exaggerated claims made for Data Mining about 20 years ago and Artificial Intelligence methods before that. Some partial good results are finally being achieved with these methods but much later than promised.
Those who contemplate using Big Data results should recognize their limitations but not despair. The insights gained from the best statistical studies are proving more important to our understanding of the world than particular quantitative results from analyses and forecasts. This was true before Big Data, and will continue to be true.
Paul and Kim Thomas are Principals and Co-founders of Economic Stories LLC. The opinions in this article are presented in the spirit of spurring discussion and reflect those of the author and not necessarily the Treasurer, his office or the state of California.