I recently came across an interesting thread on Quora entitled “What is a data scientist?” The thread had 85 responses, and while no consistent definition emerged, perhaps the most popular was a variation on this this tweet by Josh Wills:
After reading all the replies, as well as some of the linked articles and blog posts on the thread, it became clear that the term data scientist is one that should be retired immediately for being (a) wrong and (b) redundant. This statement may strike some readers as an extreme position, but I think my critique is valid. Indeed, let’s consider each part of my conclusion specifically, and let’s start that analysis with a definition of the word scientist.
To me, a scientist is someone who (a) employs the scientific method to (b) advance a body of knowledge. In general, scientists do three things, usually (but not always) in this order:
- Theorize: create hypotheses about natural phenomena
- Test: create experiments to prove or disprove a theory
- Discover: find things that add to a body of knowledge
To illustrate this definition, imagine that Jane, a scientist at NASA, theorizes that there is life on Planet X. She then gets engineers to design and build a ship that can travel to Planet X. Within the ship is an astronaut who will look for signs of life. One day, NASA launches the ship and does indeed find life on Planet X, just as Jane theorized. For me, someone is only a scientists if she engages in all three parts of the process described above. Why is that the case? Well, someone who explains why there may be life on Planet X but does not test that idea is a theorist. There have been many brilliant theorists, but coming up with a great theory that you never put to the test does not make you a scientist. Second, someone who only builds the space ship, no matter how special the design or construction, is an engineer. Again, history is full of great and immensely gifted engineers, but that does not mean they were scientists. Lastly, the astronaut who finds the signs of life is also not a scientist, since his role is to be a (perhaps very brave) field technician who executes the last step in the process that Jane designed.
The example above illustrates the first part of my definition. But my definition has two parts, and the second is equally important. A scientist uses the scientific method not just to “find interesting things” or to “prove a point” but to advance a body of knowledge. Any given scientist is always building on what others have done in the past and (hopefully) laying the foundation for what others will do in the future. Science builds and expands human understanding of a field in a continuous, nonlinear but nonetheless systematic way. Every real scientist I know (and I know several) understands not just his work but how it fits into what has come before and what may come in the future. None of them conducts work in isolation — all of them, and their efforts, are consciously part of intellectual sequences that can trace their origins back hundreds (in the case of biologists, for example), or even thousands (as is the case with some mathematicians and physicists), of years . Likewise, these same people can move into the future decades and even centuries, hypothesizing about how ideas in their fields might evolve as new data and techniques emerge.
With that background in mind, reading through the Quora posts was amusing, since almost all of what goes by the name of “data science” today is not science at all. Most of the work does not follow the scientific method and most has no backward or forward connection to any given body of knowledge. Moreover, while the mechanics of data analysis have improved dramatically with the advent of modern computing, the quality of the outputs have not yet significantly raised the bar on the best work done by the greatest “analog” data analysts. For example, anyone who has read the books of Edward Tufte knows the amazing chart made by Charles Joseph Minard, reproduced below:
Minard made this famous “viz” in 1812, and it brilliantly depicts the disastrous Napoleonic campaign into Russia. Minard was not a scientist, and great visualizations can, and often do, come from non-scientists. Which is another way of saying that there is no need to add the label “science” to great achievements in data analysis or visualization. To do so confuses the subject and suggests that data analysis is somehow not worthy on its own to be taken seriously as a field.
At the start of this post I wrote that the term data scientist has two flaws: that it is wrong (which I have tried to prove above) and that it is redundant. Turning to my second critique, I would note that all good science relies on data to complete the scientific cycle noted above. To call anyone a “data scientist” is as ridiculous as calling someone a “water sailor” or a “wood carpenter.” This redundancy is not just pointless, it is dangerous, since it implies that the new breed of data analysts (and analyst is the right term for most people calling themselves “data scientists” these days) are involved in something more rigorous than is really the case for the most part. Indeed, as Drew Conway put it in his post on the Quora thread:
The term “data science” is a misnomer with respect to what most people consider endeavors classified as such. Fundamentally, “science” is about formalizing a hypothesis given a reasonable set of observations and assumptions, designing an experiment around that hypothesis, testings it and analyzing the data generated through that process to either confirm or falsify the hypothesis. Therefore, “data” is simply a natural byproduct of science. Very (very) rarely are things labeled as data science actually scientific.
In the end, what is important to note is that science does not exist without data; however, the presence of data does not necessarily prove the presence of science. It should be enough for the people doing the best work in statistics and data visualization to label themselves theorists, engineers, analysts or even — and this is a merited term in some cases — artists. There is nothing wrong in those terms. They do not need to borrow the label of scientist to justify their value or contribution to our understanding of the world. Indeed, Minard’s graphic above is, to this day, considered by many experts the single greatest data visualization ever created.
Minard was a civil engineer.