Data Science Programs Need More Emphasis on Data and Scientific Thinking, Less on Math/Stat/CS

At last Friday’s workshop we ran out of time to fully explore the question: “[Is] data science just statistics?”

In my view, “data science” is not just statistics, but rather, at its core, it is more-or-less exactly what its name suggests: scientific thinking applied at the scale of modern data-oriented problems.

Broadly speaking, scientific thinking is about building models to explain what has been observed and testing those models against future observations. If data science sounds a lot like statistics, that’s because statistical thinking - attending to variability and uncertainty when collecting and analyzing data - has become a critical component of the modern scientific thought process. But as a retired statistician once told me, “Nobody does a science project because they want to learn how to do a t-test.”

Modern statistical and computational tools allow us to build, test, and communicate models in ways that were impossible even 50 years ago. Data scientists do need to know how to use these tools properly and understand how they work. However, it would be a mistake to focus on teaching students the tools of data science without helping them develop their scientific thinking skills and intuition about data. Especially in an age when AI can write better Python code than I can.

Last semester, I assigned students a project in which the goal was to build a small neural network to predict whether a patient presenting at a clinic with snakebite would be discharged from the clinic or die before being discharged. Not one of the students mentioned the massive survivorship bias in the data - the people that had arrived at the clinic 24 or more hours after being bitten were the ones that by definition had already survived a snakebite for days without medical attention. I shouldn’t blame the students for that oversight; they were being graded primarily on their ability to fit the network and explain mathematically how the network was being fit, and that’s where they focused their efforts. In fact, I myself didn’t catch the issue until I started writing my solutions. The point is that by placing most of the grade reward on their technical knowledge and skill, I lost an opportunity to direct their attention toward what actually mattered - determining whether the model they fit would have any prayer of being clinically usable. The answer to that question cannot solely be answered by looking at the F1 score or AUC or whatever fancy accuracy metric you want to use; it can only be answered by critically thinking about how that model would be used “in production” and whether that survivorship bias would mislead clinicians into making poor medical decisions about future patients.

We need to spend more time helping students pose more precise questions, develop better intuition about when and why something doesn’t look right, and think more critically about how and why their models would be used, and less time teaching them about the shiniest new data science tool. I don’t have all the answers for how to do this, but I’m noticing where I fall short and thinking about how to do better in the future.

4 Likes

These are such great points, Dwight! And I love that you had a “high stakes” assignment for the students to tackle in the sense that they were dealing with a serious medical issue rather than contrived questions about things like marbles in urns and call center wait times (I’m trying to remember some of my stats courses and those premises never motivated me to learn anything haha).

I agree with your point about the infusion of context into the statistical thinking that underlies data science. When I think about my statistics courses, most of what I remember is procedural (how to find a p-value, how to compute moments of a random variable, etc.). Those were important tools for me to learn! But I was rarely challenged to think very hard about where data was coming from, who collected it, when it was collected, or what any of my findings might mean from an ethical lens. Looking back, I wasn’t really asked to do those things until I started doing research projects. So I think one great aspect of data science in the classroom is giving lots of students something like a research experience where they have to think more critically about the context of the problems, not only procedural skills. Curious what others think!

2 Likes

I really appreciate this perspective.

I’m increasingly convinced that the distinction we sometimes draw between “statistics” and “data science” is less about content and more about emphasis, and that emphasis has real implications for how we prepare students for the workforce.

From a higher education standpoint, this isn’t about replacing statistics with data science. It’s about integration. How do we design curricula so that technical depth and contextual judgment grow together on purpose, rather than one after the other or by chance?

As faculty, we play a central role in that design. Through the problems we assign, the criteria we assess, and the feedback we provide, we implicitly define what “doing good work” means in our field. Students take their cues from what we reward. If our goal is to prepare them for the workforce, then our courses should reflect the environments they are likely to enter: interdisciplinary, high-stakes, data-rich, and often ambiguous.

Expanding our notion of rigor is part of that work. Rigor need not be confined to mathematical correctness or procedural precision. It can also include disciplined reasoning about data, modeling assumptions, incentives, and consequences.

Interesting posts. I agree that just applying the statistical methods/models to a dataset is not sufficient. Students need to learn that just reporting the result (for example, a t-statistic or p-value) is not adequate or meaningful. The results have to be interpreted correctly in the context of the problem. This is particularly true for Statistics courses in other disciplines like public health or business, so students can see how statistical thinking can drive decision-making processes and its relevance in different contexts and application areas. Understanding the importance of checking model assumptions and evaluating a model before deploying it is necessary to ensure that the inferences and conclusions drawn from the analyses are valid and accurate. Hence, developing that mindset of statistical thinking among students is critical at an early stage.

1 Like

I agree, Sinjini! I think teaching data science “should” be easier in disciplines like business and public health, where you can presume that students have some background knowledge or interest in how the data was collected and how the data is to be used. (Whether that presumption is anywhere near accurate is another story…)

Another example is that we tend to just teach “off the shelf” metrics like RMSE, AUC, F1-score, etc. without having students think about the “cost” associated with false positives/false negatives/overpredictions/underpredictions. But in, e.g., business there are serious and asymmetric real-world consequences for overpredictions vs. underpredictions.

1 Like