Jonathan Haidt and Robby Soave debate the impact of social media, some initial thoughts
In my little circle of the world, the debate between Jonathan Haidt and Robby Soave on the impact of social media has been making the rounds. The overall event was fascinating, but towards the end, a statistician pushes back against Haidt in a heated way, which can be found here.
The audience member says,
I think at the heart of what you said was that anxiety in first among young girls had increased, had tripled, and the analysis you had done had showed that there was a 0.20 correlation.
Continuing the audience member says,
I’m not sure that you understand that what you were saying was that four-fifths of that tripling was due to something else and that at best one-fifth of that tripling was due to involving social media if there was causation.
In other words, a correlation of 0.2 means that 0.8 is unaccounted for.
Haidt responds in three ways that I found interesting. First, he notes that correlation doesn’t mean causation. Second, Haidt conceded a bit about the implications of a 0.2 correlation. Third, and most importantly, he then shifted to discussing dose-response for social media.
It was a telling moment because it was the moment that he gave up the bag. Haidt knows that the correlations don’t matter all that much. What matters is the impact.
Before I found the paper and read it, I had three impressions of his reaction. Some of this is a bit technical, so stay with me.
Yes, obviously correlation doesn’t equal causation, but why correlation is even being brought up? Correlation is typically used as a measure of variance, and we care about variance only to the extent that it matters for the model. We are not there yet.
Second, yes, the statistician is right. A correlation of 0.2 means that the model is limited. It doesn’t explain all that much in the dependent variable. There is still a big part of the variance that remains unaccounted for.
In this understanding of correlation, sometimes R^2 is made out to be a measure of the goodness of fit. But it isn’t that. It only means that we can explain a certain portion of the model.
Finally, it is interesting that Haidt shifted to dose-response because that’s the important part. Correlations aren’t what we should be concerned with. We should be interested in effect sizes. We should care about impact. We should care about oomph, not precision.
Interlude: Oomph, not precision
Auntie D, as Deidre McCloskey liked to be called by her students, has one of the best ways of teaching the difference between oomph and precision. All of this is detailed in her work with Stephen T. Ziliak, titled “The Cult of Statistical Significance." Think about your own mother, they write,
Suppose you get a call from your Mother, who wants to lose weight. Your Mom is enlisting you—a rational statistician, educator and Web surfer—to find and choose a weight-loss pill. Let’s say you do the research and after ranking the various alternatives you deign to propose two diet pills. Mom will choose just one of them. The two pills come at identical prices and side effects (dry mouth, nausea, et cetera) but they differ in weight loss-ability and precision.
You find two kinds of pills for sale, Oomph and Precision,
The first pill, called “Oomph,” will shed from Mom an average of 20 pounds. Fantastic! But Oomph is very uncertain in its effects—at plus or minus 10 pounds (you can if you wish take “plus or minus X-pounds” as a heuristic device to signify in general the amount of “standard error” or deviation around an estimated mean or other coefficient).
The other pill you found, pill “Precision,” will take 5 pounds off Mom on average but it is very precise—at plus or minus 0.5 pounds. Precision is the same as Oomph in price and side effects but Precision is much more certain in its effects.
So what do you choose?
All right, then, once more: which pill for Mother? Recall: the pills are identical in every other way. “Well,” say our significance testing colleagues, “the pill with the highest signal to noise ratio is Precision. Precision is what scientists want and what the people, such as your mother, need. So, of course, choose Precision.” «««< Updated upstream
Stashed changes But Precision—precision commonly defined as a large t-statistic or small p-value on a coefficient—is obviously the wrong choice. Wrong for Mother’s weight-loss plan and wrong for the many other victims of the sizeless scientist.
In the traditional method of hypothesis testing, Precision would have been selected. It provides a more precise estimate around a tighter bound of confidence. But it would have been a wrong choice when we wanted the largest impact.
Think about it,
Mom wants to lose weight, not gain precision. Mom cares about the spread around her waist. She cares little—for example, not at all—about the spread around the average of a hypothetically infinitely repeated random sample.
Similarly, an R^2 is a measure of precision, but we want to know the oomph. We want to know the impact of tech use. The language of economics isn’t all that different than psychologists. Psychologists care about oomph, they just call it the dose-response. Economists care about oomph and they call it effect sizes.
It is a long wind up to the point, the reason why I looked askance when I first encountered Haidt’s response on video. When pressed about the topic, Haidt switched from correlation (precision) to dope-response (oomph). His instincts were right. We should care about the impact of tech and its expected range.
The windup: The history of the debate
The paper that Haidt references in the video is actually a response to work from researchers Amy Orben and Andy Przybylski. I wrote about the Orben and Przybylski paper when it came out. As I noted back in 2019,
The problem that Orden and Przybylski tackle is endemic one in social science. Sussing out the causal relationship between two variables will always be confounded by other related variables in the dataset. So how do you choose the right combination of variables to test? «««< Updated upstream
Stashed changes An analytical approach first developed by Simonsohn, Simmons and Nelson outlines a method for solving this problem. As Orben and Przybylski wrote, “Instead of reporting a handful of analyses in their paper, (researchers) report all results of all theoretically defensible analyses.” The result is a range of possible coefficients, which can then be plotted along a curve, a specification curve. Below is the specification curve from one of the datasets that Orben and Przybylski analyzed.
They explained why their method is important to policymakers:
Although statistical significance is often used as an indicator that findings are practically significant, the paper moves beyond this surrogate to put its findings in a real-world context. In one dataset, for example, the negative effect of wearing glasses on adolescent well-being is significantly higher than that of social media use. Yet policymakers are currently not contemplating pumping billions into interventions that aim to decrease the use of glasses.
«««< Updated upstream The analysis from Orben and Przybylski (2019), which continues with a follow-up (2021) should be considered the starting point for discussions of teens and tech use. They run all possible variables in all possible model combinations to test the entire space of possibilities. In total, they ended up testing 20,004 different models. It is only with the rise of cheap computing power in recent years that these massive datasets could be run. But testing all of those models means that you can combine them all into an estimated effect, an oomph.
These kinds of methods are powerful, garnering inclusion in a recent paper titled, “What are the most important statistical ideas of the past 50 years?” I hope that social scientists embrace these methods because they can be a key source of understanding the total impact of a topic. It is a method of calculating the total amount of oomph.
As I explained at Truth on the Market, once you go through each of the different surveys that track mental health for teens, it seems that social media doesn’t haven’t much explanatory power. only 0.4% of the variation in well-being, much greater welfare gains can be made by concentrating on other policy issues. For example, regularly eating breakfast, getting enough sleep, and avoiding marijuana use play much larger roles in the well-being of adolescents. Social media is only a tiny portion of what determines well-being as the chart below helps to illustrate.======= I find the Orben and Przybylski (2019) paper convincing. They are effectivly running all possible variables in all possible model combinations to the test the entire space. This method means that we are estimating 800 some models and then plotting all of their possibilities. kind of methods have the effect of sampling every possible model condition on a range of different measures.
total amount of variance among all of the variables.
The paper - “Underestimating digital media harm”
The paper that the video highlights is titled, “Underestimating digital media harm.” It is a team effort, co-authored by Jean M. Twenge , Jonathan Haidt, Thomas E. Joiner, and W. Keith Campbell.
«««< Updated upstream It opens, “Orben and Przybylski use a new and advanced statistical technique to run tens of thousands of analyses across three large datasets. The authors conclude that the association of screen time with wellbeing is negative but ‘too small to warrant policy change.'”
Haidt and his colleagues offer six responses that are telling. The first worry that they express is about the relationship between more screen time and more depressive symptoms. They write, “Associations between digital media use and well-being are often non-monotonic; in fact, Przybylski himself named this the Goldilocks hypothesis. Associations often follow a J-shaped curve (see Extended Data Fig. 1).”
The paper opens, “Orben and Przybylski use a new and advanced statistical technique to run tens of thousands of analyses across three large datasets. The authors conclude that the association of screen time with wellbeing is negative but ‘too small to warrant policy change.'”
Haidt and his co-authors are unconvinced. They offer six responses that are telling.
Fig 1 is a dose-response chart and it is included below.
Just to be clear, monotonicity means that the function didn’t decrease at some point throughout the graph. (In technical parlance, its first derivative isn’t negative at any point.) A J-shaped curve has a slope change, which means it is not a monotonic function. There is a bend at the front of the curve that gives the J-shape its form and it is this kind of curve that makes a function non-monotonic.
If social media effects were non-monotonic, if they were J-shaped, then the dose-response would turn back down at some point only to then rise. In other words, the bad effects would first decrease, to a low point, and then increase over more usage. See this post for more.
The paper doesn’t linger on this point long, but it is important. There is a difference between boys and girls. Boys seem to display a J-shape response curve when it comes to social media. Too little time on social media is actually indicative of more depression. There is a sweet spot of technology use for boys right around 3 hours.
«««< Updated upstream Teen girls, on the other hand, seem to have steadily increasing depressive symptoms. The dose-response curve isn’t J-shaped. It starts to increase slowly, then around 3 hours, the rate of increases rise until more than 5 hours when it rockets up. But that also means that regardless of their technology use, girls will tend to have higher levels of depressive symptoms.
Teen girls, on the other hand, seem to have steadily increasing depressive symptoms. The dose-response curve isn’t J-shaped. There are really no good points on the technology use curve.
This relates to the second concern of Haidt and his co-authors point out. They claim it is mistake to aggregate data across screen time types and gender. Here they are right that, “The mental health crisis among adolescents that began after 2012 is hitting girls far harder than boys, in multiple countries. Thus, it is vital that researchers pay special attention to girls, and to the types of media that became more popular after 2012.”
They write, “The mental health crisis among adolescents that began after 2012 is hitting girls far harder than boys, in multiple countries. Thus, it is vital that researchers pay special attention to girls, and to the types of media that became more popular after 2012.”
It is out of the scope of this post, but most of the literature points to the same concern. There is something happening with teen girls.
The fourth issue that they see with the Orben and paper that they raise is a little eyebrow-raising and it seems that it is here that Haidt got this 0.20 correlation from.
In other words, the paper takes umbrage with Orben and Przybylski because they are missing a measurement of depression. For psychologists like Twenge and Haidt as well as Orben and Przybylski, a measure is important because they seemingly track changes in depression over time.
They aim to prove that point by charting out all of the linear r values “between well-being and various factors in boys and girls from two datasets.”
But this graph is confusing because it says that well-being is highly correlated with heroin use and social media. It also suggests that exercise for boys, heroin use in girls, and exercise in girls all have about the same correlation. Great, I guess. But why does this matter at all?
To summarize, Twenge et al made an index and then tested on it. But it matters a lot how this index has changed over time. This is where I am starting because it is the bedrock. Everyone should be wary of testing on indices. For example, of the six items, what if two of the questions changed substantially over time? What if two of the six were the reason why depression got more prevalent? It would change the analysis completely.
Haidt presents good evidence that the impact of tech use seems to be more prevalent with teen girls than teen boys. See, for example, Figure 1. And yet the paper never really tests or explores it. A simple mean difference test and would show exactly that. Why not conduct a simple test? Maybe there is more that I am missing, but this should have been done.
This is why Orben and Przybylski analyzed each one of these survey responses in their (2019) paper. I could go on, but their work aims to get at effect sizes, and it shows that tech does have a negative impact, but it is a small one all things considered. To read the back and forth, start with Orben and Przybylski (2019), then this piece from Twenge, Haidt, et. al (2020) then Orben and Przybylski (2020).
Then think about this, the cherry on the pie.
Most social media research relies on self-reporting methods, which are systematically biased and often unreliable. Communication professor Michael Scharkow, for example, compared self-reports of Internet use with the computer log files, which show everything that a computer has done and when, and found that “survey data are only moderately correlated with log file data.” A quartet of psychology professors in the UK discovered that self-reported smartphone use and social media addiction scales face similar problems in that they don’t correctly capture reality. Patrick Markey, Professor and Director of the IR Laboratory at Villanova University, summarized the work, “the fear of smartphones and social media was built on a castle made of sand.”
So, um, none of the measures of Internet use are reliable.
Teen suicide rates
Still, the most alarming trends come in hospital visits for suspected suicide and suicides. Haidt’s biggest concern is that ER visits for suspected sucide is up this year for teen girls.
But when you pull out a couple more decades, the low point in the early 2000s can be seen for what it is, a low point in the record. The CDC release has more information. And here is some more on historical trend.
Among adolescents aged 12–17 years, the number of weekly ED visits for suspected suicide attempts decreased during spring 2020 compared with that during 2019 (Figure 1) (Table). ED visits for suspected suicide attempts subsequently increased for both sexes. Among adolescents aged 12–17 years, mean weekly number of ED visits for suspected suicide attempts were 22.3% higher during summer 2020 and 39.1% higher during winter 2021 than during the corresponding periods in 2019, with a more pronounced increase among females. During winter 2021, ED visits for suspected suicide attempts were 50.6% higher among females compared with the same period in 2019; among males, such ED visits increased 3.7%. Among adolescents aged 12–17 years, the rate of ED visits for suspected suicide attempts also increased as the pandemic progressed (Supplementary Figure 1, https://stacks.cdc.gov/view/cdc/106695). Compared with the rate during the corresponding period in 2019, the rate of ED visits for suspected suicide attempts was 2.4 times as high during spring 2020, 1.7 times as high during summer 2020, and 2.1 times as high during winter 2021 (Table). This increase was driven largely by suspected suicide attempt visits among females.
Among men and women aged 18–25 years, a 16.8% drop in the number of ED visits for suspected suicide attempts occurred during spring 2020 compared with the 2019 reference period (Figure 2) (Table). Although ED visits for suspected suicide attempts subsequently increased, they remained consistent with 2019 counts (Figure 2). However, the ED visit rate for suspected suicide attempts among adults aged 18–25 years was higher throughout the pandemic compared with that during 2019 (Supplementary Figure 2, https://stacks.cdc.gov/view/cdc/106696). Compared with the rate in 2019, the rate was 1.6 times as high during spring 2020, 1.1 times as high during summer 2020, and 1.3 times as high during winter 2021 (Table).
But the CDC has cautioned about divining causes. In part, xxx.
Relatively little research has focused on children and young people (CYP) whose mental health and wellbeing improved during Covid-19 lockdown measures, but about 1/3 of those in the UK surveyed did better. A deep read: https://buff.ly/377WMXo
In some ways I am not able to parse this research because I am still thinking through measurement. A lot of this research relies upon an index to measure changes in wellbeing, explored in “Increases in Depressive Symptoms, Suicide-Related Outcomes, and Suicide Rates Among U.S. Adolescents After 2010 and Links to Increased New Media Screen Time.”
This index is constructed using six items from the Bentler Medical and Psychological Functioning Inventory depression scale, including “Life often seems meaningless,” “I enjoy life as much as anyone”, “The future often seems hopeless,” “I feel that I can’t do anything right,” “I feel that my life is not very useful,” and “It feels good to be alive.” As Twenge et al wrote of this index, “Response choices ranged from 1 (disagree) to 5 (agree). After recoding the two reverse-scored items, item-mean scores were computed (α = .86).”
WTF Happened in 20XX?
I am still munching on this, so please consider it a draft. It gave me a lot to think about and some ideas. And I am sorry for how long this response is. It takes a while to windup.
Second, most social media research relies on self-reporting methods, which are systematically biased and often unreliable. Communication professor Michael Scharkow, for example, compared self-reports of Internet use with the computer log files, which show everything that a computer has done and when, and found that “survey data are only moderately correlated with log file data.” A quartet of psychology professors in the UK discovered that self-reported smartphone use and social media addiction scales face similar problems in that they don’t correctly capture reality. Patrick Markey, Professor and Director of the IR Laboratory at Villanova University, summarized the work, “the fear of smartphones and social media was built on a castle made of sand.”
Expert bodies have also been changing their tune as well. The American Academy of Pediatrics took a hardline stance for years, preaching digital abstinence. But the organization has since backpedaled and now says that screens are fine in moderation. The organization now suggests that parents and children should work together to create boundaries.
Once this pandemic is behind us, policymakers and experts should reconsider the screen time debate. We need to move from loaded terms like addiction and embrace a more realistic model of the world. The truth is that everyone’s relationship with technology is complicated. Instead of paternalistic legislation, leaders should place the onus on parents and individuals to figure out what is right for them.
- “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations.” (link) Luke Stein’s Stanford graduate economics core. Always start here. It’s 65 pages of clean stats review.
- “People often ask me how social media and the internet contribute to teenagers’ risk of suicide. The teens we spoke with rarely discussed them alone as a trigger for their suicidal thoughts. However, for already vulnerable adolescents, technology can provide a forum for more trauma, worsening conflict or isolation. Further, having easy access to information on the internet about how to engage in self-harm can be dangerous for teens with mental health concerns.” (link)
- Jonathan Haidt and Jean Twenge put together a useful Google Doc summarising the available evidence.
First published Mar 31, 2022