Where broadband likely exists in the United States; an estimation of the broadband gap using logit models

All Internet service providers are required to file data with the FCC twice a year (Form 477) describing where they offer Internet access service at speeds exceeding 200 kbps in at least one direction. This data is compiled into Form 477 data, which can be downloaded in its raw format on the FCC’s Open Data site.

There is a lot of data here to access, but one of the most important datasets is the Area Table. Search under the tag of Form477 for the most current data. It is usually 18 months behind. So in late 2021, I was able to access Dec 2020 data. Here is the permalink to that sheet.

Area Tables provide county level units for broadband provider counts. At the FCC’s map site, which is connected to the FCC site, this table gives county level readouts but the information isn’t keyed to FIPS codes, but instead to county names. That’s bad practice, so it is best to use the the Open Data site.

First set the geographic type, which is going to be county. Use the filter function to add a condition. In this case, type is “county.”

Then, filter out this dataset to include just the broadband speed of concern. Since broadband thresholds are currently set at 25/3, ensure that the first filter selects “25.”"

Then, add another filter selecting the technology you’d like to analyze. Most analysis reports the broadband access for ADSL, cable and fiber, so select acfow under the tech code since a=ADSL; c=cable; f=fiber; o=other; s=satellite; w=wireless. As noted in the data set, the combinations always appear in alphabetic order and mean that any of the technologies on the list are considered. Export this for use for a bit of cleaning up in R.

The problem with 477

The current problem with the FCC data is that it only takes one household in a Census tract for the entire tract to be considered completely served. In small tracts in urban regions, this bias doesn’t matter all that much, but larger-sized tracts in more rural areas probably mean larger bias.

Everyone worries about this problem. Politicians and FCC chairs have been dedicated to it. We know that the FCC’s estimates are biased. At least, the going assumption is that

Late last year, The Georgia Broadband Deployment Initiative released a dataset of served and unserved location based on the state’s commercially sourced address dataset. The raw data can be found here.

As one would expect, George Ford has written another great paper on Form 477 data following up on previous work on the topic. Using Georgia’s data, Ford found that “there are 14 million unserved locations in the U.S., though about 5 million of these were addressed in the recent Rural Digital Opportunity Fund (“RDOF”) auction, leaving approximately 9.1 million unserved locations.”

In his earlier work, Ford (2019) estimated an overstatement of 5.6 percentage points (using 2019 data). Meanwhile, Busby, Tanberk, and Cooper (2021) estimate an overstatement of 11.2 percentage points. The true difference is smaller than these predictions estimate and is much less than the prediction by Busby, Tanberk, and Cooper (2021).

Georgia isn’t the only state that has released data. Forty other states have been working on the data problem with the NTIA through the National Broadband Availability Map and a handful of them have released data. But the data that Georgia released is important because it counts every address as either connected or not connected. Vermont is the only other state to release data in this format. For more information on all of the states, see my Google Sheet.

Both Georgia and Vermont offer an estimate of the true number of served and unserved locations. Importantly, their data collection methods differ from the FCC. They don’t ask providers if they are offering service in the block as the FCC does. Rather, they have collected their map data via a commercial address system and then went to each address and found the number and kind of providers for Internet service. Thus the two datasets, with their geographic and demographic differences, stand as two true estimates for broadband.

Georgia data

The Georgia Broadband Deployment Initiative released a dataset of served and unserved location based on the state’s commercially sourced address dataset. The raw data can be found here. The readme provides the following data on the columns:

  • CBID - Census Block Code
  • CBYear - Census Block year
  • CountyFIPS - County FIPS code
  • County - County name
  • GeorgiaOne - Georgia One county designator (0 = No, 1 = Yes, 2 = Conditional)
  • RC - Regional Commission
  • USCongress - US Congressional District
  • GAHouse - Georgia House District
  • GASenate - Georgia Senate District
  • Served - Number of served locations in block
  • Unserved - Number of unserved locations in block
  • Status - Served Status of block
  • UnservPCT - Percent of locations unserved
  • TimeStamp - GBDI data cycle currency

For each Census block, the state of Georgia provides the number of served and unserved locations. This provides one estimate of the number of unconnected. According to this data, there are 5,283,882 locations, of which 4,825,081 are served for 91.3 percent served.

The second estimate of the unconnected uses primary Form 477 data and implements the FCC’s counting system in R. It counts the entire population as served if one building is provided service. According to that method, the FCC estimates are summaries of binaries, described later. This method suggests 96.9% of the population is served.

Finally, FCC own estimates at the county level are summarized. Again, here is the permalink to that sheet. This suggests that about 96.0% of people in GA have broadband.

State data - Vermont

Vermont also provides this information in the Vermont GeoData. This layer presents the availability of broadband service at buildings in Vermont by speed, as reported by broadband service providers as of 12-31-2020 but updated through the date of publication. Data submitted by the service providers and aggregated by the PSD (Vermont Public Service Department).

  • GEOID: Census Block Number (2020 geography)
  • WIRECENTER: Wirecenter name from Vermont Wirecenter Boundary layer
  • CO_NAME: Incumbent Local Exchange Carrier (ILEC) from Vermont Wirecenter Boundary layer
  • TOWNNAME_1: Vermont Town name, from Vermont BNDHASH layer
  • TOWNNAMEEMC: Vermont Town name lower case, from Vermont BNDHASH layer
  • CNTY: County number, from Vermont BNDHASH layer
  • TOWNGEOID: Vermont town ID, from Vermont BNDHASH layer
  • CUD: Communications Union District
  • BB_Status: The maximum reported broadband service available at this location, in one of five speed tiers (in Mbps): Served 100/100, Served 100/20 Served 25/3, Served 4/1, or Lacking 4/1.
  • COMPANY: Type of electric distribution utilty
  • COMPANYNAM:Name of eletric distribution utility

In contrast to GA, the state of Vermont has released data with each location in the state and their access to broadband. After summarizing, there are 310,633 locations, of which 249,439 are served for 80.3 percent served.

The second estimate of the unconnected in VT also uses primary Form 477 data and implements the FCC’s counting system in R. This method suggests 94.2% of the population of VT is served.

Finally, FCC own estimates at the county level are summarized. Again, here is the permalink to that sheet. This suggests that about 95% of people in VT have broadband.

After a good amount of work, which can be found located in my raw R code, I am not sure that I can really trust the VT data. Importantly on the web site, it says that “VCGI and the State of VT make no representations of any kind, including but not limited to the warranties of merchantability or fitness for a particular use, nor are any such warranties to be implied with respect to the data.” I think that is probably the case here. Georgia, in contrast, released their data to help out the community.

All of this became clear when I started doing some exploration of VT data. Oftentimes, I will throw some data into a regression to see what comes out. When I did that for both GA and VT, I got weird things for VT and normal things for GA. Models 1 and 2 were GA and Models 3 and 4 were VT. The results are below.

Dependent variable:
Total locations
(1)(2)(3)(4)
hu20201.075\*\**0.125\*\**
(0.001)(0.007)
pop20200.412\*\**0.072\*\**
(0.001)(0.004)
Constant3.330\*\**5.297\*\**13.955\*\**13.806\*\**
(0.076)(0.094)(0.285)(0.290)
Observations176,259176,2599,8089,808
R20.8160.7170.0290.029
Adjusted R20.8160.7170.0290.029
Residual Std. Error29.408 (df = 176257)36.519 (df = 176257)24.454 (df = 9806)24.460 (df = 9806)
F Statistic784,060.700\*\** (df = 1; 176257)446,476.500\*\** (df = 1; 176257)296.028\*\** (df = 1; 9806)291.202\*\** (df = 1; 9806)
Note:*p<0.1; \*\*p<0.05; \*\**p<0.01

The setup was simple. I summed all of the locations to the Census block level and then regressed it against FCC population data. As that site notes, the data file includes “housing unit, household and population counts for each block for 2010 (US Census) and 2020 (Commission staff estimate).” Housing units (hu2020) and population counts (pop2020) were selected. It is worth noting that Ford (2021) using housing unit counts and so that provides the model here as well.

As the table above really highlights, the regressions for GA look right. There is about 1 new housing unit in 2020 for every location and there are 0.412 people for every location. This comes out to roughly 2.4 people per home.

On the other hand, there are about 8 locations in VT for every housing unit, according to the coefficients, which similarly suggest that 14 people live in every household. Neither makes much sense. I have contacted the VT government, but I havent heard back. Because of the problems with the data and because it was never meant to be a dataset for broadband, unlike GA which was designed for that purpose, I haven’t included VT in the final analysis.

Comparing models

The goal with this project is to compare a couple of models to estimate a new measure of the United States broadband access using the data that Georgia provides. Naturally, this project is indebted to Ford (2021).

The problem is simple, but the estimate is tricky.

The problem: If fewer people are truly connect than the FCC estimate suggests, then the true number (truₙ) of households connected is biased. The problem might be understood as a function of the FCC estimates (fccₙ) plus some sets of controls plus some error term (ε). Each of these variables are coded to the Census block, hereafter .

Putting it together,

truᵢ = fccᵢ + [controls]ᵢ + εᵢ

In turn, the true number (tru) of locations served by broadband equals the served (servᵢ) over the total number of locations (servᵢ + unservᵢ) such that,

truᵢ = servᵢ/(servᵢ + unservᵢ)

As noted earlier, it only takes one household in a Census block for the entire block to be considered completely served by a service. So each Census block becomes a binary variable, which is then summed at the county level, since each block is subsumed underneath a county. More on Census blocks here.

By assigning each Census block a population estimate (popₙ), the FCC total served stat becomes

Σ fccᵢpopᵢ

We know from the work of Ford (2021) that fccᵢ is correlated with the size of the Census block and the total number of housing units (huᵢ) in a region. The age, income, and education of a region might also affect deployment. Combining all of the features into a full model with a constant yields,

truᵢ = consᵢ + fccᵢ + sizeᵢ + huᵢ + ageᵢ + incᵢ + educᵢ + εᵢ

In a way, this project is perfectly made for logit model. As Ford (2021) noted,

Since ti lies on the unit interval, Equation (4) is estimated by a General Linear Model of the binomial family with a logit link (thus constraining predictions to the unit interval).

The classic use of logit models is for a binary that takes the form 0 or 1. This is the case for broadband. Either households have broadband at the FCC level of 25/3 or they do not. Because it is coded at the block level, truᵢ is just its mean value.

Using glm to model a quasi-binomial logit

Part of the purpose of this project was to become more familiar with different estimations methods of the loss function. Fundamentally though, I wanted to understand the models using a typical glm estimation. According to the documentation, glm() is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

I kept refining my analysis throughout this project. I am not trying to test causality. I want to estimate bounds. Public policy cares about conditional means, given the bias in the data collection efforts. My messy thoughts were scratched out here.

A common problem for large datasets using the glm() method is that don’t converge. As I started To solve this, each of the models was estimated using quasi-binomial distribution. While similar to the binomial distribution, has an extra parameter 𝜙ϕ (limited to |𝜙|≤min{𝑝/𝑛,(1−𝑝)/𝑛}|ϕ|≤min{p/n,(1−p)/n}) that attempts to describe additional variance in the data that cannot be explained by a Binomial distribution alone. For more information see this.

The data are as follows. Housing Units Served Percent, the dependent variable, was calculated using GA data, described above. FCCserved is a measure calculated by checking every Census block to see if it was served by a 25/3 service. The result was a binary. LogALAND10 is the natural log of the Census block size, following Ford (2021). The RuralUrbanContinuumCode2013 offers a measure of the rural-urban mix of a region. It wasn’t used in the final fitted dataset, but it was calculated. FCC block estimates were used for housing units (hu2020) and (pop202). Pop2020 doesn’t show up in the final fitted data either, but it is a worthwile measure to include. Both FCCserved:logALAND10 and FCCserved:hu2020 are interaction terms, following Ford(2021). Finally, the 2019 ACS 5-year estimates provided median age, median income, and education measures. Because of the problems inherent in the 2020 data, I have reverted to using 2019 data instead. This is a flat dataset. There is no panel, so time isn’t considered. As one might expect, the usual conditionals apply here.

The results are displayed below.

Dependent variable:
Total locations served percent
(1)(2)(3)(4)(5)(6)(7)
FCCserved3.290\*\**3.379\*\**3.0535.4963.082\*\**2.9422.950
(0.017)(0.020)(1,118.002)(6,226.771)(0.017)(325.040)(325.909)
logALAND10-0.548\*\**-0.643-0.460-0.561-0.588
(0.004)(214.797)(467.369)(65.006)(64.157)
RuralUrbanContinuumCode2013-0.118
(52.800)
hu20200.0290.0070.0220.025
(24.098)(84.935)(6.914)(6.977)
FCCserved:logALAND10-0.222
(525.816)
FCCserved:hu20200.024
(88.763)
medincome-0.00001\*\**-0.000000.00000
(0.00000)(0.008)(0.008)
medage-0.051\*\**-0.023-0.029
(0.001)(14.474)(14.181)
educ5.227\*\**2.6202.700
(0.075)(1,355.767)(1,345.201)
Constant-1.308\*\**5.275\*\**6.1874.2090.230\*\**6.3466.153
(0.016)(0.046)(2,616.689)(5,438.467)(0.038)(961.179)(951.441)
Observations176,259176,259176,259176,259170,944170,944170,944
Note:*p<0.1; \*\*p<0.05; \*\**p<0.01

Modeling the true number of homes connected

The regression table offers some interesting insights, but the directions are largely where you’d expect. There is a positive relationship between the FCC served variable and the true number. Across all of the models, the estimate the is same, roughly around 3.

The relationship between the true number of served homes and the size of the Census block is negative, which is what is expected as well. The bigger that the measure gets, the bigger the size of the block, the number of served homes drops.

According to FCC data, 97.5% of the population has access to broadband. Predicting on the models, the best guess is 92.49 percent, but it could be 89.85% percent on the low end.

ModelHousing units estimatePercent of homes served
Model 2123,168,31485.76%
Model 3132,830,99092.49%
Model 5122,637,24385.39%
Model 7129,043,16789.85%

What about where there lies diasgreement?



First published Apr 28, 2022