R code for economists, a scratchpad

This is a scratchpad of sorts for economists wanting to explore R.


I am going to assume that you already have R and R Studio installed. If you need help on that, go here.

As your code grows, you will want to consider how to make it public and open source it. An easy and common method comes with Github. Learn about Github and R integration here .

CRAN’s task views help to upload or update all of the key libraries that you are going to need for analysis. Here are all the CRAN task views. The code below loads the Econometics task view.


I typically use the following code for my libraries. easypackages is great for loading a bunch of libraries at once. Remember install.packages(“easypackages”) to get it all started.



Just an FYI, scipen=999 keeps all of the leading zeros. That will come handy down the road. I also make it my default.

If you need to install all of those packages, just change around the code a bit.


Loading data and some basics tip and tricks

I like to use fread from the data.table library to load data. It is powerful and apparently is quite consistently the most efficient method. More documentation here.

Below is code from a project of mine. It loads data from a subfolder, keeps the leading zeroes, and then makes every column a character in class.

GA.state <- fread(file="./data-raw/GBDI_Unserved_CB_Jun2021.csv",
                  colClasses = "character")

To learn more about the class structure of R, go here. If you are doing joins, you might have to change classes. To see all of the classes of all columns in a dataframe run the code below.


If you ever want to know more information about a library just run a questionmark ? before the call. It will return on the side of R Studio the official documentation for the function. So for example, the code below will give you information on “Fitting Generalized Linear Models”


When you load data, you might need to delete the notation commas. The code below uses gsub() to replace the commas. They go from “,” to “”. Find more information on the syntax of gsub here.

startup.data <- startup.data %>%
  mutate(Amount=as.numeric(gsub(",", "", AmountInUSD)))

Basic functions like subsets, joins etc

dplyr and pipe operators

I mean you really should be learning tidyverse. It is powerful and easy. Go here for that.

GAVTdf2 <- GAVTdf2 %>%
  left_join(., block.area, by = c("FIPS" = "GEOID10")) %>%
  mutate(CountyFIPS = substr(FIPS, 1, 5),
         BlockFIPS = substr(FIPS, 1, 12)) %>%
  left_join(., rural.atlas, by = c("CountyFIPS" = "FIPS")) %>%
  left_join(., ACSgroup, by = c("BlockFIPS" = "GEOID")) %>%
  mutate(FCCserved=as.numeric(FCCserved)) %>%
  mutate(servper=served/totloc) %>%
         popden=pop2020/ALAND10) %>%
  replace_na(list(totloc=0, served=0,
                  unserved=0, max.down=0,
                  max.up=0, FCCserved=0,

Another course, if you can call it that, can be found here.

American Community Survey (ACS) data in R

First go here to find everything you need to know about How to get variables from the American Community Survey (ACS).

vars <- load_variables(year = 2019, dataset = "acs5",cache = TRUE)

Some common ACS variables:

  • B02001_001: Total (to divide the rest of the series for percentages)
  • B03002_003: White alone (Not Hispanic or Latino)
  • B03002_004: Black or African American alone (Not Hispanic or Latino)
  • B03002_012: Hispanic or Latino
  • B03002_005: Native American alone (Not Hispanic or Latino)
  • B03002_006: Asian alone (Not Hispanic or Latino)
  • B03002_007: Native Hawaiian or Pacific Islander alone (Not Hispanic or Latino)
  • B03002_009: Multiple Races (Not Hispanic or Latino)
  • B03002_008: Other (Not Hispanic or Latino)
  • B25064_001: Median gross rent
  • B25071_001: Rent Burden (median gross rent as percentage of household income)
  • B19013_001: Median household income in the last 12 months
  • B01002_001: Median age
  • B25115_016: Renter Occupied - family
  • B25115_027: Renter Occupied - nonfamily
  • B15003_022: Total with Bachelor’s degree for the population 25 years and older
  • B15003_023: Total with Master’s degree for the population 25 years and older
  • B15003_024: Total with professional school degree for the population 25 years and older
  • B15003_025: Total with doctorate degree for the population 25 years and older







R libraries

  • New package ppsr visualizes predictive relations and correlations side-by-side with one line of code! R library

That’s reasonably specific, but you’ll have success looking for lecture notes.

This is a great text: https://m-clark.github.io/mixed-models-with-R/

Drawing supply demand curves

Economics charts in R using ggplot2 and econocharts.

Web scraping



Other libraries of interest

R packages

  • The anomalous package provides some tools to detect unusual time series in a large collection of time series. (source

Basic functions and resources




Use Cases

  • A Regression model for smartphone trade in value prediction (link); Question: I have a dataset which has average selling prices of preowned Iphones and i am trying to predict what the price of a new Iphone launched would look like down the line. Right now i am using a log transformed regression model, with a bunch of categorical predictors. My questions: 1) Since different iphones have different price ranges, how do i go about building the model. 2) Apart from Linear Regression model, what are some other models i can use to improve accuracy? Answer: If they have different price ranges then you should definitely apply a standard scaler to the data. The obvious other regression algorithm to try is the random forest regressor. I think one regression model scales better because you can always have the categorical variable, say “smart.phone.type”, as one of the features. Imagine having 200 different phones and having to build 200 different models. Applying transformations in linear regression comes down to trying to adhere to its assumptions as closely as possible. This should be the metric for choosing between scaling and a log transformation to your features/target. (I actually jumped the gun in the previous comment because I forgot that Linear Regression is scale invariant so the features don’t need to be normalized in pre-processing.) You might also want to consider a Box-Cox transformation. I also forgot to mention Lasso, Ridge, and Elastic Net Regression as possible other algorithms to try.

  • Building a binary classifer from scratch in Python: http://www.jeannicholashould.com/what-i-learned-implementing-a-classifier-from-scratch.html 

  • Big Data Analysis: The main thing to keep in mind is that with this amount of data, every coefficient will probably come out as statistically significant. In order to find out which regressors are really important (as contrasted with statistically significant), I recommend using a holdout sample: fit your model to only 4 million data points, predict the other million points and compare to the actual values. Do this for a couple of different models (using or not using regressors, transforming regressors etc.) and see which ones yield the best predictions, by e.g. calculating the Mean Absolute Deviation (MAD) between the predictions and the actual observations. Better yet: iterate this over the entire dataset five times, using a different million points as a holdout sample each time. This is known as “cross-validation” (five-fold cross-validation in this case). (link)



A highly recommended time series analysis book: https://www.otexts.org/fpp





Books and other resources



A highly recommended time series analysis book: https://www.otexts.org/fpp






  • Export all of your Reddit saves to HTML (link)

Visual Learning Algorithms

  • Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license. (source)
  • What is thes best image recognition algorithm? (link)
  • Analyzing 50k fonts using deep neural networks (link)


Analysis of election data with R: http://thinktostart.com/analyzing-us-elections-facebook-r/ 

Interval Estimate of Population Mean with Unknown Variance: http://www.r-tutor.com/elementary-statistics/interval-estimation/interval-estimate-population-mean-unknown-variance 

Working with SQL in R:


Social Media / Social Network Analysis

Sentiment Analysis 

Analyzing Facebook:

Data Sources:



  • Lev Konstantinovskiy talking about Gensim (link)

Learn Java

d approach the data differently for sure, understand the variation more before running regressions. But when it comes to causal inference, random forests wouldn’t help. I’d now consider multi-level modeling a la Andrew Gelman in addition to the 2-way fixed effects model I used. Hope that helps


The 40 data science techniques

  1. Linear Regression 

  2. Logistic Regression 

  3. Jackknife Regression *

  4. Density Estimation 

  5. Confidence Interval 

  6. Test of Hypotheses 

  7. Pattern Recognition 

  8. Clustering - (aka Unsupervised Learning)

  9. Supervised Learning 

  10. Time Series 

  11. Decision Trees 

  12. Random Numbers 

  13. Monte-Carlo Simulation 

  14. Bayesian Statistics 

  15. Naive Bayes 

  16. Principal Component Analysis - (PCA)

  17. Ensembles 

  18. Neural Networks 

  19. Support Vector Machine - (SVM)

  20. Nearest Neighbors - (k-NN)

  21. Feature Selection - (aka Variable Reduction)

  22. Indexation / Cataloguing *

  23. (Geo-) Spatial Modeling 

  24. Recommendation Engine *

  25. Search Engine *

  26. Attribution Modeling *

  27. Collaborative Filtering *

  28. Rule System 

  29. Linkage Analysis 

  30. Association Rules 

  31. Scoring Engine 

  32. Segmentation 

  33. Predictive Modeling 

  34. Graphs 

  35. Deep Learning 

  36. Game Theory 

  37. Imputation 

  38. Survival Analysis 

  39. Arbitrage 

  40. Lift Modeling 

  41. Yield Optimization

  42. Cross-Validation

  43. Model Fitting

  44. Relevancy Algorithm *

  45. Experimental Design

Here we discuss general applications of statistical models, whether they arise from data science, operations research, engineering, machine learning or statistics. We do not discuss specific algorithms such as decision trees, logistic regression, Bayesian modeling, Markov models, data reduction or feature selection. Instead, I discuss frameworks - each one using its own types of techniques and algorithms - to solve real life problems.   

Most of the entries below are found in Wikipedia, and I have used a few definitions or extracts from the relevant Wikipedia articles, in addition to personal contributions.

Source for picture: click here

  1. Spatial Models

Spatial dependency is the co-variation of properties within geographic space: characteristics at proximal locations appear to be correlated, either positively or negatively. Spatial dependency leads to the spatial auto-correlation problem in statistics since, like temporal auto-correlation, this violates standard statistical techniques that assume independence among observations

  1. Time Series

Methods for time series analyses may be divided into two classes: frequency-domain methods and time-domain methods. The former include spectral analysis and recently wavelet analysis; the latter include auto-correlation and cross-correlation analysis. In time domain, correlation analyses can be made in a filter-like manner using scaled correlation, thereby mitigating the need to operate in frequency domain.

Additionally, time series analysis techniques may be divided into parametric and non-parametric methods. The parametric approaches assume that the underlying stationary stochastic process has a certain structure which can be described using a small number of parameters (for example, using an autoregressive or moving average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure.

Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate.

  1. Survival Analysis

Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival? Survival models are used by actuaries and statisticians, but also by marketers designing churn and user retention models.

Survival models are also used to predict time-to-event (time from becoming radicalized to turning into a terrorist, or time between when a gun is purchased and when it is used in a murder), or to model and predict decay (see section 4 in this article).

  1. Market Segmentation

Market segmentation, also called customer profiling, is a marketing strategy which involves dividing a broad target market into subsets of consumers,businesses, or countries that have, or are perceived to have, common needs, interests, and priorities, and then designing and implementing strategies to target them. Market segmentation strategies are generally used to identify and further define the target customers, and provide supporting data for marketing plan elements such as positioning to achieve certain marketing plan objectives. Businesses may develop product differentiation strategies, or an undifferentiated approach, involving specific products or product lines depending on the specific demand and attributes of the target segment.

  1. Recommendation Systems

Recommender systems or recommendation systems (sometimes replacing “system” with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that a user would give to an item.

  1. Association Rule Learning

Association rule learning is a method for discovering interesting relations between variables in large databases. For example, the rule { onions, potatoes } ==> { burger }  found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. In fraud detection, association rules are used to detect patterns associated with fraud. Linkage analysis is performed to identify additional fraud cases: if credit card transaction from user A was used to make a fraudulent purchase at store B, by analyzing all transactions from store B, we might find another user C with fraudulent activity. 

  1. Attribution Modeling

An attribution model is the rule, or set of rules, that determines how credit for sales and conversions is assigned to touchpoints in conversion paths. For example, the Last Interaction model in Google Analytics assigns 100% credit to the final touchpoints (i.e., clicks) that immediately precede sales or conversions. Macro-economic models use long-term, aggregated historical data to assign, for each sale or conversion, an attribution weight to a number of channels. These models are also used for advertising mix optimization.

  1. Scoring

Scoring model is a special kind of predictive models. Predictive models can predict defaulting on loan payments, risk of accident, client churn or attrition, or chance of buying a good. Scoring models typically use a logarithmic scale (each additional 50 points in your score reducing the risk of defaulting by 50%), and are based on logistic regression and decision trees, or a combination of multiple algorithms. Scoring technology is typically applied to transactional data, sometimes in real time (credit card fraud detection, click fraud).

  1. Predictive Modeling

Predictive modeling leverages statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive models are often used to detect crimes and identify suspects, after the crime has taken place. They may also used for weather forecasting, to predict stock market prices, or to predict sales, incorporating time series or spatial models. Neural networks, linear regression, decision trees and naive Bayes are some of the techniques used for predictive modeling. They are associated with creating a training set, cross-validation, and model fitting and selection.

Some predictive systems do not use statistical models, but are data-driven instead. See example here

  1. Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Unlike supervised classification (below), clustering does not use training sets. Though there are some hybrid implementations, called semi-supervised learning.

  1. Supervised Classification

Supervised classification, also called supervised learning, is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called label, class or category). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. 

Examples, with an emphasis on big data, can be found on DSC. Clustering algorithms are notoriously slow, though a very fast technique known as indexation or automated tagging will be described in Part II of this article.

  1. Extreme Value Theory

Extreme value theory or extreme value analysis (EVA) is a branch of statistics dealing with the extreme deviations from the median of probability distributions. It seeks to assess, from a given ordered sample of a given random variable, the probability of events that are more extreme than any previously observed. For instance, floods that occur once every 10, 100, or 500 years. These models have been performing poorly recently, to predict catastrophic events, resulting in massive losses for insurance companies. I prefer Monte-Carlo simulations, especially if your training data is very large. This will be described in Part II of this article.

  1. Simulations

Monte-Carlo simulations are used in many contexts: to produce high quality pseudo-random numbers, in complex settings such as multi-layer spatio-temporal hierarchical Bayesian models, to estimate parameters (see picture below), to compute statistics associated with very rare events, or even to generate large amount of data (for instance cross and auto-correlated time series) to test and compare various algorithms, especially for stock trading or in engineering.

  1. Churn Analysis

Customer churn analysis helps you identify and focus on higher value customers, determine what actions typically precede a lost customer or sale, and better understand what factors influence customer retention. Statistical techniques involved include survival analysis (see Part I of this article) as well as Markov chains with four states: brand new customer, returning customer, inactive (lost) customer, and re-acquired customer, along with path analysis (including root cause analysis) to understand how customers move from one state to another, to maximize profit. Related topics: customer lifetime value, cost of user acquisition, user retention.

  1. Inventory management

Inventory management is the overseeing and controlling of the ordering, storage and use of components that a company will use in the production of the items it will sell as well as the overseeing and controlling of quantities of finished products for sale. Inventory management is an operations research technique leveraging analytics (time series, seasonality, regression), especially for sales forecasting and optimum pricing - broken down per product category, market segment, and geography. It is strongly related to pricing optimization (see item #17).  This is not just for brick and mortar operations: inventory could mean the amount of available banner ad slots on a publisher website in the next 60 days, with estimates of how much traffic (and conversions) each banner ad slot is expected to deliver to the potential advertiser. You don’t want to over-sell or under-sell this virtual inventory, and thus you need good statistical models to predict the web traffic and conversions (to pre-sell the inventory), for each advertiser category.

  1. Optimum Bidding

This is an example of automated, black-box, machine-to-machine communication system, sometimes working in real time, via various API’s. It is backed by statistical models. Applications include detecting and purchasing the right keywords at the right price on Google AdWords, based on expected conversion rates for millions of keywords, most of them having no historical data; keywords are categorized using an indexation algorithm (see item #18 in this article) and aggregated into buckets (categories) to get some historical data with statistical significance, at the bucket level. This is a real problem for companies such as Amazon or eBay. Or it could be used as the core algorithm for automated high frequency stock trading.

  1. Optimum Pricing

While at first glance it sounds like an econometric problem handled with efficiency curves, or even a pure business problem, it is highly statistical in nature. Optimum pricing takes into account available and predicted inventory, production costs, prices from competitors, and profit margins. Price elasticity models are often used to determine how high prices can be boosted before reaching strong resistance. Modern systems offer prices-on-demand, in real time, for instance when booking a flight or an hotel room. User-dependent pricing - a way to further optimize pricing, offering different prices based on user segment - is a controversial issue. It is accepted in the insurance industry: bad car drivers paying more than good ones for the same coverage, or smokers / women / old people paying a different fee for healthcare insurance (this is the only price discrimination allowed by Obamacare). 

  1. Indexation

Any system based on taxonomies use an indexation algorithm, created to build and maintain the taxonomy. For instance product reviews (both products and reviewers must be categorized using an indexation algorithm, then mapped onto each other), scoring algorithms to detect the top people to follow in a specific domain (click here for details), digital content management (click here for details, read part 2), and of course search engine technology. Indexation is a very efficient clustering algorithm, and the time used to index massive amounts of content grows linearly - that is very fast - with the size of your dataset. Basically, it relies on a few hundreds categories manually selected after parsing tons of documents, extracting billions of keywords, filtering them, producing a keyword frequency table, and focusing on top keywords. Indexation is also used in systems that provide related keywords associated with user-entered keywords, for instance in this example.

Last but not least, an indexation algorithm can be used to automatically create an index for any document - report, article, blog, website, data repository, metadata, catalog, or book. Indeed, that’s the origin of the word indexation. Surprisingly, publishers still pay people today for indexing jobs: you can find these jobs listed on the American Society for Indexing website. This is an opportunity for data scientist entrepreneurs: offering publishers a software that does this job automatically, at a fraction of the cost.

  1. Search Engines

Good search engine technology relies heavily on statistical modeling. Enterprise search engines help companies - for instance Amazon - sell their products, by providing users with an easy way to find them. Our own Data Science Central search is of high quality (superior to Google search), and one of the most used features on our website. The core algorithm used in any search engine is an indexation (see item #19 in this article) or automated tagging system. Google search could be improved as follows: (1) eliminate page rank - this algorithm has been fooled by cheaters developing link farms and other web spam, (2) add new content more frequently in your index to make search results less static, less frozen in time, (3) show more relevant articles using better user / search keyword / landing page matching algorithms which ultimately means better indexation systems, and (4) use better attribution models to show the source of an article, not copies published on LinkedIn or elsewhere. (this could be as simple as putting more weights on small publishers, and identifying the first occurrence of an article, that is, time stamp detection and management).

  1. Cross-Selling

Usually based on collaborative filtering algorithms, the idea is to find - especially in retail - which products to sell to a client based on recent purchases or interests. For instance, trying to sell engine oil to a customer buying gasoline. In banking, a company might want to sell several services: a checking account first, then a saving account, then a business account, then a loan and so on, to a specific customer segment. The challenge is to identify the correct order in which products must be promoted, the correct customer segments, and the optimum time lag between the various promotions. Cross-selling is different from up-selling.

  1. Clinical trials

Clinical trials are experiments done in clinical research, usually involving small data. Such prospective biomedical or behavioral research studies on human participants are designed to answer specific questions about biomedical or behavioral interventions, including new treatments and known interventions that warrant further study and comparison. Clinical trials generate data on safety and efficacy. Major concerns include how test patients are sampled (especially if they are compensated), conflict of interests in these studies, and the lack of reproducibility.

  1. Multivariate Testing

Multivariate testing is a technique for testing an hypothesis in which multiple variables are modified. The goal is to determine which combination of variations performs the best out of all of the possible combinations. Websites and mobile apps are made of combinations of changeable elements, that are optimized using multivariate testing. This involves careful design-of-experiment, and the tiny, temporary difference (in yield or web traffic) between two versions of a webpage might not have statistical significance. While ANOVA and tests of hypotheses are used by industrial or healthcare statisticians for multivariate testing, we have developed systems that are model-free, data-driven, based on data binning and model-free confidence intervals (click here and here for details). Stopping a multivariate testing experiment (they usually last 14 days for web page optimization) as soon as the winning combination is identified, helps save a lot of money. Note that external events - for instance an holiday or some server outage - can impact the results of multivariate testing, and need to be addressed.

  1. Queuing Systems

A queue management system is used to control queues. Queues of people form in various situations and locations in a queue area, for instance in a call center. The process of queue formation and propagation is defined as queuing theory. Arrival of people in a queue is typically modeled using a Poisson process, with time to serve a client modeled using an exponential distribution. While being a statistical problem, it is considered to be part of operations research. 

  1. Supply Chain Optimization

Supply chain optimization is the application of processes and tools to ensure the optimal operation of a manufacturing and distribution supply chain. This includes the optimal placement of inventory (see item #15 in this article) within the supply chain, minimizing operating costs (including manufacturing costs, transportation costs, and distribution costs). This often involves the application of mathematical modelling techniques such as graph theory to find optimum delivery routes (and optimum locations of warehouses), the simplex algorithm, and Monte Carlo simulations. Read 21 data science systems used by Amazon to operate its business for typical applications. Again, despite being heavily statistical in nature, this is considered to be an operations research problem.

Source for picture: click here (regression via Monte Carlo simulations)

Statistics / Econometrics

When to use Fixed effects 

Use fixed-effects (FE) whenever you are only interested in analyzing the impact of variables that vary over time.

FE explore the relationship between predictor and outcome variables within an entity (country, person, company, etc.). Each entity has its own individual characteristics that may or may not influence the predictor variables (for example, being a male or female could influence the opinion toward certain issue; or the political system of a particular country could have some effect on trade or GDP; or the business practices of a company may influence its stock price).

When using FE we assume that something within the individual may impact or bias the predictor or outcome variables and we need to control for this. This is the rationale behind the assumption of the correlation between entity’s error term and predictor variables. FE remove the effect of those time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable.

Another important assumption of the FE model is that those time-invariant characteristics are unique to the individual and should not be correlated with other individual characteristics. Each entity is different therefore the entity’s error term and the constant (which captures individual characteristics) should not be correlated with the others. If the error terms are correlated, then FE is no suitable since inferences may not be correct and you need to model that relationship (probably using random-effects), this is the main rationale for the Hausman test (presented later on in this document). 

For fixed effect, as Doug noted, when you look at fifty states how likely would it be that you’d pull fifty random states? However for random effects, this is the case. How likely it is that the DV is random/stochastic?

I’ll stay away from code examples myself as they seem rather shop-specific and thus best designed locally, but if you want some questions, here you go. These questions are intentionally difficult and are more on the statistics/modeling side than the data processing side. That’s important, but someone else would be better poised to write those questions.

You might want “I don’t know, but what I would do is read the following sources….” to be part of your accepted answers, as that’s partly testing honesty and forthrightness of the candidate. The last thing an organization needs is bullshit artists who overpromise what they can do or just make things up.

You really want to be wary of wanting the “unicorn hire.”

Note: These aren’t definitive or even representative and reflect my own areas of expertise. These are prototype questions, you should alter/edit or formulate your own. You should add details to the questions to deal with the data types you typically deal with.

Your organization needs to define what people being hired into the job actually need to know how to do, and ask about that. If they’re not doing a lot of cluster analysis, why would you ask about that? If the person is mostly doing data management/cleaning, primarily ask about that.

  • Explain what regularization is and why it is useful. What are the benefits and drawbacks of specific methods, such as ridge regression and LASSO?

  • Explain what a local optimum is and why it is important in a specific context, such as k-means clustering. What are specific ways for determining if you have a local optimum problem? What can be done to avoid local optima?

  • Assume you need to generate a predictive model of a quantitative outcome variable using multiple regression. Explain how you intend to validate this model.

  • Explain what precision and recall are. How do they relate to the ROC curve?

  • Explain what a long tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and prediction problems?

  • What is latent semantic indexing? What is it used for? What are the specific limitations of the method?

  • What is the Central Limit Theorem? Explain it. Why is it important? When does it fail to hold?

  • What is statistical power?

  • Explain what resampling methods are and why they are useful. Also explain their limitations.

  • Explain the differences between artificial neural networks with softmax activation, logistic regression, and the maximum entropy classifier.

  • Explain selection bias (with regards to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

  • Provide a simple example of how an experimental design can help answer a question about behavior. For instance, explain how an experimental design can be used to optimize a web page. How does experimental data contrast with observational data.

  • Explain the difference between “long” and “wide” format data. Why would you use one or the other?

  • Is mean imputation of missing data acceptable practice? Why or why not?

  • Explain Edward Tufte’s concept of “chart junk.” 

  • What is an outlier? Explain how you might screen for outliers and what you would do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what you would do if you found them in your dataset.

  • What is principal components analysis (PCA)? Explain the sorts of problems you would use PCA for. Also explain its limitations as a method.

  • You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test (even graphically) whether your expectations are borne out?

  • Explain what a false positive and a false negative are. Why is it important to differentiate these from each other? Provide examples of situations where (1) false positives are more important than false negatives, (2) false negatives are more important than false positives, and (3) these two types of errors are about equally important.

  • Explain likely differences encountered between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problems do they bring?

  • Ask someone to explain what the terms “p-value” and “confidence interval” mean, with an example of their own choosing, in a way that a non-technical person with only high-school maths could understand.

  • What is a gold standard ? 

  • Believe it or not there are data scientists (even at very senior levels) who claim to know a hell lot about supervised machine learning and know nothing about what a gold standard is!

  • What is the difference between supervised learning and unsupervised learning? - Give concrete examples.

  • What does NLP stand for?

  • Some data scientists claim to also do NLP.  

  • Write code to count the number of words in a document using any programming language. Now, extend this for bi-grams.

  • I have seen a senior level data scientist who actually struggled to implement this. 

  • What are feature vectors?

  • When would you use SVMs vs Random Forrest and Why?

  • What is your definition of Big Data, and what is the largest size of data you have worked with? Did you parallelize your code?

  • If their notion of big data is just volume - you may have a problem. Big Data is more than just volume of data. If the largest size of data they have worked with is 5MB - again you may have a problem.

  • How do you work with large data sets? If the answer only comes out as hadoop it clearly shows that their view of solving problems is extremely narrow. Large data problems can be solved with:

    1. efficient algorithms
    1. multi-threaded applications
    1. distributed programming
    1. more…
  • Write a mapper function to count word frequencies (even if its just pseudo code)

  • Write a reducer function for counting word frequencies (even if its just pseudo code)

You can choose your path – but this is probably what I would do:


Introduction to Computer Science and Programming using Python – eDX.org

Intro to Data Science – Udacity

Workshop videos from Pycon and SciPy – some of them are mentioned here

Selectively pick from the vast tutorials available on the net in form of iPython notebooks


The Analytics Edge – eDX.org

Pick out a few courses from Data Science specialization to complement Analytics Edge

Other courses (applicable for both the stacks):

Machine Learning from Andrew Ng – Coursera

Statistics course on Udacity

Introduction to Hadoop and MapReduce on Udacity




  • Learning python and/or R are good. Learn to do everything in those, from data cleaning to visualizations and modeling.

  • Get some actual CS knowledge. You should take algorithms courses either at your university or on coursera if you’re out of school. Also probably data structures. You should be able to tell me the rough big O of an algorithm easily and know basic data structures and algorithms around them.

  • Learn some real software engineering. The best way to do that is to participate on a large open source project, or to translate a non-small (>5k lines of code) project from one language to another (that’s how I did it). Get some knowledge and opinions on design patterns.

Database stuff

  • Learn some SQL. It’s not hard. Get yourself a dataset from google’s BigQuery (using the free trial) with a SQL query.


  • Learn yourself some time series. See this. Fit some time series models to FRED data.

  • Learn some numerical optimization. Learn gradient based methods (newton, gradient descent, BFGS, etc.) and non-gradient based methods (Nelder-Mead, Powell, etc.) Learn where each is useful

  • Learn the method of maximum likelihood. Do a OLS regression by maximum likelihood on whatever data (hint: you minimize the sum of squared errors)

  • Find a dataset on kaggle.com where the right hand side variable of interest is better fit by a Poisson (or whatever non normal distribution) than a Gaussian. Why does that distribution fit better? Code the likelihood function yourself by hand and maximize it (No packages!) Example of MLE from scratch. Learn where which common distributions are useful.

  • Implement a finite mixture model, by hand, the same way as above.

  • Learn the typical microeconometrics stuff – See Mostly Harmless Econometrics


  • Go through an introductory ML course. Examples are this, this. You should know a few basic classification algorithms, like naive Bayes, LDA/QDA, SVM, logit, multinomial and ordered probit/logit, and be able to tell where each performs well. You should also know about loss functions, regularization, and the bias/variance tradeoff.

  • Learn some clustering. What’s the practical difference between a finite mixture model and clustering then regressing on each clusters?

  • Learn some decision tree based methods

  • Learn some neural nets. Implement a MNIST classifier in tensorflow.

The model above is achieved by using the lm() function in R and the output is called using the summary() function on the model.

Below we define and briefly explain each component of the model output:

Formula Call

As you can see, the first item shown in the output is the formula R used to fit the data. Note the simplicity in the syntax: the formula just needs the predictor (speed) and the target/response variable (dist), together with the data being used (cars).


The next item in the model output talks about the residuals. Residuals are essentially the difference between the actual observed response values (distance to stop dist in our case) and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0). In our example, we can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points. We could take this further consider plotting the residuals to see whether this normally distributed, etc. but will skip this for this example.


The next section in the model output talks about the coefficients of the model. Theoretically, in simple linear regression, the coefficients are two unknown constants that represent the intercept and slope terms in the linear model. If we wanted to predict the Distance required for a car to stop given its speed, we would get a training set and produce estimates of the coefficients to then use it in the model formula. Ultimately, the analyst wants to find an intercept and a slope such that the resulting fitted line is as close as possible to the 50 data points in our data set.

Coefficient - Estimate

The coefficient Estimate contains two rows; the first one is the intercept. The intercept, in our example, is essentially the expected value of the distance required for a car to stop when we consider the average speed of all cars in the dataset. In other words, it takes an average car in our dataset 42.98 feet to come to a stop. The second row in the Coefficients is the slope, or in our example, the effect speed has in distance required for a car to stop. The slope term in our model is saying that for every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324088 feet.

Coefficient - Standard Error

The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable. We’d ideally want a lower number relative to its coefficients. In our example, we’ve previously determined that for every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324088 feet. The Standard Error can be used to compute an estimate of the expected difference in case we ran the model again and again. In other words, we can say that the required distance for a car to stop can vary by 0.4155128 feet. The Standard Errors can also be used to compute confidence intervals and to statistically test the hypothesis of the existence of a relationship between speed and distance required to stop.

Coefficient - t value

The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between speed and distance exist. In our example, the t-statistic values are relatively far away from zero and are large relative to the standard error, which could indicate a relationship exists. In general, t-values are also used to compute p-values.

Coefficient - Pr(>|t|)

The Pr(>|t|) acronym found in the model output relates to the probability of observing any value equal or larger than |t|. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (speed) and response (dist) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero. Note the ‘signif. Codes’ associated to each estimate. Three stars (or asterisks) represent a highly significant p-value. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between speed and distance.

Residual Standard Error

Residual Standard Error is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error termE. Due to the presence of this error term, we are not capable of perfectly predicting our response variable (dist) from the predictor (speed) one. The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line. In our example, the actual distance required to stop can deviate from the true regression line by approximately 15.3795867 feet, on average. In other words, given that the mean distance for all cars to stop is 42.98 and that the Residual Standard Error is 15.3795867, we can say that the percentage error is (any prediction would still be off by) 35.78%. It’s also worth noting that the Residual Standard Error was calculated with 48 degrees of freedom. Simplistically, degrees of freedom are the number of data points that went into the estimation of the parameters used after taking into account these parameters (restriction). In our case, we had 50 data points and two parameters (intercept and slope).

Multiple R-squared, Adjusted R-squared

The R-squared statistic (R2R2) provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. The R2R2 is a measure of the linear relationship between our predictor variable (speed) and our response / target variable (dist). It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In our example, the R2R2 we get is 0.6510794. Or roughly 65% of the variance found in the response variable (dist) can be explained by the predictor variable (speed). Step back and think: If you were able to choose any metric to predict distance required for a car to stop, would speed be one and would it be an important one that could help explain how distance would vary based on speed? I guess it’s easy to see that the answer would almost certainly be a yes. That why we get a relatively strong R2R2. Nevertheless, it’s hard to define what level of R2R2 is appropriate to claim the model fits well. Essentially, it will vary with the application and the domain studied.

A side note: In multiple regression settings, the R2R2 will always increase as more variables are included in the model. That’s why the adjusted R2R2 is the preferred measure as it adjusts for the number of variables considered.


F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. However, how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. Generally, when the number of data points is large, an F-statistic that is only a little bit larger than 1 is already sufficient to reject the null hypothesis (H0 : There is no relationship between speed and distance). The reverse is true as if the number of data points is small, a large F-statistic is required to be able to ascertain that there may be a relationship between predictor and response variables. In our example the F-statistic is 89.5671065 which is relatively larger than 1 given the size of our data.

Research Tools



15 Education Search Engines      

  1. What is bias, variance trade off ?


“Bias is error introduced in your model due to over simplification of machine learning algorithm.” It can lead to underfitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.

Low bias machine learning algorithms - Decision Trees, k-NN and SVM

Hight bias machine learning algorithms - Liear Regression, Logistic Regression


“Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training dataset and performs bad on test dataset.” It can lead high sensitivity and overfitting.

Bias, Variance trade off:

The goal of any supervised machine learning algorithm is to have low bias and low variance to achive good prediction performance.

  1. The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model.

  2. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

There is no escaping the relationship between bias and variance in machine learning.

Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.

  1. What is exploding gradients ?

“Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.” At an extreme, the values of weights can become so large as to overflow and result in NaN values.

This has the effect of your model being unstable and unable to learn from your training data. Now let’s understand what is the gradient.


Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.

  1. What is the difference between supervised and unsupervised machine learning?

Supervised Machine learning:

Supervised machine learning requires training labeled data.

Unsupervised Machine learning:

Unsupervised machine learning doesn’t required labeled data.

  1. How KNN supervised machine learning algorithm works?

  2. What is a confusion matrix ?

The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix

A dataset used for performance evaluation is called test dataset. It should contain the correct labels and predicted labels.


The predicted labels will exactly the same if the performance of a binary classfier is perfect.


The predicted labels usually match with part of the observed labels in real world scenarios.


A binary classifier predicts all data instances of a test dataset as either positive or negative. This produces four outcomes-

  1. True positive - Correct positive prediction

  2. False positive - Incorrect positive prediction

  3. True negative - Correct negative prediction

  4. False negative - Incorrect negative prediction


Basic measures derived from the confusion matrix

  1. Error Rate = (FP+FN)/(P+N)

  2. Accuracy = (TP+TN)/(P+N)

  3. Sensitivity(Recall or True positive rate) = TP/P

  4. Specificity(True negative rate) = TN/N

  5. Precision(Positive predicted value) = TP/(TP+FP)

  6. F-Score(Harmonic mean of precision and recall) = (1+b)(PREC.REC)/(b^2PREC+REC) where b is commonly .5,1,2.


as.* function is an explicit coercion from one class to another




NAs - nonsensical

List is a special kind of vector that contains different elements of difference classes

x <- list(1, TRUE, l+4i)

dim(x) is the dimension function and will give the dimenstions of a 

matrix(nrow = 2, ncol = 3)

m <- 1:10

dim(m) <- c(2, 5)

cbind & rbind are column binding and row binding

factors are used to represent categorical data

unclass(x) will unclass factors taking it from male, female to 2, 1 


is.na() use to test objects to see if they are NA

is.nan() used to test objects to see if they are NaN 

NA values has a class as well

Unlike matrices, data frames can store different calsses of objects in each column; matrices must have every element be the same class

data frames can have the special atrribute row.namers

read.table() or read.csv()

can be converted to a matrix by calls data.matrix()


read.table default separator is space

read.csv default separator is comma



con <- url(“http://…”)

x <- readLines(con)

[ always returns an object of the same class as the original

[[ used to extract elements of a list or a data frame; 

$ extract element of a list or a dataframe by name

df[,] # All Rows and All Columns

df[1,] # First row and all columns

df[1:2,] # First two rows and all columns

df[ c(1,3), ] # First and third row and all columns

df[1, 2:3] # First Row and 2nd and third column

df[1:2, 2:3] # First, Second Row and Second and Third COlumn

df[, 1] # Just First Column with All rows

df[,c(1,3)] # First and Third Column with All rows

  x <- c(1,2,NA,3)

mean(x) # returns NA

mean(x, na.rm=TRUE) # returns 2                      

Regression Analysis with Correlation Matrix


metro <- read.csv(“metro.csv”, header=T, row.names=1)

fullModel <- lm(M_3_passengers~., data=metro)

partialModel <- lm(M_3_passengers~S_7_vehicles+S_8_passengers+T_11_passengers, data=metro)





corMatrix <- round(cor(metro), 2)

meltedCor <- melt(corMatrix)

corPlot <- ggplot(data=meltedCor, aes(X1, X2, fill=value)) +

    geom_tile() +

    scale_fill_gradient2(low=“red”, mid=“white”, high=“blue”, limit = c(-1,1)) +

    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

    geom_text(aes(label = value), color = “black”, size = 4)



First published Jun 23, 2022