skip to content
Ben Lau statistics . machine learning . programming . optimization . research

Modeling

6 min read Updated:
Curve fitting

One-liners

  • “All models are wrong, but some are useful.” — George Box
  • “A model is a simplification or approximation of reality and hence will not reflect all of reality … While a model can never be “truth,” a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless.” -— Ken Burnham and David Anderson
  • The best model is the simplest model that explains the data — Occam’s razor
  • Avoid Simpson’s paradox
  • It is a must to be aware of overfitting

Background

  • dataset dd, where d=(xi,yi)),i=1,2,...,Nd = {(x_i, y_i)), i=1,2,...,N}
  • xix_i is a vector of predictors, or “covariates” taking its value in some space X\mathcal{X}
  • response yiy_i
  • a algorithm want to output a rule rd(x)r_d(x) where y^=rd(x)\hat{y} = r_d(x), sometimes a rule is also called a hypothesis with notation hd(x)h_d(x), or a model.

Three Major uses of modeling

  • For prediction
    • Given a new observation of x, use y^=rd(x)\hat{y} = r_d(x) to predict y, e.g. email spam prediction
  • For estimation
    • Uses the rule to describe a regression surface S^\hat{S} over X\mathcal{X}, where S^={rd(x),xX}\hat{S} = \{r_d(x), x\in \mathcal{X}\}
    • For estimation, but not necessarily for prediction, we want S^\hat{S} to accurately portray S, the true regression surface.
  • For explanation
    • The relative contribution of the different selected predictors to rd(x)r_d(x) is of interest to explain the response. How the regression surface is composed is of prime concern in this use, but not in use for prediction or estimation.

The three different uses of rd(x)r_d(x) raise different inferential questions. Prediction use calls for estimates of prediction error. For estimation, the accuracy of rd(x)r_d(x) as a function of x, perhaps in standard deviation terms, sd(x)=sd(y^x)sd(x) = sd(\hat{y}|x), would tell how closely S^\hat{S} approximates S. Explanation requires more elaborate inferential tools, saying for example which of the regression coefficients αi\alpha_i can safely be set to zero. book

Distributional modeling

Structural modeling

  • Generalized linear models for modeling the conditional distribution of the response variable YY given the predictors XX
  • Zero-inflated models for count data with excess zeros
    • structural zeros come from some other probabilistic process that prevents an outcome, e.g. failed to capture the non-zero outcome, while sampling zeros come from the data generating process itself, e.g. actual zeros were observed. ref
    • it can be modeled as a mixture of data-generating processes, usually with a logistic regression model and the target model. example with ZOIB example2 example3
    • Zero-inflated means the main model can observe zeros, but the zero-augmented model can only observe non-zeros.
    • A mixture between continuous and discrete is not really a mixture and more a model with two outcomes/ likelihoods (one binomial for the discrete zeros and one continuous for the rest). Since there is no crosstalk between the two components one can model them separately or ignore one altogether (e.g drop the zeros) without loss of information for the kept parameters. ref

  • Quantile regression for any data with outliers, or when the mean is not of interest, but the median or other quantiles instead. ref
  • Censored or truncated data model for correcting the underestimation of the parameter estimates by updating the likelihood function with the knowledge that there is zero probability of observing the data beyond a certain threshold.
  • Two way fixed effects: Freedom, Hierarchies and Confounded Estimates
  • Probability modeling, i.e. using proportions as target, is much better than classification using binary target most of the time
  • Multi-output regression for forecasting or quantile regression. There are two approaches: one independent model per one output (no interaction between submodels, i.e. independent ensembles), one model for multiple-output (joint optimization, might benefit from joint constraints), or recursive model per sequential output, i.e. chained multi-output regression, using predicted values as X for the next steps. In M5 forecasting kaggle competition, most of the winning solutions use one model per one one output. Note that we need one model fro multiple-output approach to generate multiple quantiles using the quantile loss function. ref blog
  • log-log regression: Whenever we consider percentage change of a variable, we take log transformation, due to benefits of taking log. When we want to know the relationship describing the percentage change in a variable divided by that in another variable, we can take log transformation of both variables and use linear regression. In this setting, the coefficient β=dYY(dXX)1\beta=\frac{dY}{Y}(\frac{dX}{X})^{-1} would be exactly the required relationship, which would be easy to interpret, i.e. for every percentage change in X, we get β\beta percent change in Y. However, under this approach, the relationship is assumed to be constant. Price elasticity of demand would be a great use case. Interpretation of Regression Coefficients: Elasticity and Logarithmic Transformation

Modeling considerations

  • Number of observations vs number of features: linear regression requires more observations than features
  • Assumed relationship between the features and the target: linear regression only works if the relationship is linear
  • Correlation between the features: linear regression assumes that the features are not correlated with each other
  • Interpretability

Frequentist vs Bayesian vs Machine Learning

The differences could be summarized in terms of assumptions:

  • Frequentist: It assumes asymptotic properties, which cares the correctness of the procedures or experiments in the long run, and the true parameters are fixed. While a specific procedure is chosen, more strict assumptions from that model are made, e.g. z-test assumes the population variance is known, or the sampling distribution follows a normal distribution, i.e. the central limit theorem holds, while t-test also assumes normality, iid, and homogeneity of variances, details are explained in t-test violations. The assumptions are so strict that, if there are any irregularity, e.g. missing data, outliers, measurement errors, such noisy data environment can be achallenge to apply traditional methods that rely solely upon data to draw the conclusions. why bayes
  • Bayesian: It assumes prior distributions, and the parameters are random that we are uncertain about the parameters of the model. With the prior assumptions, it allows to model small sample data, e.g. new products, or rare events, with certainty. It also allows for modeling complex data through hierarchical modeling, because the priors can be served as building blocks. Community support would be a great plus practically. Somehow the open source environment in Bayesian modeling is quite active, many cases could be studied, and many questions could be answered.
  • Machine learning: It assumes the samples collected can reconstruct the data generating process which represents the true distribution of population. It also assumes that the hypothesis suggested can represent of the data generating process, and it can be learned, with given data and computational resources.

Why inflation forecasting is difficult

discussion

Readings

Readings

  • Efron, B., & Hastie, T. (2021). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.