R Is Not So Hard! A Tutorial, Part 4: Fitting a Quadratic Model

Stage 2

In Part 3 we used the lm() command to perform least squares regressions. In Part 4 we will look at more advanced aspects of regression models and see what R has to offer.

One way of checking for non-linearity in your data is to fit a polynomial model and check whether the polynomial model fits the data better than a linear model. However, you may also wish to fit a quadratic or higher model because you have reason to believe that the relationship between the variables is inherently polynomial in nature. Let’s see how to fit a quadratic model in R.

We will use a data set of counts of a variable that is decreasing over time. Cut and paste the following data into your R workspace.

A <- structure(list(Time = c(0, 1, 2, 4, 6, 8, 9, 10, 11, 12, 13, 
14, 15, 16, 18, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30), 
Counts = c(126.6, 101.8, 71.6, 101.6, 68.1, 62.9, 45.5, 41.9, 
46.3, 34.1, 38.2, 41.7, 24.7, 41.5, 36.6, 19.6, 
22.8, 29.6, 23.5, 15.3, 13.4, 26.8, 9.8, 18.8, 25.9, 19.3)), .Names = c("Time", "Counts"),
row.names = c(1L, 2L, 3L, 5L, 7L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 19L, 20L, 21L, 22L, 23L, 25L, 26L, 27L, 28L, 29L, 30L, 31L),
class = "data.frame")

Let’s attach the entire dataset so that we can refer to all variables directly by name.

attach(A)
names(A)

First, let’s set up a linear model, though really we should plot first and only then perform the regression.

linear.model <-lm(Counts ~ Time)

We now obtain detailed information on our regression through the summary() command.

summary(linear.model)
Call:
lm(formula = Counts ~ Time)

Residuals:
   Min     1Q Median     3Q     Max 
-20.084 -9.875 -1.882   8.494 39.445 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)   
(Intercept) 87.1550     6.0186 14.481 2.33e-13 ***
Time         -2.8247     0.3318 -8.513 1.03e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.16 on 24 degrees of freedom
Multiple R-squared: 0.7512,   Adjusted R-squared: 0.7408 
F-statistic: 72.47 on 1 and 24 DF, p-value: 1.033e-08

The model explains over 74% of the variance and has highly significant coefficients for the intercept and the independent variable and also a highly significant overall model p-value. However, let’s plot the counts over time and superpose our linear model.

plot(Time, Counts, pch=16, ylab = "Counts ", cex.lab = 1.3, col = "red" )

abline(lm(Counts ~ Time), col = "blue")

tn_image001

Here the syntax cex.lab = 1.3 produced axis labels of a nice size.

The model looks good, but we can see that the plot has curvature that is not explained well by a linear model. Now we fit a model that is quadratic in time. We create a variable called Time2 which is the square of the variable Time.

Time2 <- Time^2
quadratic.model <-lm(Counts ~ Time + Time2)

Note the syntax involved in fitting a linear model with two or more predictors. We include each predictor and put a plus sign between them.

 

summary(quadratic.model)
Call:
lm(formula = Counts ~ Time + Time2)
Residuals:
     Min       1Q   Median       3Q     Max 
-24.2649 -4.9206 -0.9519   5.5860 18.7728 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 110.10749   5.48026 20.092 4.38e-16 ***
Time         -7.42253   0.80583 -9.211 3.52e-09 ***
Time2         0.15061   0.02545   5.917 4.95e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.754 on 23 degrees of freedom
Multiple R-squared: 0.9014,   Adjusted R-squared: 0.8928
F-statistic: 105.1 on 2 and 23 DF, p-value: 2.701e-12

Our quadratic model is essentially a linear model in two variables, one of which is the square of the other. We see that however good the linear model was, a quadratic model performs even better, explaining an additional 15% of the variance. Now let’s plot the quadratic model by setting up a grid of time values running from 0 to 30 seconds in increments of 0.1s.

timevalues <- seq(0, 30, 0.1)
predictedcounts <- predict(quadratic.model,list(Time=timevalues, Time2=timevalues^2))
plot(Time, Counts, pch=16, xlab = "Time (s)", ylab = "Counts", cex.lab = 1.3, col = "blue")

Now we include the quadratic model to the plot using the lines() command.

lines(timevalues, predictedcounts, col = "darkgreen", lwd = 3)

tn_image002

The quadratic model appears to fit the data better than the linear model. We will look again at fitting curved models in our next blog post.


About the Author:
David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

See our full R Tutorial Series and other blog posts regarding R programming.

Four Critical Steps in Building Linear Regression Models
While you’re worrying about which predictors to enter, you might be missing issues that have a big impact your analysis. This training will help you achieve more accurate results and a less-frustrating model building experience.

Reader Interactions

Comments

  1. Sergey says

    It’s a bit difficult to find in references how to work a function predict with a list in parameters. There is a line in the post like predict(quadratic.model, list(Time=timevalues, Time2=timevalues^2)) the meaning of what is little unclear.

  2. Rens says

    Dear David,

    Much thanks for your tutorial.

    A question about one of your examples though:
    Time2 = Time^2
    [quadratic.model <-lm(Counts ~ Time + Time2)]

    Some resources seem to suggest to orthogonalize such a model, eg:
    [quadratic.model <-lm(Counts ~ poly(Time,2))]

    The difference being that the first example would violate the assumption of multicollinearity.

    Your insights are much appreciated. Thank you.

    Best wishes,
    Rens

  3. Todd says

    if i estimate a qudratic model (Y on X). I can estimate the derivative of Y wrt to X as b1+2*b2*X. Now i can compute where the marginal effect is maximized by setting that equal to zero, or X* = -b1/2*b2.
    Since b1 and b2 are estimated, R can give me a p-value for the nonlinear combination. However, i’m not sure how to interpret it when signif versus not.
    Help?

    • Karen Grace-Martin says

      Hi Todd,

      It’s not clear to me what hypothesis you’re trying to test or what exactly you’re trying to interpret. Is it whether X has an effect on Y in general? Whether the curvature is significant and therefore testing whether the effect of X on Y is quadratic vs. linear?

  4. Mia says

    How would I tackle a quadratic model if my x variables consist of dates in POSIXct. I am unable to square the values in this form, but can when I switch them to numeric. I prefer to keep them as dates in POSIXct if possible.

  5. Maria Eugenia says

    Do you know how to estimate the minimum value predicted for the quadratic model? And to estimate the precise value of time that would predict that minimum?

    Thank you
    Regards

  6. obu, obu Enang says

    Y=structure(list(Distant = c(253,337,395,451,495,534,574),
    Height = c(100,200,300,450,600,800,1000)). How can a problem be run in R using cubic and quadratic model.

  7. Kanisios Mukwashi says

    The quadratic fit works but the problem is that the fit overlaps/extend beyond the data points until it touches the x-axis line (on both ends of the fit). And my data points are somewhere in the middle of the plot. How do I just confine the curve fit to the area where there are data points?

  8. gold studs online says

    May I just say what a comfort to find someone who truly understands what they’re
    talking about on the internet. You definitely know
    how to bring a problem to light and make it important.

    A lot more people should look at this and understand this side of the story.
    I was surprised that you’re not more popular because you definitely have the gift.

  9. Natarajan says

    Thank you very much.. Hit your site, exactly when needed.
    Was sitting with a dataset that is not linear and the results were not interesting at all.
    Many Thanks.

  10. Floriano says

    Dear Karen,

    I just visited this interesting website, and wonder if you could help me with an error message I am receiving when I try to run a network meta-regression model in R studio. The error says: Regressor variable “hiv” not found. Thank you in advance for your time. Please let me know if you want me to send you the dataset and the code.

    All the best.
    Floriano

  11. Proquotient says

    For beginners who just got started with learning Data Science, languages such as R can be a little confusing but step by step tutorials such as these make the learning experience much better and easier. Thanks for sharing this great knowledge!

  12. David Lillis says

    Hello.

    You are quite correct. In this context, the word ‘linear” refers to the estimated parameters, and therefore models with quadratic or higher order polynomial variables are in fact linear.

    Applied statisticians and researchers trained in the natural and social sciences (rather than in statistics) often use the terms ‘quadratic model’, ‘exponential model’ etc in the context of regression somewhat loosely (i.e. referring to linearity or higher order functions of the prefictors rather than in the estimated parameters). Various texts (e.g. Statistics: an Introduction using R, by Michael Crawley) and other works available on-line often refer to such models this way.

    In writing Blogs such as this one, I attempt to make the examples understandable to a wide variety of people, including those relatively new to statistical modelling and those new to R. I find it convenient to refer to such regression models as “quadratic models”, “exponential models” etc. I believe that this approach makes the learning process easier for those new to statistical modelling because those descriptors reflect the statistical models we are attempting to fit.

    The challenge for those new to R is to master the syntax and the various commands that are available in R, but very often I encounter people who are both new to R and who are learning statistics at the same time. Assisting them to achieve competence in R syntax is the objective of my blogs and workshops. For them clarity and simplicity are paramount.

    Thank you for your comment. Your point is well made and in future blogs and workshops in which I present material similar to that given in this blog, I will explain that when I refer to linear and quadratic models etc I am referring to the predictors.

    Best wishes,

    David

  13. Alberto says

    Regression models with polynomial variables are linear models. The “linear” in linear model refers to the parameters, not the variables. Models with quadratic, cubic, quartic, or higher order polynomial variables are therefore linear. The post seems a little misleading on this point.


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.