Relationship between dependent and independent variables in linear regression

Linear regression attempts to forecast the value of a dependent variable given the value of an independent variable. It assumes that there is a linear relationship between dependent and independent variables. A Simple Linear Regression relates a dependent and one independent variable by estimating a linear relationship.

A dependent variable is a variable predicted by the independent variable. It is also known as the explained variable or endogenous variable. On the other hand, the independent variable explains the variation in the dependent variable. It is also known as the exogenous variable, explanatory variable, or predicting variable.

The following is a simple linear regression equation:

$$Y=b_0+b_1X_1+\epsilon$$

Where:

\(Y\) = Dependent variable

\(b_0\) = Intercept

\(b_1\) = Slope coefficient

\(X\) = independent variable

\(\epsilon\) = Error term (Noise)

\(b_0\) and \(b_1\) are known as regression coefficients.

The error term is the part of the dependent variable that the independent variable cannot explain.

The figure below illustrates a simple linear regression model:

Example: Dependent and Independent Variables

Artur  is using regression analysis to forecast inflation given unemployment data from 2011 to 2020.

The following table shows the relevant data from 2011 to 2020.

$$\small{\begin{array}{c|c|c}\textbf{Year} & \textbf{Unemployment Rate} & \textbf{Inflation rate}\\ \hline 2011 & 6.1\% & 1.7\%\\ \hline 2012 & 7.4\% & 1.2\%\\ \hline 2013 & 6.2\% & 1.3\%\\ \hline 2014 & 6.2\% & 1.3\%\\ \hline 2015 & 5.7\% & 1.4\%\\ \hline 2016 & 5.0\% & 1.8\%\\ \hline 2017 & 4.2\% & 3.3\%\\ \hline 2018 & 4.2\% & 3.1\%\\ \hline 2019 & 4.0\% & 4.7\%\\ \hline 2020 & 3.9\% & 3.6\%\\ \end{array}}$$

A scatter plot of the inflation rates against unemployment rates from 2011 to 2020 is shown in the following figure.

Relationship between dependent and independent variables in linear regression
What variable is the dependent variable?

Solution

Inflation is the dependent variable: It is the variable predicted using the unemployment rates.

Question

The independent variable in a regression model is most likely the:

    A. Predicted variable.

    B. Predicting variable.

    C. Endogenous variable.

Solution

The correct answer is B.

An independent variable explains the variation of the dependent variable. It is also called the explanatory variable, exogeneous variable, predicting variable.

A and C are incorrect. A dependent variable is a variable predicted by the independent variable. It is also known as the predicted variable, explained variable, endogenous variable.

Reading 0: Introduction to Linear Regression

LOS 0 (a) Describe a simple linear regression model and the roles of the dependent and independent variables in the model

Simple and multiple linear regression analyses are statistical methods used to investigate the link between activity/property of active compounds and the structural chemical features. One assumption of the linear regression is that the errors follow a normal distribution. This paper introduced a new approach to solving the simple linear regression in which no assumptions about the distribution of the errors are made. The proposed approach maximizes the probability of observing the event according to the random error. The use of the proposed approach is illustrated in ten classes of compounds with different activities or properties. The proposed method proved reliable and was showed to fit properly the observed data compared to the convenient approach of normal distribution of the errors.

1. Introduction

The quantitative structure activity/property relationships (QSARs/QSPRs) are computational techniques that quantitatively relate chemical feature (such as descriptors) to a biological activity or property [1]. Linear regression is one of the earliest methods [2] used to link the activity/property with structural information and is frequently used due to the relative easy interpretation [3]. Sometimes, linear regression is misuse due to the application without investigation of its assumptions (such as linearity, independence of the errors, normality, homoscedasticity, and absence of multicollinearity [4]).

The error, “a measure of the estimated difference between the observed or calculated value of a quantity and its true value” [5], was first used in mathematics/statistics in 1726 in Astronomiae Physicae & Geometricae Elementa [6]. In the late 1800’s, Adcock [7, 8] suggested that the errors must pass through the centroid of the data. The method proposed by Adcock, named orthogonal regression, explores the distance between a point and the line in a perpendicular direction to the line [7, 8]. Kummell [9] investigated other than perpendicular directions between the points and line. The regression slope (“”) was described by Galton in 1894 based on an experiment of sweet pea seeds [10]. Two years later, Pearson generalized the errors in the variable and published a rigorous description of correlation and regression analysis [11] (Pearson recognized the contribution of Bravais [12] to mathematical formula of correlation). Due to the ability to produce best linear unbiased parameters [13], the coefficients in simple linear regression (SLR) models are estimated by minimizing the sum of squared deviations (least squares estimation, method introduced by Legendre in 1805 [14] and used/applied by Gauss in 1809 [15]). Furthermore, Fisher introduced the concept of maximum likelihood within linear models [16, 17].

The generic equation of simple linear regression (1) between observed dependent variable and observed independent variable is:where and are unknown constant values (estimators of statistics parameters of simple linear regression), is the value of the dependent variable estimated by the model, is the observed value of dependent variable, and is the observed value of the predictor variable.

The array use to estimate the residuals is given by formula, where is the th observation in the sample (, when = sample size) and is an unknown coefficient. The unknown coefficient is an estimator of the power of the errors on simple linear regression.

In the SLR-LS (simple linear regression least squares), residuals (, where = residual) follow the Gauss-Laplace distribution with , , and being unknown statistical parameters:where is population mean, is population standard deviation, is power of the errors, is gamma function, and is sample standard deviation.

Gauss-Laplace distribution is symmetrical and has three statistical parameters (population mean, population standard deviation, and power of the errors) [15, 18] and two main particular cases. First particular case is Gauss distribution [15] often observed on arrays of biochemical data [19–21] while the second particular case is Laplace distribution (with mean of zero and variance ) [22, 23] commonly seen on astrophysical data [24, 25].

The problem of estimating the parameters of the SLR (1) for the first particular case (Gauss distribution) considers residuals (where is the power of the errors related with experimental errors). The coefficients of regression for this particular case are obtained by solving the system of linear equations under the assumption that [26] (, where and are unknown parameters).

The second particular case is when residuals follow the Laplace distribution. In view of the fact that “is not differentiable everywhere” [27], the solution in more difficult to be obtained for this particular case.

One question can be asked: “what is the proper value of that should be used in the simple linear regression analysis (1)?” A previous study showed that, for different sets of biological active compounds, the distribution of the dependent variable can be approximated by Gauss distribution () just in a relatively small number of cases when the whole Gauss-Laplace family is investigated [28]. Based on this result, the aim of the present study was to formulate the problem of solving the simple linear regression equation (1) without making any assumptions about the power of the errors .

2. Materials and Methods

2.1. Mathematical Approach

The problem of regression (1) is transformed into a problem of estimation if the residuals are introduced in (2) with a slight modification: in the quantity the constants and are equivalent and just one will be further used. Gauss-Laplace distribution is symmetrical and the observed mean is an unbiased estimator of the population mean . This could be expressed in terms of (1) as presented inwhere is the population mean of the Gauss-Laplace quantity (2), is observed/measured dependent variable, is dependent variable estimated by the regression model, is independent/predictor variable, and is mean operator. For certain arrays of paired observations , the problem of regression expressed in (1) is transformed to a problem of estimating the parameters of the bidimensional Gauss-Laplace distribution as presented inAn efficient instrument to solve (4) is maximum likelihood estimation (MLE), method proposed by Fisher [16, 17]. The main assumption of the MLE is that the array has been observed due to its higher chance to be observed (simultaneously and independent). This could be translated as , and thus , which lead to the expression inBy including (4) in (5) and using the natural logarithm, the problem presented in (1) became a problem of optimization:where is number of pairs.

The optimization problem presented in (5) could be iteratively solved if the start point is a good initial solution (situated near the optimal solution). In this research, the start point in the optimization was the solution of a particular case of (6) as presented inwhere is power of the errors, is population mean, is population standard deviation, is average (central tendency operator), and is variance (dispersion operator).

2.2. Algorithm Implementation

The classical simple linear regression (SLR) uses least squares method to estimate a, µ and s coefficients (see Eq7) using the value of the power of the errors equal to 2. The supplementary material contains lines of the program implemented in PHP to find the solutions of Eq6 (maximum likelihood estimation - MLE) starting with values of coefficients identified by Eq7. The program makes small changes to the values of the coefficients and selects the coefficients that maximize the MLE value.

What is the relationship between independent and dependent variables?

The independent variable is the cause. Its value is independent of other variables in your study. The dependent variable is the effect. Its value depends on changes in the independent variable.

What are dependent and independent variables in linear regression?

Naming the Variables. There are many names for a regression's dependent variable. It may be called an outcome variable, criterion variable, endogenous variable, or regressand. The independent variables can be called exogenous variables, predictor variables, or regressors.