CE Board Exam Randomizer

⬅ Back to Statistics and Probability Topics

Correlation and Regression

Correlation measures how strongly two variables move together. The correlation coefficient $r$ ranges from −1 to +1: $r \approx +1$ means a strong positive relationship (both increase together), $r \approx -1$ means a strong negative relationship, and $r \approx 0$ means no linear pattern. Regression finds the best-fit line $y = a + bx$ so you can predict $y$ from $x$. The slope $b$ means: "for every 1-unit increase in $x$, $y$ changes by $b$ units." The coefficient of determination $r^2$ tells what fraction of variation in $y$ is explained by $x$. Remember: correlation ≠ causation, and predictions far outside the data range (extrapolation) are unreliable.

$$y=a+bx, \quad b=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sum(x-\bar{x})^2}, \quad a=\bar{y}-b\bar{x}$$

Least-Squares Line

Fit a line to the points $(1,2)$, $(2,4)$, $(3,5)$, and $(4,7)$. Estimate $y$ at $x=5$.

Using least squares, $\bar{x}=2.5$ and $\bar{y}=4.5$.

$$b=\frac{8}{5}=1.6, \qquad a=4.5-1.6(2.5)=0.5$$
$$y=0.5+1.6x$$

At $x=5$, $y=8.5$.

Correlation Interpretation

A data set has correlation coefficient $r=-0.82$. Interpret the result.

The value is close to -1, so the variables have a strong negative linear relationship. As one variable increases, the other tends to decrease.

Prediction from Regression

The fitted equation for construction time is $y=12+0.08x$, where $x$ is floor area in square meters and $y$ is time in days. Estimate the time for $x=250$.

$$y=12+0.08(250)=32$$

Final answer: 32 days.

Computing r from Data

Given the pairs $(x,y)$: (1,2), (2,3), (3,5), (4,4), (5,6). Compute the correlation coefficient $r$ using the formula: $r=\frac{n\sum xy - \sum x\sum y}{\sqrt{[n\sum x^2-(\sum x)^2][n\sum y^2-(\sum y)^2]}}$.

$n=5$, $\sum x=15$, $\sum y=20$, $\sum xy=69$, $\sum x^2=55$, $\sum y^2=90$.

$$r=\frac{5(69)-15(20)}{\sqrt{[5(55)-225][5(90)-400]}}=\frac{345-300}{\sqrt{[275-225][450-400]}}=\frac{45}{\sqrt{50\cdot50}}=\frac{45}{50}=0.90$$

Final answer: $r=0.90$ — strong positive linear relationship.

Coefficient of Determination R²

A regression analysis of load (x) vs. deflection (y) gives $r = 0.92$. Interpret $r^2$.

$$r^2=(0.92)^2=0.8464$$

This means 84.64% of the variation in deflection is explained by the load. The remaining 15.36% is due to other factors (material inconsistency, temperature, etc.). A higher $r^2$ means the regression line is a better fit.

Interpreting the Slope

A regression equation relating study time $x$ (hours) to exam score $y$ is $y = 45 + 5.2x$. What does the slope mean?

The slope $b = 5.2$ means: for each additional hour of study, the exam score increases by 5.2 points on average. The y-intercept 45 is the predicted score for 0 hours of study (use cautiously — extrapolation).

$$\Delta y = 5.2\Delta x$$

Residual Calculation

A regression line gives $\hat{y}=3+2x$. For the data point $(4, 12)$, find the residual.

A residual is the difference between the actual y and the predicted $\hat{y}$. Positive residual = actual above line; negative = below line.

$$\hat{y}=3+2(4)=11$$
$$\text{Residual}=y-\hat{y}=12-11=1$$

Final answer: Residual = +1 (actual value is 1 unit above the predicted value).

Extrapolation Warning

A regression model using data from buildings with 5–20 floors gives $y=50+30x$ (x = floors, y = cost in million pesos). Should you use this to predict cost for a 50-floor building?

Substituting $x=50$: $y=50+30(50)=1550$ million pesos. However, this is extrapolation — predicting far beyond the range of the data (5–20 floors). The linear trend may not hold at 50 floors (economies of scale, structural changes, etc.). Use with extreme caution.

$$x=50 \text{ is far outside the data range }[5,20]$$

Final answer: The model gives P1,550M, but the prediction is unreliable due to extrapolation.

Scroll to zoom