Linear regression notes

Classification is a great method to predict discrete values from a given dataset, but sometimes you need to predict a continuous value, e.g: height, weight, prices... And that’s when linear regression techniques come handy.

What is linear regression ?

The definition that I read in the Wikipedia didn’t help me at all. Instead when I related it with a line, it started to make sense to me. If we’ve got a linear function, that is, a function describing a line, where ŵ is the slope of the line and b is called the intercept which is a constant value:

For every x value a new point will be drawn and eventually altogether will form a line. So, if you think about it visually, given a set of input values, a simple linear regression algorithm will try to come up with a line trying to pass as close as possible to the majority of the input dataset points. So if you try to predict an output value from the input values, the machine learning process will pick up a value from that line.

There are differences between the types of linear regression techniques depending on the presence of regularization (Ridge and Lasso), or the lack of it (Simple Linear Regression). It's also worth mentioning the use of normalization, feature generation (polynomial transformation) and feature compression (PCA).

Prerequisites

For this entry you should need the following dependencies:

GradleMavenGrapes

implementation "com.github.grooviter:underdog-ml:VERSION"
implementation "com.github.grooviter:underdog-plots:VERSION"

<dependency>
    <groupId>com.github.grooviter</groupId>
    <artifactId>underdog-ml</artifactId>
    <version>VERSION</version>
</dependency>
<dependency>
    <groupId>com.github.grooviter</groupId>
    <artifactId>underdog-plots</artifactId>
    <version>VERSION</version>
</dependency>

@Grapes([
@Grab("com.github.grooviter:underdog-ml:VERSION"),
@Grab("com.github.grooviter:underdog-plots:VERSION")
])

Note

ml and plots modules already have underdog-dataframe dependency as transitive dependency so you don't have to explicitly declare it.

Simple linear regression

The most popular linear regression uses the least squares technique. It tries to find a slope (w) and constant value (b) that minimizes the mean squared error of the model. It doesn’t have parameters to control model complexity, everything it needs is estimated from training data.

UC Irvine Dataset

The dataset used for this entry is a Bike sharing dataset from the UCI Dataset repository for machine learning.

First of all I'm loading the Bike sharing daily dataset. We are removing non-numerical series and rows with missing values:

loading data

import underdog.Underdog

def data = Underdog.df()
    .read_csv(FILE_PATH, dateFormat: 'yyyy-MM-dd')
    .dropna()  // removing missing values
    .sort_values(by: ['dteday'])
    .drop('instant', 'yr', 'casual', 'cnt', 'dteday') // removing some columns

Which outputs:

                                            day.csv
 season  |  mnth  |  holiday  |  weekday  |  workingday  |  weathersit  |    temp    |   atemp    |    hum     |  windspeed  |  registered  |
---------------------------------------------------------------------------------------------------------------------------------------------
      1  |     1  |        0  |        6  |           0  |           2  |  0.344167  |  0.363625  |  0.805833  |   0.160446  |         654  |
      1  |     1  |        0  |        0  |           0  |           2  |  0.363478  |  0.353739  |  0.696087  |   0.248539  |         670  |
      1  |     1  |        0  |        1  |           1  |           1  |  0.196364  |  0.189405  |  0.437273  |   0.248309  |        1229  |
      1  |     1  |        0  |        2  |           1  |           1  |       0.2  |  0.212122  |  0.590435  |   0.160296  |        1454  |
      1  |     1  |        0  |        3  |           1  |           1  |  0.226957  |   0.22927  |  0.436957  |     0.1869  |        1518  |

First, I’d like to see how features could be related to each other using a correlation matrix:

correlation matrix

def plot = Underdog
    .plots()
    .correlationMatrix(df)

plot.show()

There are a lot of features, but I’m focusing on just choosing one, temp which is the normalized temperature in Celsius the day of the rental. I’d like to see how it looks like visually the relationship between registered number of rentals (registered variable) and the temperature feature I’ve chosen:

scatter matrix: temp vs registered

def plot = Underdog
    .plots()
    .scatterMatrix(df['temp', 'registered'])

plot.show()

What I’m looking for at this point in the scatter plot, is tendencies. In this case it seems that points tend to go in diagonal from the bottom left to the upper right part of the graph. So far, the more tendency I see the better it seems to work. Now I'm creating a linear regression model using the Ordinary least square algorithm (OLS):

linear regression

// features (X) and labels (y)
def X = df['temp'] as double[][]
def y = df['registered'] as double[]

def ml = Underdog.ml()

// train test split (0.75 training, 0.25 test)
def (
    xTrain,
    xTest,
    yTrain,
    yTest
) = ml.utils.trainTestSplit(X, y)

// model creation and training
def model = ml.regression.ols(xTrain, yTrain)

// predicting and getting r2_score for training and test sets
def scoreTrain = model.score(xTrain, yTrain).round(6)
def scoreTest = model.score(xTest, yTest).round(6)

print("train: ${scoreTrain}, test: ${scoreTest}")

output

train: 0.307325, test: 0.223381

If we draw the regression line we’ve got:

scatter & line plot

def plt = Options.create {
    title {
        text('Linear Regression (Least Squares - No Polynomial)')
        left('center')
        top('5%')
    }
    xAxis {
        name('temp')
        nameGap(25)
        nameLocation('center')
        show(true)
    }
    yAxis {
        name('registered')
        nameGap(50)
        nameLocation('center')
        show(true)
    }
    // SCATTER PLOT
    series(ScatterSeries){
        data(
            toData(
                xTest, // x1, x2,...xn
                yTest  // y1, y2,...yn
            ) // [[x1,y1], [x2,y2],...[xn, yn]]
        )
    }
    // REGRESSION LINE
    series(LineSeries) {
        data(
            toData(
                xTest,
                model.predict(xTest)
            )
        )
    }
}

plt.show()

A straight line won’t be able to do good predictions. A way of helping the linear transformation to adapt better to the shape of the model is to use a polynomial transformation.

Polynomial transformation

When the problem doesn't fit easily a straight line or there are many features, it could become complicated to find a good relationship between them, specially with a simple line. The polynomial transformation helps finding those relationships. Applying a polynomial transformation to our problem can help the linear regression to adapt better to the shape of the data. This is the same linear regression example, but this time applying the polynomialFeatures function prior to the linear regression fit.

applying polynomial transformation

// transforming X adding new generated features
def xPoly = ml.features.polynomialFeatures(X)

// train test split (more data for training)
def (
    xTrain,
    xTest,
    yTrain,
    yTest
) = ml.utils.trainTestSplit(xPoly, y)

// creating and training model
def model = ml.regression.ols(xTrain, yTrain)

// predicting and getting r2_score for training and test sets
def scoreTrain = model.score(xTrain, yTrain).round(6)
def scoreTest = model.score(xTest, yTest).round(6)

print("train: ${scoreTrain}, test: ${scoreTest}")

output

train: 0.355879, test: 0.296751

Because the polynomial transformation is creating more features, they cover a wider spectrum of the data, therefore more likely to do improve accuracy, at least in the training dataset. If we draw now the result:

linear regression polynomial plot

// building chart
def plt = Options.create {
    title {
        text('Linear Regression (Least Squares - Polynomial)')
        left('center')
        top('5%')
    }
    xAxis {
        name('temp')
        nameGap(25)
        nameLocation('center')
        show(true)
    }
    yAxis {
        name('registered')
        nameGap(50)
        nameLocation('center')
        show(true)
    }
    // SCATTER PLOT
    series(ScatterSeries){
        data(toData(X, y))
    }
    // REGRESSION LINES
    def colors = ['gray', 'green', 'brown'].indexed()

    (0..<xTest.shape().cols).each {feature ->
        series(LineSeries) {
            smooth(true)
            itemStyle { color(colors[feature]) }
            data(
                toData(
                    xTest.collect { it[feature] },
                    model.predict(xTest)
                )
            )
        }

    }
}

// showing chart
plt.show()

Which covers much more than the previous example. However there are a couple of things to keep in mind when applying the polynomial transformation:

Polynomial transformation with a high degree value could overfit the model
It’s better to combine it with a regularized regression method like Ridge.

However so far it's clear that with just one feature we don't go anywhere as the models we've got so far barely work for training set and are useless for test sets. In regularization and normalization we will be using more features to try to create a viable model.

References

Feature selection

So far I’ve been working with just one feature temp to predict a possible outcome. I chose this feature by using the correlation table as a guide. When looking for just one variable to work with, it could be enough, but when looking for many possible features it could be cumbersome.

Lasso regression could be one method for telling me which features do perform and which don’t. How ? Well according to how the L1 regulation method works, keeping it short, those features that are not so important, Lasso makes its coefficient equal to 0, therefore, those features having a coefficient greater than 0 are worth using them to train the model (the higher the better). Lets use this knowledge to know which features could be useful to train the model.

Lasso to get feature coefficients

import underdog.Underdog

import static groovy.json.JsonOutput.toJson
import static groovy.json.JsonOutput.prettyPrint

// taking all available features but the one for labelling
def featureNames = df.columns - "registered"

def X = df[featureNames] as double[][]
def y = df['registered'] as double[]

// creating and training model
def model = Underdog.ml().regression.lasso(X, y)

// extracting coefficients
def coefficients = model.coefficients()
def featNamesAndCoefficients = [featureNames, coefficients].transpose().collectEntries()

println(prettyPrint(toJson(featNamesAndCoefficients)))

Which shows the following map:

features along with their coefficients

{
    "season": 424.11017152754937,
    "mnth": -14.868338993584615,
    "holiday": -211.17143310755654,
    "weekday": 36.06114879750305,
    "workingday": 941.7383145447468,
    "weathersit": -397.9241830877648,
    "temp": 1136.7307642112696,
    "atemp": 2730.771127273043,
    "hum": -1670.0999227731259,
    "windspeed": -2200.526972819417
}

Now as the theory stated, we can discard those features with 0 value, and maybe those which are negatively correlated. For this example, where I’m only interested in one feature to validate whether I chose the most significant feature or not. In this case I’m getting the feature with the highest possitive coefficient:

def bestFeatures = featNamesAndCoefficients
    .findAll { it.value > 0 } // filtering all features w/ coefficient > 0
    .sort { -it.value } // sorting by coefficient (desc)
    *.key as List<String> // getting feature names

println(bestFeatures)

output

['atemp', 'temp', 'workingday', 'season', 'weekday']

Regularization and normalization

Regularization is a technique used to reduce the model complexity and thus it helps dealing with over-fitting:

It reduces the model size by shrinking the number of parameters the model has to learn
It adds weight to the values so that it tries to favor smaller values

Regularization penalizes certain values by using a loss function with a cost. This cost could be of type:

L1: The cost is proportional to the absolute value of the weight coefficients (Lasso)
L2: The cost is proportional to the square of the value of the weight coefficients (Ridge)

Tip

Regularization really shines when there is a high dimensionality, meaning there’re multiple features. So in these examples it won’t make a huge impact with the scores.

Data normalization is the process of rescaling one or more features to a common scale. It’s normally used when features used to create the model have different scales. There are a few advantages of using normalization is such scenario:

It could improve the numerical stability of your model
It could speed up the training process

Normalization is specially important when applying certain regression techniques, as regression is sensitive to model feature adjustments.

Tip

When using only using ONE feature, normalization doesn't make much difference but, when using multiple features, and each of them in different scales, then we should use normalization.

Regularization Baseline

Lets do the simple linear regression again with the best features obtained from lasso regression coefficients to set the baseline for the regularization & normalization examples:

Baseline

def X = df['atemp', 'temp', 'workingday', 'season', 'weekday'] as double[][]
def y = df['registered'] as double[]
def ml = Underdog.ml()

// train test split
def (xTrain, xTest, yTrain, yTest) = ml.utils.trainTestSplit(X, y)

// creating and training model
def model = ml.regression.ols(xTrain, yTrain)

// getting scores
def scoreTrain = model.score(xTrain, yTrain).round(6)
def scoreTest = model.score(xTest, yTest).round(6)

print("train: ${scoreTrain}, test: ${scoreTest}")

Obtaining the scores:

baseline scores

train: 0.460528, test: 0.311946

Ridge

Follows the least-squares criterion but it uses regularization as a penalty for large variations in w parameters.
Regularization prevents over-fitting by restricting the model, it normally reduces its complexity
Regularization is controlled by the alpha parameter
The higher the value of alpha the simpler the model, that is, the model is less likely to over-fit

Now I’m using Ridge class with the same dataset:

using Ridge regression

def X = df['atemp', 'temp', 'workingday', 'season', 'weekday'] as double[][]
def y = df['registered'] as double[]
def ml = Underdog.ml()

// train test split (more data for training)
def (xTrain, xTest, yTrain, yTest) = ml.utils.trainTestSplit(X, y)

// creating and training model (RIDGE)
def model = ml.regression.ridge(xTrain, yTrain, alpha: 40)

// predicting and getting r2_score for training and test sets
def scoreTrain = model.score(xTrain, yTrain).round(6)
def scoreTest = model.score(xTest, yTest).round(6)

print("train: ${scoreTrain}, test: ${scoreTest}")

Giving me the following scores:

output

train: 0.459691, test: 0.321293

It looks better than the simple OLS example, the takeaway idea here is that the Ridge regression along with a high value of alpha is going to reduce the complexity of the model and make the generalization better (or at least more stable).

Ridge regression score can be improved by applying normalization to the source dataset. Is important for some ML methods that all features are on the same scale. In this case we’re applying a MinMax normalization. However there’re some basic tips to be aware of:

Fit the scaler with the training set and then apply the same scaler to transform the training and test feature sets
Don’t use the test dataset to fit the scaler. That could lead to data leakage.

Ridge with scaled set

def (
    xTrain,
    xTest,
    yTrain,
    yTest
) = ml.utils.trainTestSplit(X, y)

// NORMALIZATION: training scaler with training set
def minMaxTransformation = ml.features.minMaxScaler(xTrain)

// NORMALIZATION: transforming training set
def xTrainScaled = minMaxTransformation.apply(xTrain)

// NORMALIZATION: transforming testing set
def xTestScaled = minMaxTransformation.apply(xTest)

// creating and training model
def model = ml.regression.ridge(xTrainScaled, yTrain, alpha: 40)

// predicting and getting r2_score for training and test sets
def scoreTrain = model.score(xTrainScaled, yTrain).round(6)
def scoreTest = model.score(xTestScaled, yTest).round(6)

print("train: ${scoreTrain}, test: ${scoreTest}")

output

train: 0.459691, test: 0.321293

Well it didn't change a bit. I'm not sure whether is because the model is not that complex or the features don't change much and the regularization + normalization doesn't add much either.

Lasso

It uses a L1 type regularization penalty, meaning it minimizes the sum of the absolute values of the coefficients
It works as a kind of feature selection
It also has an alpha parameter to control regularization

using lasso regression

def model = ml.regression.lasso(xTrain, yTrain, alpha: 40)

// predicting and getting r2_score for training and test sets
def scoreTrain = model.score(xTrain, yTrain).round(6)
def scoreTest = model.score(xTest, yTest).round(6)

print("train: ${scoreTrain}, test: ${scoreTest}")

output

train: 0.460528, test: 0.31195

And finally using min-max scaler to try to improve regression scoring:

lasso with scaled features

// train test split
def (
        xTrain,
        xTest,
        yTrain,
        yTest
) = ml.utils.trainTestSplit(X, y)

// normalization
def minMaxTransformation = ml.features.minMaxScaler(xTrain)
def xTrainScaled = minMaxTransformation.apply(xTrain)
def xTestScaled = minMaxTransformation.apply(xTest)

// creating and training model
def model = ml.regression.lasso(xTrainScaled, yTrain, alpha: 40)

def scoreTrain = model.score(xTrainScaled, yTrain).round(6)
def scoreTest = model.score(xTestScaled, yTest).round(6)

print("train: ${scoreTrain}, test: ${scoreTest}")

output

train: 0.460528, test: 0.311954>

Unfortunately this didn't improve the result either.

Ridge vs Lasso

In this case we’ve used both algorithms with the same dataset, but there’re situations where one or the other fit best:

Ridge: Many small/medium sized effects
Lasso: Few medium/large sized effects

PCA

Principal component analysis (PCA) is an orthogonal linear transformation that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. Long story short tries to do the same model with less features involved doing a type of data compression. In this example we are trying to obtain similar results with one feature less using PCA:

pca

// train test split
def (
    xTrain,
    xTest,
    yTrain,
    yTest
) = ml.utils.trainTestSplit(X, y)

// normalization
def normalization = ml.features.standardizeScaler(xTrain)
//      |
//      v
def xTrainScaled = normalization.apply(xTrain)
def xTestScaled = normalization.apply(xTest)

// compression
def compression = ml.features.pca(xTrainScaled, 4)
//      |
//      v
def xTrainScaledReduced = compression.apply(xTrainScaled)
def xTestScaledReduced = compression.apply(xTestScaled)

// creating and training model
def model = ml.regression.ols(xTrainScaledReduced, yTrain)

// predicting and getting r2_score for training and test sets
def scoreTrain = model.score(xTrainScaledReduced, yTrain).round(6)
def scoreTest = model.score(xTestScaledReduced, yTest).round(6)

print("train: ${scoreTrain}, test: ${scoreTest}")

Which shows pretty the same results but reducing the data needed to train the data, which, may not be relevant in this example but it could be a huge amount of data for some models.

output

train: 0.460506, test: 0.311325