Pink bollworm prediction using regression
Random Forest Regression is a machine learning algorithm that is based on the ensemble learning technique. It is used for solving regression problems, where the goal is to predict a continuous numerical value.
Random Forest Regression combines the power of multiple decision trees to make accurate predictions. Each decision tree in the random forest independently predicts the target value, and the final prediction is obtained by aggregating the predictions of all the trees.
The mathematical expression for Random Forest Regression:
𝑚̂ = ∑i=1N fi(x) / N
Where:
- 𝑚̂ is the predicted value for a given input x.
- N is the number of decision trees in the random forest.
- fi(x) is the prediction of the i-th decision tree.
Each decision tree in the random forest is constructed by randomly selecting a subset of the training data and a subset of the input features. The trees are built using a process called recursive partitioning, where the data is split into smaller subsets based on certain conditions.
To make a prediction, each tree in the forest independently predicts a value for the input x. The final prediction is then obtained by averaging the predictions of all the trees.
Random Forest Regression is a powerful algorithm that can handle a large number of input features and capture complex relationships between the features and the target variable. It is widely used in various domains, including finance, healthcare, and engineering, for tasks such as stock market prediction, disease prognosis, and quality control.
Support Vector Regression (SVR) is a regression algorithm that uses support vector machines to perform regression tasks. It aims to find a function that approximates the relationship between the input variables and the continuous target variable. Here's the mathematical expression for SVR:
y = f(x) = ∑i=1n αi K(x, xi) + b
Where:
- y is the predicted value for the input x.
- αi are the coefficients determined during the training process.
- K(x, xi) is the kernel function that measures the similarity between x and xi.
- b is the bias term.
During the training process, SVR aims to minimize the following objective function:
½ ∑i=1n (λi - yi)2 + ε C ∑i=1n (ξi + ξi*)
Subject to the following constraints:
- -ε ≤ λi - yi ≤ ε
- ξi, ξi* ≥ 0
Where:
- λi is the predicted value for the i-th input sample.
- C is the regularization parameter that controls the trade-off between the model's complexity and the training error.
- ε is the epsilon-tube, which defines the margin of tolerance around the predicted values.
- ξi and ξi* are the margin errors.
Support Vector Regression is a powerful algorithm that can handle non-linear regression tasks by using appropriate kernel functions. It is also widely used in various domains, including finance, engineering, and economics, for tasks such as stock price prediction, demand forecasting, and time series analysis.
LASSO (Least Absolute Shrinkage and Selection Operator) regression is a regularization technique used in linear regression to select relevant features and control model complexity. It adds a penalty term to the least squares objective function, encouraging sparse solutions.
The mathematical expression for LASSO regression:
y = f(x) = β0 + ∑j=1p βj xj
Where:
- y is the predicted value for the input x.
- β0 is the intercept term.
- βj represents the coefficients corresponding to the j-th feature xj.
- p is the number of features in the dataset.
To incorporate the LASSO penalty, the objective function is modified as follows:
ƒ1/n ∑i=1n (yi - f(xi))2 + λ ∑j=1p |βj|
Where:
- The first term represents the least squares loss, which measures the squared difference between the predicted and actual values.
- The second term represents the LASSO penalty, which is the sum of the absolute values of the coefficients multiplied by the regularization parameter λ.
LASSO regression encourages feature selection by driving the coefficients of irrelevant or redundant features towards zero, effectively performing automatic feature elimination.
LASSO regression is widely used in various domains, including statistics, machine learning, and economics, for tasks such as feature selection, variable importance determination, and high-dimensional data analysis.
LASSO regression encourages feature selection by driving the coefficients of irrelevant or redundant features towards zero, effectively performing automatic feature elimination.
LASSO regression is widely used in various domains, including statistics, machine learning, and economics, for tasks such as feature selection, variable importance determination, and high-dimensional data analysis.
Elastic Net regression is a regularization technique that combines both L1 (LASSO) and L2 (Ridge) regularization penalties to control model complexity and perform feature selection.
The mathematical expression for Elastic Net regression:
y = f(x) = β0 + ∑j=1p βj xj
Where:
- y is the predicted value for the input x.
- β0 is the intercept term.
- βj represents the coefficients corresponding to the j-th feature xj.
- p is the number of features in the dataset.
To incorporate both L1 and L2 regularization penalties, the objective function is modified as follows:
ƒ1/n ∑i=1n (yi - f(xi))2 + λ1 ∑j=1p |βj| + λ2 ∑j=1p βj2
Where:
- The first term represents the least squares loss, which measures the squared difference between the predicted and actual values.
- The second term represents the L1 (LASSO) penalty, which is the sum of the absolute values of the coefficients multiplied by the regularization parameter λ1.
- The third term represents the L2 (Ridge) penalty, which is the sum of the squared coefficients multiplied by the regularization parameter λ2.
Elastic Net regression combines the strengths of both L1 and L2 regularization, making it suitable for handling datasets with high-dimensional features and correlated predictors.
Elastic Net regression is widely used in various domains, including machine learning, statistics, and bioinformatics, for tasks such as feature selection, dimensionality reduction, and predictive modeling.
Elastic Net regression combines the strengths of both L1 and L2 regularization, making it suitable for handling datasets with high-dimensional features and correlated predictors.
Elastic Net regression is widely used in various domains, including machine learning, statistics, and bioinformatics, for tasks such as feature selection, dimensionality reduction, and predictive modeling.
The XGBoost regression model was used to build predictive model. XGBoost stands for extreme gradient boosting. For a given dataset with n samples and m features, $$D=\left\{\left(X_{i},y_{i}\right)\right\}\left(|D|=n,X_{i}\varepsilon R^{m},y_{i}\varepsilon R\right)$$ a tree ensemble model uses K additive function to predict the output
$$ estimate\left(y_{i}\right)=Φ\left(X_{i}\right)= \sum_{k=1}^{K}f_{k}\left(X_{i}\right),f_{k}\varepsilon F...............(1) $$
where $$F=f_{k}=w_{q\left(x\right)}\left(R_{m}→T,W\varepsilon R^{T}\right)$$ is the space of regression trees. Here ‘q ‘represents the structure of each tree that maps an example to the corresponding leaf index. T is the number of leaves in a tree. Each $ f_{k} $ corresponds to an independent tree structure q and leaf weight w. Unlike decision trees, each regression tree contains a continuous score on each of leaf, we use $w_{i}$ to represent score on i-th leaf.
To learn the set of functions used in the model, we minimize the following regularized objective
$$ L\left(Φ\right)= \sum_{i}l\left(estimate\left(y_{i}\right),y_{i}\right)+\sum_{k}\Omega \left(f_{k}\right)...............(2) $$
where, $$ \Omega \left(f_{k}\right)= \gamma^{T}+\frac{1}{2}\lambda ||w||^{2}$$
Here ‘ι’ is a differentiable convex loss function that measure the difference between the prediction $estimate(y_{i}$ and target $y_{i})$ . The second term, Ω, penalizes the complexity of the regression tree function. The additional regulizer term helps to smoothen the final learned weight to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.
The predictive models were validated considering the following statistical measures:
$$ RMSE= { \sqrt{\sum_{i=1}^{n}{\left(Predicted\left(y_{i}\right) - observed\left(y_{i}\right)\right)^{2} \over n}}}..............(3) $$
To check the predictive accuracy of each model Relative mean absolute percentage error (RMAPE) was used. The RMAPE is widely used to validate forecast accuracy, which provides an indication of the average size of prediction error expressed as a percentage of the relevant observed value irrespective of whether that prediction error is positive or negative.
$$ RMAPE= \frac{1}{n}{ \sum_{i=1}^{n}|\left(observed\left(y_{i}\right) - Predicted\left(y_{i}\right)\right)| \over observed\left(y_{i}\right)}\times100 .....(4)$$
$$ Model\space Accuracy = 100 - RMAPE ..............................(5) $$
Once the model has been trained. We can't assume, it can work well on data that it hasn't seen before. In other words, it can not be certain that the model will have the desired accuracy and variance in the environment of production. It needs some form of confirmation that the predictions our model is sending out are accurate. For that, our model needs to be validated. The process of determining whether the numerical results which quantify hypothesized relationships between variables are appropriate as data descriptions is known as validation. To evaluate the performance of any machine learning model we need to test it on some unseen data. Based on the models performance on unseen data we can say weather our model is Under-fitting/Over-fitting/Well generalized. Cross-validation (CV) is one of the authentication technique used to test the effectiveness of the build model, and is also a re-sampling procedure used to evaluate that model. We need to hold a sample / portion of the data that is not used to train the model to perform a CV. A 10-fold cross-validation method was applied to verify the accuracy of the models, and accuracy was tested based on the RMSE value.