Impact of feature selection on the prediction of global horizontal irradiation under ouarzazate city climate

Ensuring accurate forecasts of Global Horizontal Irradiance (GHI) stands as a pivotal aspect in optimizing the efficient utilization of solar energy resources. Machine learning techniques offer promising prospects for predicting global horizontal irradiance. However, within the realm of machine learning, the importance of feature selection cannot be overestimated, as it is crucial in determining performance and reliability of predictive models. To address this, a comprehensive machine learning algorithm has been developed, leveraging advanced feature importance techniques to forecast GHI data with precision. The proposed models draw upon historical data encompassing solar irradiance characteristics and environmental variables within the Ouarzazate region, Morocco, spanning from 1st January 2018, to 31 December 2018, with readings taken at 60-minute intervals. The findings underscore the profound impact of feature selection on enhancing the predictive capabilities of machine learning models for GHI forecasting. By identifying and prioritizing the most informative features, the models exhibit significantly enhanced accuracy metrics, thereby bolstering the reliability, efficiency, and practical applicability of GHI forecasts. This advancement not only holds promise for optimizing solar energy utilization but also contributes to the broader discourse on leveraging machine learning for renewable energy forecasting and sustainability initiatives.


INTRODUCTION
For solar energy systems to be designed, operated, and controlled in the most efficient manner possible, (1,2,3) a precise forecasting of solar irradiance is necessary.One of the most crucial variables in solar irradiance prediction models, global horizontal irradiance (GHI), describes the total quantity of solar radiation that impacts a horizontal surface at a specific point. (4,5)But predicting GHI is difficult since it depends on a complex interplay of several meteorological and environmental variables, including, temperature and humidity. (6,7,8)In recent years, a number of forecasting models have been Introduced to target various prediction time horizons, such as extremely short-term (within an hour), short-term (within a day or day ahead), medium-term (1 month), and long-term (1 year). (9)antian et al. (10) introduce a new feature selection approach for short-term solar irradiance prediction, integrating conditional mutual information (CMI) and Gaussian process regression (GPR).The aim is to assess feature correlations and redundancies using conditional mutual information, and to use GPR with adaptively determined hyper-parameters for prediction.Sequential forward selection (SFS) is used to identify the optimal subset of features and the GPR covariance function.Results demonstrate improved prediction accuracy with reduced feature dimensions, showcasing the efficacy of the proposed method in enhancing solar irradiance prediction performance.Takahiro et al. (11) study explores enhancing solar irradiance forecasting in the Kanto region by integrating the mesoscale ensemble prediction system (MEPS) into support vector regression (SVR)based predictors.Despite previous challenges in accuracy stemming from numerical weather prediction (NWP), MEPS offers a promising solution.By leveraging multiple network configurations, SVR models were developed using MEPS inputs, resulting in improved accuracy and reduced maximum prediction error for global horizontal irradiance (GHI).The integration of MEPS not only enhances average accuracy but also addresses systematic errors, offering a robust framework for solar irradiance forecasting in the Kanto region.Domingos S et al. (12) introduces HetDS, a heterogeneous ensemble dynamic selection model, designed to forecast solar irradiance with superior accuracy compared to individual models.HetDS selects the most suitable forecasting model from a pool of seven well-established methods, including ARIMA, SVR, MLP, ELM, DBN, RF, and GB.Experimental evaluation using four datasets of hourly solar irradiance measurements in Brazil demonstrates HetDS's superior overall accuracy across five prominent error metrics.By mitigating the risk of selecting an inappropriate model, HetDS enhances system generalization, consistently outperforming standalone methods in nearly all comparisons.Marco et al. (13) study examines the efficacy of exogenous inputs for short-term GHI forecasting, employing various feature selection techniques to identify pertinent variables such as ultraviolet index, cloud cover, and temperature.Five machine learning models, including LSTM, were employed to assess predictive performance, with LSTM exhibiting the highest accuracy (MAD of 24,51 %) for GHI forecasting up to 4 hours ahead.Integration of exogenous inputs notably enhanced forecasting performance beyond 15-minute horizons, reducing errors by over 22 % in 4-hour predictions, yet offering minimal improvements for very short horizons.David Puga-Gil et al. (14) investigated the application of machine learning models, including random forest (RF), support vector machine (SVM), and artificial neural network (ANN), to predict monthly global solar irradiation based on data from six measurement stations in Rias Baixas, Spain.The ANN models exhibited superior performance during both development and testing phases, as well as in extrapolating knowledge to other locations.Achieving mean absolute percentage error (MAPE) values between 3,9 % and 13,8 % for model development and an overall MAPE between 4,1 % and 12,5 % for seven additional locations, ANNs emerged as effective tools for modeling and predicting monthly global solar irradiation in data-rich environments, with potential for extrapolation to nearby areas.Additionally, RF, a widely utilized method for regression and classification tasks, was employed in the study to forecast solar irradiation.Utilizing data from 13 measurement stations in the Rias Baixas area along the Galician coast in Spain, RF models were developed leveraging the available dataset from these stations.
Using meteorological and environmental data, (15) machine learning (ML) algorithms have become a strong tool for estimating solar irradiance. (16)However, the quality and number of input features used in ML-based GHI prediction models might have an impact on their accuracy and reliability. (17)In order to increase the precision and robustness of the prediction models, feature selection is the process of choosing the most pertinent and instructive input features for a specific prediction task. (18)The accuracy and dependability of forecasting solar irradiation can be improved by providing more information on the impact of feature selection on GHI prediction models.This, in turn, may help with resource planning, solar energy system optimization, and gridscale solar power integration that is sustainable. (19,20)The ability of the proposed method to handle non-linear relationships, perform automatic and interpretable feature selection, and efficiently handle high-dimensional and potentially incomplete datasets makes it an excellent choice for feature selection in global horizontal Data and Metadata.2024; 3:363 2 irradiation forecasting.While other methods have their own strengths, the comprehensive benefits of the proposed methodology align well with the specific challenges and requirements of accurate global horizontal irradiation (GHI) forecasting.
The total quantity of solar radiation that a surface receives over a specific time period per unit area is known as solar irradiance.The various characteristics and elements of solar irradiation include: Direct Normal Irradiance (DNI): is the amount of solar energy absorbed by a surface parallel to the trajectory of the sun, which is measured per unit area.In general, maintaining a surface's normal to the incoming radiation can optimize its annual irradiance.This amount is especially important for systems that concentrate solar thermal energy and track the position of the sun. (21)irect Horizontal Irradiance (DHI): is the amount of solar radiation that a horizontal surface that is not directly exposed to the sun receives per unit area.It arrives equally from all directions.
Global horizontal irradiance (GHI): is the entire quantity of solar radiation per unit area on a horizontal surface, including both the direct component of solar radiation that comes directly from the sun and the diffuse component of solar radiation that is diffused by the atmosphere.GHI is usually measured in units of watts per square meter (W/m²). (21,22)

Proposed Methodology
In machine learning methods, feature selection is frequently employed to select the set of variables most effective in reflecting the source data, minimizing data size and model complexity and increasing prediction precision.It is common practice to test several feature selection techniques and select the variable set that has the highest forecasting performance.As an alternative, an ensemble feature selection can be used by integrating the benefits of various feature selection techniques.
The five basic steps of the suggested method for predicting solar radiation are shown in figure 3. First, data on solar radiation and other meteorological data are collected, considerable database. (23)The data is then cleaned by pre-processing, which includes eliminating anomalies and inputting missing values.The following phase involves selecting the most important variables using an ensemble feature selection method, then, several machine learning techniques are employed after the data has been divided into training, validation, and test sets.Finally, different statistical metrics are employed to rate the forecasting accuracy of system.Experiments were carried out with Python 3.8.The results of each model are presented in terms of accuracy, in order to assess the effectiveness of each feature importance technique.The results of the experiment were then examined. (24)

Heat map
A heat map is a graphical representation of data where the values of a matrix are represented by color gradients.Darker colors generally indicate higher values, while lighter colors reflect lower values.the heat map analysis likely revealed insights into the relationship between GHI, Solar Zenith Angle, and solar radiation components.The DNI and the angle of the solar zenith are the two most important meteorological factors influencing the components of solar radiation, and in particular the anticipated performance, according to the heat map data.When the sky is clear, figure 2 shows that there is a strong correlation between the calculated and regular components of solar radiation.

Machine learning for feature selection
Feature importance scores are crucial in a predictive modeling approach, as they provide information about the data, information about the model, and a foundation for selecting characteristics and performing dimension reduction.These steps can significantly enhance the effectiveness and efficiency of a predictive model for the given problem.Feature selection is a critical step in the development of accurate models for predicting global horizontal irradiance (GHI).By identifying and using the most relevant variables, the performance of machine learning models can be improved, over-fitting reduced and computational efficiency enhanced.By employing the suggested methodology, the effectiveness of multiple ML systems in predicting solar radiation is assessed.The algorithms evaluated include Random Forest (RF), Linear Regression (LR), Lasso Regression (LARS), and Gradient Boosting Regression (GBR).
Random Forest Regressor is a machine-learning algorithm from the family of models based on decision trees.It is a type of ensemble model which combines several decision trees to produce a more accurate and robust prediction model. (25)The RF regressor is particularly useful for regression problems, where the aim is to forecast a continuous output variable.It can handle both numerical and categorical input features, and is relatively insensitive to outliers and missing values in the data.Additionally, it gives a measure of feature importance, that can be applied for feature selection and to gain insights into the underlying relationships between the input and output variables.
Lasso Least Angle Regression (LARS) is a machine learning approach used for regularization and feature selection.In doing so, it offers effective solutions for high-dimensional datasets by taking into account both the correlation and prediction power of variables concurrently.LARS favors sparse solutions by combining components of least squares estimation and forward stepwise selection.When working with datasets where there are more features than samples, it is very helpful as it reduces overfitting and provides insights into pertinent variables.
Linear Regressor (Lr), is a machine learning algorithm used to forecast a continuous output based on one or more input features (26) that have a linear relationship with the output.It is one of the simplest and most widely used algorithms for regression tasks.The algorithm defines the relationship between the input characteristics and the output variable using a linear equation, which can be represented as y = b0 + b1x1 + b2x2 + ... + bnxn, where y is the predicted output variable, x1, x2, ..., xn are the input features, and b0, b1, b2, ..., bn are the coefficients that the algorithm learns during training.
Gradient boosting regression (GBR), is a machine learning method combines several weak learners, usually decision trees, one after the other to create prediction models.By repeatedly fitting new models to the residuals of the earlier predictions, iteratively minimizing errors.GBR is well-liked for a variety of regression tasks, such as forecasting, classification, and ranking, due to its strong predictive accuracy and resilience against overfitting.It is a valuable tool in the field of machine learning because of its adaptability, interpretability, and capacity to handle complex correlations in data.

RESULTS AND DISCUSSION
This section displays the results of the suggested methodology for solar forecasting using several ML algorithms. (27)An investigation is additionally conducted into the utility of doing an ensemble feature selection.In this part, four distinct machine learning (ML) methods for estimating sun radiation are compared; Random Forest Regressor, Linear Regressor, Lasso Least Angle Regression, and Gradient Boosting Regression.Algorithm performance is evaluated by means of the coefficient of determination (R2).It ranges from 0 to 1 and measures the variance in the predictions.A coefficient of 1 implies that the model recognizes observed data precisely, whereas a value of 0 shows that the model's predictions fail badly when applied to unobserved data.The initial phase of the architecture involves identifying the most relevant features.Figures 4-7 showcase the outcomes of this process.Specifically, figure 5 presents the results obtained from the LR algorithm.It reveals that three key parameters-GHI, Clearsky Global Horizontal Irradiance (GNI) Direct Normal Irradiance (DHI), along with the DHI-significantly influence our target feature.
By examining the difference between actual and anticipated values of GHI, the effectiveness of forecasting models can be assessed, highlighting their predictive accuracy.In addition, it highlights the profound influence and interconnection of key variables such as DNI, DHI, zenith angle of the sun and temperature on global horizontal irradiance.These variables, as shown by correlation maps and feature importance techniques, play an essential role in shaping solar radiation patterns.Their collective impact highlights the complex dynamics inherent in solar energy forecasting, underscoring the need for comprehensive modeling approaches that take into account both meteorological conditions and solar radiation characteristics.By recognizing these complex relationships, forecasters can refine their models to deliver more accurate and reliable forecasts, improving the effectiveness and efficiency of solar energy utilization.
The results of the proposed architecture are presented, and show that, of all the algorithms evaluated, the RF regressor is the best approach in terms of feature importance, achieving maximum accuracy (99,95 %) when 11 features are used.Benchikh

CONCLUSIONS
This comprehensive study meticulously assessed the effectiveness of four prominent ensemble methods in forecasting Global Horizontal Irradiance (GHI): Random Forest (RF), Linear Regressor (LR), Lasso Least Angle Regression (LLAR), and Gradient Boosting Regression (GBR).Through rigorous experimentation, we unveiled the remarkable potential of feature selection techniques in optimizing prediction accuracy while maintaining the integrity of the forecast.By strategically identifying and incorporating essential variables, these methods harness the intricate interplay between diverse data attributes crucial for precise GHI prediction.Our findings not only highlight the practical utility of feature selection methodologies but also underscore their pivotal role in advancing renewable energy forecasting applications.By elucidating the intricate relationships between input variables and GHI outcomes, we offer valuable insights for future research endeavors.Subsequent investigations can delve deeper into the nuanced contributions of specific data features towards refining prediction accuracy, thus catalyzing further advancements in renewable energy forecasting models.In essence, this study not only

Figure 1 .
Figure 1.Equation of global horizontal irradiation using DNI and DHI

Figure 3 .
Figure 3. Heat map data and coorelation between variables

Table 1
shows all the database variables.

Table 2 .
Proposed approach results Figure 8. Random forest regressor feature importance