I. Introduction
Time series data prediction has been studied for a long time in various fields such as finance, climate, medicine, and transportation. Traditionally, there are statistics-based forecasting models such as ARIMA and VAR. Recently, deep learning models that effectively learn the nonlinear and irregular characteristics of time series data are being actively researched. More specifically, deep learning methods offer a lot of promise for time series forecasting, such as the automatic learning of temporal dependence and the automatic handling of temporal structures like trends and seasonality. According to such experimental results, it has been shown that deep learning modes perform better than the traditional ARIMA or SARIMA methodology [1-5]. Notably, with the explosive increase of various deep learing models, the best model among them needs to be taken into account for better accuracy and efficiency.
The main purpose of this paper is to propose Informer as the best deep learning based time series prediction model for Seoul's air pollution (NO2) prediction in comparison with LSTM (Long-Short Term Memory), BI-LSTM (Bidirectional LSTM), and Transformer [6-8].
II. Description of 4 deep learning prediciton models
Deep learning is an artificial intelligence (AI) approach that teaches computers to process data in ways inspired by the human brain. The models can recognize complex patterns in pictures, text, sound and other data to generate accurate insights and predictions. Generally, a deep learning’s component, an artificial neural network consists of input layer, hidden layers, and output layers. In this paper, among various kinds of neural networks, we will use the concepts of RNN and its applications shown below in order to analyze our data. For our data analysis, Python deep learning package Pytorch (https://pytorch.org/) is used.
1. Recurrent Neural Networks (RNNs)
RNN is a type of artificial neural network that uses sequential or time series data (Fig. 1.). This deep learning algorithm is commonly used for sequence or timing problems such as language conversion, natural language processing (NLP), speech recognition, image captions, time series data forecasting, our analysis. The main problems of conventional RNN, however, is in long-term dependency problems. Therefore, LSTM was introduced for processing such sequence data, and thus solving the vanishing gradient problem that RNNs have [9-10].
Fig. 1. RNN structure.
2. LSTM
LSTM has been used to better predict NO2. Previous study showed that the LSTM model has high application value in NO2 concentration prediction [11-14]. Suhartono (2019) proposed a hybrid model combining time series regression (TSR) and LSTM with a higher accuracy to forecast NO2 in Surabaya City, Indonesia [15]. Drewil and Bahadili (2022) forecasted air pollution using LSTM method. In this sense, LSTM has become a major prediction model in deep learning [16]. In Fig. 2., LSTM adjusts the information to be contained in the cell state through a total of three gates and transfers it to the next state [9, 17]. The forget gate is a gate that determines whether to discard or use information. By using the sigmoid function, we have a number between 0 and 1. By multiplying the number derived here with ct - 1 of the cell state, we decide whether to use the value of the previous state or not. The input gate determines which of the new information to store in the cell state. In the tanh layer, a new information candidate vector is determined, and in the input layer, the sigmoid function is used to determine which information among the candidates to use. After that, the value deleted or used by the forget gate and the new information determined through the input gate are added. Finally, in the output gate, a sigmoid is applied to the input value to determine which value to derive as the output value from the cell state. Then, the value of the cell state is multiplied by the value obtained by tanh, and only the required output value is produced.
Fig. 2. LSTM.
LSTM can solve time series data with longer sequences than existing RNNs [9, 11-12, 18].
3. BI-LSTM
Wu et al. (2023) show that the hybrid model, the Res-GCN-BiLSTM model was better adapted and improved the prediction accuracy, with nearly 11% improvements in mean absolute error for NO2 compared to the best performing baseline model. [19]. Verma et al. (2018) showed that the predictions can be significantly improved using BI-LSTM that model the long-term, short-term and immediate effects of PM2.5 severity levels [19]. BI-LSTM becomes a better prediction model mechanism for NO2 than LSTM. BI-LSTM is a time-stepped bidirectional long-latency RNN on time-series extended data. This can be useful if you want such a standard RNN to train on the entire time series each time. In Fig. 3., BI-LSTM adds an LSTM layer that processes backward (1) to the existing LSTM layer. The final hidden state outputs a vector connecting the hidden states of the two LSTM layers. In addition to concatenation, adding or averaging can be applied in various ways. There is a limitation that the result shows a tendency to converge based on the previous pattern in RNN or LSTM. Both this convergence problem and long sequence one lead to introduce BI-LSTM instead of BI-RNN or LSTM [10, 13, 20].
Fig. 3. BI-LSTM many to many structure.
4. Transformer
Hickman et al. (2022) predicted European ozone air pollution using Transformer [21]. It has recently taken more interest among various deep learning models.
A major issue in long sequence time series prediction (LSTF) is to increase the prediction capacity in order to satisfy the demand for increasingly longer sequences. In this respect, (a) the ability to classify over a long ranges (b) efficient operation of long sequence inputs and outputs is required. Accordingly, the Transformer model (Fig. 4.) showed better performance than RNN in capturing long-range dependencies [22]. The self-attention mechanism reduces the maximum length of the network signal propagation path to the theoretically shortest O(1) and avoids repetitive structures. Consequently, it showed excellent potential in LSTF problems. Nevertheless, this mechanism violates requirement (b) because it consumes L-order computation and memory for inputs and outputs of length L. Thus, the Transformer structure is not applied to LSTF problems well.
Fig. 4. Transformer.
5. Informer
As mentioned earlier, the Transformer model has problems in direct application to LSTF due to time complexity, high memory usage, and inherent limitations of the encoder-decoder architecture. For those problems, an efficient LSTF converter called Informer was recently proposed [23]. This solves the LSTF problem with the encoder-decoder structure. An outline of the model is shown in the following figure [23].
Informer utilizes the ProbSparse self-attention mechanism when the encoder receives a long sequence (Fig. 5.). The small size cloned encoder next to it increases the robustness of the model. When the decoder inputs a long sequence, the predicted object portion should be padded with zeros. The decoder predicts the output at once by performing the concatenated feature map generated by the encoder and encoder-decoder attention. In previous studies, predictions were made with RNN models, LSTMs, Transformer, and the Informer model which showed the highest prediction accuracy [24-25]. The Informer model greatly improves LSTF problem solving owing to its structural advantages. For more details on the Informer model, see Zhou et al. (2012) [23].
Fig. 5. Informer.
III. Data preparation
1. Data acquisition
Information on daily average air pollution in Seoul provides daily information on average air pollution such as air quality index, fine dust, ozone, nitrogen dioxide, carbon monoxide, and sulfur dioxide.
Among them, daily air pollution (NO2) data (Seocho-gu) in Seoul from January 1, 2018 to January 1, 2022 were selected as a sample (https://data.seoul.go.kr/dataList/OA-2218/S/1/datasetView.do).
2. Preprocessing
We first preprocessed the downloaded NO2 data by replacing the missing 19 values by the mean. The criterion for the outlier was recognized as a value outside the 1.5-fold interquantile range. Figure 6 shows training and test data for daily average NO2 data. Since the mean and variance in this plot are largely constant, it can be considered a stationary time series. Therefore, no other transformations were required. MinMax scaling was performed on the training data.
Fig. 6. Plot of daily average NO2 versus date for train (including validation data) and test data.
3. Train data, validation data and test data.
We analyzed 877 NO2 observations out of the 4383 total NO2 data as test data (20%). The number of NO2 observations for train data is 2804. The 702 observations is for validation. The proportion of train, validation and test data is 0.64:0.16:0.2.
IV. Experiment and Results
1. Computing environment
This model utilized Google colab with the Python package Pytorch.
2. Settings
For comparative analysis, LSTM, BI-LSTM, Transformer, and Informer are trained. Prediction accuracy was evaluated by comparing predicted values with out-of-sample predictions with actual values in the test data.
3. Results
The train data was initially used to train LSTM and BI-LSTM, Transformers, and Informers. At the same time, it is trained using validation data to ensure that the error is small enough. The prediction was made by setting the window size to 100, and the optimal model was derived by adjusting the number of epochs, the loss function, the number of input layers, and the number of hidden layers, and so forth. As a result, the Adam optimizer using 0.001 as the learning rate and the Huber loss function (delta=0.1) were utilized. LSTM and BI-LSTM were prematurely stopped at the 68th and 70th epochs, respectively. For BI-LSTM, dropout was set to 0.5 and batch normalization was performed with momentum of 0.5.
These can be summarized in Tables 1 and 2, respectively.
Table 1. LSTM network
Table 2. BI-LSTM network
In this study, RMSE (Root Mean Square Error) and MAE (Mean Absolute Error) were used to evaluate the prediction accuracy. In Fig. 7, BI-LSTM has lower RMSE than LSTM. Similarly, in Figure 8, the MAE of BI-LSTM was lower than that of LSTM. RMSE and MAE gradually declined over epochs.
Fig. 7. RMSE
Fig. 8. MAE
Table 3 tells us that Informer is preferable over LSTM, BI-LSTM, and Transformer in terms of relatively smaller MAE and RMSE for test data. Informer and Transformer has 0.0138 and 0.0160 for MAE and 0.0167 and 0.187 for RMSE, respectively, concluding that Informer is a little bit better mechanism than Transformer. That’s because Informer can deal better with the LSTF problem for a range of daily NO2 data.
Table 3. Prediction error
Fig. 9 compared the predicted NO2 values by LSTM and BI-LSTM, Transformer, and Informer with the real NO2 value in the test set. For the specific period between September 1, 2021 and January 1, 2022 regarding test data, Informer and Transformer appeared to be much more accurate than LSTM and BI-LSTM, though they might be known as accurate prediction models. Informer is effective in that for predicting air pollution (NO2) in Seoul, solving an LSTF problem,
Fig. 9. Predicted values of NO2 by LSTM, BI-LSTM, Transformer, and Informer versus real NO2.
V. Comparison with other studies for NO2 concentration prediction models
So far, a few of prediction models for NO2, as the main air pollutant, has been introduced using deep learning methodology. Yammahi and Aung (2023) predicted NO2 concentration by ARIMA, SARIMA, LSTM, and nonlinear autoregressive neural network (NAR-NN) with both open- and closed-loop architectures, showing that predictions based on the open loop are better than those based on the closed loop [7]. In another study conducted by Heydari et al. (2021), a hybrid model based on long short-term memory (LSTM) and multi-verse optimization algorithm (MVO) was developed to predict NO2 [8]. In the paper by Liu et al. (2021), a daily NO2 concentration prediction model in Beijing based on LSTM is constructed and they showed that the LSTM model has high application value in NO2 concentration prediction [13]. Wu et al. (2023) proposed a hybrid model called Res-GCN-BiLSTM combining the residual neural network (ResNet), graph convolutional network (GCN), and BI-LSTM, for predicting short-term NO2, not long-term behavior of it. They utilized a hybrid model combining LSTM-based methods and other ones but they did not include the recently presented methods such as Transformer-based models. In this respect, our prediction models, Informer for NO2 is meaningful in meteorology perspective. This research focused on Seoul’s NO2 data only but there are a few of sources of air pollutants to be studied in the future. Moreover, a hybrid of deep learning methodologies or other newly presented method needs to be assessed for more exact air pollutant prediction.
VI. Conclusions
In the paper, LSTM, BI-LSTM, Transformer, and Informer were applied to NO2 with the performance assessment. As a result, Informer showed the best prediction accuracy for Seoul's air pollution (NO2) prediction, followed by Transformer, BI-LSTM and LSTM. Our NO2 data is a typical example of LSTF problems. The results show that among 4 deep learning methods, Informer can effectively process those problems. Another manuscript titled ‘The prediction model of Ultraviolet-B (UV-B) using Deep learning model (Bidirectional LSTM) in comparison with LSTM and SARIMA’ has analyzed Ultraviolet-B data by comparing LSTM, BI-LSTM, and SARIMA [25]. This data showed strong seasonality. When fitting such a data, deep learning models such as Transformer-based models or LSTM (or BI-LSTM) does not quite well decompose time series into seasonality and trend-cycle components, causing interpretation problems. It was expected that SARIMA performs better than other models because of this seasonality but it did not in reality. It’s, however, enough to mention that based on results from a few of other previous studies, SARIMA did better for short Term prediction only but LSTM or BI-LSTM would definitely do better [26], In fact, for UV-B data, Transformer or Informer did not even converge completely with slower speed. It does not express seasonality well and still causes a memory bottleneck.
References
- G. E. P. Box, G. M. Jenkins, and G. C. Reinsel, "Time Series Analysis, 4th Edition", Wiley Series in Probability and Statistics, June 2008. DOI:10.1002/9781118619193
- C. Lee and J. Kim, "A study on the short-term prediction of power consumption by using the ARIMA model", Journal of the Korean Data Analysis Society, Vol. 19, No. 3, pp. 1349- 1362, June 2017. DOI:10.37727/jkdas.2017.19.3.1349
- D.-C. Han, D. W. Lee, and D. Y. Jung, "A Study of the Traffic Volume Correction and Prediction Using SARIMA Algorithm", J. Korea Inst. Intell. Transp. Syst., Vol. 20, No.6,pp.1-13,2021, https://doi.org/10.12815/kits.2021.20.6.1
- J. Korstanje, "The SARIMAX Model", Advanced Forecasting with Python, Springer, pp 125-131, July 2021.
- S. S.-Namini, N. Tavakoli, A. S. Namin, "A Comparison of ARIMA and LSTM in Forecasting Time Series", 2018 17th IEEE International Conference on Machine Learning and Applications, pp. 1-8, 2018. DOI: 10.1109/ICMLA.2018.00227
- P. T. Yamak, L. Yujian, P. K. Gadosey, "A Comparison between ARIMA, LSTM, and GRU for Time Series Forecasting", ACAI 2019: Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence, pp.49-55. December 2019, https://doi.org/10.1145/3377713.3377722
- A. A. Yammahi and Z. Aung, "Forecasting the concentration of NO2 using statistical and machine learning methods: A case study in the UAE", Heliyon, Vol. 9, Issue 2, February 2023, e12584
- A. Heydari, M. M. Nezhad, D. A. Garcia, F. Keynia, and L. D. Santoli, "Air pollution forecasting application based on deep learning model and optimization algorithm", Clean Technologies and Environmental Policy, 2021.
- J. Ilonen, J.-K. Kamarainen, and J. Lampinen, "Differential evolution training algorithm for feed-forward neural networks", Neural Processing Letters, Vol. 17, No. 1, pp. 93-105, 2003, https://doi.org/10.1023/A:1022995128597
- J. Kim and J.-Y. Kim, "Comparative analysis of performance of BI-LSTM and GRU algorithm for predicting the number of Covid-19 confirmed cases", Journal of the Korea Institute of Information and Communication Engineering, Vol. 26, No. 2, pp.187-192, 2022. https://doi.org/10.6109/JKIICE.2022.26.2.187
- S. Hochreiter, J. Schmidhuber, "Long short-term memory", Neural Computation, Vol. 9, No. 8, pp. 1735-1780,1997. https://doi.org/10.1162/neco.1997.9.8.1735
- J. Jung and J. Kim, "A Performance Analysis by Adjusting Learning Methods in Stock Price Prediction Model Using LSTM", Journal of Digital Convergence, Vol. 18, No. 11, pp.259-266,2020, https://doi.org/10.14400/JDC.2020.18.11.259
- B. Liu, X. Yu, Q. Wang, S. Zhao, and L. Zhang, "A Long Short-Term Memory Neural Network for Daily NO2 Concentration Forecasting", International Journal of Information Technology and Web Engineering (IJITWE), Vol. 16, Issue 4, 2021.
- C-L. Wu, H-D. He, R.-F. Song, X.-H. Zhu, Z.-R. Peng, Q.-Y. Fu, and J. Pan, "A hybrid deep learning model for regional O3 and NO2 concentrations prediction based on spatiotemporal dependencies in air quality monitoring network", Environmental Pollution, Vol. 320, 1, March 2023.
- Suhartono, H. Prabowo and S.-F. Fam, "A Hybrid TSR and LSTM for Forecasting NO2 and SO2 in Surabaya", Home Soft Computing in Data Science Conference paper, First Online: 24 September 2019.
- G. I. Drewil, R. J. A-Bahadili, "Air pollution prediction using LSTM deep learning and metaheuristics algorithms", Measurement: Sensors, VoL. 24, 2022, 100546
- L. Nashold and R. Krishnan, "Using LSTM and SARIMA Models to Forecast Cluster CPU Usage", arXiv:2007.08092[cs.LG], pp. 1-11, Jul 2020, https://doi.org/10.48550/arXiv.2007.08092
- Preeti, R. Bala, R. P. Singh, "Financial and Non-Stationary Time Series Forecasting using LSTM Recurrent Neural Network for Short and Long Horizon", International Conference on Computing and Networking Technology (ICCNT), pp.1 - 7, Kanpur, India, July 2019, DOI:https://doi.org/10.1109/ICCCNT45670.2019.8944624
- I. Verma, R. Ahuja, H. Meisheri, and L. Dey, "Air Pollutant Severity Prediction Using Bi-Directional LSTM Network", 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), December 2018, DOI:10.1109/WI.2018.00-19.
- S. Ko, H.-Y. Yun, and D.-M. Shin, "Electronic Demand Data Prediction using Bidirectional Long Short Term Memory Networks", Journal of Software Assessment and Valuation , Vol. 14, No. 1, pp. 33-40, 2018.
- S. H. M. Hickman, P. T. Griffiths, P. Nowack, E. Alhajjar, and A. T. Archibald, "Forecasting European Ozone Air Pollution With Transformers", Tackling Climate Change with Machine Learning: workshop at NeurIPS 2022.
- A. Zeng, M. Chen, L. Zhang, Q. Xu, "Are Transformers Effective for Time Series Forecasting?", arXiv:2205.13504v3 [cs.AI], pp. 1-15, 2022.
- H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, "Informer: Beyond Efficient Transformer for long Sequence Time-Series Forecasting", arXiv:2012.07436v3 [cs.LG], pp 1-15, Mar 2021.
- H. Wei, W.-S. Wang, X.-X. Kao, "A novel approach to ultra-short-term wind power prediction based on feature engineering and Informer", Energy Reports, VoL. 9, pp. 1236-1250, Dec 2023. https://doi.org/10.1016/j.egyr.2022.12.062
- M. Kang and J. Kang, "The prediction model of Ultraviolet-B using Deep learning model (Bidirectional LSTM) in comparison with LSTM and SARIMA", Journal of The Korea Society of Computer and Information (JKSCI), submitted, 2023.
- T. Falatouri, F. Darbanian, P. Brandtner and C. Udokwu, "Predictive Analytics for Demand Forecasting - A Comparison of SARIMA and LSTM in Retail SCM", Procedia Computer Science, Vol. 200, pp. 993-1003, 2022. https://doi.org/10.1016/j.procs.2022.01.298