DOI QR코드

DOI QR Code

Machine Learning Approaches to Corn Yield Estimation Using Satellite Images and Climate Data: A Case of Iowa State

  • Kim, Nari (Division of Earth Environmental System Science, Pukyong National University) ;
  • Lee, Yang-Won (Department of Spatial Information Engineering, Pukyong National University)
  • Received : 2016.07.22
  • Accepted : 2016.08.23
  • Published : 2016.08.31

Abstract

Remote sensing data has been widely used in the estimation of crop yields by employing statistical methods such as regression model. Machine learning, which is an efficient empirical method for classification and prediction, is another approach to crop yield estimation. This paper described the corn yield estimation in Iowa State using four machine learning approaches such as SVM (Support Vector Machine), RF (Random Forest), ERT (Extremely Randomized Trees) and DL (Deep Learning). Also, comparisons of the validation statistics among them were presented. To examine the seasonal sensitivities of the corn yields, three period groups were set up: (1) MJJAS (May to September), (2) JA (July and August) and (3) OC (optimal combination of month). In overall, the DL method showed the highest accuracies in terms of the correlation coefficient for the three period groups. The accuracies were relatively favorable in the OC group, which indicates the optimal combination of month can be significant in statistical modeling of crop yields. The differences between our predictions and USDA (United States Department of Agriculture) statistics were about 6-8 %, which shows the machine learning approaches can be a viable option for crop yield modeling. In particular, the DL showed more stable results by overcoming the overfitting problem of generic machine learning methods.

Keywords

1. Introduction

Monitoring crop yield is important for many agronomy issues such as farming management, food security and international crop trade. Because South Korea highly depends on imports of most major grains except for rice, reasonable estimations of crop yields are more required under recent conditions of climate changes and various disasters.

Remote sensing data has been widely used in the estimation of crop yields by employing statistical methods such as regression model. Prasad et al. (2006) conducted multivariate regression analyses to estimate corn and soybean yields in Iowa using MODIS (Moderate Resolution Imaging Spectroradiometer) NDVI (Normalized Difference Vegetation Index), climate factors and soil moisture. Ren et al. (2008) presented regression models for the estimation of winter wheat yields using MODIS NDVI and weather data in Shandong, China. Kim et al. (2014) estimated corn and soybean yields using several MODIS products and climatic variables for Midwestern United States (US) and represented prediction errors of about 10 %. Hong et al. (2015) built multiple regression models using MODIS NDVI and weather data to estimate rice yields in North Korea and showed the RMSE of 0.27 ton/ha. Most of the previous studies are based on the multivariate regression analysis using the relationship between crop yields and agro-environmental factors such as vegetation index, climate variables and soil properties.

Machine learning, which is an efficient empirical method for classification and prediction, is another approach to crop yield estimation. Jiang et al. (2004) adopted ANN (Artificial Neural Network) technique for estimation of winter wheat yields using AVHRR (Advanced Very High Resolution Radiometer) dataset, and the ANN model showed a higher accuracy than multivariate regression models. Jaikla et al. (2008) estimated rice yields using SVM (Support Vector Machine) and compared the result with the simulation of DSSAT (Decision Support System for Acrotechnology Transfer) model, which showed a similar performance. Kuwata and Shibasaki (2015) employed DL (Deep Learning) methods for estimation of corn yields for Illinois and presented that the DL contributed to higher accuracy than SVM. Despite the efficient predictability of machine learning techniques, the applications in crop yield estimation are relatively insufficient, and the comparative studies among various machine learning methods for crop yield estimation have not reported yet.

The objective of this study is to estimate crop yields by employing several major techniques for machine learning such as SVM, RF (Random Forest), ERT (Extremely Randomized Trees) and DL, and to present the comparisons of validation statistics among them. We used satellite images from MODIS and the climate reanalysis data created by PRISM (Parameter-Elevation Regressions on Independent Slopes Model) for the machine learning analyses. To improve the prediction accuracies according to phenology effects, we set up three types of data period: (1) May to September, (2) July and August and (3) an optimal combination of the months.

 

2. Data and Method

2.1 Study area

Iowa is a state in the Midwestern US and belongs to the Corn Belt (Fig. 1). Iowa produces approximately 18 % of the US corn yields, which is the highest ranking in the US (USDA, 2012). Out of the 99 counties of Iowa State, we selected 94 counties whose cropland exceeded 10 % of the county area. The study period is between 2004 and 2014 according to the data availability.

Fig. 1.Study area

2.2 Data

2.2.1 Remote sensing data

Satellite remote sensing data was acquired from NASA (National Aeronautics and Space Administration) and ESA (European Space Agency) CCI (Climate Change Initiative). The Terra/MODIS products by NASA such as NDVI, EVI (Enhanced Vegetation Index), LAI (Leaf Area Index), FPAR (Fraction of Photosynthetically Active Radiation), GPP (Gross Primary Production) and ET (Evapotranspiration) are closely related to crop yields. Also, SM (Soil Moisture) dataset was obtained from ESA CCI, which produces the most complete and consistent global soil moisture data on the grid of 0.25° using active and passive microwave sensors. Table 1 shows the summary of dataset used. Previous studies (Prasad et al., 2006; Na et al., 2014; Kim et al., 2014) presented these variables were associated with the corn yield.

Table 1.Summary of dataset used in this study

2.2.2 Climate data

The PRISM Climate Group (http://www.prism.oregonstate.edu/) provides daily and monthly reanalysis of seven climate elements in the US: precipitation (PPT), maximum temperature (Tmax), minimum temperature (Tmin), mean temperature (Tmean), mean dew point temperature (TDmean), minimum vapor pressure deficit (VPDmin) and maximum vapor pressure deficit (VPDmax). We used monthly data for PPT, Tmax, Tmin, Tmean at the 4-km resolution.

2.2.3 Crop yield data

As a reference dataset, county-level yield statistics of corn were obtained from the NASS (National Agricultural Statistics Service) of USDA (United States Department of Agriculture) (http://quickstats.nass.usda.gov). The unit of corn yield (bushels per acre) was converted to ton per hectare for convenience sake.

2.2.4 Data processing

Because cropland areas for each county should be first determined, we extracted the pixels which were recorded as cropland (land cover ID = 12) throughout the period of 2004-2014 from the MODIS land cover data. Fig. 2 shows that the distribution of the cropland pixels is similar to the pattern of major counties for corn production in Iowa. For these cropland pixels, we constructed a database including satellite images and climate variables. Crop yield statistics were the values accumulated by county, so the satellite and climate data need to be averaged at the county level. We employed the zonal operation to summarize the pixel values for a given county.

Fig. 2.(a) Corn yields by county and (b) cropland pixels derived from MODIS land cover data (Iowa State in the dashed line)

Various environmental factors related to crop yields can have different sensitivities to growing seasons. Hence, we derived 13 cases for month combination such as MJJAS (from May to September), each individual month between May and September (May, Jun, Jul, Aug and Sep), two successive months (MJ, JJ, JA and AS), and three successive months (MJJ, JJA and JAS) for calculation of the correlation coefficients (Table 2). From these combinations, we selected three period groups: (1) MJJAS for the whole growing season, (2) JA as the group having mostly highest correlation coefficients and (3) OC for the optimal combination of the periods in terms of the correlation coefficient (shaded in gray in Table 2). In order to estimate the corn yield in the 94 counties in Iowa, we built a matchup database consisting of 11 input variables from satellite images (NDVI, EVI, LAI, FPAR, GPP, ET and SM) and climate dataset (PPT, Tmin, Tmax and Tmean) for the three period groups between 2004 and 2012.

Table 2.Correlation coefficients of the variables against corn yields, 2004-2014

2.3 Methods

2.3.1 Support vector machine

SVM is a powerful technique for general classification which can minimize the classification error of existing machine learning techniques (Vapnik, 1998). For estimation or prediction, regression methods are combined with each classified group. SVM finds the optimal separating classifier between the two classes by maximizing the margin between support vectors using the kernel functions such as linear, Gaussian RBF (Radial Basis Function), polynomial and hyperbolic tangent (Cortes and Vapnik, 1995; Karatzoglou et al., 2006). The Gaussian RBF were used in our experiment.

2.3.2 Random forest

The RF, which is an improved version of CART (Classification and Regression Trees), is an ensemble method using bootstrap aggregating (Breiman, 2001). RF makes decision trees by extracting random samples from the training data and predicts results through the vote for classification or averaging of the regression using a large number of trees (Ali et al., 2012). In our experiment, the number of trees were 500, and the number of variables used for splitting nodes were set to n/3 (n = number of input variables). In addition, the out-of-bag error was used as the criterion of model suitability.

2.3.3 Extremely randomized trees

ERT is an ensemble classifier method using unpruned decision trees. ERT is different from the other tree-based ensemble methods such as RF, in that it divides nodes by randomly choosing cut-points and that it uses the complete learning sample (no bootstrap copying) to grow the trees (Geurts et al., 2006). Such randomization is based on the bias-variance analysis like the Friedman test (Friedman, 1997). Randomization increases bias and variance of individual trees, but they can be attenuated by averaging over a sufficiently large ensemble of trees. In our experiment, the number of trees and the number of variables used for splitting nodes were set to the same as those of RF.

2.3.4 Deep learning

DL is a machine learning method similar to ANN but is capable of processing the complicated, huge input data by learning tasks by using feed-forward multi-layer network (Ali et al., 2015). Training process of DL usually consists of pre-training and fine-tuning. Pre-training is the phase of data processing by using unsupervised learning for improving the generalization error of trained deep architectures. Finetuning by supervised learning is performed to improve the classification error (Erhan et al., 2010). Our experiment used a 200×200 multi-layer network.

2.3.5 Validation

The leave-one-year-out cross-validation, also known as the Jackknife, was conducted to examine the accuracies of the corn yield estimation by machine learning methods. We calculated the mean bias, MAE (Mean Absolute Error), RMSE (Root-Mean-Square Error), MAPE (Mean Absolute Percentage Error) and the correlation coefficient (r) between the observed and predicted yields during the period of 2004-2014.

 

3. Results and Discussion

We implemented the machine learning methods (SVM, RF, ERT and DL) using R libraries (https://www.r-project.org/). We first estimated the corn yields using the MJJAS dataset for the whole growing season, and the results were compared with the USDA yield statistics. The leave-one-year-out cross-validation produced 11 sets of validation results for each year between 2004 and 2014. Table 3 shows the averages of the 11-year validation results in terms of the mean bias, MAE, MAPE, RMSE and r. Fig. 3 shows the scatter plots of the predicted corn yields against USDA statistics between 2004 and 2014. According to the results, DL achieved the highest accuracy with the correlation coefficient of 0.776 and the RMSE of 0.844 ton/ha, although three methods (RF, ERT and DL) presented similar accuracies. In particular, RF and ERT showed very similar results with the correlation coefficients of 0.651 and 0.654, respectively, and the RMSE were 0.879 and 0.891 ton/ha, respectively. This is because the two approaches are based on regression trees even if their randomization strategies for tree splitting are somewhat different. The SVM showed the lowest accuracy with the correlation coefficient of 0.560 and the RMSE of 0.959 ton/ha.

Table 3.Validation statistics for the period group MJJAS (May to September)

Fig. 3.Scatter plots for observed vs. predicted corn yields, 2004-2014 (red dots: 2012, black dots: all years except for 2012)

Tables 4 and 5 show the 11-year averaged statistics for JA and OC, respectively. When comparing the results of the three period groups (MJJAS, JA and OC), the correlation coefficients for SVM were almost the same (MJJAS=0.590, JA=0.575, OC=0.606), but the RMSE of OC (0.852 ton/ha) were somewhat improved than those of MJJAS (0.959 ton/ha) and JA (0.936 ton/ha). As for RF and ERT, the correlation coefficients (JA=0.774 and 0.774, OC=0.772 and 0.785, respectively) and the RMSE (JA=0.803 and 0.802 ton/ha, OC=0.767 and 0.756 ton/ha, respectively) were similar for both JA and OC, showing improved results than those of the MJJAS. Hence, it is notable that the seasonal sensitivities of corn yields were well captured by the RF and ERT methods. The DL method produced the highest accuracies for the three period groups in terms of the correlation coefficients (MJJAS=0.776, JA=0.796 and OC=0.800, respectively).

Table 4.Validation statistics for the period group JA (July and August)

Table 5.Validation statistics for the period group OC (optimal combination of month)

Moreover, the DL presented more stable results in the scatter plots while the other three methods had a tendency of overfitting. Machine learning techniques such as SVM, RF and ERT can have an overfitting problem, which occurs when a model is very complex with many parameters and shows a poor predictive performance by overreacting to minor fluctuations in dataset. The red dots in Fig. 3 were the cases of 2012, in which an extreme drought occurred in the Midwestern US. The machine learning models for prediction of 2012 (that is, the models built using the data of the years except for 2012, for the Jackknife) were too trained for non-drought years (except for 2012), so that they could not predict the corn yield under conditions of abrupt drought. However, the DL method can overcome the overfitting problem by a pre-training process based on unsupervised learning (Erhan et al., 2010). Fig. 3(j), 3(k) and 3(l) for the DL method shows that the red dots for 2012 are more closely located around the 1:1 line.

 

4. Conclusions

This paper described the estimation of corn yields in Iowa State using four machine learning techniques such as SVM, RF, ERT and DL, and presented the comparisons of the validation statistics among them. We set up the three period groups (MJJAS, JA and OC) to examine the seasonal sensitivities of the corn yields. In overall, the DL method showed the highest accuracies in terms of the correlation coefficient for all the period groups. The accuracies were relatively favorable in the OC group, which indicates an optimal combination of month can be influential in statistical modeling of crop yields. The differences between our predictions and the USDA statistics were about 6-8 %, which shows the machine learning approaches can be a viable option for crop yield modeling. In particular, the DL showed more stable results by overcoming the overfitting problem of generic machine learning methods. To utilize temporal characteristics of crop yields, time-series machine learning techniques such as RNN (Recurrent Neurual Network) are challengeable as a future work. A sensitivity test to examine the contribution of climate change to the crop yields by including or excluding the climate variables can be another future work.

References

  1. Ali, I., Greifeneder, F., Stamenkovic, J., Neumann, M., and Notarnicol, C. (2015), Review of machine learning approaches for biomass and soil moisture retrievals from remote sensing data, Remote Sensing, Vol. 7, No. 12, pp. 16398-16421. https://doi.org/10.3390/rs71215841
  2. Ali, J., Khan, R., Ahmad, N., and Maqsood, I. (2012), Random forests and decision trees, International Journal of Computer Science Issues, Vol. 9, No. 5, pp. 272-278.
  3. Breiman, L. (2001), Random forests, Machine Learning, Vol. 45, No. 1, pp. 5-32. https://doi.org/10.1023/A:1010933404324
  4. Cortes, C. and Vapnik, V. (1995), Support-vector network, Machine Learning, Vol. 20, No. 3, pp. 273-297. https://doi.org/10.1007/BF00994018
  5. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., and Vincent, P. (2010), Why does unsupervised pre-training help deep learning?, Journal of Machine Learning Research, Vol. 11, pp. 625-660.
  6. Friedman, J.H. (1997), On bias, variance, 0/1-loss, and the curse-of-dimensionality, Data Mining and Knowledge Discovery, Vol. 1, pp. 55-77. https://doi.org/10.1023/A:1009778005914
  7. Geurts, P., Ernst, D., and Wehenkel, L. (2006), Extremely randomized trees, Machine Learning, Vol. 63, No. 1, pp. 3-42. https://doi.org/10.1007/s10994-006-6226-1
  8. Hong, S.Y., Na, S.I., Lee, K.D., Kim, Y.S., and Baek, S.C. (2015), A study on estimating rice yield in DPRK using MODIS NDVI and rainfall data, Korean Journal of Remote Sensing, Vol. 31, No. 5, pp. 441-448. (in Korean with English abstract) https://doi.org/10.7780/kjrs.2015.31.5.8
  9. Jaikla, R., Auephanwiriyakul, S., and Jintrawet, A. (2008), Rice yield prediction using a support vector regression method, Proceedings of Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology 2008, 14-17 May, Krabi, Thailand, pp. 908-913.
  10. Jiang, D., Yango, X., Clinton, N., and Wang, N. (2004), An artificial neural network model for estimating crop yields using remotely sensed information, International Journal of Remote Sensing, Vol. 25, No. 9, pp. 1723-1732. https://doi.org/10.1080/0143116031000150068
  11. Karatzoglou, A., Meyer, D., and Hornik, K. (2006), Support vector machines in R, Journal of Statistical Software, Vol. 15, No. 9. pp. 1-28.
  12. Kim, N., Cho, J., Shibasaki, R., and Lee, Y.W. (2014), Estimation of corn and soybean yields of the US Midwest using satellite imagery and climate dataset, Journal of Climate Research, Vol. 9, No. 4, pp. 315-329. (in Korean with English abstract) https://doi.org/10.14383/cri.2014.9.4.315
  13. Kuwata, K. and Shibasaki, R. (2015), Estimating crop yields with deep learning and remotely sensed data, Proceedings of 2015 IEEE International Geoscience and Remote Sensing Symposium, 26-31 July, Milan, Italy, pp. 858-861.
  14. Na, S., Hong, S., Kim, Y., and Lee, K. (2014), Estimation of corn and soybean yields based on MODIS data and CASA model in Iowa and Illinois, USA, Korean Journal of Soil Science and Fertilizer, Vol. 47, No. 2, pp. 92-99. (in Korean with English abstract) https://doi.org/10.7745/KJSSF.2014.47.2.092
  15. Prasad, A.K., Chai, L., Singh, R.P., and Kafatos, M. (2006), Crop yield estimation model for Iowa using remote sensing and surface parameters. International Journal of Applied Earth Observation and Geoinformation, Vol. 8, pp. 26-33. https://doi.org/10.1016/j.jag.2005.06.002
  16. Ren, J.Q., Chen, Z.X., Zhou, Q.B., and Tang, H.J. (2008), Regional yield estimation for winter wheat with MODIS-NDVI data in Shandong, China. International Journal of Applied Earth Observation and Geoinformation, Vol. 10, pp. 403-413. https://doi.org/10.1016/j.jag.2007.11.003
  17. USDA (2012), Census of agriculture, United States Department of Agriculture, https://www.agcensus.usda.gov/ (last date accessed: 17 August 2016).
  18. Vapnik, V. (1998), Statistical Learning Theory, Wiley, New York, NY.

Cited by

  1. Regional-scale rice-yield estimation using stacked auto-encoder with climatic and MODIS data: a case study of South Korea pp.1366-5901, 2018, https://doi.org/10.1080/01431161.2018.1488291
  2. Using phenology-based enhanced vegetation index and machine learning for soybean yield estimation in Paraná State, Brazil vol.12, pp.02, 2018, https://doi.org/10.1117/1.JRS.12.026029
  3. Downscaling of MODIS Land Surface Temperature to LANDSAT Scale Using Multi-layer Perceptron vol.35, pp.4, 2016, https://doi.org/10.7848/ksgpc.2017.35.4.313
  4. Performance Evaluation of Best Feature Subsets for Crop Yield Prediction Using Machine Learning Algorithms vol.33, pp.7, 2016, https://doi.org/10.1080/08839514.2019.1592343
  5. A Comparison Between Major Artificial Intelligence Models for Crop Yield Prediction: Case Study of the Midwestern United States, 2006-2015 vol.8, pp.5, 2016, https://doi.org/10.3390/ijgi8050240
  6. County-Level Soybean Yield Prediction Using Deep CNN-LSTM Model vol.19, pp.20, 2016, https://doi.org/10.3390/s19204363
  7. Monitoring Within-Field Variability of Corn Yield using Sentinel-2 and Machine Learning Techniques vol.11, pp.23, 2016, https://doi.org/10.3390/rs11232873
  8. Prediction of Winter Wheat Yield Based on Multi-Source Data and Machine Learning in China vol.12, pp.2, 2016, https://doi.org/10.3390/rs12020236
  9. DeepCropNet: a deep spatial-temporal learning framework for county-level corn yield estimation vol.15, pp.3, 2016, https://doi.org/10.1088/1748-9326/ab66cb
  10. Winter Wheat Yield Prediction at County Level and Uncertainty Analysis in Main Wheat-Producing Regions of China with Deep Learning Approaches vol.12, pp.11, 2016, https://doi.org/10.3390/rs12111744
  11. Using Multi-Temporal MODIS NDVI Data to Monitor Tea Status and Forecast Yield: A Case Study at Tanuyen, Laichau, Vietnam vol.12, pp.11, 2016, https://doi.org/10.3390/rs12111814
  12. Predicting county-scale maize yields with publicly available data vol.10, pp.None, 2020, https://doi.org/10.1038/s41598-020-71898-8
  13. Neural network for grain yield predicting based multispectral satellite imagery: comparative study vol.186, pp.None, 2016, https://doi.org/10.1016/j.procs.2021.04.146
  14. Classification of Rice Yield Using UAV-Based Hyperspectral Imagery and Lodging Feature vol.2021, pp.None, 2016, https://doi.org/10.34133/2021/9765952
  15. Rice-Yield Prediction with Multi-Temporal Sentinel-2 Data and 3D CNN: A Case Study in Nepal vol.13, pp.7, 2016, https://doi.org/10.3390/rs13071391
  16. Recognition of Bloom/Yield in Crop Images Using Deep Learning Models for Smart Agriculture: A Review vol.11, pp.4, 2016, https://doi.org/10.3390/agronomy11040646
  17. Forecasting Rainfed Agricultural Production in Arid and Semi-Arid Lands Using Learning Machine Methods: A Case Study vol.13, pp.9, 2016, https://doi.org/10.3390/su13094607
  18. Estimation of flood-damaged cropland area using a convolutional neural network vol.16, pp.5, 2021, https://doi.org/10.1088/1748-9326/abeba0
  19. Selection of Independent Variables for Crop Yield Prediction Using Artificial Neural Network Models with Remote Sensing Data vol.10, pp.6, 2016, https://doi.org/10.3390/land10060609
  20. Including Leaf Traits Improves a Deep Neural Network Model for Predicting Photosynthetic Capacity from Reflectance vol.13, pp.21, 2021, https://doi.org/10.3390/rs13214467