Nonparametric analysis of income distributions among different regions based on energy distance with applications to China Health and Nutrition Survey data

  • Ma, Zhihua (Department of Statistics, University of Connecticut) ;
  • Xue, Yishu (Department of Statistics, University of Connecticut) ;
  • Hu, Guanyu (Department of Statistics, University of Connecticut)
  • Received : 2018.10.23
  • Accepted : 2018.12.05
  • Published : 2019.01.31


Income distribution is a major concern in economic theory. In regional economics, it is often of interest to compare income distributions in different regions. Traditional methods often compare the income inequality of different regions by assuming parametric forms of the income distributions, or using summary statistics like the Gini coefficient. In this paper, we propose a nonparametric procedure to test for heterogeneity in income distributions among different regions, and a K-means clustering procedure for clustering income distributions based on energy distance. In simulation studies, it is shown that the energy distance based method has competitive results with other common methods in hypothesis testing, and the energy distance based clustering method performs well in the clustering problem. The proposed approaches are applied in analyzing data from China Health and Nutrition Survey 2011. The results indicate that there are significant differences among income distributions of the 12 provinces in the dataset. After applying a 4-means clustering algorithm, we obtained the clustering results of the income distributions in the 12 provinces.


  1. Bartels CP and Van Metelen H (1975). Alternative probability density functions of income: A comparison of the lognormal-, Gamma-and Weibull-distribution with Dutch data, Vrije Universiteit, Economische Faculteit.
  2. Bibby J, Kent J, and Mardia K (1979). Multivariate Analysis, Academic Press, London.
  3. Gibrat R (1931). Les inegalites economiques, Recueil Sirey.
  4. Johnson R and Wichern D (2007). Discrimination and classification, Applied Multivariate Statistical Analysis, 4.
  5. Kruskal JB (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika, 29, 1-27.
  6. Li S and Rizzo ML (2017). K-groups: a generalization of K-means clustering, ArXiv e-prints.
  7. Lorenz MO (1905). Methods of measuring the concentration of wealth, Publications of the American Statistical Association, 9, 209-219.
  8. McDonald JB (1984). Some generalized functions for the size distribution of income, Econometrica: Journal of the Econometric Society, 52, 647-665.
  9. McDonald JB and Xu YJ (1995). A generalization of the beta distribution with applications, Journal of Econometrics, 66, 133-152.
  10. Pareto V (1964). Cours d'economie politique, volume 1, Librairie Droz.
  11. Rizzo ML and Szekely GJ (2010). Disco analysis: a nonparametric extension of analysis of variance, The Annals of Applied Statistics, 4, 1034-1055.
  12. Rizzo ML and Szekely GJ (2015). Energy distance, Wiley Interdisciplinary Reviews: Computational Statistics, 8, 27-38.
  13. Salem ABZ and Mount TD (1974). A convenient descriptive model of income distribution: the gamma density, Econometrica: Journal of the Econometric Society, 42, 1115-1127.
  14. Sullivan A (2003). Economics: Principles in action.
  15. Szekely GJ and Rizzo ML (2004). Testing for equal distributions in high dimension, InterStat, 5, 2004.
  16. Yitzhaki S (1979). Relative deprivation and the Gini coefficient, The Quarterly Journal of Economics, 93, 321-324.