『Graphics』QQ图(分位数图)'

分位数图示法(Quantile Quantile Plot,简称 Q-Q 图)

理论

分位数图示法(Quantile Quantile Plot,简称 Q-Q 图)

统计学里Q-Q图(Q代表分位数)是一个概率图,用图形的方式比较两个概率分布,把他们的两个分位数放在一起比较。首先选好分位数间隔。图上的点(x,y)反映出其中一个第二个分布(y坐标)的分位数和与之对应的第一分布(x坐标)的相同分位数。因此,这条线是一条以分位数间隔为参数的曲线。如果两个分布相似,则该Q-Q图趋近于落在y=x线上。如果两分布线性相关,则点在Q-Q图上趋近于落在一条直线上,但不一定在y=x线上。Q-Q图可以用来可在分布的位置-尺度范畴上可视化的评估参数。

从定义中可以看出Q-Q图主要用于检验数据分布的相似性,如果要利用Q-Q图来对数据进行正态分布的检验,则可以令x轴为正态分布的分位数,y轴为样本分位数,如果这两者构成的点分布在一条直线上,就证明样本数据与正态分布存在线性相关性,即服从正态分布。

(自己的理解:qq图就是理论值和实际值的关系图,x=理论值,y=实际值。)

案例

检验一序列值是否服从正态分布,序列为:$X=(x_1, x_2, , x_i , x_N),(N>0)$

1、将原序列按升序从新排列

$$x_1 \leq x_2 ,\leq ,\leq x_i,\leq , \leq x_N$$

2、计算QQ序列

样本均值:$\bar{x}=\frac{\sum_{i=1}^{N} x i}{N}$

样本标准差:$\sigma=\sqrt{\frac{\sum(x_i-\bar{x})^{2}}{N-1}}$

分位数 :$Q_{i}=\frac{x_i-\bar{x}}{\sigma}, \quad t_{i}=\frac{i-0.5}{N}$ , 通过查正太分布表可以得到 $t_i$ 对应的分位数 $P_{i}^{\prime}$

计算如下表所示:

image-20200922202518466

3、画出 QQ 图 ,分布呈直线,就是正态分布,如果这条线是y=x,则是标准正态分布。

gofplots.ProbPlot

函数:

class statsmodels.graphics.gofplots.ProbPlot(data, 
                                             dist=<scipy.stats._continuous_distns.norm_gen object>, 
                                             fit=False, 
                                             distargs=(), a=0, loc=0, scale=1)

Q-Q and P-P Probability Plots

可以接受指定dist参数的参数,也可以自动将其拟合。

参数:

  • data : array_like A 1d data array
  • dist : callable, Compare x against dist. A scipy.stats or statsmodels distribution. The default is scipy.stats.distributions.norm (a standard normal). Can be a SciPy frozen distribution.
  • fit : bool, If fit is false, loc, scale, and distargs are passed to the distribution. If fit is True then the parameters for dist are fit automatically using dist.fit. The quantiles are formed from the standardized data, after subtracting the fitted loc and dividing by the fitted scale. fit cannot be used if dist is a SciPy frozen distribution.
  • distargs : tuple, A tuple of arguments passed to dist to specify it fully so dist.ppf may be called. distargs must not contain loc or scale. These values must be passed using the loc or scale inputs. distargs cannot be used if dist is a SciPy frozen distribution.
  • a : float , Offset for the plotting position of an expected order statistic, for example. The plotting positions are given by (i - a)/(nobs - 2*a + 1) for i in range(0,nobs+1)
  • loc : float, Location parameter for dist. Cannot be used if dist is a SciPy frozen distribution.
  • scale : Scale parameter for dist. Cannot be used if dist is a SciPy frozen distribution.

Methods

ppplot([xlabel, ylabel, line, other, ax]) Plot of the percentiles of x versus the percentiles of a distribution.
probplot([xlabel, ylabel, line, exceed, ax]) Plot of unscaled quantiles of x against the prob of a distribution.
qqplot([xlabel, ylabel, line, other, ax]) Plot of the quantiles of x versus the quantiles/ppf of a distribution.

The three plotting methods are summarized below:

  • ppplotProbability-Probability plot

    Compares the sample and theoretical probabilities (percentiles).

  • qqplotQuantile-Quantile plot

    Compares the sample and theoretical quantiles

  • probplotProbability plot

    Same as a Q-Q plot, however probabilities are shown in the scale of the theoretical distribution (x-axis) and the y-axis contains unscaled quantiles of the sample data.

Properties

sample_percentiles Sample percentiles(样本百分位数)
sample_quantiles sample quantiles(样本分位数)
sorted_data sorted data
theoretical_percentiles Theoretical percentiles (理论百分位数)
theoretical_quantiles Theoretical quantiles(理论分位数)

Examples

The first example shows a Q-Q plot for regression residuals

import statsmodels.api as sm
import matplotlib.pyplot as plt

data = sm.datasets.longley.load(as_pandas=False)
data.exog = sm.add_constant(data.exog)
model = sm.OLS(data.endog, data.exog)
mod_fit = model.fit()
res = mod_fit.resid
# res

array([ 267.34002979,  -94.01394237,   46.28716779, -410.1146219 ,
        309.71459079, -249.3112153 , -164.04895636,  -13.18035684,
         14.30477263,  455.39409458,  -17.26892708,  -39.05504249,
       -155.54997356,  -85.67130801,  341.93151399, -206.75782516])
pplot.theoretical_quantiles

'''
array([-1.56472647, -1.18683143, -0.92889949, -0.72152228, -0.54139509,
       -0.37739194, -0.22300783, -0.07379127,  0.07379127,  0.22300783,
        0.37739194,  0.54139509,  0.72152228,  0.92889949,  1.18683143,
        1.56472647])
'''
pplot = sm.ProbPlot(res)
fig = pplot.qqplot()
plt.title("Ex. 1 - qqplot - residuals of OLS fit")
下载 (7)

qqplot of the residuals against quantiles of t-distribution with 4 degrees of freedom:

import scipy.stats as stats

pplot = sm.ProbPlot(res, stats.t, distargs=(4,))
'''
pplot.theoretical_quantiles
array([-1.98853847, -1.39569844, -1.05006478, -0.79602407, -0.58782183,
       -0.40547071, -0.23806378, -0.07853211,  0.07853211,  0.23806378,
        0.40547071,  0.58782183,  0.79602407,  1.05006478,  1.39569844,
        1.98853847])
'''
fig = pplot.qqplot()
plt.title("Ex. 2 - qqplot - residuals against quantiles of t-dist")
下载 (8)

qqplot against same as above, but with mean 3 and std 10:

pplot = sm.ProbPlot(res, stats.t, distargs=(4,), loc=3, scale=10)
'''
pplot.theoretical_quantiles
array([-16.88538475, -10.95698438,  -7.50064777,  -4.96024066,
        -2.87821833,  -1.05470709,   0.61936217,   2.21467888,
         3.78532112,   5.38063783,   7.05470709,   8.87821833,
        10.96024066,  13.50064777,  16.95698438,  22.88538475])
'''
fig = pplot.qqplot()
h = plt.title("Ex. 3 - qqplot - resids vs quantiles of t-dist")
下载 (9)

Automatically determine parameters for t distribution including the loc and scale:

pplot = sm.ProbPlot(res, stats.t, fit=True)
fig = pplot.qqplot(line="45")
h = plt.title("Ex. 4 - qqplot - resids vs. quantiles of fitted t-dist")

'''
pplot.sample_quantiles
array([-1.79364223, -1.09036424, -0.90425585, -0.71746747, -0.68029695,
       -0.41116747, -0.37468074, -0.17080326, -0.0755211 , -0.05763963,
        0.06256732,  0.20244318,  1.16922377,  1.35455004,  1.49545161,
        1.99168324])

pplot.theoretical_quantiles
array([-1.56472676, -1.18683159, -0.92889958, -0.72152234, -0.54139512,
       -0.37739197, -0.22300784, -0.07379128,  0.07379128,  0.22300784,
        0.37739197,  0.54139512,  0.72152234,  0.92889958,  1.18683159,
        1.56472676])
'''
下载 (10)

A second ProbPlot object can be used to compare two separate sample sets by using the other kwarg in the qqplot and ppplot methods.

import numpy as np
np.random.seed(1)
x = np.random.normal(loc=8.25, scale=2.75, size=37)
y = np.random.normal(loc=8.75, scale=3.25, size=37)

pp_x = sm.ProbPlot(x, fit=True)
pp_y = sm.ProbPlot(y, fit=True)
'''
用的都是样本的分位数
pp_x.sample_quantiles.min()
-2.254490936888067
pp_y.sample_quantiles.min()
-1.89620626452838
'''

fig = pp_x.qqplot(line='45', other=pp_y)
plt.title("Ex. 5 - qqplot - compare two sample sets")

In qqplot, sample size of other can be equal or larger than the first. In case of larger, size of other samples will be reduced to match the size of the first by interpolation

x = np.random.normal(loc=8.25, scale=2.75, size=37)
y = np.random.normal(loc=8.75, scale=3.25, size=57)
pp_x = sm.ProbPlot(x, fit=True)
pp_y = sm.ProbPlot(y, fit=True)
fig = pp_x.qqplot(line="45", other=pp_y)
title = "Ex. 6 - qqplot - compare different sample sizes"
h = plt.title(title)
下载 (12)

在ppplot中,其他样本和第一个样本的大小可以不同。 其他将用于估计经验累积分布函数(ECDF)。ECDF(x) will be plotted against p(x)=0.5/n, 1.5/n, …, (n-0.5)/n where x are sorted samples from the first.

x = np.random.normal(loc=8.25, scale=2.75, size=37)
y = np.random.normal(loc=8.75, scale=3.25, size=57)
pp_x = sm.ProbPlot(x, fit=True)
pp_y = sm.ProbPlot(y, fit=True)
pp_y.ppplot(line="45", other=pp_x)
plt.title("Ex. 7A- ppplot - compare two sample sets, other=pp_x")
pp_x.ppplot(line="45", other=pp_y)
plt.title("Ex. 7B- ppplot - compare two sample sets, other=pp_y")
下载 (13) 下载 (14)

gofplots.qqline

函数:

statsmodels.graphics.gofplots.qqline(ax, line, 
                                     x=None, y=None, 
                                     dist=None, fmt='r-', **lineoptions)

为qqplot绘制参考线。

参数:

  • ax : matplotlib轴实例, The axes on which to plot the line

  • line : str {“45”,”r”,”s”,”q”} , 与数据进行比较的参考线的选项。

    • “45” - 45-degree line
    • “s”- standardized line, the expected order statistics are scaled by the standard deviation of the given sample and have the mean added to them
    • “r” - A regression line is fit
    • “q” - A line is fit through the quartiles.
    • None - By default no reference line is added to the plot.
  • X : ndarray

    X data for plot. Not needed if line is “45”.

  • yndarray

    Y data for plot. Not needed if line is “45”.

  • dist : scipy.stats.distribution

    A scipy.stats distribution, needed if line is “q”.

  • fmt : str, optional

    Line format string passed to plot.

  • \lineoptions

    Additional arguments to be passed to the plot command.

There is no return value. The line is plotted on the given ax.

Examples

导入食品支出数据集。 在x轴上绘制年度食品支出,在y轴上绘制家庭收入。 使用qqline将回归线添加到绘图中。

import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import qqline

foodexp = sm.datasets.engel.load(as_pandas=False)
x = foodexp.exog
y = foodexp.endog

#x.shape  #(235, 1)
# y.shape #(235,)

ax = plt.subplot(111)
plt.scatter(x,y)
ax.set_xlabel(foodexp.exog_name[0])
ax.set_ylabel(foodexp.endog_name)
qqline(ax, 'r', x, y)
plt.show()
下载

gofplots.qqplot

scipy.stats.probplot 类似。

函数:

statsmodels.graphics.gofplots.qqplot(data, 
                                     dist=<scipy.stats._continuous_distns.norm_gen object>, 
                                     distargs=(), 
                                     a=0, loc=0, scale=1, 
                                     fit=False, line=None, 
                                     ax=None, **plotkwargs)

Q-Q plot of the quantiles of x versus the quantiles/ppf of a distribution.

可以接受指定dist参数,也可以自动将其拟合。

参数:

  • data : A 1d data array.

  • dist : callable , Comparison distribution. The default is scipy.stats.distributions.norm (a standard normal).

  • distargs : 元组,传递给 dist 分布的参数,让它能够调用 ppf 函数。

  • a : float , Offset for the plotting position of an expected order statistic, for example. The plotting positions are given by (i - a)/(nobs - 2*a + 1) for i in range(0,nobs+1)

  • loc : float , Location parameter for dist

  • scale : float , Scale parameter for dist

  • fit : bool, 如果fit为false,则将loc,scale和distargs传递到分布。 如果fit为True,则使用dist.fit自动拟合dist的参数。The quantiles are formed from the standardized data, after subtracting the fitted loc and dividing by the fitted scale.

  • line : {None, “45”, “s”, “r”, “q”},

    Options for the reference line to which the data is compared:

    • “45” - 45-degree line
    • “s” - standardized line, the expected order statistics are scaled by the standard deviation of the given sample and have the mean added to them
    • “r” - A regression line is fit
    • “q” - A line is fit through the quartiles.
    • None - by default no reference line is added to the plot.
  • ax : AxesSubplot, optional

    If given, this subplot is used to plot in instead of a new figure being created.

  • \plotkwargs

    Additional matplotlib arguments to be passed to the plot command.

返回值:

Figure

If ax is None, the created figure. Otherwise the figure to which ax is connected.

Examples

import statsmodels.api as sm
import matplotlib.pyplot as plt 
from statsmodels.graphics.gofplots import qqplot

data = sm.datasets.longley.load(as_pandas=False)

exog = sm.add_constant(data.exog)
mod_fit = sm.OLS(data.endog,exog).fit()
res = mod_fit.resid # residuals
#res
array([ 267.34002979,  -94.01394237,   46.28716779, -410.1146219 ,
        309.71459079, -249.3112153 , -164.04895636,  -13.18035684,
         14.30477263,  455.39409458,  -17.26892708,  -39.05504249,
       -155.54997356,  -85.67130801,  341.93151399, -206.75782516])
fig = sm.qqplot(res)
下载 (1)

qqplot of the residuals against quantiles of t-distribution with 4 degrees of freedom:

import scipy.stats as stats
fig = sm.qqplot(res, stats.t, distargs=(4,))
下载 (2)

qqplot against same as above, but with mean 3 and std 10:

fig = sm.qqplot(res, stats.t, distargs=(4,), loc=3, scale=10)
下载 (3)

Automatically determine parameters for t distribution including the loc and scale:

fig = sm.qqplot(res, stats.t, fit=True, line="45")
下载 (4)

得到残差之后,也可以直接用 qqplot 进行绘制:

from statsmodels.graphics.gofplots import qqplot

fig = qqplot(res, stats.t, fit=True,line='45')
下载 (5)

gofplots.qqplot_2samples

Q-Q Plot of two samples’ quantiles

Can take either two ProbPlot instances or two array-like objects, In the case of the latter, both inputs will be converted to ProbPlot instances using only the default values - so use ProbPlot instances if finer-grained control of the quantile

函数:

statsmodels.graphics.gofplots.qqplot_2samples(data1, data2, 
                                              xlabel=None, ylabel=None, 
                                              line=None, ax=None)

参数:

  • data1 : {array_like, ProbPlot} Data to plot along x axis.

  • data2 : {array_like, ProbPlot} Data to plot along y axis. Does not need to have the same number of observations as data 1.

  • xlabel : {None, str} User-provided labels for the x-axis. If None (default), other values are used.

  • y : {None, str} User-provided labels for the y-axis. If None (default), other values are used.

  • line : {None, “45”, “s”, “r”, q”}

    Options for the reference line to which the data is compared:

    • “45” - 45-degree line
    • “s” - standardized line, the expected order statistics are scaled by the standard deviation of the given sample and have the mean added to them
    • “r” - A regression line is fit
    • “q” - A line is fit through the quartiles.
    • None - by default no reference line is added to the plot.
  • ax : AxesSubplot, optional

    If given, this subplot is used to plot in instead of a new figure being created.

返回值:

  • figure : If ax is None, the created figure. Otherwise the figure to which ax is connected.
  1. Depends on matplotlib.
  2. If data1 and data2 are not ProbPlot instances, instances will be created using the default parameters. Therefore, it is recommended to use ProbPlot instance if fine-grained control is needed in the computation of the quantiles.

Examples

import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import qqplot_2samples

x = np.random.normal(loc=8.5, scale=2.5, size=37)
y = np.random.normal(loc=8.0, scale=3.0, size=37)

pp_x = sm.ProbPlot(x)
pp_y = sm.ProbPlot(y)
qqplot_2samples(pp_x, pp_y)
plt.show()
下载 (6) 下载 (13)
打赏
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2019-2021 HG | 访问人数: | 浏览次数:

请我喝瓶农夫三拳吧~

支付宝
微信