本文共 5041 字,大约阅读时间需要 16 分钟。
Box plots in R are a good way to measure and visualize how closely your data is distributed. These are also sometimes known as box and whisker plots. Each data distribution has certain measures of central tendency – mean, median and mode.
R中的箱形图是衡量和可视化数据分布的紧密程度的好方法。 这些有时也称为箱形图和晶须图。 每个数据分布都有一定的集中趋势度量- 均值 , 中位数和众数 。
Some distributions are closely placed around the median and mean values, while others get spread across a wide range of values and also contain a number of outliers. Box plots let you examine your data using a five-number summary. These are:
一些分布紧密地位于中值和平均值附近,而其他分布则分布在广泛的值范围内,并且还包含许多离群值 。 箱形图使您可以使用五位数摘要检查数据。 这些是:
Any data point that is beyond the limits of the minimum and maximum values is treated as an outlier. Thus the box plot can give you a comprehensive idea of the data distribution.
任何超出最小值和最大值限制的数据点均被视为异常值。 因此,箱形图可以为您提供有关数据分布的全面概念。
Box plots can be created using the boxplot()
function in R. Let us try creating our first box plot by making use of the R’s builtin airquality dataset.
可以使用R中的boxplot boxplot()
函数创建boxplot()
。让我们尝试使用R的内置空气质量数据集创建第一个箱形图。
This is a with 6 columns and 153 rows, recording weather data like wind speed, temperature, ozone quantity, etc. Let us try making a box plot for the wind speed column of the dataset.
这是一个有6列和153列,记录气象数据,如风速,温度,臭氧数量等让我们尝试使该数据集的风速柱箱线图。
boxplot(airquality$Wind)
Interpretations:
释义:
Let us try plotting a box plot for another variable in the dataset.
让我们尝试为数据集中的另一个变量绘制箱形图。
boxplot(airquality$Ozone)
It can be observed that this dataset has two outliers above the maximum mark and the data is dispersed above the median value.
可以观察到,该数据集在最大标记之上有两个异常值,并且数据在中值之上分散。
R also makes it possible to compare the distribution of two variables using multiple box plots.
R还可以使用多个箱形图比较两个变量的分布。
> boxplot(airquality$Ozone,airquality$Temp, names=c('Ozone','Temperature'),col=c('red','orange'))
The command uses two different colors to distinguish the variables. The names to the different plots are provided by the names attribute to the function.
该命令使用两种不同的颜色来区分变量。 函数的名称属性提供了不同图的名称。
It is also possible to compare a variable against any other categorical variable in the dataset. For example, if we wish to look at the distribution of the temperature for every individual month, we only need to include the two variables within the formula part as – Temp ~ Month, setting data to the data frame name.
还可以将变量与数据集中的任何其他类别变量进行比较。 例如,如果我们希望查看每个月的温度分布,则只需在公式部分中包括两个变量,例如– Temp〜Month ,即可将数据设置为数据框名称。
Temp ~ Month means that we wish to know the relationship of Temp depending upon the month. Let us now execute the command and try building a horizontal plot instead of a vertical one.
Temp〜Month表示我们希望根据月份了解Temp的关系。 现在让我们执行命令,并尝试构建水平图而不是垂直图。
boxplot(Temp ~ Month, data=airquality, horizontal= TRUE, col=c('red','green'))
A variation to the box plot is sometimes seen with notches added. Notch is nothing but a small compression in the middle of the box, identified by its width and height.
有时会在添加槽口的情况下看到箱形图的变化。 Notch只是盒子中间的一个小压缩,由宽度和高度确定。
Two plots with similar notch dimensions tell us that the two plots were likely drawn on data selected from the same distribution. Also, if two notches do not overlap, the medians of the distributions are likely to be different. Notches can be added setting the notch parameter to TRUE.
缺口大小相似的两个图告诉我们,这两个图可能是根据从相同分布中选择的数据绘制的。 同样,如果两个凹口不重叠,则分布的中位数可能会不同。 可以添加缺口,将缺口参数设置为TRUE。
Let us make a notched variant of the above multigraph.
让我们对上面的多重图进行刻槽。
> boxplot(Temp ~ Month, data=airquality, horizontal= TRUE, notch= TRUE, col=c('red','green','orange','blue','purple'))
翻译自:
转载地址:http://ggozd.baihongyu.com/