File talk:Motor vehicle deaths in the US histogram.svg

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Note that this image image is not what is called a histogram. A histogram is a graph of data values versus frequencies of those values in the data set, or of the values after grouping them in bins of certain size. For example, if the bin size were 6, then in the data set the values 10390 and 10396 would fall in the same bin (lets assume that the label of the bin is 10390). Then in the histogram we would have a point at (10390, 2). The 2 due to having two points in the original data set (1918, 10390), (1919, 10896) with values falling in the bin 10390 with the assumed width 6. I could learn how to create an actual histogram in gnuplot, but probably the author or someone else can do it better an faster. Until then, this is just a warning such that the image doesn't get used for something that it is not currently showing. Cactus0192837465 (talk) 16:28, 27 December 2018 (UTC)[reply]

If it helps here is the deaths data decomposed in bins. I used numpy.histogram.

   >>> import numpy
   >>> data = [36,54,79,117,172,252,338,581,751,1174,1599,2043,2968,4079,4468,6779,7766,9630,10390,10896,
            12155,13253,14859,17870,18400,20771,22194,22727,23165,24470,26557,26785,27007,27979,29592,29746,30246,
            30775,30895,31083,31193,31204,31874,31963,32479,32719,32744,32914,32999,33186,33782,33883,33890,34240,34494,35092,35309,
            35331,36088,36126,36190,36223,36285,36399,36688,36932,37423,37819,37965,38142,38980,39250,40150,40716,41259,41501,41508,
            41717,41723,41817,41945,42013,42065,42196,42589,42708,42836,42884,43005,43510,43825,43945,44257,44525,44599,45196,45523,
            45582,45645,46087,46390,47087,47089,47878,49301,50331,50724,50894,51091,51093,52542,52627,52725,53543,54052,54589]
   >>> numpy.histogram(data,bins=70,range=(0,70000))
   (array([9, 2, 2, 0, 2, 0, 1, 1, 0, 1, 2, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 2,
          1, 1, 0, 2, 2, 0, 2, 3, 5, 5, 4, 2, 3, 8, 3, 2, 1, 2, 7, 7, 4, 3, 4,
          2, 3, 0, 1, 3, 2, 3, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0]), array([     0.,   1000.,   2000.,   3000.,   4000.,   5000.,   6000.,
            7000.,   8000.,   9000.,  10000.,  11000.,  12000.,  13000.,
           14000.,  15000.,  16000.,  17000.,  18000.,  19000.,  20000.,
           21000.,  22000.,  23000.,  24000.,  25000.,  26000.,  27000.,
           28000.,  29000.,  30000.,  31000.,  32000.,  33000.,  34000.,
           35000.,  36000.,  37000.,  38000.,  39000.,  40000.,  41000.,
           42000.,  43000.,  44000.,  45000.,  46000.,  47000.,  48000.,
           49000.,  50000.,  51000.,  52000.,  53000.,  54000.,  55000.,
           56000.,  57000.,  58000.,  59000.,  60000.,  61000.,  62000.,
           63000.,  64000.,  65000.,  66000.,  67000.,  68000.,  69000.,
           70000.]))

Here 'data' is the data of deaths. I put in the command to use 70 bins (implicitly of the same size) in the range from 0 to 70000. The output consists of two arrays. The second is the end points of the bins. The first is the number of data points falling in each bin. The image for this histogram in a bar plot would have a bar between 0 and 1000 of height 9, a bar between 1000 and 2000 of height 2, ... etc. The number of bins, their sizes, the range, etc. can be modified. Histograms are not unique. Maybe other settings yield a nicer looking image for illustration. For example

   array([ 16.,   6.,   3.,   8.,  20.,  21.,  29.,  13.]), array([  3.60000000e+01,   6.85512500e+03,   1.36742500e+04,
        2.04933750e+04,   2.73125000e+04,   3.41316250e+04,
        4.09507500e+04,   4.77698750e+04,   5.45890000e+04]

is another histogram of the deaths data. Cactus0192837465 (talk) 20:52, 27 December 2018 (UTC)[reply]