comparison env/lib/python3.7/site-packages/boltons/statsutils.py @ 5:9b1c78e6ba9c draft default tip

"planemo upload commit 6c0a8142489327ece472c84e558c47da711a9142"
author shellac
date Mon, 01 Jun 2020 08:59:25 -0400
parents 79f47841a781
children
comparison
equal deleted inserted replaced
4:79f47841a781 5:9b1c78e6ba9c
1 # -*- coding: utf-8 -*-
2 """``statsutils`` provides tools aimed primarily at descriptive
3 statistics for data analysis, such as :func:`mean` (average),
4 :func:`median`, :func:`variance`, and many others,
5
6 The :class:`Stats` type provides all the main functionality of the
7 ``statsutils`` module. A :class:`Stats` object wraps a given dataset,
8 providing all statistical measures as property attributes. These
9 attributes cache their results, which allows efficient computation of
10 multiple measures, as many measures rely on other measures. For
11 example, relative standard deviation (:attr:`Stats.rel_std_dev`)
12 relies on both the mean and standard deviation. The Stats object
13 caches those results so no rework is done.
14
15 The :class:`Stats` type's attributes have module-level counterparts for
16 convenience when the computation reuse advantages do not apply.
17
18 >>> stats = Stats(range(42))
19 >>> stats.mean
20 20.5
21 >>> mean(range(42))
22 20.5
23
24 Statistics is a large field, and ``statsutils`` is focused on a few
25 basic techniques that are useful in software. The following is a brief
26 introduction to those techniques. For a more in-depth introduction,
27 `Statistics for Software
28 <https://www.paypal-engineering.com/2016/04/11/statistics-for-software/>`_,
29 an article I wrote on the topic. It introduces key terminology vital
30 to effective usage of statistics.
31
32 Statistical moments
33 -------------------
34
35 Python programmers are probably familiar with the concept of the
36 *mean* or *average*, which gives a rough quantitiative middle value by
37 which a sample can be can be generalized. However, the mean is just
38 the first of four `moment`_-based measures by which a sample or
39 distribution can be measured.
40
41 The four `Standardized moments`_ are:
42
43 1. `Mean`_ - :func:`mean` - theoretical middle value
44 2. `Variance`_ - :func:`variance` - width of value dispersion
45 3. `Skewness`_ - :func:`skewness` - symmetry of distribution
46 4. `Kurtosis`_ - :func:`kurtosis` - "peakiness" or "long-tailed"-ness
47
48 For more information check out `the Moment article on Wikipedia`_.
49
50 .. _moment: https://en.wikipedia.org/wiki/Moment_(mathematics)
51 .. _Standardized moments: https://en.wikipedia.org/wiki/Standardized_moment
52 .. _Mean: https://en.wikipedia.org/wiki/Mean
53 .. _Variance: https://en.wikipedia.org/wiki/Variance
54 .. _Skewness: https://en.wikipedia.org/wiki/Skewness
55 .. _Kurtosis: https://en.wikipedia.org/wiki/Kurtosis
56 .. _the Moment article on Wikipedia: https://en.wikipedia.org/wiki/Moment_(mathematics)
57
58 Keep in mind that while these moments can give a bit more insight into
59 the shape and distribution of data, they do not guarantee a complete
60 picture. Wildly different datasets can have the same values for all
61 four moments, so generalize wisely.
62
63 Robust statistics
64 -----------------
65
66 Moment-based statistics are notorious for being easily skewed by
67 outliers. The whole field of robust statistics aims to mitigate this
68 dilemma. ``statsutils`` also includes several robust statistical methods:
69
70 * `Median`_ - The middle value of a sorted dataset
71 * `Trimean`_ - Another robust measure of the data's central tendency
72 * `Median Absolute Deviation`_ (MAD) - A robust measure of
73 variability, a natural counterpart to :func:`variance`.
74 * `Trimming`_ - Reducing a dataset to only the middle majority of
75 data is a simple way of making other estimators more robust.
76
77 .. _Median: https://en.wikipedia.org/wiki/Median
78 .. _Trimean: https://en.wikipedia.org/wiki/Trimean
79 .. _Median Absolute Deviation: https://en.wikipedia.org/wiki/Median_absolute_deviation
80 .. _Trimming: https://en.wikipedia.org/wiki/Trimmed_estimator
81
82
83 Online and Offline Statistics
84 -----------------------------
85
86 Unrelated to computer networking, `online`_ statistics involve
87 calculating statistics in a `streaming`_ fashion, without all the data
88 being available. The :class:`Stats` type is meant for the more
89 traditional offline statistics when all the data is available. For
90 pure-Python online statistics accumulators, look at the `Lithoxyl`_
91 system instrumentation package.
92
93 .. _Online: https://en.wikipedia.org/wiki/Online_algorithm
94 .. _streaming: https://en.wikipedia.org/wiki/Streaming_algorithm
95 .. _Lithoxyl: https://github.com/mahmoud/lithoxyl
96
97 """
98
99 from __future__ import print_function
100
101 import bisect
102 from math import floor, ceil
103
104
105 class _StatsProperty(object):
106 def __init__(self, name, func):
107 self.name = name
108 self.func = func
109 self.internal_name = '_' + name
110
111 doc = func.__doc__ or ''
112 pre_doctest_doc, _, _ = doc.partition('>>>')
113 self.__doc__ = pre_doctest_doc
114
115 def __get__(self, obj, objtype=None):
116 if obj is None:
117 return self
118 if not obj.data:
119 return obj.default
120 try:
121 return getattr(obj, self.internal_name)
122 except AttributeError:
123 setattr(obj, self.internal_name, self.func(obj))
124 return getattr(obj, self.internal_name)
125
126
127 class Stats(object):
128 """The ``Stats`` type is used to represent a group of unordered
129 statistical datapoints for calculations such as mean, median, and
130 variance.
131
132 Args:
133
134 data (list): List or other iterable containing numeric values.
135 default (float): A value to be returned when a given
136 statistical measure is not defined. 0.0 by default, but
137 ``float('nan')`` is appropriate for stricter applications.
138 use_copy (bool): By default Stats objects copy the initial
139 data into a new list to avoid issues with
140 modifications. Pass ``False`` to disable this behavior.
141 is_sorted (bool): Presorted data can skip an extra sorting
142 step for a little speed boost. Defaults to False.
143
144 """
145 def __init__(self, data, default=0.0, use_copy=True, is_sorted=False):
146 self._use_copy = use_copy
147 self._is_sorted = is_sorted
148 if use_copy:
149 self.data = list(data)
150 else:
151 self.data = data
152
153 self.default = default
154 cls = self.__class__
155 self._prop_attr_names = [a for a in dir(self)
156 if isinstance(getattr(cls, a, None),
157 _StatsProperty)]
158 self._pearson_precision = 0
159
160 def __len__(self):
161 return len(self.data)
162
163 def __iter__(self):
164 return iter(self.data)
165
166 def _get_sorted_data(self):
167 """When using a copy of the data, it's better to have that copy be
168 sorted, but we do it lazily using this method, in case no
169 sorted measures are used. I.e., if median is never called,
170 sorting would be a waste.
171
172 When not using a copy, it's presumed that all optimizations
173 are on the user.
174 """
175 if not self._use_copy:
176 return sorted(self.data)
177 elif not self._is_sorted:
178 self.data.sort()
179 return self.data
180
181 def clear_cache(self):
182 """``Stats`` objects automatically cache intermediary calculations
183 that can be reused. For instance, accessing the ``std_dev``
184 attribute after the ``variance`` attribute will be
185 significantly faster for medium-to-large datasets.
186
187 If you modify the object by adding additional data points,
188 call this function to have the cached statistics recomputed.
189
190 """
191 for attr_name in self._prop_attr_names:
192 attr_name = getattr(self.__class__, attr_name).internal_name
193 if not hasattr(self, attr_name):
194 continue
195 delattr(self, attr_name)
196 return
197
198 def _calc_count(self):
199 """The number of items in this Stats object. Returns the same as
200 :func:`len` on a Stats object, but provided for pandas terminology
201 parallelism.
202
203 >>> Stats(range(20)).count
204 20
205 """
206 return len(self.data)
207 count = _StatsProperty('count', _calc_count)
208
209 def _calc_mean(self):
210 """
211 The arithmetic mean, or "average". Sum of the values divided by
212 the number of values.
213
214 >>> mean(range(20))
215 9.5
216 >>> mean(list(range(19)) + [949]) # 949 is an arbitrary outlier
217 56.0
218 """
219 return sum(self.data, 0.0) / len(self.data)
220 mean = _StatsProperty('mean', _calc_mean)
221
222 def _calc_max(self):
223 """
224 The maximum value present in the data.
225
226 >>> Stats([2, 1, 3]).max
227 3
228 """
229 if self._is_sorted:
230 return self.data[-1]
231 return max(self.data)
232 max = _StatsProperty('max', _calc_max)
233
234 def _calc_min(self):
235 """
236 The minimum value present in the data.
237
238 >>> Stats([2, 1, 3]).min
239 1
240 """
241 if self._is_sorted:
242 return self.data[0]
243 return min(self.data)
244 min = _StatsProperty('min', _calc_min)
245
246 def _calc_median(self):
247 """
248 The median is either the middle value or the average of the two
249 middle values of a sample. Compared to the mean, it's generally
250 more resilient to the presence of outliers in the sample.
251
252 >>> median([2, 1, 3])
253 2
254 >>> median(range(97))
255 48
256 >>> median(list(range(96)) + [1066]) # 1066 is an arbitrary outlier
257 48
258 """
259 return self._get_quantile(self._get_sorted_data(), 0.5)
260 median = _StatsProperty('median', _calc_median)
261
262 def _calc_iqr(self):
263 """Inter-quartile range (IQR) is the difference between the 75th
264 percentile and 25th percentile. IQR is a robust measure of
265 dispersion, like standard deviation, but safer to compare
266 between datasets, as it is less influenced by outliers.
267
268 >>> iqr([1, 2, 3, 4, 5])
269 2
270 >>> iqr(range(1001))
271 500
272 """
273 return self.get_quantile(0.75) - self.get_quantile(0.25)
274 iqr = _StatsProperty('iqr', _calc_iqr)
275
276 def _calc_trimean(self):
277 """The trimean is a robust measure of central tendency, like the
278 median, that takes the weighted average of the median and the
279 upper and lower quartiles.
280
281 >>> trimean([2, 1, 3])
282 2.0
283 >>> trimean(range(97))
284 48.0
285 >>> trimean(list(range(96)) + [1066]) # 1066 is an arbitrary outlier
286 48.0
287
288 """
289 sorted_data = self._get_sorted_data()
290 gq = lambda q: self._get_quantile(sorted_data, q)
291 return (gq(0.25) + (2 * gq(0.5)) + gq(0.75)) / 4.0
292 trimean = _StatsProperty('trimean', _calc_trimean)
293
294 def _calc_variance(self):
295 """\
296 Variance is the average of the squares of the difference between
297 each value and the mean.
298
299 >>> variance(range(97))
300 784.0
301 """
302 global mean # defined elsewhere in this file
303 return mean(self._get_pow_diffs(2))
304 variance = _StatsProperty('variance', _calc_variance)
305
306 def _calc_std_dev(self):
307 """\
308 Standard deviation. Square root of the variance.
309
310 >>> std_dev(range(97))
311 28.0
312 """
313 return self.variance ** 0.5
314 std_dev = _StatsProperty('std_dev', _calc_std_dev)
315
316 def _calc_median_abs_dev(self):
317 """\
318 Median Absolute Deviation is a robust measure of statistical
319 dispersion: http://en.wikipedia.org/wiki/Median_absolute_deviation
320
321 >>> median_abs_dev(range(97))
322 24.0
323 """
324 global median # defined elsewhere in this file
325 sorted_vals = sorted(self.data)
326 x = float(median(sorted_vals))
327 return median([abs(x - v) for v in sorted_vals])
328 median_abs_dev = _StatsProperty('median_abs_dev', _calc_median_abs_dev)
329 mad = median_abs_dev # convenience
330
331 def _calc_rel_std_dev(self):
332 """\
333 Standard deviation divided by the absolute value of the average.
334
335 http://en.wikipedia.org/wiki/Relative_standard_deviation
336
337 >>> print('%1.3f' % rel_std_dev(range(97)))
338 0.583
339 """
340 abs_mean = abs(self.mean)
341 if abs_mean:
342 return self.std_dev / abs_mean
343 else:
344 return self.default
345 rel_std_dev = _StatsProperty('rel_std_dev', _calc_rel_std_dev)
346
347 def _calc_skewness(self):
348 """\
349 Indicates the asymmetry of a curve. Positive values mean the bulk
350 of the values are on the left side of the average and vice versa.
351
352 http://en.wikipedia.org/wiki/Skewness
353
354 See the module docstring for more about statistical moments.
355
356 >>> skewness(range(97)) # symmetrical around 48.0
357 0.0
358 >>> left_skewed = skewness(list(range(97)) + list(range(10)))
359 >>> right_skewed = skewness(list(range(97)) + list(range(87, 97)))
360 >>> round(left_skewed, 3), round(right_skewed, 3)
361 (0.114, -0.114)
362 """
363 data, s_dev = self.data, self.std_dev
364 if len(data) > 1 and s_dev > 0:
365 return (sum(self._get_pow_diffs(3)) /
366 float((len(data) - 1) * (s_dev ** 3)))
367 else:
368 return self.default
369 skewness = _StatsProperty('skewness', _calc_skewness)
370
371 def _calc_kurtosis(self):
372 """\
373 Indicates how much data is in the tails of the distribution. The
374 result is always positive, with the normal "bell-curve"
375 distribution having a kurtosis of 3.
376
377 http://en.wikipedia.org/wiki/Kurtosis
378
379 See the module docstring for more about statistical moments.
380
381 >>> kurtosis(range(9))
382 1.99125
383
384 With a kurtosis of 1.99125, [0, 1, 2, 3, 4, 5, 6, 7, 8] is more
385 centrally distributed than the normal curve.
386 """
387 data, s_dev = self.data, self.std_dev
388 if len(data) > 1 and s_dev > 0:
389 return (sum(self._get_pow_diffs(4)) /
390 float((len(data) - 1) * (s_dev ** 4)))
391 else:
392 return 0.0
393 kurtosis = _StatsProperty('kurtosis', _calc_kurtosis)
394
395 def _calc_pearson_type(self):
396 precision = self._pearson_precision
397 skewness = self.skewness
398 kurtosis = self.kurtosis
399 beta1 = skewness ** 2.0
400 beta2 = kurtosis * 1.0
401
402 # TODO: range checks?
403
404 c0 = (4 * beta2) - (3 * beta1)
405 c1 = skewness * (beta2 + 3)
406 c2 = (2 * beta2) - (3 * beta1) - 6
407
408 if round(c1, precision) == 0:
409 if round(beta2, precision) == 3:
410 return 0 # Normal
411 else:
412 if beta2 < 3:
413 return 2 # Symmetric Beta
414 elif beta2 > 3:
415 return 7
416 elif round(c2, precision) == 0:
417 return 3 # Gamma
418 else:
419 k = c1 ** 2 / (4 * c0 * c2)
420 if k < 0:
421 return 1 # Beta
422 raise RuntimeError('missed a spot')
423 pearson_type = _StatsProperty('pearson_type', _calc_pearson_type)
424
425 @staticmethod
426 def _get_quantile(sorted_data, q):
427 data, n = sorted_data, len(sorted_data)
428 idx = q / 1.0 * (n - 1)
429 idx_f, idx_c = int(floor(idx)), int(ceil(idx))
430 if idx_f == idx_c:
431 return data[idx_f]
432 return (data[idx_f] * (idx_c - idx)) + (data[idx_c] * (idx - idx_f))
433
434 def get_quantile(self, q):
435 """Get a quantile from the dataset. Quantiles are floating point
436 values between ``0.0`` and ``1.0``, with ``0.0`` representing
437 the minimum value in the dataset and ``1.0`` representing the
438 maximum. ``0.5`` represents the median:
439
440 >>> Stats(range(100)).get_quantile(0.5)
441 49.5
442 """
443 q = float(q)
444 if not 0.0 <= q <= 1.0:
445 raise ValueError('expected q between 0.0 and 1.0, not %r' % q)
446 elif not self.data:
447 return self.default
448 return self._get_quantile(self._get_sorted_data(), q)
449
450 def get_zscore(self, value):
451 """Get the z-score for *value* in the group. If the standard deviation
452 is 0, 0 inf or -inf will be returned to indicate whether the value is
453 equal to, greater than or below the group's mean.
454 """
455 mean = self.mean
456 if self.std_dev == 0:
457 if value == mean:
458 return 0
459 if value > mean:
460 return float('inf')
461 if value < mean:
462 return float('-inf')
463 return (float(value) - mean) / self.std_dev
464
465 def trim_relative(self, amount=0.15):
466 """A utility function used to cut a proportion of values off each end
467 of a list of values. This has the effect of limiting the
468 effect of outliers.
469
470 Args:
471 amount (float): A value between 0.0 and 0.5 to trim off of
472 each side of the data.
473
474 .. note:
475
476 This operation modifies the data in-place. It does not
477 make or return a copy.
478
479 """
480 trim = float(amount)
481 if not 0.0 <= trim < 0.5:
482 raise ValueError('expected amount between 0.0 and 0.5, not %r'
483 % trim)
484 size = len(self.data)
485 size_diff = int(size * trim)
486 if size_diff == 0.0:
487 return
488 self.data = self._get_sorted_data()[size_diff:-size_diff]
489 self.clear_cache()
490
491 def _get_pow_diffs(self, power):
492 """
493 A utility function used for calculating statistical moments.
494 """
495 m = self.mean
496 return [(v - m) ** power for v in self.data]
497
498 def _get_bin_bounds(self, count=None, with_max=False):
499 if not self.data:
500 return [0.0] # TODO: raise?
501
502 data = self.data
503 len_data, min_data, max_data = len(data), min(data), max(data)
504
505 if len_data < 4:
506 if not count:
507 count = len_data
508 dx = (max_data - min_data) / float(count)
509 bins = [min_data + (dx * i) for i in range(count)]
510 elif count is None:
511 # freedman algorithm for fixed-width bin selection
512 q25, q75 = self.get_quantile(0.25), self.get_quantile(0.75)
513 dx = 2 * (q75 - q25) / (len_data ** (1 / 3.0))
514 bin_count = max(1, int(ceil((max_data - min_data) / dx)))
515 bins = [min_data + (dx * i) for i in range(bin_count + 1)]
516 bins = [b for b in bins if b < max_data]
517 else:
518 dx = (max_data - min_data) / float(count)
519 bins = [min_data + (dx * i) for i in range(count)]
520
521 if with_max:
522 bins.append(float(max_data))
523
524 return bins
525
526 def get_histogram_counts(self, bins=None, **kw):
527 """Produces a list of ``(bin, count)`` pairs comprising a histogram of
528 the Stats object's data, using fixed-width bins. See
529 :meth:`Stats.format_histogram` for more details.
530
531 Args:
532 bins (int): maximum number of bins, or list of
533 floating-point bin boundaries. Defaults to the output of
534 Freedman's algorithm.
535 bin_digits (int): Number of digits used to round down the
536 bin boundaries. Defaults to 1.
537
538 The output of this method can be stored and/or modified, and
539 then passed to :func:`statsutils.format_histogram_counts` to
540 achieve the same text formatting as the
541 :meth:`~Stats.format_histogram` method. This can be useful for
542 snapshotting over time.
543 """
544 bin_digits = int(kw.pop('bin_digits', 1))
545 if kw:
546 raise TypeError('unexpected keyword arguments: %r' % kw.keys())
547
548 if not bins:
549 bins = self._get_bin_bounds()
550 else:
551 try:
552 bin_count = int(bins)
553 except TypeError:
554 try:
555 bins = [float(x) for x in bins]
556 except Exception:
557 raise ValueError('bins expected integer bin count or list'
558 ' of float bin boundaries, not %r' % bins)
559 if self.min < bins[0]:
560 bins = [self.min] + bins
561 else:
562 bins = self._get_bin_bounds(bin_count)
563
564 # floor and ceil really should have taken ndigits, like round()
565 round_factor = 10.0 ** bin_digits
566 bins = [floor(b * round_factor) / round_factor for b in bins]
567 bins = sorted(set(bins))
568
569 idxs = [bisect.bisect(bins, d) - 1 for d in self.data]
570 count_map = {} # would have used Counter, but py26 support
571 for idx in idxs:
572 try:
573 count_map[idx] += 1
574 except KeyError:
575 count_map[idx] = 1
576
577 bin_counts = [(b, count_map.get(i, 0)) for i, b in enumerate(bins)]
578
579 return bin_counts
580
581 def format_histogram(self, bins=None, **kw):
582 """Produces a textual histogram of the data, using fixed-width bins,
583 allowing for simple visualization, even in console environments.
584
585 >>> data = list(range(20)) + list(range(5, 15)) + [10]
586 >>> print(Stats(data).format_histogram(width=30))
587 0.0: 5 #########
588 4.4: 8 ###############
589 8.9: 11 ####################
590 13.3: 5 #########
591 17.8: 2 ####
592
593 In this histogram, five values are between 0.0 and 4.4, eight
594 are between 4.4 and 8.9, and two values lie between 17.8 and
595 the max.
596
597 You can specify the number of bins, or provide a list of
598 bin boundaries themselves. If no bins are provided, as in the
599 example above, `Freedman's algorithm`_ for bin selection is
600 used.
601
602 Args:
603 bins (int): Maximum number of bins for the
604 histogram. Also accepts a list of floating-point
605 bin boundaries. If the minimum boundary is still
606 greater than the minimum value in the data, that
607 boundary will be implicitly added. Defaults to the bin
608 boundaries returned by `Freedman's algorithm`_.
609 bin_digits (int): Number of digits to round each bin
610 to. Note that bins are always rounded down to avoid
611 clipping any data. Defaults to 1.
612 width (int): integer number of columns in the longest line
613 in the histogram. Defaults to console width on Python
614 3.3+, or 80 if that is not available.
615 format_bin (callable): Called on each bin to create a
616 label for the final output. Use this function to add
617 units, such as "ms" for milliseconds.
618
619 Should you want something more programmatically reusable, see
620 the :meth:`~Stats.get_histogram_counts` method, the output of
621 is used by format_histogram. The :meth:`~Stats.describe`
622 method is another useful summarization method, albeit less
623 visual.
624
625 .. _Freedman's algorithm: https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule
626 """
627 width = kw.pop('width', None)
628 format_bin = kw.pop('format_bin', None)
629 bin_counts = self.get_histogram_counts(bins=bins, **kw)
630 return format_histogram_counts(bin_counts,
631 width=width,
632 format_bin=format_bin)
633
634 def describe(self, quantiles=None, format=None):
635 """Provides standard summary statistics for the data in the Stats
636 object, in one of several convenient formats.
637
638 Args:
639 quantiles (list): A list of numeric values to use as
640 quantiles in the resulting summary. All values must be
641 0.0-1.0, with 0.5 representing the median. Defaults to
642 ``[0.25, 0.5, 0.75]``, representing the standard
643 quartiles.
644 format (str): Controls the return type of the function,
645 with one of three valid values: ``"dict"`` gives back
646 a :class:`dict` with the appropriate keys and
647 values. ``"list"`` is a list of key-value pairs in an
648 order suitable to pass to an OrderedDict or HTML
649 table. ``"text"`` converts the values to text suitable
650 for printing, as seen below.
651
652 Here is the information returned by a default ``describe``, as
653 presented in the ``"text"`` format:
654
655 >>> stats = Stats(range(1, 8))
656 >>> print(stats.describe(format='text'))
657 count: 7
658 mean: 4.0
659 std_dev: 2.0
660 mad: 2.0
661 min: 1
662 0.25: 2.5
663 0.5: 4
664 0.75: 5.5
665 max: 7
666
667 For more advanced descriptive statistics, check out my blog
668 post on the topic `Statistics for Software
669 <https://www.paypal-engineering.com/2016/04/11/statistics-for-software/>`_.
670
671 """
672 if format is None:
673 format = 'dict'
674 elif format not in ('dict', 'list', 'text'):
675 raise ValueError('invalid format for describe,'
676 ' expected one of "dict"/"list"/"text", not %r'
677 % format)
678 quantiles = quantiles or [0.25, 0.5, 0.75]
679 q_items = []
680 for q in quantiles:
681 q_val = self.get_quantile(q)
682 q_items.append((str(q), q_val))
683
684 items = [('count', self.count),
685 ('mean', self.mean),
686 ('std_dev', self.std_dev),
687 ('mad', self.mad),
688 ('min', self.min)]
689
690 items.extend(q_items)
691 items.append(('max', self.max))
692 if format == 'dict':
693 ret = dict(items)
694 elif format == 'list':
695 ret = items
696 elif format == 'text':
697 ret = '\n'.join(['%s%s' % ((label + ':').ljust(10), val)
698 for label, val in items])
699 return ret
700
701
702 def describe(data, quantiles=None, format=None):
703 """A convenience function to get standard summary statistics useful
704 for describing most data. See :meth:`Stats.describe` for more
705 details.
706
707 >>> print(describe(range(7), format='text'))
708 count: 7
709 mean: 3.0
710 std_dev: 2.0
711 mad: 2.0
712 min: 0
713 0.25: 1.5
714 0.5: 3
715 0.75: 4.5
716 max: 6
717
718 See :meth:`Stats.format_histogram` for another very useful
719 summarization that uses textual visualization.
720 """
721 return Stats(data).describe(quantiles=quantiles, format=format)
722
723
724 def _get_conv_func(attr_name):
725 def stats_helper(data, default=0.0):
726 return getattr(Stats(data, default=default, use_copy=False),
727 attr_name)
728 return stats_helper
729
730
731 for attr_name, attr in list(Stats.__dict__.items()):
732 if isinstance(attr, _StatsProperty):
733 if attr_name in ('max', 'min', 'count'): # don't shadow builtins
734 continue
735 if attr_name in ('mad',): # convenience aliases
736 continue
737 func = _get_conv_func(attr_name)
738 func.__doc__ = attr.func.__doc__
739 globals()[attr_name] = func
740 delattr(Stats, '_calc_' + attr_name)
741 # cleanup
742 del attr
743 del attr_name
744 del func
745
746
747 def format_histogram_counts(bin_counts, width=None, format_bin=None):
748 """The formatting logic behind :meth:`Stats.format_histogram`, which
749 takes the output of :meth:`Stats.get_histogram_counts`, and passes
750 them to this function.
751
752 Args:
753 bin_counts (list): A list of bin values to counts.
754 width (int): Number of character columns in the text output,
755 defaults to 80 or console width in Python 3.3+.
756 format_bin (callable): Used to convert bin values into string
757 labels.
758 """
759 lines = []
760 if not format_bin:
761 format_bin = lambda v: v
762 if not width:
763 try:
764 import shutil # python 3 convenience
765 width = shutil.get_terminal_size()[0]
766 except Exception:
767 width = 80
768
769 bins = [b for b, _ in bin_counts]
770 count_max = max([count for _, count in bin_counts])
771 count_cols = len(str(count_max))
772
773 labels = ['%s' % format_bin(b) for b in bins]
774 label_cols = max([len(l) for l in labels])
775 tmp_line = '%s: %s #' % ('x' * label_cols, count_max)
776
777 bar_cols = max(width - len(tmp_line), 3)
778 line_k = float(bar_cols) / count_max
779 tmpl = "{label:>{label_cols}}: {count:>{count_cols}} {bar}"
780 for label, (bin_val, count) in zip(labels, bin_counts):
781 bar_len = int(round(count * line_k))
782 bar = ('#' * bar_len) or '|'
783 line = tmpl.format(label=label,
784 label_cols=label_cols,
785 count=count,
786 count_cols=count_cols,
787 bar=bar)
788 lines.append(line)
789
790 return '\n'.join(lines)