Name: Plotivy
Author: Francesco Villasmunta

Histograms are the fastest way to inspect how values in a dataset are distributed. Before choosing a statistical test, checking for outliers, or deciding whether to log-transform data, a histogram gives you the visual context you need.

Many tutorials stop at plt.hist(data). That is enough for a screenshot, but not enough for technical work where binning choices, scaling, weighting, and reproducibility affect your scientific conclusions. This guide goes from baseline usage to edge cases you will encounter in real data pipelines.

You will get multiple code variants, interpretation guidance, and practical troubleshooting patterns for common failure modes. If your goal is a publication figure or a defensible exploratory analysis, this is the level of histogram control you need.

Bin count

Too few bins hide structure; too many add noise. Start with bins="auto" and adjust.

density=True

Normalises the y-axis to probability density, making groups with different sample sizes comparable.

alpha

Set alpha=0.5–0.6 when overlapping two or more distributions so both remain readable.

1. Minimal histogram

Pass your array to ax.hist(). The most important optional argument is bins - an integer sets the number of equal-width bins. For fast exploratory checks, this is fine, but for reporting or comparing cohorts, you should avoid hardcoding a bin count before inspecting spread, outliers, and sample size.

import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng(42)
data = rng.normal(loc=5, scale=1.5, size=300)

fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(data, bins=30, color="#6b21a8", edgecolor="white", linewidth=0.4)
ax.set_xlabel("Value")
ax.set_ylabel("Count")
ax.set_title("Basic Histogram")
plt.tight_layout()

2. Choosing bins with intent

Bin selection controls what structure you can see. Too few bins hide multimodality. Too many bins amplify noise and produce unstable tails. Instead of guessing, compare rule-based options on the same dataset and keep your choice explicit in methods sections.

Sturges - conservative, good for small near-normal datasets.
Freedman-Diaconis (fd) - robust to outliers, often strong for skewed experimental data.
auto - chooses between rules, good as a baseline but still inspect visually.

import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng(42)
data = np.r_[rng.normal(0, 1, 700), rng.normal(4, 0.7, 300)]

fig, axes = plt.subplots(1, 3, figsize=(12, 3.5), sharey=True)
rules = ["sturges", "fd", "auto"]

for ax, rule in zip(axes, rules):
  ax.hist(data, bins=rule, color="#6b21a8", edgecolor="white", linewidth=0.4)
  ax.set_title(f"bins='{rule}'")
  ax.set_xlabel("Value")

axes[0].set_ylabel("Count")
fig.suptitle("How bin rules change your interpretation")
plt.tight_layout()

3. Add a KDE curve on top

Use density=True to normalize the histogram, then overlay a Gaussian KDE for a smooth estimate of the underlying distribution. In reports, describe KDE as a smoothed estimate rather than raw frequency.

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde

rng = np.random.default_rng(42)
data = rng.normal(loc=5, scale=1.5, size=300)

fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(data, bins=30, density=True, color="#6b21a8", edgecolor="white",
        linewidth=0.4, alpha=0.6, label="Histogram")

xs = np.linspace(data.min(), data.max(), 200)
kde = gaussian_kde(data)
ax.plot(xs, kde(xs), lw=2, color="#f59e0b", label="KDE")

ax.set_xlabel("Value")
ax.set_ylabel("Density")
ax.legend(frameon=False)
plt.tight_layout()

4. Overlapping histograms for group comparison

When comparing two groups, plot both histograms on the same axes with alpha set below 1. Use density=True if group sizes differ. Counts can be misleading when one cohort has more observations.

import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng(42)
group_a = rng.normal(4, 1.2, 200)
group_b = rng.normal(6, 1.0, 200)

fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(group_a, bins=25, density=True, alpha=0.55, color="#6b21a8", label="Group A")
ax.hist(group_b, bins=25, density=True, alpha=0.55, color="#f59e0b", label="Group B")
ax.set_xlabel("Value")
ax.set_ylabel("Density")
ax.legend(frameon=False)
plt.tight_layout()

5. Weighted histograms for non-uniform sampling

In many workflows, samples do not contribute equally. You may have confidence weights from measurement quality, inverse probability weights in surveys, or replicate-level weighting from merged instruments. Use the weightsargument so your histogram reflects weighted frequency, not raw row count.

import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng(42)
signal = rng.normal(0, 1, 600)

# Simulate acquisition confidence for each sample
weights = np.clip(rng.normal(loc=1.0, scale=0.25, size=signal.size), 0.2, 1.8)

fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=True)
axes[0].hist(signal, bins=30, color="#1d4ed8", edgecolor="white", linewidth=0.4)
axes[0].set_title("Unweighted")
axes[0].set_xlabel("Signal")

axes[1].hist(signal, bins=30, weights=weights, color="#b45309", edgecolor="white", linewidth=0.4)
axes[1].set_title("Weighted")
axes[1].set_xlabel("Signal")

axes[0].set_ylabel("Count / weighted count")
plt.tight_layout()

6. Skewed data: log bins and log axes

Particle size, latency, concentration, and many biological intensity datasets are right-skewed. Linear bins over-emphasize the dense left tail and compress the rest. Logarithmic bins plus a log x-axis often produce a more interpretable shape. Keep in mind that bin widths are no longer equal in linear space, so report this explicitly in figure captions.

import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng(42)
data = rng.lognormal(mean=1.6, sigma=0.9, size=1500)

# Build logarithmic bin edges
log_bins = np.logspace(np.log10(data.min()), np.log10(data.max()), 40)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(data, bins=40, color="#6b21a8", edgecolor="white", linewidth=0.4)
axes[0].set_title("Linear bins")
axes[0].set_xlabel("Value")
axes[0].set_ylabel("Count")

axes[1].hist(data, bins=log_bins, color="#6b21a8", edgecolor="white", linewidth=0.4)
axes[1].set_xscale("log")
axes[1].set_title("Log bins + log x-axis")
axes[1].set_xlabel("Value (log scale)")

plt.tight_layout()

Try it

Try it now: generate this histogram from your own data

Upload a CSV or Excel file and let Plotivy build a publication-ready histogram with sensible bins and clean styling.

Generate my histogram →

Newsletter

Get a weekly Python plotting tip

One concise tip each week for cleaner, faster scientific figures. Built for researchers who publish.

7. Matplotlib vs Seaborn vs Plotly for histograms

The best library depends on your output target. Matplotlib offers maximum control for journal exports. Seaborn accelerates statistical defaults. Plotly gives interactivity for dashboards and exploratory notebooks.

Library	Strength	Trade-off
matplotlib	Precise publication styling and export control	More manual code for polished defaults
seaborn	Fast statistical visuals and clean defaults	Less granular control in edge styling cases
plotly	Interactive hover, zoom, and web-native output	Static publication output needs extra tuning

Quick reference: key parameters

Parameter	What it does
bins	Integer, sequence of edges, or `"auto"` / `"fd"`
density	Normalise to probability density (default False)
alpha	Transparency 0–1; use 0.5–0.6 for overlapping groups
edgecolor	Border colour of bars; `"white"` separates dense bins
log	Log-scale y-axis — useful for power-law distributions
range	Clip input to a (min, max) window before binning
weights	Apply per-sample contribution when observations are non-uniform
histtype	Bar style: `bar`, `step`, or `stepfilled`

8. Common mistakes and robust fixes

Histogram bugs are often data quality bugs in disguise. Before styling, validate numeric types, remove non-finite values, and lock your bin strategy. If your figure changes drastically across reruns, inspect random seeds and preprocessing order.

Too few bins - large bins hide bimodality or skewness. Use bins="auto" as a starting point.
Comparing raw counts - if sample sizes differ, normalise with density=True before overlapping.
No axis label - always label both axes, including the units on the x-axis.
Missing caption - state the number of observations (n) in the figure caption or on the plot.
Ignoring NaN or inf values - clean arrays before plotting or you will get unstable bins and hard-to-debug behavior.
Changing bins between cohorts - when comparing groups, use a shared bin edge definition.

import matplotlib.pyplot as plt
import numpy as np

def clean_numeric(series):
  arr = np.asarray(series, dtype=float)
  arr = arr[np.isfinite(arr)]
  if arr.size == 0:
    raise ValueError("No finite numeric values available for histogram")
  return arr

raw = [1.1, 2.0, np.nan, np.inf, 3.4, 4.1, -np.inf, 2.8]
data = clean_numeric(raw)

fig, ax = plt.subplots(figsize=(6, 4))
ax.hist(data, bins="auto", color="#6b21a8", edgecolor="white", linewidth=0.4)
ax.set_xlabel("Value")
ax.set_ylabel("Count")
ax.set_title("Histogram after numeric cleaning")
plt.tight_layout()

9. Publication-ready histogram checklist

Report sample size (n) in caption or panel text.
Declare bin rule or fixed bin width explicitly.
Use consistent bins across compared cohorts.
Label x-axis units and whether y-axis is count or density.
Document preprocessing: outlier handling, filtering, and transforms.
Export at target journal dimensions and verify readability at print scale.

10. When to use a histogram vs alternatives

Histograms are excellent for showing distribution shape, but they are not always the best final figure. If your audience needs quartiles and outlier emphasis, a box plot may communicate faster. If you need full density shape comparison with small samples, violin plots can be more stable than aggressively binned histograms.

A practical workflow is: start with histogram for quick distribution diagnostics, validate assumptions, then choose the final chart form based on the message you need to communicate in the paper or report.

11. Advanced edge cases you should handle explicitly

Technical datasets rarely follow clean textbook assumptions. If you are working with sensor streams, assay readouts, or merged multi-site data, your histogram pipeline should include edge-case handling as first-class logic. Treat these as part of the analysis method, not cosmetic cleanup.

Zero-inflated distributions - many biological and reliability datasets include a large spike at zero plus a continuous positive tail. Consider plotting a dedicated zero bar annotation and a second panel for non-zero values to avoid flattening useful variation.
Censored measurements - if values below detection limit are imputed, your left tail may be artificial. Mark detection thresholds with vertical reference lines and mention censoring rules in captions.
Integer count data - for low-range count variables (for example read counts or defect counts), use bin edges aligned to integer boundaries. Arbitrary fractional bins can imply precision that does not exist in the measured variable.
Streaming windows - in online QC systems, histograms computed on rolling windows can drift due to changing sample volume. Keep a fixed bin edge policy and version your windowing settings for reproducible audits.
Mixture populations - multimodal peaks may represent true subpopulations rather than noise. Before smoothing them away, verify whether modality aligns with treatment group, batch, instrument, or site.

12. Interpreting histogram shapes without overclaiming

Histogram interpretation is useful but easy to overstate. Shape alone does not prove mechanism. Use the plot to generate hypotheses, then test them with domain checks and formal statistics.

Observed shape	Common interpretation	What to verify next
Right skew	Rare high values or multiplicative process	Check log transform, outlier provenance, measurement floor
Bimodal	Two latent groups or regime change	Color by batch/treatment and inspect subgroup histograms
Heavy tails	Extreme events matter to summary metrics	Use robust statistics (median, IQR) and tail diagnostics
Sharp central spike	Quantization, rounding, or imputation artifact	Inspect raw instrument precision and preprocessing rules

13. Domain-specific implementation notes

The same plotting function behaves differently across domains because data collection constraints differ. These patterns reduce avoidable review comments in technical papers and internal reports.

Clinical measurements - always annotate units and detection limits. If subgroup sample sizes differ strongly, favor density-normalized overlays plus subgroup-specific sample counts.
Materials characterization - grain size and particle distributions are typically right-skewed. Use log bins and report the exact edge definition so another lab can reproduce your visual summary.
A/B experimentation - compare control and treatment with shared bins and explicit cohort sizes. Pair histograms with confidence intervals or bootstrap summaries to avoid visual-only conclusions.
Sensor engineering - if calibration changes over time, stratify histograms by calibration era. Mixed-calibration histograms can create false multimodality and mislead threshold setting.

14. Common error messages and immediate fixes

If you are building production notebooks or reusable plotting utilities, add guards for these failure cases. Most histogram runtime issues come from invalid data arrays rather than plotting API syntax.

ValueError: autodetected range of [nan, nan] is not finite- clean non-finite values before binning.
TypeError with mixed string and numeric arrays- cast to numeric explicitly and log rejected rows.
Unreadable overlapping fills- lower alpha and use step histograms or faceting.
Histogram shape changes between runs- control random seed, filtering order, and bin-edge policy.
Axis labels clipped in export- apply plt.tight_layout() and verify target figure size before saving.

15. FAQ for technical teams

Should I show counts or density in publications?

If you compare cohorts with unequal sample sizes, density is usually clearer. If absolute volume matters to the claim, keep counts and show sample size for each cohort. In many papers, the strongest approach is counts in supplemental plots and density overlays in the main figure where shape comparison matters.

How many bins are appropriate for n around 100 to 500?

There is no universal answer. Start with rule-based edges such as fd or auto, then compare visual stability across nearby settings. If conclusions change drastically with small bin adjustments, report that sensitivity and avoid overconfident narrative claims.

When should I choose KDE alone instead of histogram + KDE?

KDE alone is useful for cleaner visual summaries, especially when many groups overlap. Histogram + KDE is better when readers need to see both empirical bin occupancy and smoothed trend. In regulated or high-stakes contexts, showing the empirical histogram reduces ambiguity about where data points actually lie.

Can I use histograms for hypothesis testing directly?

Histograms are descriptive, not inferential. Use them to check assumptions, detect anomalies, and motivate model choice. Then run appropriate statistical tests on the underlying data. Keep the histogram as diagnostic evidence, not as the sole proof of significance.

Want to skip the boilerplate? Describe your data in plain language and Plotivy generates the histogram code for you — bins, density, and styling included.

Try histogram generation in Plotivy

Tags:#histogram#matplotlib#seaborn#python#distribution

Related chart guides

Apply this tutorial directly in the chart gallery with ready-to-run prompts and examples.

Histogram

Inspect distribution shape, spread, and skewness.

Box Plot

Summarize quartiles, spread, and outliers across groups.

Technique guides scientists read next

scipy.signal.find_peaks guide

Tune prominence and width parameters for robust peak extraction.

Savitzky-Golay smoothing

Reduce noise while preserving peak shape and position.

PCA visualization workflow

Move from high-dimensional measurements to interpretable components.

ANOVA with post-hoc brackets

Add statistically correct pairwise significance annotations.

Found this helpful? Share it with your network.

Francesco Villasmunta

Experimental Physicist & Photonics Researcher

Hands-on experience in silicon photonics, semiconductor fabrication (DRIE/ICP-RIE), optical simulation, and data-driven analysis. Built Plotivy to help researchers focus on discoveries instead of data struggles.

More about the author

Visualize your own data

Apply the techniques from this article to your own datasets. Upload CSV, Excel, or paste data directly.

Start Analyzing - Free

Menu

How to Plot a Histogram in Python with Matplotlib (Complete Guide)