MultivariateLive Code Editor

134 researchers ran this analysis this month

PCA Visualization in Python: Scores, Loadings, and Biplots

Technique overview

Create scree plots, scores plots with confidence ellipses, loadings plots, and biplots for PCA results. Covers interpretation for omics and environmental data.

Principal Component Analysis is the default dimensionality-reduction technique when you have a matrix of samples by features and need to visualise the dominant sources of variation. Whether you are exploring metabolomics profiles from an LC-MS experiment, checking for batch effects in RNA-seq counts, or separating geological samples by elemental composition, PCA reduces dozens or hundreds of correlated variables into a few orthogonal components that capture most of the variance. The statistical computation is one line of sklearn code; the real challenge is producing the four-panel figure that reviewers expect: scree plot, scores plot with confidence ellipses, loadings bar chart, and biplot. This page delivers all four.

Key points

Create scree plots, scores plots with confidence ellipses, loadings plots, and biplots for PCA results. Covers interpretation for omics and environmental data.
Principal Component Analysis is the default dimensionality-reduction technique when you have a matrix of samples by features and need to visualise the dominant sources of variation.
Whether you are exploring metabolomics profiles from an LC-MS experiment, checking for batch effects in RNA-seq counts, or separating geological samples by elemental composition, PCA reduces dozens or hundreds of correlated variables into a few orthogonal components that capture most of the variance.
The statistical computation is one line of sklearn code; the real challenge is producing the four-panel figure that reviewers expect: scree plot, scores plot with confidence ellipses, loadings bar chart, and biplot.

sklearnnumpymatplotlib

Example Visualization

Open full screen

Review the example first, then use the live editor below to run and customize the full workflow.

Mathematical Foundation

Principal Component Analysis is the default dimensionality-reduction technique when you have a matrix of samples by features and need to visualise the dominant sources of variation.

Equation

X = U * S * V^T (SVD of the centered, scaled data matrix)

Parameter breakdown

X(n x p) data matrix, centered and optionally scaled to unit variance

U(n x k) matrix of left singular vectors (relate to scores)

SDiagonal matrix of singular values (relate to explained variance)

V(p x k) matrix of right singular vectors (loadings / feature weights)

Explained variance ratio = s_i^2 / sum(s^2)Explained variance ratio = s_i^2 / sum(s^2)

When to use this technique

Use PCA when you want to reduce dimensionality for visualization, identify outliers, check for batch effects, or pre-process data before clustering or classification. Always scale your data (StandardScaler) unless all features share the same units and dynamic range.

Apply This Technique Now

Run this analysis workflow with AI in seconds. Use the prepared technique prompt or bring your own dataset.

Try Technique Prompt Use Your Own Data

View example prompt

Example AI Prompt

"Run PCA on my dataset, show a scree plot, a 2D scores plot with 95% confidence ellipses colored by group, and a loadings biplot overlay"

How to apply this technique in 30 seconds

Upload Data

Upload your CSV or Excel file in Analyze and keep your column names as-is.

Generate

Run the example prompt and let AI generate this technique automatically.

Refine and Export

Adjust code or prompt, then export publication-ready figures.

Implementation Code

The core data processing logic. Copy this block and replace the sample data with your measurements.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# --- Simulated metabolomics data: 40 samples, 20 metabolites, 2 groups ---
np.random.seed(42)
n_samples, n_features = 40, 20
group_labels = np.array(['Control'] * 20 + ['Treated'] * 20)

X_ctrl = np.random.normal(0, 1, (20, n_features))
X_trt  = np.random.normal(0, 1, (20, n_features))
X_trt[:, :5] += 1.5  # Shift first 5 metabolites in treated group
X = np.vstack([X_ctrl, X_trt])

# --- Scale and run PCA ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=5)
scores = pca.fit_transform(X_scaled)

print("Explained variance ratios:")
for i, ev in enumerate(pca.explained_variance_ratio_):
    print(f"  PC{i+1}: {ev:.3f} ({ev*100:.1f} %)")
print(f"  Cumulative (PC1-5): {pca.explained_variance_ratio_.sum()*100:.1f} %")

# Loadings
loadings = pca.components_.T  # shape: (n_features, n_components)
feature_names = [f'Metabolite_{i+1}' for i in range(n_features)]

Visualize This Implementation in Analyze

Visualization Code

Complete matplotlib code for a publication-ready figure. Copy, paste into your notebook, and adjust labels to match your data.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
X_ctrl = np.random.normal(0, 1, (20, 20))
X_trt  = np.random.normal(0, 1, (20, 20))
X_trt[:, :5] += 1.5
X = np.vstack([X_ctrl, X_trt])
labels = ['Control'] * 20 + ['Treated'] * 20

X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=5)
scores = pca.fit_transform(X_scaled)
loadings = pca.components_.T
feat_names = [f'M{i+1}' for i in range(20)]
ev = pca.explained_variance_ratio_

def confidence_ellipse(x, y, ax, n_std=2.0, **kwargs):
    cov = np.cov(x, y)
    vals, vecs = np.linalg.eigh(cov)
    order = vals.argsort()[::-1]
    vals, vecs = vals[order], vecs[:, order]
    angle = np.degrees(np.arctan2(*vecs[:, 0][::-1]))
    w, h = 2 * n_std * np.sqrt(vals)
    ell = Ellipse(xy=(x.mean(), y.mean()), width=w, height=h,
                  angle=angle, **kwargs)
    ax.add_patch(ell)

fig, axes = plt.subplots(2, 2, figsize=(11, 9))

# 1. Scree plot
axes[0, 0].bar(range(1, 6), ev * 100, color='#9240ff', edgecolor='black')
axes[0, 0].plot(range(1, 6), np.cumsum(ev) * 100, 'o-', color='black')
axes[0, 0].set_xlabel('Principal Component')
axes[0, 0].set_ylabel('Variance Explained (%)')
axes[0, 0].set_title('Scree Plot')

# 2. Scores plot
colors_map = {'Control': '#888888', 'Treated': '#9240ff'}
for grp in ['Control', 'Treated']:
    mask = np.array(labels) == grp
    axes[0, 1].scatter(scores[mask, 0], scores[mask, 1], s=40,
                       color=colors_map[grp], label=grp, zorder=5)
    confidence_ellipse(scores[mask, 0], scores[mask, 1], axes[0, 1],
                       n_std=2, facecolor=colors_map[grp], alpha=0.15,
                       edgecolor=colors_map[grp], lw=1.5)
axes[0, 1].set_xlabel(f'PC1 ({ev[0]*100:.1f} %)')
axes[0, 1].set_ylabel(f'PC2 ({ev[1]*100:.1f} %)')
axes[0, 1].set_title('Scores Plot with 95 % Ellipses')
axes[0, 1].legend(frameon=False)

# 3. Loadings bar chart (PC1)
order = np.argsort(np.abs(loadings[:, 0]))[::-1][:10]
axes[1, 0].barh(range(len(order)), loadings[order, 0], color='#9240ff')
axes[1, 0].set_yticks(range(len(order)))
axes[1, 0].set_yticklabels([feat_names[i] for i in order])
axes[1, 0].set_xlabel('Loading on PC1')
axes[1, 0].set_title('Top 10 Loadings (PC1)')
axes[1, 0].invert_yaxis()

# 4. Biplot
axes[1, 1].scatter(scores[:, 0], scores[:, 1], s=15, color='#888', alpha=0.5)
scale = 3
for i in range(min(8, len(feat_names))):
    axes[1, 1].arrow(0, 0, loadings[i, 0]*scale, loadings[i, 1]*scale,
                     head_width=0.08, head_length=0.05, fc='#9240ff',
                     ec='#9240ff', alpha=0.7)
    axes[1, 1].text(loadings[i, 0]*scale*1.1, loadings[i, 1]*scale*1.1,
                    feat_names[i], fontsize=7, color='#9240ff')
axes[1, 1].set_xlabel(f'PC1 ({ev[0]*100:.1f} %)')
axes[1, 1].set_ylabel(f'PC2 ({ev[1]*100:.1f} %)')
axes[1, 1].set_title('Biplot')

plt.tight_layout()
plt.savefig('pca_4panel.png', dpi=300, bbox_inches='tight')
plt.show()

Visualize This Code in Analyze

Group-Colored PCA with Custom Legend

When your dataset has three or more experimental groups (e.g., time points, concentrations, cell types), you need a clear color scheme and legend. This extension shows how to assign distinct colors per group, draw separate confidence ellipses, and produce a polished multi-group scores plot.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
groups = ['Wildtype', 'Knockout', 'Rescue']
n_per = 15
X_list, labels = [], []
for i, g in enumerate(groups):
    Xi = np.random.normal(0, 1, (n_per, 15))
    Xi[:, :3] += i * 1.5
    X_list.append(Xi)
    labels += [g] * n_per
X = np.vstack(X_list)

X_sc = StandardScaler().fit_transform(X)
pca = PCA(n_components=2)
scores = pca.fit_transform(X_sc)
ev = pca.explained_variance_ratio_

colors = {'Wildtype': '#888', 'Knockout': '#9240ff', 'Rescue': '#e67e22'}

fig, ax = plt.subplots(figsize=(7, 6))
for grp in groups:
    mask = np.array(labels) == grp
    ax.scatter(scores[mask, 0], scores[mask, 1], s=50,
               color=colors[grp], label=grp, zorder=5, edgecolors='white', lw=0.5)
    cov = np.cov(scores[mask, 0], scores[mask, 1])
    vals, vecs = np.linalg.eigh(cov)
    order = vals.argsort()[::-1]
    vals, vecs = vals[order], vecs[:, order]
    angle = np.degrees(np.arctan2(*vecs[:, 0][::-1]))
    w, h = 2 * 2.0 * np.sqrt(vals)
    ell = Ellipse(xy=(scores[mask, 0].mean(), scores[mask, 1].mean()),
                  width=w, height=h, angle=angle,
                  facecolor=colors[grp], alpha=0.12,
                  edgecolor=colors[grp], lw=1.5)
    ax.add_patch(ell)

ax.set_xlabel(f'PC1 ({ev[0]*100:.1f} %)')
ax.set_ylabel(f'PC2 ({ev[1]*100:.1f} %)')
ax.set_title('PCA Scores - Multi-Group Comparison', fontsize=13)
ax.legend(frameon=False)
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.savefig('pca_multigroup.png', dpi=300, bbox_inches='tight')
plt.show()

Visualize This Advanced Variant

Common Errors and How to Fix Them

Features dominate PCA because data is not scaled

Why: PCA maximizes variance. If one feature has values in the thousands and another in decimals, the large-valued feature will dominate PC1 regardless of its biological relevance.

Fix: Always use StandardScaler (zero mean, unit variance) before PCA unless all features have the same units and comparable ranges.

Too few samples relative to features (n < p)

Why: With more features than samples, PCA can produce at most n-1 non-zero components, and the results may be dominated by noise.

Fix: Use regularised PCA or pre-filter features (e.g., remove low-variance columns). Many omics datasets are inherently n << p, which is acceptable as long as you interpret cautiously.

Interpreting Euclidean distance in the scores plot as "similarity"

Why: Distances in the PC1-PC2 plane ignore variance in higher PCs. Two points close in PC1/PC2 may be far apart in PC3+.

Fix: Check the cumulative explained variance. If PC1+PC2 < 60 %, distances in the 2D plot can be misleading. Consider a 3D plot or include more components.

Mixing up loadings sign (positive vs negative)

Why: The sign of a loading indicates the direction of the correlation with the PC, but the sign of PCs themselves is arbitrary (PCA is sign-invariant).

Fix: Focus on the magnitude and relative sign of loadings within a component. A feature with loading = -0.4 on PC1 contributes as much as one with +0.4, just in the opposite direction.

Including the target variable in PCA for a classification task

Why: PCA is an unsupervised method. Including the class label as a feature introduces data leakage.

Fix: Remove class labels before running PCA. If you want supervised dimensionality reduction, use Linear Discriminant Analysis (LDA) instead.

Frequently Asked Questions

Learn More Before You Run It

Scientific Data Visualization: Complete Guide

Context for PCA in exploratory analysis and storytelling workflows.

Read guide

Apply PCA Visualization in Python: Scores, Loadings, and Biplots to Your Data

Upload your dataset and Plotivy generates the Python code, runs the analysis, and produces a publication-ready figure.

Generate Code for This Technique

Python Libraries