Preclinical Sample Size Calculation for Fibrosis Models
Calculate optimal sample size for fibrosis studies using Power Analysis and G*Power. Balance statistical rigor with 3Rs ethics in MASH, IPF, CKD models.
Introduction: The Eternal Question—"How Many Mice?"
When planning a preclinical in vivo efficacy study, the most frequent and critical question asked by researchers and sponsors alike is: "What should the sample size (N per group) be?"
If the N number is too small, you risk failing to detect a statistically significant difference even if your drug works perfectly (a False Negative / Type II error). Conversely, if the N number is excessively large, you unnecessarily sacrifice more animals (violating the 3Rs principles) and waste significant budget and time.
Particularly in fibrosis models (such as MASH, IPF, and CKD), where histological readouts like Sirius Red morphometry carry inherent inter-individual variability, relying on "rules of thumb" (like "let's just use N=8") is a recipe for study failure. This article breaks down the science of Power Analysis, demonstrating how to logically calculate your required sample size using the free, industry-standard tool, G*Power.
1. The Four Pillars of Sample Size Calculation
To reverse-engineer the required N number, you must pre-define or estimate four key statistical parameters:
- α (Alpha / Significance Level / Type I error rate): The probability of concluding there is an effect when none actually exists (False Positive). By scientific consensus, this is almost always set at 0.05 (5%). This is the threshold where P < 0.05 is considered "statistically significant."
- Power (1 - β / Type II error rate): The probability of correctly detecting a true effect (True Positive). The industry standard for preclinical drug screening is generally set at 0.80 (80%) or sometimes 0.90 (90%).
- Effect Size (The Expected Difference): How much of a difference (e.g., in mean fibrosis percentage) do you realistically expect your test article to achieve compared to the vehicle control? This must be estimated from previous in-house data or robust published literature.
- Standard Deviation (SD): A measure of how "noisy" or variable the data is within your specific animal model. This is notoriously critical in fibrosis evaluation.
For researchers tracking fibrosis & inflammation R&D
FDA approval alerts, trial readouts, preclinical model selection, and assay optimization — curated signal for bench-to-pipeline readers. 2 emails/month max.
2. Practical Example: Using G*Power (Fibrosis Area Comparison)
Let's walk through a standard "T-test (2 independent groups)" scenario using G*Power, a widely utilized free statistical software.
[The Scenario] You are testing a novel compound in a CCl4 Liver Fibrosis model, evaluating Sirius Red stained fibrosis area (%). From historical CRO data, the Vehicle group averages a 10.0% fibrosis area with a Standard Deviation (SD) of 2.5%. You hypothesize your drug will reduce fibrosis by 30%, resulting in a mean of 7.0%. Question: How many mice per group are required to detect this difference with α=0.05 and Power=0.80?
Step 1: Calculate the Effect Size (Cohen's d)
Cohen’s $d$ is the difference between means divided by the pooled standard deviation.
- Difference = 10.0 - 7.0 = 3.0
- SD = 2.5
- Effect Size (d) = 3.0 / 2.5 = 1.2
Step 2: Input into G*Power
- Test family:
t tests - Statistical test:
Means: Difference between two independent means (two groups) - Type of power analysis:
A priori: Compute required sample size - Tail(s):
Two(Two-sided test) - Effect size d:
1.2 - α err prob:
0.05 - Power (1 - β err prob):
0.80 - Allocation ratio N2/N1:
1(Equal group sizes)
Step 3: Interpret the Result
- Click 'Calculate'. The Output will show Total sample size = 24, meaning you need exactly 12 mice per group.
[!WARNING] Accounting for Attrition (Dropouts) G*Power tells you how many mice you need at the end of the study for the final analysis. Because animal models inherently carry mortality risks (e.g., Bleomycin IT administration has a ~10-20% mortality rate), always add 10–20% extra animals to your starting cohort. For this scenario, you should start with N=14 per group.
3. Recommended N Numbers Based on Model Variability (SD)
The necessary sample size fluctuates wildly depending on the intrinsic "noisiness" of the chosen model.
① Low Variability Models (e.g., CCl4 Liver Fibrosis)
- Characteristics: Because chemical toxicity is applied uniformly via injection to genetically identical mice, the resulting fibrosis is highly uniform and consistent.
- Typical N Requirement: Due to the low SD and resultant high effect size (often d > 1.2), studies can successfully achieve 80% power with relatively small numbers, typically N=8 to 10 per group.
② High Variability Models (e.g., Dietary MASH Models, Bleomycin Lung Fibrosis)
- Characteristics: Dietary models (like GAN or CDAHFD) rely on individual feeding behaviors and metabolic responses, leading to highly heterogeneous staging (e.g., a mix of F2 and F3 within the same vehicle group). Bleomycin models suffer from variable surgical instillation precision.
- Typical N Requirement: The "noise" heavily dilutes the effect size (often dropping 'd' to 0.8–1.0). Therefore, these models frequently require N=12 to 15 per group to robustly detect therapeutic efficacy.
4. 3Rs Compliance and Ethical Reductions
If ethical or budgetary constraints demand a "Reduction" in N numbers without sacrificing statistical power, consider these advanced strategies:
- Upgrade to Continuous Quantitative Endpoints: Avoid relying purely on categorical, semi-quantitative pathologist scores (e.g., 0, 1, 2, 3, 4). Using AI pathology or ImageJ to generate continuous data (e.g., 14.3% area fraction) severely reduces data chunkiness and mathematically boosts statistical power, allowing you to use fewer animals.
- Stringent Baseline Stratification: Prior to starting drug treatment, strictly randomize (stratify) the mice based on precise baseline biomarkers (like matching body weights or pre-bleed ALT levels). Eliminating baseline bias removes massive amounts of downstream noise from your SD.
- Harness ANOVA with Multiple Doses: Instead of relying solely on a massive 'Vehicle vs. High Dose' t-test, employing a thoughtful 'Low, Mid, High' dose-response design analyzed via ANOVA can sometimes yield better overall confidence in the drug's mechanism of action using fewer animals per specific arm.
ARRIVE 2.0 (Animal Research: Reporting of In Vivo Experiments 2.0) compliance (Percie du Sert N, et al. PLoS Biol. 2020;18(7):e3000410) is now the international standard for preclinical study design and reporting. Sample size justification (power calculation) should be documented under Item 2 (Study Design) of the ARRIVE 2.0 checklist.
Summary
The era of arbitrarily selecting "N=10 because that's what we usually do" is over. Modern IND submissions and high-impact peer-reviewed journals strictly require A priori sample size justifications to prove both scientific validity and ethical adherence to the 3Rs. By utilizing robust historical data and tools like G*Power, sponsors can design lean, highly powered studies that maximize the chances of confidently identifying the next blockbuster anti-fibrotic drug.
References
-
Dell RB, et al. Sample size determination. ILAR J. 2002;43(4):207-213. PMID: 12391396
-
Festing MFW, Altman DG. Guidelines for the Design and Statistical Analysis of Experiments Using Laboratory Animals. ILAR J. 2002;43(4):244-258. PMID: 12391400
-
Percie du Sert N, et al. The ARRIVE guidelines 2.0: Updated guidelines for reporting animal research. PLoS Biol. 2020;18(7):e3000410. PMID: 32663219