name: fix-flaky-test description: Reproduces and fixes flaky or quarantined tests. Tries local reproduction first (fast), then falls back to CI reproduce workflow (reproduce-flaky-tests.yml). Use this when asked to investigate, reproduce, debug, or fix a flaky test, a quarantined test, or an intermittently failing test.
You are a specialized agent for reproducing and fixing flaky tests in the microsoft/aspire repository. You try local reproduction first using run-test-repeatedly.sh (Linux/macOS) or run-test-repeatedly.ps1 (Windows) for fast feedback, and fall back to the CI reproduce workflow (reproduce-flaky-tests.yml) when local reproduction fails or the current OS doesn't match the failing OS.
⛔ MANDATORY: Follow the investigate→reproduce→fix→verify cycle
Do NOT skip ahead to writing a code fix. Even if you think you already know the root cause, you MUST follow every step in order:
- Step 1 — Gather failure data from the issue and read the test code for understanding
- Step 1.5 — Analyze existing quarantine failure logs (may reveal root cause, informs reproduction strategy)
- Step 2 — Try to reproduce locally using
run-test-repeatedly.sh/.ps1(fast path) ← try this FIRST - Step 3 — If local reproduction fails, reproduce on CI using
reproduce-flaky-tests.yml(graduated: single-test → quarantine-project → log-based) - Step 4 — Analyze failure logs to confirm root cause
- Step 5 — Apply fix and verify (local verification first, then CI verification for final validation)
- Step 6 — Clean up investigation branch and create final PR
Each step has a checkpoint at the end. Do not proceed to the next step until the checkpoint is satisfied. Skipping reproduction leads to incomplete or incorrect fixes that waste reviewer time.
Two-Branch Workflow
This skill uses two branches to keep investigation artifacts separate from the final clean fix:
Investigation Branch (draft PR)
- Created from the working branch (or
main) - Named:
<base-branch>-investigate(e.g.,flaky-test0-investigate) - Contains: disabled
ci.yml, configuredreproduce-flaky-tests.yml, code fix - Opened as a draft PR with prominent WIP marking
- Purpose: CI verification of the fix using the reproduce workflow without triggering full CI
Fix Branch (final PR)
- The original working branch (e.g.,
flaky-test0) - Contains: only the code fix (clean diff)
ci.ymlenabled,reproduce-flaky-tests.ymlat defaults- Created/updated after verification succeeds on the investigation branch
Why two branches? Pushing workflow changes (disable ci.yml, configure reproduce workflow) to the same branch as the fix would trigger unwanted CI runs and pollute the final PR diff. The investigation branch isolates this.
Top-Level Tracking
Use SQL to track the overall investigation state. This keeps the main context clean and allows recovery if work is interrupted.
Initialize tracking at the start of every investigation:
INSERT INTO todos (id, title, description, status) VALUES
('gather-data', 'Gather failure data', 'Read issue, find test code, determine failure rates per OS', 'pending'),
('analyze-existing', 'Analyze existing quarantine logs', 'Download logs from recent quarantine failures to understand the error', 'pending'),
('reproduce-local', 'Reproduce locally', 'Try local reproduction with run-test-repeatedly.sh/.ps1 (fast path)', 'pending'),
('reproduce-ci', 'Reproduce on CI', 'Configure and run reproduce-flaky-tests.yml: single-test first, then quarantine-project if needed', 'pending'),
('analyze', 'Analyze failure logs', 'Download CI logs or review local logs, identify root cause', 'pending'),
('fix', 'Apply fix', 'Write the code fix based on root cause analysis', 'pending'),
('verify', 'Verify fix on CI', 'Re-run reproduce workflow to confirm fix works', 'pending'),
('verify-ci', 'Verify no CI regressions', 'Confirm fix does not introduce regressions in the CI workflow', 'pending'),
('cleanup', 'Clean up investigation', 'Close investigation PR, create clean fix PR', 'pending');
INSERT INTO todo_deps (todo_id, depends_on) VALUES
('analyze-existing', 'gather-data'),
('reproduce-local', 'analyze-existing'),
('reproduce-ci', 'reproduce-local'),
('analyze', 'reproduce-ci'),
('fix', 'analyze'),
('verify', 'fix'),
('verify-ci', 'verify'),
('cleanup', 'verify-ci');
Store key parameters in session state:
CREATE TABLE IF NOT EXISTS session_state (key TEXT PRIMARY KEY, value TEXT);
INSERT OR REPLACE INTO session_state (key, value) VALUES
('test_method', '<FullyQualifiedMethodName>'),
('test_project', '<ProjectShortname>'),
('issue_url', '<GitHubIssueURL>'),
('failure_rate_linux', '<rate or unknown>'),
('failure_rate_windows', '<rate or unknown>'),
('failure_rate_macos', '<rate or unknown>'),
('max_failure_rate', '<highest rate across OSes>'),
('reproduce_attempt', '1'),
('reproduce_mode', 'single-test'),
('fix_attempt', '1'),
('reproduce_run_id', ''),
('verify_run_id', ''),
('investigation_branch', ''),
('fix_branch', ''),
('user_interaction', 'false');
Always update todo status as you work — set to in_progress before starting, done when complete. Query SELECT * FROM todos; to check progress. Store CI run IDs and attempt counts in session_state.
Tracking User Interaction
If at any point during the investigation you use the ask_user tool to get input from the user, immediately update the session state:
INSERT OR REPLACE INTO session_state (key, value) VALUES ('user_interaction', 'true');
This flag determines whether the final PR is labeled as [automated] (see Step 6.2).
Investigation Notes
Keep investigation notes in the session workspace (not in the repo). This avoids commit noise from temporary artifacts:
~/.copilot/session-state/<session-id>/
├── plan.md # Summary: test name, issue, root cause, fix, status
└── files/
└── failure-logs/ # Downloaded CI failure logs (if any)
Use plan.md in the session workspace for running notes and observations. Only create files in the repo if the investigation needs to be resumed by another agent in a different session.
Overview: The Investigate→Reproduce→Fix→Verify Cycle
The steps below are sequential and gated. Complete each step fully before moving to the next.
- Gather failure data from the issue (OS-specific failure rates, error messages) and read the test code for understanding
- Analyze existing quarantine failure logs — informs reproduction strategy and may reveal root cause
- Try to reproduce locally using
run-test-repeatedly.sh(Linux/macOS) orrun-test-repeatedly.ps1(Windows) — this is the fast path (~minutes vs ~30 min for CI). Works when the current OS matches a failing OS. - If local reproduction fails (wrong OS, contention-sensitive, or low failure rate), reproduce on CI using
reproduce-flaky-tests.ymlwith graduated escalation: single-test → quarantine-project → log-based analysis - Analyze failure logs to identify root cause
- Apply a fix. Try local verification first with
run-test-repeatedly.sh/.ps1, then always validate on CI as final verification. - Clean up: close investigation branch, create clean fix PR
Prefer analyzing existing data first. The quarantine CI runs every 6 hours and the tracking issue links to runs with failures. These logs are often sufficient to diagnose the root cause, but CI reproduction should still be attempted to establish a baseline failure rate.
Step 1: Gather Failure Data
Finding the Issue
The user may provide:
- A test method name (e.g.,
DeployAsync_WithMultipleComputeEnvironments_Works) - A GitHub issue URL (e.g.,
https://github.com/microsoft/aspire/issues/13287) - Both
If you only have the test name, find the tracking issue:
-
First check the test code for a
[QuarantinedTest]attribute — it contains the issue URL:grep -rn "QuarantinedTest" tests/ --include="*.cs" | grep "TestMethodName" -
If not found there, look up the test in the quarantine tracking meta-issue https://github.com/microsoft/aspire/issues/8813 — this issue tracks all quarantined tests with links to their individual issues:
gh issue view 8813 --repo microsoft/aspireSearch the output for the test name to find its linked issue.
-
If neither source has the issue, proceed without historical failure data. Use a default configuration (all 3 OSes, 5×5 iterations) since you don't know which OSes fail or the failure rate.
From the Issue
Quarantined test issues contain tracking tables with per-OS failure rates over the last 100 runs. This data is critical:
- Which OSes fail: Target only those OSes to save runner time
- Failure rate: Determines how many iterations you need for reproduction
- Error pattern: Helps identify root cause before reproducing
# Read the issue to get failure data
gh issue view <issue-number> --repo microsoft/aspire
From the Test Code
Find the test method, class, and project. Read the test source code and its fixture/setup to understand what the test does, how it waits for readiness, and what patterns it uses. This is essential for understanding what you're trying to reproduce and for matching against common flaky test patterns.
# Search for the test method
grep -rn "public.*async.*Task.*TestMethodName\|public.*void.*TestMethodName" tests/ --include="*.cs"
Consult the flaky test patterns in .github/instructions/test-review-guidelines.instructions.md early. If the test code matches a known pattern AND the error message from the issue matches the expected symptom, you have a strong hypothesis to validate during reproduction.
Iteration Count Heuristic
Based on the failure rate from the issue tracking data, calculate iterations to achieve 95% probability of seeing at least one failure (if the bug exists):
| Failure Rate | Runners × Iterations per OS | Total per OS | Confidence |
|---|---|---|---|
| >50% | 3 × 3 | 9 | >99% |
| 20-50% | 5 × 5 | 25 | >99% |
| 10-20% | 5 × 10 | 50 | >99% |
| 5-10% | 10 × 10 | 100 | >99% |
| <5% | 10 × 25 | 250 | >95% |
The math: for failure rate p, need n ≥ log(0.05) / log(1-p) iterations for 95% confidence. The table above provides comfortable margins.
✅ Step 1 Checkpoint
Before proceeding to Step 1.5, confirm you have:
- The test method name, class, and project path
- The issue URL (if available)
- Per-OS failure rates (to choose target OSes and iteration counts)
- The error message/pattern from the issue
- Read the test source code and its fixture/setup for understanding
- Checked the Common Flaky Test Patterns table for matches
- SQL tracking initialized with all parameters stored
Do NOT write a fix yet. You have a hypothesis, but proceed to Step 1.5 to validate it with existing failure data.
Step 1.5: Analyze Existing Quarantine Failure Logs
Before running a separate reproduction, check if existing quarantine CI logs already contain the information you need. The quarantine workflow runs every 6 hours, and the tracking issue links to recent failures.
Finding Failure Logs from Quarantine Runs
The tracking issue contains ❌ links to failed quarantine runs. Use those run IDs to find the specific job that failed:
# Find the failed job for your test project in a quarantine run
gh api "repos/microsoft/aspire/actions/runs/<run_id>/jobs?per_page=100&filter=latest" \
--jq '.jobs[] | select(.name | contains("<ProjectShortname>")) | select(.conclusion == "failure") | {id: .id, name: .name}'
Then download the logs for that job:
# Get logs via the GitHub MCP tool (preferred — handles encoding automatically)
# Use get_job_logs with the job_id, return_content: true, tail_lines: 300
# Or via CLI
gh api "repos/microsoft/aspire/actions/jobs/<job_id>/logs" > quarantine-failure.log
Search the logs for the test name, error message, and stack trace:
grep -i "TestMethodName\|TaskCanceled\|Assert\|Exception\|FAIL" quarantine-failure.log | head -30
Identifying Contention-Sensitive Tests
A test is likely contention-sensitive (fails only when running alongside other tests) if:
- It uses
randomizePorts: false— fixed ports can conflict with other concurrent tests - It uses a shared fixture (collection fixture or class fixture) — startup timing depends on other tests
- It uses
WaitForTextAsync— log-based readiness checks are fragile under contention - It shares a
CancellationTokenSourceacross startup and readiness phases — one phase can starve the other's timeout budget - The tracking issue shows 0% failure on macOS (which often has less CI contention) but failures on Linux/Windows
If you identify contention-sensitive indicators, note this for Step 3 — single-test CI reproduction may fail, and you'll need to escalate to quarantine-project mode. Do NOT skip reproduction; the graduated escalation in Step 3 handles this.
✅ Step 1.5 Checkpoint
Before proceeding:
- Downloaded and examined at least 1-2 quarantine failure logs for the test
- Confirmed the error matches the pattern in the tracking issue
- Assessed whether the test is contention-sensitive (noted for Step 3 escalation strategy)
Proceed to Step 2 for local reproduction.
Step 2: Try Local Reproduction (Fast Path)
Before going to CI, try reproducing the failure locally. This gives feedback in minutes instead of 30+ minutes.
2.1: Check OS Compatibility
uname -s # Linux, Darwin (macOS), or Windows (via MSYS/Git Bash)
Compare your OS against the failing OSes from Step 1. Local reproduction is viable when:
- Your OS matches one of the failing OSes, OR
- The test fails on all OSes (OS-independent flakiness)
If the test only fails on an OS you don't have (e.g., fails only on Windows and you're on Linux), skip to Step 3 (CI reproduction).
2.2: Build the Test Project
# Restore first if not already done
./restore.sh # or ./restore.cmd on Windows
# Build the specific test project
dotnet build tests/<TestProject>.Tests/<TestProject>.Tests.csproj -v:q
2.3: Run with run-test-repeatedly script
Use the run-test-repeatedly.sh (Linux/macOS) or run-test-repeatedly.ps1 (Windows) script in .github/skills/fix-flaky-test/. It runs the test command repeatedly with process cleanup between iterations.
Linux/macOS:
# Basic usage — run a single test 20 times (stop on first failure)
./.github/skills/fix-flaky-test/run-test-repeatedly.sh -n 20 -- \
dotnet test tests/<TestProject>.Tests/<TestProject>.Tests.csproj --no-build \
-- --filter-method "*.<TestMethodName>" \
--filter-not-trait "quarantined=true" --filter-not-trait "outerloop=true"
Windows (PowerShell):
# Basic usage — run a single test 20 times (stop on first failure)
./.github/skills/fix-flaky-test/run-test-repeatedly.ps1 -n 20 -- dotnet test tests/<TestProject>.Tests/<TestProject>.Tests.csproj --no-build `
-- --filter-method "*.<TestMethodName>" `
--filter-not-trait "quarantined=true" --filter-not-trait "outerloop=true"
For quarantined tests, you need /p:RunQuarantinedTests=true during both build and test to prevent the build system from filtering them out:
dotnet build tests/<TestProject>.Tests/<TestProject>.Tests.csproj -v:q /p:RunQuarantinedTests=true
# Linux/macOS
./.github/skills/fix-flaky-test/run-test-repeatedly.sh -n 20 -- \
dotnet test tests/<TestProject>.Tests/<TestProject>.Tests.csproj --no-build \
/p:RunQuarantinedTests=true \
-- --filter-method "*.<TestMethodName>"
# Windows (PowerShell)
./.github/skills/fix-flaky-test/run-test-repeatedly.ps1 -n 20 -- dotnet test tests/<TestProject>.Tests/<TestProject>.Tests.csproj --no-build `
-- --filter-method "*.<TestMethodName>"
Choose iteration count based on failure rate (same heuristic as CI):
| Failure Rate | Local Iterations | Expected failures |
|---|---|---|
| >50% | 10 | ~5+ |
| 20-50% | 20 | ~4-10 |
| 10-20% | 30 | ~3-6 |
| 5-10% | 50 | ~2-5 |
| <5% | 100 | ~1-5 |
Script options (same for both .sh and .ps1):
-n <count>— Number of iterations (default: 100)--run-all— Don't stop on first failure, run all iterations--help— Show usage
Results are saved to /tmp/test-results-<timestamp>/ (Linux/macOS) or $env:TEMP\test-results-<timestamp>\ (Windows). Failure logs are in failure-*.log files.
2.4: Handle Local Reproduction Results
If the test fails locally: Reproduction successful ✅. Examine the failure log:
# The script prints the results directory path
cat /tmp/test-results-*/failure-*.log
Mark reproduce-local as done in SQL and proceed to Step 4 (root cause analysis) using the local failure logs.
UPDATE todos SET status = 'done' WHERE id = 'reproduce-local';
UPDATE todos SET status = 'done' WHERE id = 'reproduce-ci'; -- skip CI reproduction
UPDATE todos SET status = 'in_progress' WHERE id = 'analyze';
If the test passes all local iterations: Local reproduction failed. This can happen because:
- The failure is OS-specific and you're on the wrong OS
- The failure is contention-sensitive (only happens with parallel tests)
- The failure rate is very low and you didn't run enough iterations
Proceed to Step 3 (CI reproduction) for cross-OS, parallel-runner reproduction.
UPDATE todos SET status = 'done' WHERE id = 'reproduce-local';
INSERT OR REPLACE INTO session_state (key, value) VALUES ('local_result', 'no_failures');
✅ Step 2 Checkpoint
- Checked OS compatibility
- Ran
run-test-repeatedly.sh/.ps1with appropriate iteration count (or skipped due to OS mismatch) - Recorded result: local failure found → proceed to Step 4, or no failures → proceed to Step 3
Step 3: Reproduce on CI (Fallback)
3.0: Create the Investigation Branch
Create a separate branch for CI investigation. This branch will have ci.yml disabled and reproduce-flaky-tests.yml configured, keeping the fix branch clean.
# Create investigation branch from the current working branch
git checkout -b <fix-branch>-investigate
# Store the branch names
INSERT OR REPLACE INTO session_state (key, value) VALUES
('investigation_branch', '<fix-branch>-investigate'),
('fix_branch', '<fix-branch>');
3.0a: Disable ci.yml
Disable ci.yml so pushing to the investigation branch doesn't trigger full CI:
# .github/workflows/ci.yml — add this at the top level, after `name:`
# Change the `on:` trigger to disable automatic runs:
on:
workflow_dispatch: {} # Only manual trigger, no automatic PR/push triggers
This prevents CI from running on every push to the investigation branch. You will re-enable it when creating the final fix PR.
3.1: Configure the Reproduce Workflow
Edit .github/workflows/reproduce-flaky-tests.yml — change only the env: section at the top:
env:
TEST_PROJECT: "Hosting.Azure" # Project shortname
TEST_FILTER: '--filter-method "*.DeployAsync_WithMultipleComputeEnvironments_Works"'
TARGET_OSES: "windows-latest" # Focus on highest-failure-rate OS
RUNNERS_PER_OS: "5"
ITERATIONS_PER_RUNNER: "5"
OS targeting strategy:
- High failure rate (>20%) on one OS: Target that OS only first — fastest feedback
- High rate on multiple OSes: Target all failing OSes
- Low rate or can't reproduce: Focus on the OS with the highest failure rate, increase iterations
- Unknown rates: Target
ubuntu-latest,windows-latestwith moderate iterations
Test project shortname mapping: The workflow resolves TEST_PROJECT to a path:
- Tries
tests/{name}.Tests/{name}.Tests.csprojfirst - Then
tests/Aspire.{name}.Tests/Aspire.{name}.Tests.csproj - Examples:
Hosting→Aspire.Hosting.Tests,Hosting.Azure→Aspire.Hosting.Azure.Tests
Common filter patterns:
# Single test method
TEST_FILTER: '--filter-method "*.TestMethodName"'
# All tests in a class
TEST_FILTER: '--filter-class "*.TestClassName"'
# Multiple test methods
TEST_FILTER: '--filter-method "*.Test1" --filter-method "*.Test2"'
For quarantined tests: The workflow automatically disables the quarantine exclusion filter for both build and test phases (via /p:_NonQuarantinedTestRunAdditionalArgs=""), so quarantined tests are included regardless of their trait. You do NOT need to add any special flags.
Zero-test detection: The workflow detects when zero tests execute (e.g., due to a misconfigured filter) and treats it as a failure. If you see "Zero tests executed" errors, verify that TEST_FILTER matches the actual test name and that quarantine settings are correct.
3.2: Push and Open Draft PR
Commit the workflow changes and open a draft PR with the investigation template:
git add .github/workflows/ci.yml .github/workflows/reproduce-flaky-tests.yml
git commit -m "🔍 Investigation: configure CI for flaky test reproduction
⚠️ DO NOT MERGE — This is a temporary investigation branch.
ci.yml disabled, reproduce workflow configured for <test name>."
git push --set-upstream origin <fix-branch>-investigate
Open a draft PR with prominent WIP marking:
gh pr create --draft --repo microsoft/aspire \
--title "🔍 [DO NOT MERGE] Investigation: <test name>" \
--body "## ⚠️ DO NOT MERGE — Investigation Branch
This is a temporary branch for reproducing and verifying a fix for a flaky test.
**Issue**: #<issue-number>
**Test**: \`<FullyQualifiedTestName>\`
### What's changed on this branch
- \`ci.yml\` disabled (prevents full CI on investigation pushes)
- \`reproduce-flaky-tests.yml\` configured for the target test
- Code fix (will be applied after reproduction)
### Status
- [ ] Reproduction confirmed
- [ ] Fix applied
- [ ] Fix verified on CI
- [ ] Clean fix PR created
This branch will be deleted after the fix is verified and a clean PR is created."
3.3: Trigger the Reproduce Workflow
gh workflow run reproduce-flaky-tests.yml --repo microsoft/aspire --ref <fix-branch>-investigate
This dispatches the workflow from main but runs the version from your branch, so your env var edits will be used.
If the workflow dispatch fails (e.g. HTTP 403 "Resource not accessible by integration"): your GitHub token lacks actions:write permission on the repository. This is a non-fatal blocker — continue with the investigation, but you must document this in every PR you open (both investigation and fix PRs). Include the exact error, and provide the manual trigger command so a reviewer or maintainer can run it. See the PR template in Step 6.2 for the required format.
3.4: Monitor and Cancel
Monitor the run using polling (CI runs take 10-30+ minutes):
# Find the run ID
gh run list --repo microsoft/aspire --branch <branch> --limit 1 --json databaseId,status
Store the run ID, then poll periodically for completion:
INSERT OR REPLACE INTO session_state (key, value) VALUES ('reproduce_run_id', '<run-id>');
# Poll for completion (use bash mode="async", then read_bash with increasing delays)
# Avoid `gh run watch` — it produces excessive output that floods the context window.
gh run view <run-id> --repo microsoft/aspire --json status,conclusion --jq '{status, conclusion}'
# Check individual job results as they complete
gh run view <run-id> --repo microsoft/aspire --json jobs \
--jq '.jobs[] | select(.status == "completed") | {name: .name, conclusion: .conclusion}'
Cancel old runs when starting new ones to avoid wasting CI resources:
# Cancel a specific run
gh run cancel <run-id> --repo microsoft/aspire
# Cancel all in-progress runs on your branch (useful when iterating)
gh run list --repo microsoft/aspire --branch <branch> --status in_progress --json databaseId --jq '.[].databaseId' | \
xargs -I {} gh run cancel {} --repo microsoft/aspire
Always cancel previous reproduce/verify runs before pushing a new configuration. workflow_dispatch runs are NOT auto-cancelled, so you must cancel them manually.
3.5: Handle Reproduction Results
⛔ GATE: Do not proceed past this point until the CI run has completed.
If there are failure artifacts, download them:
# Download failure artifacts
gh run download <run-id> --repo microsoft/aspire --dir /tmp/failure-logs
# Or get logs directly via the GitHub API / MCP tools
gh api "repos/microsoft/aspire/actions/jobs/<job_id>/logs" > /tmp/failure.log
Distinguishing test failures from infrastructure failures:
CI runners sometimes fail due to infrastructure issues, NOT the test itself. Common infrastructure failures include:
Failed to install or invoke dotnet...(exit code -1073741502 on Windows)The runner has received a shutdown signalor runner timeouts- Network connectivity errors during
dotnet restore
These do NOT count as reproductions. Check the actual error message — only count iterations where the test itself failed with the expected error pattern from the tracking issue.
If some runners show test failures (the expected error): Reproduction successful ✅. Proceed to Step 4.
If no runners show the expected test failure — scale up and retry:
-- Track the scaling attempt
INSERT OR REPLACE INTO session_state (key, value)
VALUES ('reproduce_attempt', CAST((SELECT CAST(value AS INTEGER) FROM session_state WHERE key = 'reproduce_attempt') + 1 AS TEXT));
Scale up progressively, focusing on the OS most likely to fail first (based on per-OS failure rates from the issue). Go back to Step 3.1 after each change:
| Attempt | TARGET_OSES | RUNNERS_PER_OS | ITERATIONS_PER_RUNNER | Notes |
|---|---|---|---|---|
| 1 | Highest-failure-rate OS only | From heuristic table | From heuristic table | Start narrow — one OS, sized by failure rate |
| 2 | Same single OS | Same | 2× previous | Double ITERATIONS_PER_RUNNER only |
Upper bounds: Do not exceed RUNNERS_PER_OS=10 or ITERATIONS_PER_RUNNER=50 (total matrix entries must stay ≤ 256 per GitHub Actions limits).
If 2 attempts produce zero test failures → escalate to quarantine-project mode (Step 3.6).
3.6: Quarantine-Project Mode (Contention Escalation)
When single-test reproduction fails, the test likely only fails under contention with other tests. Escalate by running all quarantined tests in the assembly, which matches what tests-quarantine.yml does:
Edit .github/workflows/reproduce-flaky-tests.yml — change the TEST_FILTER to target all quarantined tests:
env:
TEST_PROJECT: "<same project>"
TEST_FILTER: '--filter-trait "quarantined=true"' # Run all quarantined tests in this assembly
TARGET_OSES: "<same as before>"
RUNNERS_PER_OS: "3"
ITERATIONS_PER_RUNNER: "3"
This recreates the contention environment from the quarantine workflow. Push, trigger, and monitor as before.
If quarantine-project mode reproduces the failure: Reproduction successful ✅. Proceed to Step 4. Note: verification (Step 5) should use the same quarantine-project mode to confirm the fix.
If quarantine-project mode also produces zero failures: The test requires heavier contention than we can simulate on demand. In this case:
- Fall back to analyzing existing quarantine failure logs (from Step 1.5)
- Read the test code to identify contention indicators (shared ports, shared fixtures, sequential waits)
- Proceed to Step 4 using quarantine logs as your failure evidence
- The verification run (Step 5) will still validate your fix in isolation, which is useful even if you can't reproduce the original failure
- Note in the PR description that reproduction required full CI contention and the fix relies on log analysis + quarantine monitoring for confirmation
INSERT OR REPLACE INTO session_state (key, value) VALUES ('reproduce_mode', 'log-based');
CRITICAL: Windows log encoding gotcha
Windows CI log files downloaded as artifacts are encoded as UTF-16LE. Running cat on them produces garbled output. Convert first:
# Convert Windows log to readable UTF-8
iconv -f UTF-16LE -t UTF-8 /tmp/failure-logs/failures-windows-latest-1/test-output.log > /tmp/readable-windows.log
cat /tmp/readable-windows.log
Tip: Using get_job_logs via GitHub API/MCP tools returns UTF-8 directly, avoiding encoding issues entirely. Prefer API-based log retrieval when possible.
Alternatively, search for the error directly:
# Search across all failure logs (handles encoding)
find /tmp/failure-logs -name "*.log" -exec grep -l "Assert\|Error\|Exception" {} \;
Step 4: Identify Root Cause
Interpreting Reproduction Results
- Some runners fail, some pass: This is the expected pattern for a flaky test. Proceed to analyze the failures.
- All runners fail (100%): Compare against the failure rate from the tracking issue. If the issue says e.g. 84% and you see 100%, that's consistent — proceed. But if the issue says e.g. 10% and you see 100%, this may be an unrelated issue (e.g., a build break, a new dependency problem). Investigate whether the failure is the same error as reported in the issue before attempting a fix.
- No runners fail: The test may not be reliably reproducible with your current iteration count. Increase
RUNNERS_PER_OSandITERATIONS_PER_RUNNERand try again.
Analyzing Failure Logs
Failure logs may come from local runs (Step 2, in /tmp/test-results-*/), CI reproduce runs (Step 3), or existing quarantine runs (Step 1.5). All are valid sources.
Preferred: Use GitHub API/MCP tools to get logs directly (avoids encoding issues):
# Get job logs via GitHub MCP tool: get_job_logs with job_id, return_content: true, tail_lines: 300
# Or via CLI:
gh api "repos/microsoft/aspire/actions/jobs/<job_id>/logs" > /tmp/failure.log
Delegate log analysis to a sub-agent to keep the main context clean:
Use a task agent (explore or general-purpose) to analyze the failure logs:
- Pass the log file paths or content
- Ask it to identify the specific assertion/exception
- Ask it to read the test source code and identify the concurrency/timing model
- Have it return a structured root cause summary
Look for the assertion or exception that failed:
# Find the actual test failure in logs
grep -A 10 "FAIL\|Assert\.\|Exception" /tmp/failure.log | head -50
# For .trx files (XML test results) from downloaded artifacts
find /tmp/failure-logs -name "*.trx" -exec grep -l 'outcome="Failed"' {} \;
Then find the corresponding test code and understand the concurrency/timing model.
✅ Step 4 Checkpoint
Before proceeding to Step 5, confirm you have:
- Examined CI failure logs (from reproduce runs OR existing quarantine runs)
- Identified the specific error (assertion failure, exception, timeout)
- Read the test source code and identified the root cause
- Documented the root cause in your session plan
Now — and only now — proceed to write the fix.
Step 5: Apply Fix and Verify
⚠️ DO NOT remove the
[QuarantinedTest]attribute or close the tracking issue. Unquarantining is a separate process that happens after 21 days of zero failures in quarantine CI. Your fix PR should contain only the code fix. See Step 6.5 for details.
5.1: Apply the Fix
- Make the code change
- Build locally to confirm it compiles:
dotnet build tests/<TestProject>.Tests/<TestProject>.Tests.csproj --no-restore -v:q - Keep
reproduce-flaky-tests.ymlconfigured for the same test
5.2: Local Verification (Fast Pre-Check)
Principle: Local runs are a fast pre-check, not a substitute for CI. Running a test N times on one machine does not have the same statistical power as N runs across separate CI runners. Some flakiness stems from environmental variation (machine load, Docker daemon state, network conditions) that a single machine cannot reproduce. Local verification catches obvious regressions quickly and saves CI round-trips, but CI verification is always required as the final gate.
If local reproduction succeeded in Step 2, run a quick local verification first:
# Rebuild with fix
dotnet build tests/<TestProject>.Tests/<TestProject>.Tests.csproj --no-restore -v:q
# Quick local check — same iteration count as reproduction
# Linux/macOS:
./.github/skills/fix-flaky-test/run-test-repeatedly.sh -n 20 -- \
dotnet test tests/<TestProject>.Tests/<TestProject>.Tests.csproj --no-build \
-- --filter-method "*.<TestMethodName>"
# Windows (PowerShell):
# ./.github/skills/fix-flaky-test/run-test-repeatedly.ps1 -n 20 -- dotnet test tests/<TestProject>.Tests/<TestProject>.Tests.csproj --no-build -- --filter-method "*.<TestMethodName>"
If local verification fails, iterate on the fix before going to CI. This saves ~30 minutes per CI round-trip.
5.3: Choose CI Verification Scale
CI verification is always required. However, the scale should reflect your local confidence — how much evidence you already have that the fix is correct.
Assessing Local Confidence
Consider these factors to determine how aggressively to scale CI verification:
Higher confidence (scale CI down):
- Root cause matches a well-known flaky pattern (from the patterns reference)
- Fix is a mechanical pattern application (e.g., replacing bare
HttpClientwith resilient one) - Local reproduction succeeded and the fix eliminated the failure
- Current OS matches the primarily affected OS
- Failure rate is high (>20%), making local runs more meaningful
Lower confidence (scale CI up):
- Root cause is a hypothesis, not confirmed by reproduction
- Fix involves behavioral changes or new logic
- Local reproduction failed (wrong OS, contention-sensitive, or very low failure rate)
- Failure is OS-specific and you're on a different OS
- Failure rate is low (<10%), meaning local runs provide weak signal
CI Scale Heuristic
Use the original failure rate combined with your local confidence to size the CI verification. The base scale ensures that if the bug were still present, it would manifest with ≥95% probability (n ≥ log(0.05) / log(1-p)):
| Original Failure Rate | High Confidence (CI scale) | Low Confidence (CI scale) |
|---|---|---|
| >50% | 3 × 3 per OS (9 total) | 3 × 3 per OS (9 total) |
| 20-50% | 3 × 5 per OS (15 total) | 5 × 5 per OS (25 total) |
| 10-20% | 5 × 5 per OS (25 total) | 5 × 10 per OS (50 total) |
| 5-10% | 5 × 10 per OS (50 total) | 10 × 10 per OS (100 total) |
| <5% | 10 × 10 per OS (100 total) | 10 × 25 per OS (250 total) |
For tests with very low failure rates (<5%), consider whether the verification is practical within CI budget constraints. If not, document the limitation and rely on the 21-day quarantine monitoring to confirm.
For contention-sensitive tests (where quarantine-project mode was needed for reproduction): Use the same quarantine-project TEST_FILTER for verification. This ensures the fix is validated under the same contention conditions where the failure was observed. If reproduction fell back to log-based analysis, use the low-confidence column and note in the PR that definitive confirmation relies on the 21-day quarantine monitoring.
5.4: Push and Verify on CI
Push the fix to the investigation branch (where reproduce workflow is already configured):
git add -A
git commit -m "Fix flaky test: <description of fix>"
git push
Then trigger the reproduce workflow to verify:
gh workflow run reproduce-flaky-tests.yml --repo microsoft/aspire --ref <fix-branch>-investigate
If the workflow dispatch fails due to permissions (HTTP 403), see the guidance in Step 3.3. Continue to Step 6 but document the failure in the PR description.
Store the verification run ID:
INSERT OR REPLACE INTO session_state (key, value) VALUES ('verify_run_id', '<run-id>');
INSERT OR REPLACE INTO session_state (key, value) VALUES ('fix_attempt', '1');
Wait for CI to complete. Monitor with polling (gh run view --json status,conclusion), not gh run watch.
5.5: Handle Verification Results
If all iterations pass across all OSes: The fix is validated ✅. Proceed to Step 6.
If some iterations still fail: The fix is incomplete or incorrect. Iterate:
-- Track the fix attempt
INSERT OR REPLACE INTO session_state (key, value)
VALUES ('fix_attempt', CAST((SELECT CAST(value AS INTEGER) FROM session_state WHERE key = 'fix_attempt') + 1 AS TEXT));
- Download the new failure logs:
gh run download <run-id> --repo microsoft/aspire --dir /tmp/failure-logs - Analyze the new failure pattern — is it the same error or a different one?
- Refine the fix based on the new evidence
- Push and re-verify
After 3 failed fix attempts: Stop and report findings to the user. The issue may require deeper architectural changes or domain expertise.
Step 6: Clean Up and Create Final PR
After the fix is verified on the investigation branch, create a clean fix PR.
6.0: Cancel Any Remaining CI Runs
Cancel any in-progress reproduce or verify runs that are no longer needed:
# List and cancel any remaining runs on your branch
gh run list --repo microsoft/aspire --branch <branch> --status in_progress --json databaseId,name --jq '.[] | "\(.databaseId) \(.name)"'
gh run cancel <run-id> --repo microsoft/aspire
6.1: Cherry-Pick Fix to the Clean Branch
Switch back to the fix branch and cherry-pick only the code fix commits (not the workflow changes):
git checkout <fix-branch>
# Cherry-pick the fix commit(s) from the investigation branch
git cherry-pick <fix-commit-sha>
# Verify the fix branch has NO workflow changes
git diff main -- .github/workflows/ # Should be empty
6.1.1: Clean Up Git History
Before pushing, ensure the fix branch has a clean, linear history with only the code fix commit(s). If intermediate commits crept in (e.g. workflow config changes, reverts, debug attempts), squash them down:
# Interactive rebase to squash noise commits into the fix
git rebase -i $(git merge-base HEAD main)
# In the editor, mark the fix commit as "pick" and any noise commits as "fixup" or "drop"
# Save and close
# Verify: the branch should have only fix-related commits
git log --oneline main..HEAD
# Verify: no unintended file changes
git diff main -- .github/workflows/ # Should be empty
Why this matters: A clean history makes the PR easy to review and avoids confusing commit pairs (config + revert) that produce no net change but clutter the log.
6.2: Push and Open Final PR
git push
Determine the PR title prefix: Check whether any user interaction occurred during the investigation:
SELECT value FROM session_state WHERE key = 'user_interaction';
- If
user_interactionis'false': prefix the PR title with[automated] - If
user_interactionis'true': no prefix
Open a non-draft PR with the fix. The PR body must include a note that it was created using the fix-flaky-test skill:
gh pr create --repo microsoft/aspire \
--title "<prefix>Fix flaky test: <description>" \
--body "## Flaky Test Fix
### Test
- **Method**: \`<fully qualified test name>\`
- **Issue**: #<issue-number>
### Root Cause
<1-2 sentence description of the root cause>
### Fix
<1-2 sentence description of what was changed>
### Verification
| Run | Config | Result |
|-----|--------|--------|
| Pre-fix (local) | <iterations>, <OS> | **<pass/fail>** |
| Post-fix (local) | <iterations>, <OS> | **<pass/fail>** |
| Post-fix (CI) | <runners × iters × OSes> | **<link to run>** |
> **If any verification step was skipped or failed** (e.g. workflow dispatch permission error), replace the CI row with a clear explanation:
> - What step failed and the exact error (e.g. \\\`HTTP 403: Resource not accessible by integration\\\`)
> - Why it could not be completed (e.g. agent token lacks \\\`actions:write\\\` permission)
> - The manual command a reviewer can run to complete verification
> - A link to the investigation PR/branch with the pre-configured reproduce workflow
### Verification Rationale
<Brief explanation of CI scale choice: local confidence level, why that scale was appropriate for the failure rate, and acknowledgment that local runs are a pre-check — not equivalent to CI runs across separate runners.>
### Notes
- \`[QuarantinedTest]\` attribute kept — unquarantining will happen separately after 21 days of zero failures in quarantine CI
---
> **Note:** This PR intentionally does not close #<issue-number>. The test will remain quarantined until a separate unquarantine process confirms it has been stable (zero failures) for a sufficient period. Once stability is confirmed, the test will be unquarantined and the issue will be closed.
---
*This fix was generated using the [fix-flaky-test skill](https://github.com/microsoft/aspire/blob/main/.github/skills/fix-flaky-test/SKILL.md).*"
If gh pr create fails (e.g. permissions error, API failure): Do NOT delete the branch or undo the work. Instead:
- Ensure the clean fix branch is pushed to the remote
- Tell the user the PR could not be opened, include the exact error message
- Provide the branch name and a direct link they can use to open the PR manually (e.g.
https://github.com/microsoft/aspire/compare/main...<branch-name>)
6.3: Close the Investigation PR
# Close the investigation draft PR
gh pr close <investigation-pr-number> --repo microsoft/aspire --delete-branch
6.4: Verify No CI Regressions
After opening the final PR, the regular CI pipeline (ci.yml) will run automatically. Monitor it to confirm the fix does not introduce regressions:
# Find the CI run for your PR
gh pr checks <pr-number> --repo microsoft/aspire
If CI fails on unrelated tests, that's not your problem — note it in the PR. If CI fails on your changed files or the test project you modified, investigate and fix before marking the task complete.
UPDATE todos SET status = 'done' WHERE id = 'verify-ci';
6.5: DO NOT Unquarantine or Close the Issue
Important policy: A code fix alone is not sufficient to unquarantine a test. The test must have zero failures across all OSes for 21 consecutive days in the quarantine CI runs before it can be unquarantined. See docs/unquarantine-policy.md.
- DO NOT remove the
[QuarantinedTest]attribute - DO NOT close the tracking issue
- A separate process monitors the quarantine CI and handles unquarantining when the 21-day criteria are met
✅ Final Validation Checklist
Before opening the final PR, verify every item. This is a hard gate — do not skip any item.
- Fix is verified on CI via the reproduce workflow (all iterations pass), OR if CI could not be triggered (e.g. permissions error), the PR description documents the failure, the exact error, and provides the manual trigger command for a reviewer
-
[QuarantinedTest]attribute is still present on the test method (not removed) - Tracking issue is still open (not closed)
- Clean fix PR is open with only code changes (no workflow modifications)
- Git history is clean: only fix-related commits, no leftover config/revert pairs (squash if needed)
- PR description includes verification rationale (local confidence, CI scale reasoning)
- PR description documents any skipped or failed steps with exact errors and manual remediation commands
- Investigation draft PR is closed and branch deleted
- No remaining in-progress CI runs on the investigation branch
- Summary comment posted (see Response Format below)
Self-check: Run git diff on the fix branch and scan for any unintended changes — removed test attributes, workflow file edits, or unrelated modifications.
UPDATE todos SET status = 'done' WHERE id = 'cleanup';
Key Technical Details
Build System Quarantine Filtering
eng/Testing.props auto-appends --filter-not-trait "quarantined=true" to test arguments via the TestRunnerAdditionalArguments MSBuild property. This property is evaluated during dotnet test even with --no-build, so it must be handled in both build and test commands:
- Reproduce workflow: Overrides
_NonQuarantinedTestRunAdditionalArgsto empty, removing the quarantine exclusion filter for all tests - Local reproduction: Pass
/p:RunQuarantinedTests=trueto bothdotnet buildanddotnet test
Testing.props also adds --ignore-exit-code 8, which masks zero-test runs as successes. The workflow and run-test-repeatedly scripts detect this by checking test output for the Total: count indicator.
test-reproduce.yml Architecture
The workflow:
- Setup job: Parses env vars, generates a matrix of
{os, index}combinations - Reproduce jobs (parallel): Each runner builds the test project once, then loops through iterations with DCP process cleanup between runs
- Results job: Aggregates pass/fail across all runners into a summary table
Failed iterations upload their test output as artifacts named failures-<os>-<index>.
workflow_dispatch Behavior
workflow_dispatch requires the workflow file to exist on the default branch (main). Key implications:
- You can dispatch it against any branch with
gh workflow run reproduce-flaky-tests.yml --ref <branch>. GitHub discovers the workflow frommainbut runs the version from the specified--ref. This means your investigation branch's env var edits will be used. - The investigation branch has
ci.ymldisabled, so pushes don't trigger full CI — onlyworkflow_dispatchof the reproduce workflow is used. - Creating a new workflow file on a feature branch won't help — GitHub won't discover it via
workflow_dispatchuntil it's merged tomain.
Response Format
After completing a flaky test fix, provide a summary:
## Flaky Test Fix Summary
### Test
- **Method**: `Namespace.Type.Method`
- **Issue**: #XXXXX
- **Project**: `tests/Aspire.{Project}.Tests/`
### Failure Data
| OS | Failure Rate |
|---|---|
| Windows | XX% |
| Linux | XX% |
### Root Cause
Brief description of what caused the flaky behavior.
### Fix
Description of the code change.
### Verification
| Run | Config | Result |
|-----|--------|--------|
| Pre-fix | X runners × Y iters × Z OSes | N failures ❌ |
| Post-fix | X runners × Y iters × Z OSes | All passed ✅ |
### Files Changed
- `path/to/file.cs` — description
### Next Steps
- Test remains quarantined — will be unquarantined after 21 days of zero failures
- Issue #XXXXX remains open — will be closed by the unquarantine process
Important Constraints
- Reproduce before fixing: Always confirm the failure is reproducible before attempting a fix — try locally first, then CI. Use graduated escalation: single-test → quarantine-project → log-based (see Steps 3.5 and 3.6)
- Try local first: Use
run-test-repeatedly.sh(Linux/macOS) orrun-test-repeatedly.ps1(Windows) for fast feedback (~minutes). Fall back to CI when local reproduction fails (wrong OS, contention-sensitive, very low failure rate) - Detect your OS: Check with
uname -sto decide if local reproduction is viable for the failing OS - Quarantined tests need /p:RunQuarantinedTests=true: The build system filters them out by default. Pass this property to both
dotnet buildanddotnet testcommands for local reproduction. The CI reproduce workflow handles this automatically. - Watch for zero-test runs: If the test runner reports zero tests executed (exit code 8, masked by
--ignore-exit-code 8), the filter or quarantine settings are misconfigured. Therun-test-repeatedlyscripts and reproduce workflow detect this automatically. - Keep investigation notes in session workspace: Use
plan.mdandfiles/in the session workspace, not a directory in the repo - Use two branches: Investigation branch (draft PR with disabled ci.yml) for reproduce/verify; fix branch (clean PR with only code changes) for the final submission
- Distinguish infrastructure vs test failures: CI runners sometimes fail due to infrastructure issues (e.g.,
Failed to install or invoke dotnet...on Windows). These do NOT count as test reproductions. Always verify the error matches the expected test failure pattern. - DO NOT unquarantine or close issue: The test stays quarantined until 21 days of zero failures (see
docs/unquarantine-policy.md) - Scale verification to failure rate: A 50% failure rate test needs fewer verification iterations than a 5% failure rate test. Use the verification heuristic table.
- Match verification mode to reproduction mode: If reproduction required quarantine-project mode, use the same mode for verification. Store the mode in
session_state(reproduce_mode). - Target specific OSes: Focus on the OS with the highest failure rate first. Only expand to multiple OSes when the rate is high (>20%) on multiple OSes or when initial attempts don't reproduce.
- Build-verify everything: After fixes, after any test attribute changes
- Don't fix unrelated issues: If you encounter unrelated test failures, ignore them
- Windows UTF-16LE: Always handle encoding when reading Windows CI logs downloaded as files (not needed when using
get_job_logsvia GitHub API/MCP, which returns UTF-8) - Prefer polling over
gh run watch: Usegh run view --json status,conclusionto check CI status —gh run watchproduces excessive output that floods the context window - Use sub-agents for heavy work: Delegate log analysis and CI monitoring to sub-agents to keep main context clean
- Track state in SQL: Use the todos table and session_state for tracking progress across the investigate→reproduce→fix→verify cycle
Appendix: Flaky Test Patterns Reference
Common flaky test patterns are documented in .github/instructions/test-review-guidelines.instructions.md. Consult that file during Step 1 (gather data) to form hypotheses, and during Step 4 (analysis) to confirm root causes.