Failure-Aware LLM Jobs for Long Reports

Jun 21, 2026

AIArchitectureLLMReliabilityWorkflow Design

A practical architecture playbook for long LLM report workflows: split the work into checkpointed sections, classify failures, retry safely, and preserve useful output instead of burning tokens in a broken loop.

Long LLM report generation is not the same as calling a normal API endpoint.

With a normal API call, a retry often makes sense. A network request may fail because of a timeout, rate limit, temporary overload, or transient provider issue. Wait for a short cooldown, retry the call, and the system may recover.

LLM workflows have another category of failure: the request may complete, tokens may be spent, and the workflow may still fail because the output cannot be used.

That changes the architecture.

Problem

Imagine a paid or credit-based report workflow.

A user clicks Generate Report. The backend collects source data, builds a prompt, calls an LLM, receives output, parses it, formats sections, stores the final report, and marks the job as complete.

The user only sees one result: either the report was delivered or it was not.

But internally, many things can fail after tokens are already consumed:

the output is malformed
the parser cannot map the output to the expected schema
the prompt no longer matches the parser contract
source data is missing
one section is too large for the model context
the final report save fails
report formatting fails after generation
a specific report type is unsupported by the current workflow

If the system treats all of these as simple retry problems, it can spend money repeatedly while still failing to deliver value.

Common mistake

The common mistake is building the report as one large operation:

User request
  -> Build one huge prompt
  -> Call the LLM once
  -> Parse one huge response
  -> Save one final report

Then, when the call fails, the retry logic says:

Call failed? Retry later.

That is useful for transient failures. It is not enough for structural failures.

A retry cannot fix a broken schema, missing source data, unsupported workflow, or prompt/parser mismatch. It only repeats the same expensive mistake.

Why it fails in production

The failure mode becomes painful because report generation has business state, not just technical state.

The system has to answer questions like:

Did the user spend credits?
Did the provider consume tokens?
Did any report sections complete successfully?
Is the failure temporary or structural?
Should the user be allowed to retry?
Should this report type be disabled until reviewed?
Should a credit be reversed?
Should an admin inspect the prompt, parser, model, or source data?

If these states are not explicit, the system can drift into a messy middle:

users keep retrying failed jobs
completed sections are lost
retries burn more tokens
support cannot explain what happened
refunds or credit reversals become manual
broken report types continue accepting new jobs

The problem is not only reliability. It is also cost control, trust, and operational control.

Better architecture

The better architecture is to treat report generation as a checkpointed job, not one big request.

Each report is split into sections. Each section has its own status, retry count, failure classification, output, and checkpoint.

The job runner can then resume from completed work instead of starting from zero.

Report job
  -> Prepare source data
  -> Plan sections
  -> Generate section 1
  -> Save checkpoint
  -> Generate section 2
  -> Save checkpoint
  -> Generate section 3
  -> Save checkpoint
  -> Aggregate final report

This lets the system preserve useful output and make better decisions when something fails.

Suggested flow

A practical flow looks like this:

User requests report
  -> Create report job
  -> Reserve or hold credits
  -> Validate source data
  -> Build section plan
  -> Queue section jobs
  -> Generate each section
  -> Store completed sections
  -> Classify failures
  -> Retry only retryable failures
  -> Lock structural failures for review
  -> Aggregate final report
  -> Commit credits when delivered

The key design decision is failure classification.

At minimum, classify failures into these groups:

| Failure type | Example | System behavior | | --- | --- | --- | | Transient | timeout, rate limit, provider overload | retry with cooldown | | Contract | malformed JSON, schema mismatch, parser failure | stop and mark for review | | Input | missing source data, invalid user profile | stop and ask for corrected data | | Workflow | unsupported report type, broken configuration | disable or lock that workflow | | Persistence | database save failed after generation | retry save or resume from checkpoint | | Cost-sensitive | repeated paid job failure | stop retries and trigger credit review |

This is what makes the job failure-aware.

C# example

This is a simplified C# example. It is not a full framework. The point is the shape: classify failure, save checkpoints, and resume safely.

public enum ReportJobStatus
{
    Pending,
    Running,
    WaitingForRetry,
    NeedsReview,
    Completed,
    Failed
}

public enum SectionStatus
{
    Pending,
    Running,
    Completed,
    WaitingForRetry,
    NeedsReview,
    Failed
}

public enum FailureKind
{
    None,
    Transient,
    Contract,
    MissingInput,
    UnsupportedWorkflow,
    Persistence
}

public sealed record ReportSection(
    string Key,
    SectionStatus Status,
    int Attempts,
    string? Output,
    FailureKind LastFailure,
    DateTimeOffset? RetryAfterUtc
);

public sealed class ReportJobRunner
{
    private readonly IReportJobStore _store;
    private readonly ILlmReportClient _llm;
    private readonly IFailureClassifier _failureClassifier;

    public ReportJobRunner(
        IReportJobStore store,
        ILlmReportClient llm,
        IFailureClassifier failureClassifier)
    {
        _store = store;
        _llm = llm;
        _failureClassifier = failureClassifier;
    }

    public async Task RunAsync(Guid jobId, CancellationToken cancellationToken)
    {
        var job = await _store.GetAsync(jobId, cancellationToken);

        foreach (var section in job.Sections.Where(CanRun))
        {
            await _store.MarkSectionRunningAsync(jobId, section.Key, cancellationToken);

            try
            {
                var output = await _llm.GenerateSectionAsync(
                    job.SourceData,
                    section.Key,
                    cancellationToken);

                await _store.SaveSectionOutputAsync(
                    jobId,
                    section.Key,
                    output,
                    cancellationToken);
            }
            catch (Exception ex)
            {
                var failure = _failureClassifier.Classify(ex);

                if (failure == FailureKind.Transient && section.Attempts < 3)
                {
                    await _store.MarkSectionForRetryAsync(
                        jobId,
                        section.Key,
                        failure,
                        retryAfterUtc: DateTimeOffset.UtcNow.AddMinutes(5),
                        cancellationToken);

                    continue;
                }

                await _store.MarkSectionNeedsReviewAsync(
                    jobId,
                    section.Key,
                    failure,
                    cancellationToken);

                return;
            }
        }

        var refreshed = await _store.GetAsync(jobId, cancellationToken);
        if (refreshed.Sections.All(x => x.Status == SectionStatus.Completed))
        {
            await _store.AggregateAndCompleteAsync(jobId, cancellationToken);
        }
    }

    private static bool CanRun(ReportSection section)
    {
        if (section.Status == SectionStatus.Pending)
            return true;

        if (section.Status == SectionStatus.WaitingForRetry &&
            section.RetryAfterUtc <= DateTimeOffset.UtcNow)
            return true;

        return false;
    }
}

The important part is not the exact class design. The important part is that every section has state, and the system can resume from that state.

Storage/checkpoint strategy

A checkpoint should store enough information to avoid repeating completed work.

For the report job:

job id
user id or account id
report type
source data version or snapshot reference
credit/payment state
overall status
created and updated timestamps

For each section:

section key
prompt version
model configuration
status
attempts
generated output
parser result if applicable
last failure kind
retry-after time
review notes if an admin needs to inspect it

For paid workflows, credit state should be explicit. A useful pattern is:

requested -> reserved -> committed
requested -> reserved -> reversed

Do not silently charge the user before the report is deliverable unless the product rules clearly allow that.

Retry and resume behavior

Retry should depend on the failure kind.

Transient failures can retry with cooldown:

provider timeout
rate limit
temporary overload
network issue

Structural failures should stop the loop:

malformed output after repeated attempts
schema mismatch
parser contract failure
missing source data
unsupported report type
wrong model or prompt configuration

Resume behavior should start from the last good checkpoint:

Section 1 completed
Section 2 completed
Section 3 failed with transient error

Next run:
  skip section 1
  skip section 2
  retry section 3

If the failure is structural, the job should move to NeedsReview instead of retrying forever.

Tradeoffs

This design adds more moving parts.

You need job state, section state, retry policy, failure classification, and checkpoint storage. That is more work than one controller action calling an LLM once.

But the tradeoff is usually worth it when the workflow has any of these characteristics:

paid reports
long-running generation
multiple report sections
expensive LLM calls
user-visible delivery expectation
support or refund handling
output that can be partially useful

For small free features, a simple request/response flow may be enough. For serious report generation, checkpointed jobs make the system easier to operate.

Production checklist

Before shipping a long LLM report workflow, I would check:

Is the report split into independently generated sections?
Is source data validated before tokens are spent?
Is credit/payment state explicit?
Are completed sections stored before moving to the next one?
Can the job resume from completed sections?
Are transient and structural failures classified differently?
Is there a maximum retry count?
Is there a cooldown for provider failures?
Can a report type be disabled if it fails repeatedly?
Can an admin inspect failed jobs?
Are prompt version and model configuration stored with the job?
Is malformed output treated as a contract failure, not a blind retry?
Does the user see a clear status instead of a generic failure?
Is there a path for credit reversal or manual review?

LinkedIn short version

For long LLM reports, retry alone is not enough.

A timeout or rate limit can be retried. A malformed response, parser mismatch, missing source data, or unsupported workflow should not be retried forever.

The safer pattern is:

Split the report into sections.
Store each completed section.
Classify the failure.
Retry only retryable failures.
Resume from checkpoints.
Lock structural failures for review.

This protects tokens, user trust, and operational control.

LLM reliability is not just about retrying harder. It is about knowing which failures deserve another attempt and which failures mean the system should stop, preserve state, and ask for review.