Mace4 Checkpoint and Resume

Mace4 can save its search state to disk (checkpoint) and later resume from that saved state. This is useful for:

  • Long-running model searches where you want insurance against crashes or machine reboots.
  • Splitting a long run across multiple sessions.
  • Examining intermediate search state.
  • Moving a run to a different machine.

Unlike Prover9 (which saves clause lists and index state), Mace4 uses a lightweight decision-trail approach. The checkpoint saves only the DPLL decision trail, the current domain size, model count, and a self-contained input file. Typical checkpoint size is a few kilobytes regardless of problem size.

Checkpointing works in both LADR mode and TPTP mode.

Creating a Checkpoint

While Mace4 is running, send it the SIGUSR2 signal:

kill -USR2 <pid>

Mace4 will write a checkpoint directory and print a confirmation to stderr:

Checkpoint written: mace4_12345_ckpt_50000 (depth=25, selections=50000)

The directory name includes the process ID and selection count: mace4_<pid>_ckpt_<selections>/

If checkpoint_exit is set, Mace4 exits with code 107. Otherwise, it continues searching.

What the Checkpoint Contains

FileContents
metadata.txtDomain size, model count, search depth, stats
trail.txtDPLL decision trail (one line per depth level)
saved_input.txtSelf-contained LADR input (options + clauses)
options.txtHuman-readable option dump (not used on resume)

The trail.txt file has one line per search depth:

<depth> <cell_id> <value_index> <last> <max_constrained>

Example:

0 25 0 1 0
1 26 0 2 1
2 30 1 2 1

This says: at depth 0, cell 25 is trying value 0 (of 0..1). At depth 1, cell 26 is trying value 0 (of 0..2). And so on.

Resuming from a Checkpoint

To resume, use the -r flag with the checkpoint directory:

mace4 -r mace4_12345_ckpt_50000

The -r flag is all you need. The saved_input.txt inside the checkpoint directory contains the complete input, so you do not need to supply the original input file again.

You can override options on the command line:

mace4 -r mace4_12345_ckpt_50000 -t 7200 -N 30

On resume, Mace4 prints restoration messages to stderr:

Resuming from checkpoint: mace4_12345_ckpt_50000
Resume: domain_size=15, total_models=194972, depth=197, selections=390130

=== Mace4 starting on domain size 15. ===
Trail replay complete: 197 depths replayed

The search then continues exactly where it left off.

How It Works

On resume, Mace4 re-reads the input, rebuilds the cell table and ground clauses for the saved domain size (this is deterministic), then replays the saved assignments via assign_and_propagate(). This reconstructs the exact same search state without needing to serialize internal pointers or data structures.

Complete Example

1. Start a long-running model search:

mace4 -n 10 -N 20 -t 86400 -f hard_problem.p > run1.out 2> run1.err &

2. After some time, create a checkpoint:

kill -USR2 $!

3. If the machine needs to restart:

kill $!

4. Later, resume:

mace4 -r mace4_54321_ckpt_500000 -t 86400 > run2.out 2> run2.err &

5. You can checkpoint the resumed run too:

kill -USR2 $!

Checkpoint-and-Exit Workflow

Add set(checkpoint_exit) to your input file. After SIGUSR2, Mace4 writes the checkpoint and exits with code 107.

#!/bin/bash
mace4 -n 10 -N 20 < "$1" > run.out 2> run.err &
PID=$!
sleep 3600                    # run for 1 hour
kill -USR2 $PID               # request checkpoint
wait $PID
EXIT=$?
if [ $EXIT -eq 107 ]; then
    echo "Checkpointed successfully"
elif [ $EXIT -eq 0 ]; then
    echo "Model found before checkpoint"
else
    echo "Exited with code $EXIT"
fi

Periodic Automatic Checkpoints

Mace4 supports automatic periodic checkpoints with rotation, identical to Prover9:

assign(checkpoint_minutes, 30).   % checkpoint every 30 minutes
assign(checkpoint_keep, 3).       % keep last 3 auto checkpoints

Unlike Prover9, Mace4 does not enforce a minimum interval. SIGUSR2 manual checkpoints are independent of rotation and are never removed automatically.

Model Count Accumulation

The total model count is maintained across checkpoint/resume boundaries. For example:

Original run:  finds 194,972 models before checkpoint
Resumed run:   finds 327,719 more models in 5 seconds

Total reported: "Exiting with 522691 models."

The count in the "Exiting with" message is the cumulative total, not just the models found after resume.

LADR Mode vs TPTP Mode

Checkpointing works identically in both modes. The saved_input.txt is always in LADR format (TPTP formulas are clausified and saved as LADR clauses). On resume, Mace4 detects tptp_mode=1 in metadata.txt and sets up the symbol table automatically.

You do not need to do anything special — just use -r.

Options Reference

OptionDefaultDescription
set(checkpoint_exit)off Exit (code 107) after writing checkpoint.
assign(checkpoint_minutes, N)-1 (disabled) Periodic checkpoint interval in minutes (wall-clock).
assign(checkpoint_keep, K)3 Max auto checkpoint directories to retain (range 1–100). Does not affect SIGUSR2 manual checkpoints.

Exit Codes

CodeMeaning
0max_models reached (model found)
2Search exhausted (all domain sizes tried)
3All models found (max_models = -1, search exhausted)
4max_seconds exceeded (models found)
5max_seconds exceeded (no models found)
6max_megs exceeded (models found)
7max_megs exceeded (no models found)
101Interrupted (SIGINT / Ctrl-C)
107Checkpoint saved (with checkpoint_exit set)

Comparison with Prover9 Checkpointing

FeatureProver9Mace4
Search algorithmGiven-clause loopDPLL backtracking
What is savedClause lists, indexes, justifications, formulasDecision trail only
Checkpoint size~15% of RAM usageA few kilobytes
Resume mechanismLoad clauses + re-indexRebuild + replay trail
Input on resumeMust supply original -fSelf-contained (-r only)
DeterminismGiven/Kept counts matchExact replay of trail

Troubleshooting

"assign_and_propagate: contradictory assignment" during resume
The input or options changed between checkpoint and resume, causing the cell table to differ. Do not modify the input between checkpoint and resume. Command-line overrides like -t (time) and -N (end size) are safe.
Checkpoint directory not created after SIGUSR2
Mace4 checks for the signal between cell selections in the DPLL loop. If the search hasn't started yet, wait a moment and send SIGUSR2 again.
"cannot open .../saved_input.txt for resume"
The checkpoint directory path is incorrect or was moved/deleted. Verify the -r argument.
Resumed run finds fewer models than expected
The model count on the "Exiting with" line is cumulative. A shorter time limit on resume means fewer additional models, but the total includes all models from before the checkpoint.
Next Section: Mace4 Options