Mace4 Checkpoint and Resume
Mace4 can save its search state to disk (checkpoint) and later resume from that saved state. This is useful for:
- Long-running model searches where you want insurance against crashes or machine reboots.
- Splitting a long run across multiple sessions.
- Examining intermediate search state.
- Moving a run to a different machine.
Unlike Prover9 (which saves clause lists and index state), Mace4 uses a lightweight decision-trail approach. The checkpoint saves only the DPLL decision trail, the current domain size, model count, and a self-contained input file. Typical checkpoint size is a few kilobytes regardless of problem size.
Checkpointing works in both LADR mode and TPTP mode.
Creating a Checkpoint
While Mace4 is running, send it the SIGUSR2 signal:
kill -USR2 <pid>
Mace4 will write a checkpoint directory and print a confirmation to stderr:
Checkpoint written: mace4_12345_ckpt_50000 (depth=25, selections=50000)
The directory name includes the process ID and selection count: mace4_<pid>_ckpt_<selections>/
If checkpoint_exit is set, Mace4 exits with code 107. Otherwise, it continues searching.
What the Checkpoint Contains
| File | Contents |
|---|---|
| metadata.txt | Domain size, model count, search depth, stats |
| trail.txt | DPLL decision trail (one line per depth level) |
| saved_input.txt | Self-contained LADR input (options + clauses) |
| options.txt | Human-readable option dump (not used on resume) |
The trail.txt file has one line per search depth:
<depth> <cell_id> <value_index> <last> <max_constrained>
Example:
0 25 0 1 0 1 26 0 2 1 2 30 1 2 1
This says: at depth 0, cell 25 is trying value 0 (of 0..1). At depth 1, cell 26 is trying value 0 (of 0..2). And so on.
Resuming from a Checkpoint
To resume, use the -r flag with the checkpoint directory:
mace4 -r mace4_12345_ckpt_50000
The -r flag is all you need. The saved_input.txt inside the checkpoint directory contains the complete input, so you do not need to supply the original input file again.
You can override options on the command line:
mace4 -r mace4_12345_ckpt_50000 -t 7200 -N 30
On resume, Mace4 prints restoration messages to stderr:
Resuming from checkpoint: mace4_12345_ckpt_50000 Resume: domain_size=15, total_models=194972, depth=197, selections=390130 === Mace4 starting on domain size 15. === Trail replay complete: 197 depths replayed
The search then continues exactly where it left off.
How It Works
On resume, Mace4 re-reads the input, rebuilds the cell table and ground clauses for the saved domain size (this is deterministic), then replays the saved assignments via assign_and_propagate(). This reconstructs the exact same search state without needing to serialize internal pointers or data structures.
Complete Example
1. Start a long-running model search:
mace4 -n 10 -N 20 -t 86400 -f hard_problem.p > run1.out 2> run1.err &
2. After some time, create a checkpoint:
kill -USR2 $!
3. If the machine needs to restart:
kill $!
4. Later, resume:
mace4 -r mace4_54321_ckpt_500000 -t 86400 > run2.out 2> run2.err &
5. You can checkpoint the resumed run too:
kill -USR2 $!
Checkpoint-and-Exit Workflow
Add set(checkpoint_exit) to your input file. After SIGUSR2, Mace4 writes the checkpoint and exits with code 107.
#!/bin/bash
mace4 -n 10 -N 20 < "$1" > run.out 2> run.err &
PID=$!
sleep 3600 # run for 1 hour
kill -USR2 $PID # request checkpoint
wait $PID
EXIT=$?
if [ $EXIT -eq 107 ]; then
echo "Checkpointed successfully"
elif [ $EXIT -eq 0 ]; then
echo "Model found before checkpoint"
else
echo "Exited with code $EXIT"
fi
Periodic Automatic Checkpoints
Mace4 supports automatic periodic checkpoints with rotation, identical to Prover9:
assign(checkpoint_minutes, 30). % checkpoint every 30 minutes assign(checkpoint_keep, 3). % keep last 3 auto checkpoints
Unlike Prover9, Mace4 does not enforce a minimum interval. SIGUSR2 manual checkpoints are independent of rotation and are never removed automatically.
Model Count Accumulation
The total model count is maintained across checkpoint/resume boundaries. For example:
Original run: finds 194,972 models before checkpoint Resumed run: finds 327,719 more models in 5 seconds Total reported: "Exiting with 522691 models."
The count in the "Exiting with" message is the cumulative total, not just the models found after resume.
LADR Mode vs TPTP Mode
Checkpointing works identically in both modes. The saved_input.txt is always in LADR format (TPTP formulas are clausified and saved as LADR clauses). On resume, Mace4 detects tptp_mode=1 in metadata.txt and sets up the symbol table automatically.
You do not need to do anything special — just use -r.
Options Reference
| Option | Default | Description |
|---|---|---|
| set(checkpoint_exit) | off | Exit (code 107) after writing checkpoint. |
| assign(checkpoint_minutes, N) | -1 (disabled) | Periodic checkpoint interval in minutes (wall-clock). |
| assign(checkpoint_keep, K) | 3 | Max auto checkpoint directories to retain (range 1–100). Does not affect SIGUSR2 manual checkpoints. |
Exit Codes
| Code | Meaning |
|---|---|
| 0 | max_models reached (model found) |
| 2 | Search exhausted (all domain sizes tried) |
| 3 | All models found (max_models = -1, search exhausted) |
| 4 | max_seconds exceeded (models found) |
| 5 | max_seconds exceeded (no models found) |
| 6 | max_megs exceeded (models found) |
| 7 | max_megs exceeded (no models found) |
| 101 | Interrupted (SIGINT / Ctrl-C) |
| 107 | Checkpoint saved (with checkpoint_exit set) |
Comparison with Prover9 Checkpointing
| Feature | Prover9 | Mace4 |
|---|---|---|
| Search algorithm | Given-clause loop | DPLL backtracking |
| What is saved | Clause lists, indexes, justifications, formulas | Decision trail only |
| Checkpoint size | ~15% of RAM usage | A few kilobytes |
| Resume mechanism | Load clauses + re-index | Rebuild + replay trail |
| Input on resume | Must supply original -f | Self-contained (-r only) |
| Determinism | Given/Kept counts match | Exact replay of trail |
Troubleshooting
- "assign_and_propagate: contradictory assignment" during resume
- The input or options changed between checkpoint and resume, causing the cell table to differ. Do not modify the input between checkpoint and resume. Command-line overrides like -t (time) and -N (end size) are safe.
- Checkpoint directory not created after SIGUSR2
- Mace4 checks for the signal between cell selections in the DPLL loop. If the search hasn't started yet, wait a moment and send SIGUSR2 again.
- "cannot open .../saved_input.txt for resume"
- The checkpoint directory path is incorrect or was moved/deleted. Verify the -r argument.
- Resumed run finds fewer models than expected
- The model count on the "Exiting with" line is cumulative. A shorter time limit on resume means fewer additional models, but the total includes all models from before the checkpoint.