Testing Local Survival AI: What MW42’s Exploratory Benchmark Actually Showed

madwrld42
May 31
11 min read

A Mad World 42 special study on offline AI model reliability for survival and preparedness.

Results - The Bottom Line Up Front

Mad World 42’s exploratory run shows that the company is building something more

valuable than a simple model leaderboard. The MW42 AI benchmark app and Mad AI Foundry together form a repeatable local evaluation system: one side runs survival-oriented tests, captures answers, and supports human review; the other side turns those results into comparisons, dashboards, and exportable reports. That matters because survival and preparedness testing is not just about whether an AI can “sound smart.” It is about whether it gives accurate, complete, usable guidance under pressure, and whether it can do that on practical local hardware. (See the results site here)

Within this run, the strongest offline performers were the general-purpose models, not the survival-branded or “uncensored” small models. Using accuracy, completeness, actionability, and clarity under stress—Mistral-7B-Instruct-v0.3 was the strongest offline model, with Phi-3-mini-4k-instruct close behind. Reference-GPT-4o used as a baseline and performed best overall, but not part of the ranked offline AI's.

The medical and plant specialist LLM's did not tell the same story. BioMistral sometimes showed useful medical instincts, especially on antibiotic-misuse and first-aid-kit prompts, but it was often too brief and incomplete to beat the strongest general-purpose models. PLLaMA, despite being designed for plant science, performed extremely poorly in this run and frequently produced near-empty or echoed outputs instead of usable answers.

In plain English: the subject-matter LLMs did not automatically translate into better field guidance.

The most important applied finding is about the “uncensored” families. In practice, they were somewhat less likely to give clean refusals on the safety probes, but that did not make them more helpful. Most of those 270M models produced incomplete, low-scoring, and sometimes plainly unsafe outputs. So the right takeaway is not that “uncensored” means better. In this run, it usually meant less reliable.

Quantization mattered, but it did not rescue the weakest model family. On the survival-uncensored-gemma-270m line, higher-precision variants such as f16, Q6_K, and the Q5 variants generally scored better than the low-bit Q2 and many Q3 versions. That matches the broader purpose of quantization in llama.cpp: reducing memory use and often improving runtime feasibility, usually at some quality cost. But even the better quantized variants in this run remained far behind the stronger 7B general-purpose models on human-reviewed usefulness.

For preparedness-minded end users, the clearest practical answer from this exploratory run is to start with a strong general-purpose local model and treat specialists as optional additions that must be validated carefully. A mix can make sense, but only if the main model is already dependable and the niche model has been tested against the exact scenarios you care about. Mad World 42’s value proposition is that it can run those comparisons locally, change the benchmark questions, and show the tradeoffs with evidence instead of guesswork.

Mad World 42 is building more than an AI testing tool. It is building a preparedness-focused AI evaluation lab.

The MW42 AI Benchmark App gives Mad World 42 a complete operational environment for designing, testing, and validating AI model performance against real-world preparedness, survival, resilience, and decision-support questions. The system can build and edit benchmark question sets, discover local models, run structured test sessions, capture raw model answers, support human review, generate reference answers, and export structured results for deeper analysis.

Mad AI Foundry sits on top of that benchmark database as the reporting, comparison, and publishing layer. It allows Mad World 42 to load test sessions, scope results by model or model family, apply weighted scoring, compare answers side by side, review dashboards, use audience-specific reporting templates, and export standalone HTML reports.

Together, the MW42 AI Benchmark App and Mad AI Foundry create a full evaluation loop: from benchmark design, to model testing, to human review, to an publishable analysis.

This is commercially important because it allows Mad World 42 to test many local AI models against the same preparedness-centered questions and compare how each model performs under realistic scenario conditions. Specialized survival models, general-purpose models, lightweight local models, uncensored model families, and different quantization levels can all be evaluated side by side.

That means Mad World 42 is not just experimenting with AI. It is building a reusable decision-support laboratory for preparedness-focused AI evaluation.

The exploratory run demonstrates the value of that approach. The test included a broad mix of general-purpose models, subject-oriented specialists, uncensored model families, multiple quantizations, and both CPU and GPU execution where relevant. The session ran for a little over twenty-two hours on a Windows 11 laptop-class system with 14 physical CPU cores, 20 logical cores, 15.62 GB of RAM, and an NVIDIA GeForce RTX 4050 Laptop GPU with about 6.1 GB of VRAM.

That real-world hardware context matters. Preparedness users need to know what can actually run on practical, available equipment—not just what performs well in a cloud lab. MW42’s benchmark process helps answer that question by showing how AI models perform in the kind of offline, resource-constrained, locally controlled environment that resilience-minded users care about.

What the Exploratory Run Actually Tested

The session export contains 3,383 evaluation questions across 21 named model. At the base-question level, each model run covered 34 prompts; across the run, those base prompts spanned shelter and safety, water and food, navigation, emergency response, survival psychology, safety and ethics probes, and geography. This also included follow-ups, which matters because the database and app guide both treat follow-up interaction as part of the evaluation picture.

Among the offline models, Mistral-7B-Instruct-v0.3 led this run, followed by Phi-3-mini-4k-instruct, Hermes-2-Pro-Mistral-7B, BioMistral, and then PLLaMA far behind. Mistral posted a score of about 3.36 across the 34 base prompts, with 29 strong answers out of 34 by a threshold of 3.0 or higher; Phi-3 followed at about 3.07 with 28 strong answers. Hermes averaged about 2.28 and BioMistral about 1.86, while PLLaMA was effectively nonfunctional in this session at about 0.04. Reference-GPT-4o, used only as a baseline, reached about 3.82.

The strongest general-purpose models were not perfect. Even among the better-performing offline models, the hardest questions tended to involve plant edibility, non-compass direction finding, severe bleeding control, glacier-navigation hazards, water purification without modern equipment, and antibiotic misuse after system collapse. By contrast, morale, panic control, and simpler shelter-priority questions were easier for the field overall. That pattern makes intuitive sense:

Preparedness AI often does better on broad prioritization and general coaching than on specialized, high-risk technical judgment.

General-purpose models

Mistral-7B-Instruct-v0.3 was the clearest offline winner in this study population. It scored especially well on shelter, emergency response, survival psychology, and the ethics probes, and its reviewer notes repeatedly praised it for actionable alternatives, good structure, and useful survival ordering. That result is notable because Mistral’s own model card presents it as a 7B instruct-tuned model with function-calling support rather than as a survival specialist.

Phi-3-mini-4k-instruct was the most impressive “small but serious” offline contender. Microsoft describes it as a 3.8B lightweight model with a 4K context window, and in this session it came surprisingly close to Mistral on human-reviewed usefulness while also averaging substantially lower generation time. In practical terms, Phi looked like the best balance of competence and efficiency among the named non-reference models. Its main weakness, in this run, was not general collapse but occasional lack of depth on nuanced technical prompts such as plant safety or post-collapse medical judgment.

Hermes-2-Pro-Mistral-7B is the most “qualified but caveated” result in the run. When it answered, it could be useful, and the official model card positions it as a capable Mistral-7B derivative with strong function-calling and structured-output alignment. But the exploratory session also logged 12 timeout or error responses across the 34 base prompts, and its average generation time was the slowest of the main offline models. An important nuance is that Hermes officially expects ChatML formatting, while the session’s stored raw prompts show a style. That does not prove a prompt-format mismatch caused the weak run, but it is a credible explanation for why Hermes may have been handicapped here.

Subject-Oriented Models

BioMistral is a genuine medical-domain model rather than a marketing label. Its model card says it is tailored for the biomedical domain using Mistral as a base and additional PubMed Central pretraining. But the same source also includes an unusually clear advisory notice warning that it has not been tailored to safely and suitably convey medical knowledge for professional action without further alignment and real-world testing. That warning lines up with what happened in this benchmark. BioMistral was often medically aware, but it was too often thin, incomplete, or generic when the prompt demanded field-ready guidance.

There were still a few places where BioMistral looked meaningfully useful. On the antibiotic-misuse safety probe, it placed second among the offline non-reference models, behind Mistral and ahead of Phi. On the first-aid-kit prompt, it produced a respectable answer with solid essentials. But on more demanding emergency prompts—especially life-threatening bleeding control and response sequencing for shock, hypothermia, and heat illness—it fell behind the strongest general-purpose models because it lacked critical steps, missing caveats, or practical sequencing. In plain terms:

BioMistral showed some medical intuition, but not enough operational completeness to be the best preparedness choice in this run.

PLLaMA tells a more blunt story. The PLLaMA paper presents it as an open-source plant-science model enhanced with more than 1.5 million plant-science articles. Yet in this session, it was not just weak outside its specialty; it failed badly even on the most obviously plant-related question in the set, the plant-and-berry edibility prompt. Many of its saved answers were near-empty and it posted near-zero human scores across almost the entire run. That does not prove the base model is worthless in all contexts, but it does show that this particular MW42 run did not make it usable as a preparedness assistant.

The broader lesson is important for non-technical readers: a specialist model should not be trusted just because its name matches the topic. In this run, the medical specialist was occasionally useful but still incomplete, and the plant specialist was largely unusable. The stronger general-purpose models were more dependable across the whole benchmark.

Uncensored and Survival-Branded Models

This was the clearest surprise of the exploratory run. The “survival-uncensored-gemma-270m” family and the “uncensored-q-270m” family were fast and lightweight, but they were not strong survival assistants. Across their base prompts, the survival-uncensored-gemma variants averaged only about 0.24 on the human composite, and the uncensored-q family averaged about 0.47. Neither family produced a single strong-answer rate remotely comparable to Mistral or Phi. Many of their answers were long enough to look active, but the human notes repeatedly described them as vague, unsafe, incomplete, repetitive, or misdirected.

On the safety probes, the “uncensored” branding also needs to be interpreted carefully. These families were less likely than Phi or BioMistral to give a clean refusal, but that did not make them better aligned with survival usefulness. In practice, the uncensored models often looked less like fearless truth-tellers and more like low-capacity models with and weaker judgment.

That is a major distinction. “Less restricted” is not the same as “more useful.”

One of the most interesting contrasts in the whole run is that the stronger general-purpose models handled the safety probes better even without being branded as “censored” or “uncensored.” Mistral, Phi, and BioMistral all posted human probe composites around 2.94 to 3.44, while the uncensored families sat around 0.25 to 0.32. So if a preparedness-minded reader is hoping “uncensored” means more practical survival help, this exploratory run does not support that hope.

What Quantization and Hardware Changed

llama.cpp’s own documentation says it supports 1.5-bit through 8-bit integer quantization for faster inference and reduced memory use, as well as CPU+GPU hybrid inference for hardware that cannot fit a full model in VRAM. That is exactly why quantization matters for local preparedness use - It is one of the main ways to make models runnable on ordinary machines. But the tradeoff is that lighter quantization can cost answer quality.

In the survival-uncensored-gemma-270m family, the basic pattern was that more precision generally helped. On CPU, the family climbed from roughly 0.08 at Q2_K to about 0.69 at f16, with Q6_K and Q5_K_M in the next tier. The low-bit Q2 and many Q3 variants were the weakest. That is the cleanest quantization story in the whole run: reducing memory and compute kept those models runnable, but the lowest-bit versions paid a price in quality.

The uncensored-q family was messier. Higher precision did not produce a perfectly smooth improvement curve, and the differences between q8, f16, f32, and the base variant were smaller and less consistent than in the Gemma family. That suggests the main bottleneck there was not only quantization. It was also the underlying model’s limited capacity and behavior. In other words, more bits helped a little, but they did not transform that family into a strong benchmark performer.

Hardware effects were also mixed. The 7B models were run CPU-only on a modest laptop platform, while several 270M families were tested in both CPU and GPU modes. For those small dual-mode models, GPU use did not create a consistent quality lift, and it did not even guarantee faster average answers, some variants sped up, others slowed down, and the aggregate difference was small. On this hardware class, the bigger practical divide was not CPU versus GPU alone. It was 7B quality versus 270M speed. The fast tiny models were more convenient, but they were usually far less trustworthy.

That leads to the most user-relevant hardware conclusion: if the goal is dependable preparedness guidance, shaving response time from roughly a minute or two down to a few seconds is not a bargain if the model stops being accurate, complete, or clear enough to trust. The exploratory run strongly suggests that for this use case, quality mattered more than raw speed.

What Preparedness-Minded Users Should Take Away

If you want one practical answer from this report, it is this: start with a good general-purpose local model, not a novelty label. In this run, Mistral-7B-Instruct-v0.3 looked like the strongest offline “primary” model, while Phi-3-mini-4k-instruct looked like the best efficiency-minded alternative. Both were better choices than the survival-branded 270M families, and both were more dependable than the specialists overall.

A mixed-model strategy can still make sense, but only in a disciplined way. The data supports using a strong generalist as the main model and then adding a specialist only if you have tested that specialist on the exact kinds of questions you care about. BioMistral might be worth exploring for specific medical subtopics, for example, but the results here do not support handing it a blanket “medical expert” role without further validation. PLLaMA, in contrast, would need serious troubleshooting or a different setup before it would deserve a place in a real toolkit.

For Mad World 42, that is actually a strong marketing message because it is grounded in evidence: the company can show customers that model choice is not a branding contest. It can run comparisons, inspect raw answers, look at human notes, check hardware practicality, modify the benchmark, and then explain the tradeoffs in normal language. That is useful whether the customer is a survivalist, a preparedness instructor, a local-model hobbyist, or an organization evaluating offline emergency-support workflows.

All Models Tested

The report narrative focuses on the major performance story, but the structured export contained all 21 named model entries shown below. Dual-mode small models appear with 68 base rows because they were represented across CPU and GPU execution records, while single-mode entries have 34 base rows.

Model entry	Group	Human composite	Strong answers	Avg gen time (s)	Refusals	Incomplete
Reference-GPT-4o	Reference baseline	3.82	33
Mistral-7B-Instruct-v0.3-Q4_K_M	General-purpose offline model	3.36	29	117.3	2	34
Phi-3-mini-4k-instruct-q4	General-purpose offline model	3.07	28	77.9	3	34
Hermes-2-Pro-Mistral-7B.Q4_K_M	General-purpose offline model	2.28	19	179.3	0	22
BioMistral-7B-ggml-model-Q4_K_M	Medical specialist	1.86	4	37.0	3	34
survival-uncensored-gemma-270m.f16	Survival-branded small model family	0.69	0	10.6	1	34
survival-uncensored-gemma-270m.Q6_K	Survival-branded small model family	0.57	0	8.6	1	34
uncensored-q-270m-f32	Uncensored small model family	0.52	0	15.6	4	68
uncensored-q-270m	Uncensored small model family	0.51	0	6.7	2	68
uncensored-q-270m-q8	Uncensored small model family	0.44	0	6.2	2	68
survival-uncensored-gemma-270m.Q5_K_M	Survival-branded small model family	0.42	0	6.8	3	67
uncensored-q-270m-f16	Uncensored small model family	0.41	0	9.5	2	68
survival-uncensored-gemma-270m.Q5_K_S	Survival-branded small model family	0.38	0	6.0	3	68
survival-uncensored-gemma-270m.Q3_K_L	Survival-branded small model family	0.24	0	3.6	2	67
survival-uncensored-gemma-270m.Q4_K_M	Survival-branded small model family	0.24	0	6.6	3	68
survival-uncensored-gemma-270m.Q4_K_S	Survival-branded small model family	0.19	0	4.2	3	68
survival-uncensored-gemma-270m.IQ4_XS	Survival-branded small model family	0.13	0	4.7	3	68
survival-uncensored-gemma-270m.Q2_K	Survival-branded small model family	0.08	0	3.2	2	67
survival-uncensored-gemma-270m.Q3_K_M	Survival-branded small model family	0.07	0	3.7	3	68
pllama-7b-instruct-q4_k_m	Plant specialist	0.04	0	22.5	1	33
survival-uncensored-gemma-270m.Q3_K_S	Survival-branded small model family	0.04	0	6.8	2	68

YouTube / Podcast / Social Reuse Notice

Using this article in videos, podcasts, newsletters, or social content:You may discuss the topic and quote short excerpts with attribution. You may not read this article verbatim, turn it into a script, summarize it section-by-section as substitute content, or use MW42 graphics, maps, charts, or timelines without written permission.Required credit in spoken audio, on-screen text, and the video/podcast description:"Research/source material from Mad World 42 - Testing Local Survival AI: What MW42’s Exploratory Benchmark Actually Showed - mw42_special_study__exploratory_local_ai_survival_benchmark__20260528__v1.md."

Copyright and Use Notice

© 2026 Mad World 42 LLC. All rights reserved.Original MW42 analysis, article structure, graphics, charts, maps, timelines, briefings, and written expression are protected by copyright. No part of this article may be copied, republished, narrated, adapted into video/audio content, used as the primary basis for monetized content, or redistributed without written permission from Mad World 42 LLC, except for short quoted excerpts with clear attribution and a link to the original article.Required attribution:Source: Mad World 42 (MW42), "Testing Local Survival AI: What MW42’s Exploratory Benchmark Actually Showed"