Evaluating 19 sessions against hermes3:8b Skipping sessions with >20 human turns ------------------------------------------------------------ [1/19] 20260323_154745_7c9379 (1 human turns)... composite=0.522 (19.0s) [2/19] 20260323_155153_bbc661 (1 human turns)... composite=0.681 (20.3s) [3/19] 20260323_155458_a7563d (1 human turns)... composite=0.523 (15.5s) [4/19] 20260323_160415_267a50 (1 human turns)... composite=0.000 (0.0s) [5/19] 20260323_160533_edc7cf (4 human turns)... composite=0.684 (35.0s) [6/19] 20260323_160633_6fab92 (1 human turns)... composite=0.000 (0.0s) [7/19] 20260323_160736_754836 (1 human turns)... composite=0.628 (8.1s) [8/19] 20260323_160920_c90867 (2 human turns)... composite=0.592 (36.7s) [9/19] 20260323_161348_d553f0 (1 human turns)... composite=0.628 (13.2s) [10/19] 20260323_162017_83fe11 (1 human turns)... composite=0.623 (13.2s) [11/19] 20260323_162338_0267eb (2 human turns)... composite=0.592 (28.4s) [12/19] 20260323_162704_86f56f (1 human turns)... composite=0.623 (8.1s) [13/19] 20260323_163322_e45fa8 (1 human turns)... composite=0.624 (8.0s) [14/19] 20260323_163943_9ac5e0 (1 human turns)... composite=0.481 (75.1s) [15/19] 20260323_164605_a4aedf (1 human turns)... composite=0.726 (8.4s) [16/19] 20260323_164606_c000d8 (2 human turns)... composite=0.592 (25.9s) [17/19] 20260323_165414_a92bf6 (1 human turns)... composite=0.674 (10.5s) [18/19] 20260323_165533_940c49 (1 human turns)... composite=0.521 (19.5s) [19/19] 20260323_170039_2a191f (1 human turns)... composite=0.754 (8.2s) ============================================================ RESULTS — hermes3:8b ============================================================ tool_selection : 0.895 length_ratio : 0.271 format_compliance : 0.892 text_similarity : 0.029 brevity : 0.668 composite : 0.551 Saved to /Users/apayne/autolora/evals/v0-baseline/8b/scores.json ============================================================ STEP 1: Evaluating hermes3:8b ============================================================ Done. Results in /Users/apayne/autolora/evals/v0-baseline/8b/