Performance Metrics Analysis Dashboard

Executive Summary

This analysis compares various measurement metrics across different model configurations, with a focus on efficiency and performance. The data is color-coded to highlight the best (green), close to best (yellow), still good overall (orange), and worst values (red).

Key findings:

The most notable metrics for comparison are Mean PPL (perplexity), KLD (Kullback-Leibler Divergence), and the bottom section metrics including 1PPL/MB and various efficiency ratios which indicate performance relative to model size.

Performance Measurement Data

Measurement IQ1_M (mine) IQ1_M (main) IQ2_XXS (mine) IQ2_XXS (main) IQ2_S (mine) UD-IQ1_M (unsloth) Q2_K_L (mine) Q2_K_L (main) UD-Q2_K_XL (unsloth) IQ3_XXS (mine) IQ3_XXS (main)
Size (GB) 25.32 24.57 30.17 28.56 34.34 35.4 44 40.57 42.6 44.96 41.66
Mean PPL 11.81 13.79 10.55 11.66 9.85 10.30 9.02 9.88 9.31 9.266434 9.76184
KLD
Mean 0.691 0.933 0.464 0.664 0.361 0.376 0.217 0.332 0.185 0.164 0.244
Max 17.819 23.806 26.647 26.761 17.597 21.264 24.180 17.556 23.286 28.166 25.849
99.9% 9.912 10.822 7.897 10.029 6.693 6.995 7.129 12.766 4.213 4.232 4.964
99% 5.463 6.250 4.084 5.094 3.237 3.560 2.108 2.966 1.844 1.600 2.178
median 0.315 0.387 0.187 0.335 0.141 0.131 0.067 0.125 0.060 0.055 0.099
10% 0.0053 0.0099 0.002 0.004 0.0012 0.0012 0.0005 0.0009 0.0004 0.0004 0.0005
5% 0.00097 0.00179 0.0003 0.00064 0.00019 0.00018 0.00008 0.00013 0.00005 0.00005 0.00007
1% 0.00046 0.00073 0.00011 0.00030 0.00007 0.00007 0.00002 0.00004 0.00001 0.00001 0.00002
Delta probs
Mean -8.03% -10.30% -4.62% -6.70% -3.38% -3.46% -2.14% -2.37% -1.38% -1.13% -1.57%
Max 99.67% 98.73% 99.81% 99.81% 99.13% 98.90% 99.88% 99.81% 99.83% 99.91% 99.99%
99.9% 77.40% 79.77% 76.35% 75.42% 75.03% 76.59% 69.34% 75.65% 69.69% 65.60% 71.73%
99% 42.37% 41.40% 41.62% 47.11% 40.65% 40.56% 32.34% 41.89% 33.46% 31.38% 37.88%
95.00% 15.79% 18.51% 16.32% 19.86% 16.95% 15.56% 12.41% 17.30% 12.83% 12.71% 16.04%
90.00% 6.59% 7.56% 7.69% 9.05% 7.62% 7.33% 5.92% 8.86% 6.43% 6.50% 8.23%
75.00% 0.16% 0.13% 0.44% 0.35% 0.54% 0.51% 0.53% 0.89% 0.70% 0.70% 0.86%
Median -0.78% -1.21% -0.18% -0.42% -0.09% -0.09% -0.03% -0.02% -0.01% -0.01% -0.01%
25.00% -11.66% -15.85% -6.11% -9.93% -4.65% -4.56% -2.86% -3.40% -2.11% -1.96% -2.66%
10.00% -35.57% -46.38% -23.74% -34.00% -19.19% -18.97% -12.61% -16.60% -10.78% -10.12% -13.88%
5.00% -56.91% -68.67% -40.94% -53.40% -33.86% -34.31% -23.01% -30.06% -20.17% -18.53% -24.41%
1.00% -91.26% -95.39% -80.42% -87.98% -70.51% -73.12% -55.83% -67.16% -49.11% -44.35% -53.65%
0.10% -99.61% -99.87% -98.74% -99.76% -95.85% -95.98% -99.92% -99.92% -82.64% -78.71% -86.82%
Minimum -100.00% -100.00% -100.00% -100.00% -99.95% -99.95% -100.00% -100.00% -99.96% -100.00% -100.00%
RMS Δp 23.63% 27.63% 19.13% 23.06% 16.86% 17.16% 13.55% 16.31% 12.16% 11.30% 13.69%
Same top 68.58% 62.65% 74.02% 67.77% 76.74% 77.00% 82.92% 77.85% 83.42% 84.20% 80.09%
1PPL/MB 3.217 2.952 3.140 3.002 2.956 2.743 2.521 2.434 2.522 2.401 2.459
1/mean KLD/GB 0.0571 0.0437 0.0715 0.0528 0.0806 0.0751 0.1045 0.0741 0.1268 0.1357 0.0983
1/median KLD/GB 0.2690 0.1442 0.5076 0.2553 0.7196 0.7434 1.6611 0.8113 1.7799 1.9124 1.0377
1/RMS/GB 0.1607559878 0.147351131 0.1733007884 0.1518255381 0.1724846058 0.1645804449 0.1677165724 0.1516537142 0.1936001069 0.1981899277 0.1753256929
top P/GB 43.54236966 40.76096623 40.76096623 42.1431626 44.74616598 45.57342992 53.06575329 52.11571564 51.06444189 53.94945508 52.02427633

Color Legend:

Best
Close to best
Still good overall
Worst values

Model Size Comparison (GB)

Mean Perplexity (PPL) Comparison

Lower values indicate better performance

Mean KLD Comparison

Lower values indicate better alignment with the reference distribution

Efficiency Metrics Heatmap

Visualizing the color-coded bottom metrics that indicate performance efficiency relative to model size

Metric IQ1_M (mine) IQ1_M (main) IQ2_XXS (mine) IQ2_XXS (main) IQ2_S (mine) UD-IQ1_M (unsloth) Q2_K_L (mine) Q2_K_L (main) UD-Q2_K_XL (unsloth) IQ3_XXS (mine) IQ3_XXS (main)
1PPL/MB 3.217 2.952 3.140 3.002 2.956 2.743 2.521 2.434 2.522 2.401 2.459
1/mean KLD/GB 0.0571 0.0437 0.0715 0.0528 0.0806 0.0751 0.1045 0.0741 0.1268 0.1357 0.0983
1/median KLD/GB 0.2690 0.1442 0.5076 0.2553 0.7196 0.7434 1.6611 0.8113 1.7799 1.9124 1.0377
1/RMS/GB 0.1608 0.1474 0.1733 0.1518 0.1725 0.1646 0.1677 0.1517 0.1936 0.1982 0.1753
top P/GB 43.5424 40.7610 40.7610 42.1432 44.7462 45.5734 53.0658 52.1157 51.0644 53.9495 52.0243

Understanding the Metrics

Size (GB)

The storage size of the model in gigabytes. Smaller models generally require less computational resources.

Mean PPL (Perplexity)

Perplexity measures how well a probability model predicts a sample. Lower values indicate better performance and more accurate predictions.

KLD (Kullback-Leibler Divergence)

Measures how one probability distribution diverges from a second, expected probability distribution. Lower values indicate better alignment with reference distribution.

Delta probs

The difference in probability distributions between the model and a reference. Values closer to zero indicate better alignment.

RMS Δp

Root Mean Square of the delta probabilities, providing a single metric for the overall deviation. Lower values are better.

Same top

Percentage of cases where the model predicts the same top token as the reference. Higher values indicate better alignment.

1PPL/MB

Perplexity efficiency relative to model size. Higher values indicate more efficient perplexity performance per megabyte.

1/mean KLD/GB

KLD efficiency relative to model size. Higher values indicate more efficient KLD performance per gigabyte.

1/median KLD/GB

Alternative metric for KLD efficiency using median values. Higher values indicate better efficiency.

1/RMS/GB

RMS delta probability efficiency relative to model size. Higher values indicate better efficiency.

top P/GB

Same top prediction percentage efficiency relative to model size. Higher values indicate better efficiency.

Conclusions

Based on the data analysis, several important conclusions can be drawn:

These findings suggest that model optimization techniques like those used in the unsloth variants can significantly improve efficiency without sacrificing performance. Additionally, newer model architectures (like IQ3) demonstrate enhanced efficiency metrics compared to earlier versions.