Performance Metrics Analysis

Executive Summary

This analysis compares various measurement metrics across different model configurations, with a focus on efficiency and performance. The data is color-coded to highlight the best (green), close to best (yellow), still good overall (orange), and worst values (red).

Key findings:

The IQ1_M configurations show strong performance in certain metrics but have higher resource requirements.
IQ2_XXS models demonstrate efficient resource utilization with competitive performance metrics.
UD-prefixed models (unsloth) show significant optimization in specific performance categories.
The IQ3_XXS configuration achieves a good balance between size and performance metrics.

The most notable metrics for comparison are Mean PPL (perplexity), KLD (Kullback-Leibler Divergence), and the bottom section metrics including 1PPL/MB and various efficiency ratios which indicate performance relative to model size.

Performance Measurement Data

Measurement	IQ1_M (mine)	IQ1_M (main)	IQ2_XXS (mine)	IQ2_XXS (main)	IQ2_S (mine)	UD-IQ1_M (unsloth)	Q2_K_L (mine)	Q2_K_L (main)	UD-Q2_K_XL (unsloth)	IQ3_XXS (mine)	IQ3_XXS (main)
Size (GB)	25.32	24.57	30.17	28.56	34.34	35.4	44	40.57	42.6	44.96	41.66
Mean PPL	11.81	13.79	10.55	11.66	9.85	10.30	9.02	9.88	9.31	9.266434	9.76184
KLD
Mean	0.691	0.933	0.464	0.664	0.361	0.376	0.217	0.332	0.185	0.164	0.244
Max	17.819	23.806	26.647	26.761	17.597	21.264	24.180	17.556	23.286	28.166	25.849
99.9%	9.912	10.822	7.897	10.029	6.693	6.995	7.129	12.766	4.213	4.232	4.964
99%	5.463	6.250	4.084	5.094	3.237	3.560	2.108	2.966	1.844	1.600	2.178
median	0.315	0.387	0.187	0.335	0.141	0.131	0.067	0.125	0.060	0.055	0.099
10%	0.0053	0.0099	0.002	0.004	0.0012	0.0012	0.0005	0.0009	0.0004	0.0004	0.0005
5%	0.00097	0.00179	0.0003	0.00064	0.00019	0.00018	0.00008	0.00013	0.00005	0.00005	0.00007
1%	0.00046	0.00073	0.00011	0.00030	0.00007	0.00007	0.00002	0.00004	0.00001	0.00001	0.00002
Delta probs
Mean	-8.03%	-10.30%	-4.62%	-6.70%	-3.38%	-3.46%	-2.14%	-2.37%	-1.38%	-1.13%	-1.57%
Max	99.67%	98.73%	99.81%	99.81%	99.13%	98.90%	99.88%	99.81%	99.83%	99.91%	99.99%
99.9%	77.40%	79.77%	76.35%	75.42%	75.03%	76.59%	69.34%	75.65%	69.69%	65.60%	71.73%
99%	42.37%	41.40%	41.62%	47.11%	40.65%	40.56%	32.34%	41.89%	33.46%	31.38%	37.88%
95.00%	15.79%	18.51%	16.32%	19.86%	16.95%	15.56%	12.41%	17.30%	12.83%	12.71%	16.04%
90.00%	6.59%	7.56%	7.69%	9.05%	7.62%	7.33%	5.92%	8.86%	6.43%	6.50%	8.23%
75.00%	0.16%	0.13%	0.44%	0.35%	0.54%	0.51%	0.53%	0.89%	0.70%	0.70%	0.86%
Median	-0.78%	-1.21%	-0.18%	-0.42%	-0.09%	-0.09%	-0.03%	-0.02%	-0.01%	-0.01%	-0.01%
25.00%	-11.66%	-15.85%	-6.11%	-9.93%	-4.65%	-4.56%	-2.86%	-3.40%	-2.11%	-1.96%	-2.66%
10.00%	-35.57%	-46.38%	-23.74%	-34.00%	-19.19%	-18.97%	-12.61%	-16.60%	-10.78%	-10.12%	-13.88%
5.00%	-56.91%	-68.67%	-40.94%	-53.40%	-33.86%	-34.31%	-23.01%	-30.06%	-20.17%	-18.53%	-24.41%
1.00%	-91.26%	-95.39%	-80.42%	-87.98%	-70.51%	-73.12%	-55.83%	-67.16%	-49.11%	-44.35%	-53.65%
0.10%	-99.61%	-99.87%	-98.74%	-99.76%	-95.85%	-95.98%	-99.92%	-99.92%	-82.64%	-78.71%	-86.82%
Minimum	-100.00%	-100.00%	-100.00%	-100.00%	-99.95%	-99.95%	-100.00%	-100.00%	-99.96%	-100.00%	-100.00%
RMS Δp	23.63%	27.63%	19.13%	23.06%	16.86%	17.16%	13.55%	16.31%	12.16%	11.30%	13.69%
Same top	68.58%	62.65%	74.02%	67.77%	76.74%	77.00%	82.92%	77.85%	83.42%	84.20%	80.09%
1PPL/MB	3.217	2.952	3.140	3.002	2.956	2.743	2.521	2.434	2.522	2.401	2.459
1/mean KLD/GB	0.0571	0.0437	0.0715	0.0528	0.0806	0.0751	0.1045	0.0741	0.1268	0.1357	0.0983
1/median KLD/GB	0.2690	0.1442	0.5076	0.2553	0.7196	0.7434	1.6611	0.8113	1.7799	1.9124	1.0377
1/RMS/GB	0.1607559878	0.147351131	0.1733007884	0.1518255381	0.1724846058	0.1645804449	0.1677165724	0.1516537142	0.1936001069	0.1981899277	0.1753256929
top P/GB	43.54236966	40.76096623	40.76096623	42.1431626	44.74616598	45.57342992	53.06575329	52.11571564	51.06444189	53.94945508	52.02427633

Color Legend:

Best

Close to best

Still good overall

Worst values

Model Size Comparison (GB)

Mean Perplexity (PPL) Comparison

Lower values indicate better performance

Mean KLD Comparison

Lower values indicate better alignment with the reference distribution

Efficiency Metrics Heatmap

Visualizing the color-coded bottom metrics that indicate performance efficiency relative to model size

Metric	IQ1_M (mine)	IQ1_M (main)	IQ2_XXS (mine)	IQ2_XXS (main)	IQ2_S (mine)	UD-IQ1_M (unsloth)	Q2_K_L (mine)	Q2_K_L (main)	UD-Q2_K_XL (unsloth)	IQ3_XXS (mine)	IQ3_XXS (main)
1PPL/MB	3.217	2.952	3.140	3.002	2.956	2.743	2.521	2.434	2.522	2.401	2.459
1/mean KLD/GB	0.0571	0.0437	0.0715	0.0528	0.0806	0.0751	0.1045	0.0741	0.1268	0.1357	0.0983
1/median KLD/GB	0.2690	0.1442	0.5076	0.2553	0.7196	0.7434	1.6611	0.8113	1.7799	1.9124	1.0377
1/RMS/GB	0.1608	0.1474	0.1733	0.1518	0.1725	0.1646	0.1677	0.1517	0.1936	0.1982	0.1753
top P/GB	43.5424	40.7610	40.7610	42.1432	44.7462	45.5734	53.0658	52.1157	51.0644	53.9495	52.0243

Understanding the Metrics

Size (GB)

The storage size of the model in gigabytes. Smaller models generally require less computational resources.

Mean PPL (Perplexity)

Perplexity measures how well a probability model predicts a sample. Lower values indicate better performance and more accurate predictions.

KLD (Kullback-Leibler Divergence)

Measures how one probability distribution diverges from a second, expected probability distribution. Lower values indicate better alignment with reference distribution.

Delta probs

The difference in probability distributions between the model and a reference. Values closer to zero indicate better alignment.

RMS Δp

Root Mean Square of the delta probabilities, providing a single metric for the overall deviation. Lower values are better.

Same top

Percentage of cases where the model predicts the same top token as the reference. Higher values indicate better alignment.

1PPL/MB

Perplexity efficiency relative to model size. Higher values indicate more efficient perplexity performance per megabyte.

1/mean KLD/GB

KLD efficiency relative to model size. Higher values indicate more efficient KLD performance per gigabyte.

1/median KLD/GB

Alternative metric for KLD efficiency using median values. Higher values indicate better efficiency.

1/RMS/GB

RMS delta probability efficiency relative to model size. Higher values indicate better efficiency.

top P/GB

Same top prediction percentage efficiency relative to model size. Higher values indicate better efficiency.

Conclusions

Based on the data analysis, several important conclusions can be drawn:

The IQ3_XXS (mine) configuration shows excellent efficiency metrics, particularly in the 1/mean KLD/GB and 1/RMS/GB categories where it achieves best (green) ratings.
Despite their smaller size, the IQ1_M configurations show competitive performance in the 1PPL/MB metric, with IQ1_M (main) achieving the best rating.
The UD-prefixed models (unsloth optimized) demonstrate significant improvements in efficiency compared to their non-optimized counterparts.
Q2_K_L models show strong performance in the 1/median KLD/GB metric, indicating good median divergence efficiency for their size.
For overall balance between size and performance, the IQ3_XXS (mine) and UD-Q2_K_XL (unsloth) configurations appear to offer the best combinations of metrics.

These findings suggest that model optimization techniques like those used in the unsloth variants can significantly improve efficiency without sacrificing performance. Additionally, newer model architectures (like IQ3) demonstrate enhanced efficiency metrics compared to earlier versions.

Performance Metrics Analysis Dashboard