#AI #LLM

I’ve recently been working on evaluating LLMs and building a private benchmark to test these models for my team. Though I am not an expert in this field, I have some thoughts I’d like to share.

For most of us, a good evaluation of LLMs is more critical than the training itself. We often don’t engage in pre-training or post-training work because prompt engineering with an intelligent LLM can solve many real business problems. This highlights the importance of understanding an LLM’s potential abilities through thorough evaluation. But what constitutes a good evaluation? Here are some key attributes what I think:

  1. Sufficient Data Samples: A robust evaluation requires enough data samples. Since LLMs are probability-based models, their outputs can vary even when asked the same question multiple times. Sometimes, an LLM might produce an incorrect response initially but get it right on a subsequent attempt. To achieve fair results, evaluations must account for this variability.

  2. High-Quality Data Samples: High-level evaluations necessitate high-quality data samples. While LLMs are rapidly advancing, it’s common to see claims like “GPT-4 achieves human-level performance in some fields.” However, these models sometimes struggle with basic commonsense questions or elementary arithmetic problems. High-quality data should be diverse, accurate, and somewhat challenging to effectively differentiate the model’s intelligence. Diversity should encompass question formats, domains, and content.

  3. Clear Judging Criteria: Good evaluations must clearly explain how outcomes are judged. Some benchmarks use LLMs as judges, while others rely on human annotations. Both methods have their merits, but it’s crucial to detail the evaluation process to avoid potential issues. If using LLMs as judges, the prompts should be carefully curated, and human annotations should sample the results for accuracy.

  4. Meaningful Evaluation Tasks: The evaluation tasks should be meaningful and relevant. If a task lacks significance, it won’t attract attention. Some evaluations may present very challenging tasks, but if they don’t offer clear insights into the LLM’s intelligence, they aren’t particularly useful.

  5. Frequent, Private Benchmarks: Establishing and frequently updating personal, private benchmarks is crucial, especially as many LLMs claim to achieve state-of-the-art results on public benchmarks like MMLU and CMMLU. Regular evaluations help monitor the model’s performance over time and ensure it remains effective for real business scenarios.

A newer evaluation format gaining popularity involves two LLMs battling it out, with humans choosing their preference. This format has its pros and cons. On the plus side, human judgment is generally more accurate than LLM self-evaluation, and the cost is lower than API calls. However, concerns arise about companies potentially using this data to train their models, and the process of collecting human preference data can be time-consuming.

Conclusion

Evaluating LLMs is an ongoing process. As these models continue to develop, we need to adapt our benchmarks to keep pace with their intelligence. Investing time in establishing a high-quality, reliable evaluation benchmark is worthwhile and essential for leveraging LLMs effectively in various applications.