Generative-AI: Usecase-Specific Evaluation of LLM-powered Applications

Dr. Homa Ansari

Wednesday 17:10 in Platinum3

Large Language Models (LLMs) are transformative technology, enabling a wide array of applications, from content generation to interactive chatbots. This technology is leveraged in creating LLM-powered applications. A wide variety of LLMs are offered, followed by independent and generic evaluation of their performance by the LLM community. The requirements and domain-specificity of the usecases behind the LLM-applications, renders this generic evaluation of the LLMs insufficient in revealing their performance issues. Furthermore, the usecase-specific performance evaluation of LLM-applications becomes a necessary component in the design and continuous development of the LLM-applications. In this talk, we address the need for usecase-specific evaluation of LLM-applications by proposing a workflow for creating evaluation models that support the selection and optimization of the design of LLM-applications. The workflow is comprised of three main activities: 1) Human-expert evaluation of LLM-applications & benchmark dataset curation 2) Creating evaluation agents 3) Aligning evaluation agents with human evaluation based on the curated dataset And it leads to two concrete outcomes: 1) Curated benchmark dataset: against which the LLM-applications will be tested. 2) Evaluation Agent: this is the scoring model which automatically evaluates the responses of the LLM-applications. The talk will elaborate on the workflow, the limitations, and best practices to increase the reliability of the evaluations considering the limitations.

Dr. Homa Ansari

Lead AI/ML scientist with 10+ years of experience in algorithm design for information ex- traction from multimodal unstructured data (image, time series, geospatial data). Experienced in innovative algorithm development with statistical signal processing, shallow and deep machine learning, and pre-trained Large Language Models (LLMs); for radar satellite imagery and niche medical sensors. Recipient of innovation awards from the German Aerospace Center (DLR) as well as IEEE for designing algorithms and data products tailored to spaceborne data.