Generative-AI: Usecase-Specific Evaluation of LLM-powered Applications

Wednesday 17:10 in Platinum3

Large Language Models (LLMs) are transformative technology, enabling a wide array of applications, from content generation to interactive chatbots. This technology is leveraged in creating LLM-powered applications. A wide variety of LLMs are offered, followed by independent and generic evaluation of their performance by the LLM community. The requirements and domain-specificity of the usecases behind the LLM-applications, renders this generic evaluation of the LLMs insufficient in revealing their performance issues. Furthermore, the usecase-specific performance evaluation of LLM-applications becomes a necessary component in the design and continuous development of the LLM-applications. In this talk, we address the need for usecase-specific evaluation of LLM-applications by proposing a workflow for creating evaluation models that support the selection and optimization of the design of LLM-applications. The workflow is comprised of three main activities: 1) Human-expert evaluation of LLM-applications & benchmark dataset curation 2) Creating evaluation agents 3) Aligning evaluation agents with human evaluation based on the curated dataset And it leads to two concrete outcomes: 1) Curated benchmark dataset: against which the LLM-applications will be tested. 2) Evaluation Agent: this is the scoring model which automatically evaluates the responses of the LLM-applications. The talk will elaborate on the workflow, the limitations, and best practices to increase the reliability of the evaluations considering the limitations.

Dr. Homa Ansari

Lead AI/ML scientist at ZEISS Meditec with 10+ years of experience in algorithm design for multimodal unstructured data (image, time series, geospatial data). Expert in developing innovative algorithms with statistical methods, shallow and deep machine learning, and pre-trained Large Language Models (LLMs); specifically for satellite data and niche medical sensors. Recipient of innovation awards from the German Aerospace Center (DLR) and IEEE for novel algorithms and data products for satellite missions. Previous work experience at German Aerospace Center (DLR) and DataRobot Inc.