Large Language Models (LLMs) are transformative technology, enabling a wide array of applications, from content generation to interactive chatbots. This technology is leveraged in creating LLM-powered applications. A wide variety of LLMs are offered, followed by independent and generic evaluation of their performance by the LLM community. The requirements and domain-specificity of the usecases behind the LLM-applications, renders this generic evaluation of the LLMs insufficient in revealing their performance issues. Furthermore, the usecase-specific performance evaluation of LLM-applications becomes a necessary component in the design and continuous development of the LLM-applications. In this talk, we address the need for usecase-specific evaluation of LLM-applications by proposing a workflow for creating evaluation models that support the selection and optimization of the design of LLM-applications. The workflow is comprised of three main activities: 1) Human-expert evaluation of LLM-applications & benchmark dataset curation 2) Creating evaluation agents 3) Aligning evaluation agents with human evaluation based on the curated dataset And it leads to two concrete outcomes: 1) Curated benchmark dataset: against which the LLM-applications will be tested. 2) Evaluation Agent: this is the scoring model which automatically evaluates the responses of the LLM-applications. The talk will elaborate on the workflow, the limitations, and best practices to increase the reliability of the evaluations considering the limitations.