This paper presents a novel application of multimodal largelanguagemodels (LLMs) to enhance the learning and application of building energymodeling. The study leverages Retrieval-Augmented Generation (RAG) models i...
详细信息
This paper presents a novel application of multimodal largelanguagemodels (LLMs) to enhance the learning and application of building energymodeling. The study leverages Retrieval-Augmented Generation (RAG) models integrated with a dataset of 59 publicly available YouTube video tutorials focused on energyPlus and OpenStudio. Unlike traditional LLM methods, this approach is unique in its utilization of three LLMs to process and integrate different types of data: text, screenshots, and video references. The preprocessing phase utilizes Google's transcription service to convert video content into text, which is then summarized using the T5 model and embedded with the Instructor Embedder model. Meta's Llama v2 7b model handles user queries, extracting relevant information and providing detailed responses enriched with visual references, including exact video minutes and screenshots. Unlike traditional LLM models, the proposed model delivers comprehensive responses that go beyond text, providing users with reference videos with exact timestamps and screenshots. Furthermore, the proposed web interface presents these enriched responses in one page, significantly reducing the time needed for users to understand and apply complex energymodeling concepts. This framework demonstrates the potential of multimodal LLMs in creating powerful educational tools for architects, engineers, and students. The entire workflow was completed on a laptop with a single 4GB GPU, demonstrating the feasibility of implementing such a system on relatively modest hardware.
暂无评论