[Note] AST x Industry TechConnect 2026 Seminar on “Typhoon – Thai LLM" @ICT MU
Link ต่าง ๆ ที่ได้มาจาก workshop นี้ ต้องขอขอบคุณคณะ ICT สำหรับการจัดเวิร์คชอปในครั้งนี้ที่ฟรี และบุคคลทั่วไปสามารถเข้าถึงได้
ภาพข่าว https://news.ict.mahidol.ac.th/muict-ast-x-industry-techconnect_typhoon-thai-llm/
ลิงค์หังข้อสัมนาในซีรีส์นี้ https://www.ict.mahidol.ac.th/en/event/techconnect2026/
- https://neurips.cc/ -- trend ในเรื่อง AI สามารถไปอัพเดตได้จากรีเซิร์ชในนี้
- https://sommai.wangchan.ai/auth/login
- https://playground.thaillm.or.th/chat/
- https://playground.opentyphoon.ai/
เพิ่งรู้ว่า model Thai LLM ที่ทางไต้ฝุ่นทำสามารถเอาไปใช้ได้ฟรี สามารถทำเรื่อง request ขอ API เพื่อเอาไปประยุกต์ใช้ต่อได้เลยเป็น open source
สรุปโดยใช้ NotebookLM ช่วย คำอาจผิดไปบ้าง ไม่ได้ edit
Thai AI Ecosystem Development: Research, Open-Source Models, and Linguistic Optimization
Executive Summary
This document synthesizes key insights regarding the strategic development of Thai-centric artificial intelligence. The central initiative focuses on bridging the gap between theoretical research and practical application by creating high-performance, open-source models tailored to the Thai language and culture. Key developments include the Typhoon (LLM), Faiphun (OCR/Vision), and Speech (ASR) series. The primary objective is to provide a robust infrastructure for Thai developers and organizations, ensuring that AI tools are not only accurate but also accessible through hardware optimization (CPU-runnable models) and culturally relevant data integration. Critical takeaways include the shift toward small, efficient models, the necessity of rigorous Thai-specific benchmarking, and the ongoing expansion into regional dialects like Isan.
--------------------------------------------------------------------------------
1. Strategic Vision and Development Philosophy
The project operates on the principle that Thailand must not merely consume global AI but become a creator of its own foundational technology.
- Open-Source Commitment: The initiative prioritizes releasing models to the public. This strategy aims to foster a community where developers can identify weaknesses, improve performance, and build specialized applications.
- From Research to Application: The workflow distinguishes between "Deep Research" (publishing papers to contribute to global knowledge) and "Applied Research" (transforming theoretical methods into usable models like those found on Hugging Face).
- Technological Independence: By developing localized models, the ecosystem reduces reliance on expensive international APIs. This is particularly crucial for maintaining cultural nuance—referred to as "Thainess"—which global models often miss.
--------------------------------------------------------------------------------
2. Core Model Ecosystem
The development is categorized into three specialized branches: Large Language Models (LLM), Optical Character Recognition (OCR)/Vision, and Automatic Speech Recognition (ASR).
2.1 Taiphun (LLM Series)
- Purpose: A Thai-language Large Language Model focused on instruction following and reasoning.
- Development History: Evolved through several iterations, including versions 2.5 and 3.
- Key Feature: Integrates research from Thai professors and international experts to improve linguistic accuracy. It is designed to be a "General Infrastructure" for Thai AI applications.
2.2 Faiphun (OCR and Vision)
- Purpose: To digitize Thai documents and interpret visual data.
- Capabilities:
- Structure Recognition: Handling complex document layouts, such as tables in PDF or Excel formats where standard models often fail.
- Versatility: Capable of processing various document types, including university notes, forms, and medical records.
- Strategy: Moves beyond simple text extraction to "Vision" capabilities, allowing the model to understand the context of images and document hierarchies.
2.3 Speed (ASR/Speech-to-Text)
- Purpose: High-speed, real-time conversion of Thai speech into text.
- Technical Goals:
- Low Latency: Optimized for real-time use cases.
- Hardware Efficiency: A major focus is making models small enough (e.g., 1.5 billion parameters) to run on CPUs rather than requiring expensive GPUs.
- Accuracy: Prioritizes "Character Accuracy" and "Keyword Accuracy" to ensure critical data (like numbers and names) are transcribed correctly.
--------------------------------------------------------------------------------
3. Technical Challenges in the Thai Context
Developing AI for Thai presents unique linguistic and infrastructural hurdles that global models fail to address effectively.
Challenge | Description |
Linguistic Complexity | Thai has no word delimiters (spaces), uses complex tone marks, and contains "silent" characters (Karan) that complicate tokenization. |
Dialect Variance | Regional dialects like Isan lack a standardized written form, making data collection and transcription difficult. |
Data Gaps | While there are approximately 10,000 hours of general Thai speech data, specialized data for dialects or specific domains is significantly scarcer. |
Hardware Constraints | Heavy reliance on GPU/EC2 instances for scaling is costly; there is a strategic push toward CPU-optimized inference. |
3.1 Special Case: Isan Language Development
The project has developed a systematic approach to the Isan dialect:
- Standardization: Because Isan is primarily a spoken language, the team created a "Spelling Standard" using phonetic logic to train models.
- Tone Mapping: Research identified that Isan uses different tonal structures compared to Central Thai, requiring specialized "Tone Note" logic for accurate ASR.
--------------------------------------------------------------------------------
4. Evaluation and Validation Methodology
A central theme of the development is the rejection of "blind" model training in favor of rigorous, goal-oriented evaluation.
- Benchmarking: Models are benchmarked against global standards (e.g., OpenAI models) to ensure competitive performance.
- Error Analysis: The team performs deep manual error analysis on transcripts—examining samples for mistakes in numbers, names, or formatting.
- Validation Framework:
- Accuracy vs. Value: High accuracy (e.g., 95%) is irrelevant if the 5% error occurs on critical keywords like medical dosages or legal terms.
- Custom Metrics: Developers are encouraged to define their own success metrics based on their specific domain (e.g., Health, Law, or Finance).
--------------------------------------------------------------------------------
5. Domain-Specific Applications and Future Outlook
The source highlights several emerging use cases for these models:
- Financial & Legal: Integration with organizations like SCB for analyzing market data and legal documents (e.g., Criminal and Commercial Codes).
- Medical/Healthcare: Utilizing OCR and ASR to digitize patient records and transcribe doctor-patient interactions, though this requires high-precision "Validation" to ensure safety.
- Public Accessibility: By hosting models on Hugging Face and providing "Playgrounds," the project lowers the barrier to entry for Thai startups and researchers.
Key Quote on Innovation
"If we don't start running [in AI development], they [global competitors] will ask us to run slower. We must create an open-source environment so that everyone can participate in building the foundation."
--------------------------------------------------------------------------------
6. Practical Implementation Advice for Developers
- Start with the Goal: Define the "Good Result" (the benchmark) before selecting a model.
- Small is Better: For production, prioritize models that can run on standard CPUs to manage costs and thermal limits.
- Feedback Loop: Use community feedback from platforms like Hugging Face or Discord to fine-tune models for specific vocabulary or "slang" that the general model might miss.
- Digitization First: Use Faiphun (OCR) to convert existing physical or PDF archives into a digital format before applying LLM reasoning.
Comments
Post a Comment