[Note] AST x Industry TechConnect 2026 Seminar on “Typhoon – Thai LLM" @ICT MU

 Link ต่าง ๆ ที่ได้มาจาก workshop นี้  ต้องขอขอบคุณคณะ ICT สำหรับการจัดเวิร์คชอปในครั้งนี้ที่ฟรี และบุคคลทั่วไปสามารถเข้าถึงได้

ภาพข่าว https://news.ict.mahidol.ac.th/muict-ast-x-industry-techconnect_typhoon-thai-llm/

ลิงค์หังข้อสัมนาในซีรีส์นี้ https://www.ict.mahidol.ac.th/en/event/techconnect2026/

- https://neurips.cc/ -- trend ในเรื่อง AI สามารถไปอัพเดตได้จากรีเซิร์ชในนี้

- https://opentyphoon.ai/

- https://sommai.wangchan.ai/auth/login

https://4corners.visai.ai/

https://playground.thaillm.or.th/chat/

https://playground.opentyphoon.ai/


เพิ่งรู้ว่า model Thai LLM ที่ทางไต้ฝุ่นทำสามารถเอาไปใช้ได้ฟรี สามารถทำเรื่อง request ขอ API เพื่อเอาไปประยุกต์ใช้ต่อได้เลยเป็น open source


สรุปโดยใช้ NotebookLM ช่วย คำอาจผิดไปบ้าง ไม่ได้ edit

Thai AI Ecosystem Development: Research, Open-Source Models, and Linguistic Optimization

Executive Summary

This document synthesizes key insights regarding the strategic development of Thai-centric artificial intelligence. The central initiative focuses on bridging the gap between theoretical research and practical application by creating high-performance, open-source models tailored to the Thai language and culture. Key developments include the Typhoon (LLM), Faiphun (OCR/Vision), and Speech (ASR) series. The primary objective is to provide a robust infrastructure for Thai developers and organizations, ensuring that AI tools are not only accurate but also accessible through hardware optimization (CPU-runnable models) and culturally relevant data integration. Critical takeaways include the shift toward small, efficient models, the necessity of rigorous Thai-specific benchmarking, and the ongoing expansion into regional dialects like Isan.

--------------------------------------------------------------------------------

1. Strategic Vision and Development Philosophy

The project operates on the principle that Thailand must not merely consume global AI but become a creator of its own foundational technology.

  • Open-Source Commitment: The initiative prioritizes releasing models to the public. This strategy aims to foster a community where developers can identify weaknesses, improve performance, and build specialized applications.
  • From Research to Application: The workflow distinguishes between "Deep Research" (publishing papers to contribute to global knowledge) and "Applied Research" (transforming theoretical methods into usable models like those found on Hugging Face).
  • Technological Independence: By developing localized models, the ecosystem reduces reliance on expensive international APIs. This is particularly crucial for maintaining cultural nuance—referred to as "Thainess"—which global models often miss.

--------------------------------------------------------------------------------

2. Core Model Ecosystem

The development is categorized into three specialized branches: Large Language Models (LLM), Optical Character Recognition (OCR)/Vision, and Automatic Speech Recognition (ASR).

2.1 Taiphun (LLM Series)

  • Purpose: A Thai-language Large Language Model focused on instruction following and reasoning.
  • Development History: Evolved through several iterations, including versions 2.5 and 3.
  • Key Feature: Integrates research from Thai professors and international experts to improve linguistic accuracy. It is designed to be a "General Infrastructure" for Thai AI applications.

2.2 Faiphun (OCR and Vision)

  • Purpose: To digitize Thai documents and interpret visual data.
  • Capabilities:
    • Structure Recognition: Handling complex document layouts, such as tables in PDF or Excel formats where standard models often fail.
    • Versatility: Capable of processing various document types, including university notes, forms, and medical records.
  • Strategy: Moves beyond simple text extraction to "Vision" capabilities, allowing the model to understand the context of images and document hierarchies.

2.3 Speed (ASR/Speech-to-Text)

  • Purpose: High-speed, real-time conversion of Thai speech into text.
  • Technical Goals:
    • Low Latency: Optimized for real-time use cases.
    • Hardware Efficiency: A major focus is making models small enough (e.g., 1.5 billion parameters) to run on CPUs rather than requiring expensive GPUs.
    • Accuracy: Prioritizes "Character Accuracy" and "Keyword Accuracy" to ensure critical data (like numbers and names) are transcribed correctly.

--------------------------------------------------------------------------------

3. Technical Challenges in the Thai Context

Developing AI for Thai presents unique linguistic and infrastructural hurdles that global models fail to address effectively.

Challenge

Description

Linguistic Complexity

Thai has no word delimiters (spaces), uses complex tone marks, and contains "silent" characters (Karan) that complicate tokenization.

Dialect Variance

Regional dialects like Isan lack a standardized written form, making data collection and transcription difficult.

Data Gaps

While there are approximately 10,000 hours of general Thai speech data, specialized data for dialects or specific domains is significantly scarcer.

Hardware Constraints

Heavy reliance on GPU/EC2 instances for scaling is costly; there is a strategic push toward CPU-optimized inference.

3.1 Special Case: Isan Language Development

The project has developed a systematic approach to the Isan dialect:

  • Standardization: Because Isan is primarily a spoken language, the team created a "Spelling Standard" using phonetic logic to train models.
  • Tone Mapping: Research identified that Isan uses different tonal structures compared to Central Thai, requiring specialized "Tone Note" logic for accurate ASR.

--------------------------------------------------------------------------------

4. Evaluation and Validation Methodology

A central theme of the development is the rejection of "blind" model training in favor of rigorous, goal-oriented evaluation.

  • Benchmarking: Models are benchmarked against global standards (e.g., OpenAI models) to ensure competitive performance.
  • Error Analysis: The team performs deep manual error analysis on transcripts—examining samples for mistakes in numbers, names, or formatting.
  • Validation Framework:
    • Accuracy vs. Value: High accuracy (e.g., 95%) is irrelevant if the 5% error occurs on critical keywords like medical dosages or legal terms.
    • Custom Metrics: Developers are encouraged to define their own success metrics based on their specific domain (e.g., Health, Law, or Finance).

--------------------------------------------------------------------------------

5. Domain-Specific Applications and Future Outlook

The source highlights several emerging use cases for these models:

  • Financial & Legal: Integration with organizations like SCB for analyzing market data and legal documents (e.g., Criminal and Commercial Codes).
  • Medical/Healthcare: Utilizing OCR and ASR to digitize patient records and transcribe doctor-patient interactions, though this requires high-precision "Validation" to ensure safety.
  • Public Accessibility: By hosting models on Hugging Face and providing "Playgrounds," the project lowers the barrier to entry for Thai startups and researchers.

Key Quote on Innovation

"If we don't start running [in AI development], they [global competitors] will ask us to run slower. We must create an open-source environment so that everyone can participate in building the foundation."

--------------------------------------------------------------------------------

6. Practical Implementation Advice for Developers

  1. Start with the Goal: Define the "Good Result" (the benchmark) before selecting a model.
  2. Small is Better: For production, prioritize models that can run on standard CPUs to manage costs and thermal limits.
  3. Feedback Loop: Use community feedback from platforms like Hugging Face or Discord to fine-tune models for specific vocabulary or "slang" that the general model might miss.
  4. Digitization First: Use Faiphun (OCR) to convert existing physical or PDF archives into a digital format before applying LLM reasoning.

Comments

Most viewed blogs

Useful links (updated: 2026-01-29)

Genome editing technology short note

Umbrella vs Basket Trial