The increasing prevalence of online education for children underscores the need for intelligent systems capable of recognizing and responding to learners' emotional states in real time. Emotional fluctuations, such as boredom, frustration, or confusion, have been empirically linked to cognitive disengagement and poor learning outcomes, particularly in younger learners. However, existing educational platforms lack reliable mechanisms for detecting such states and initiating timely pedagogical interventions. This study proposes a multimodal emotion recognition and intervention framework tailored for children's online learning environments. The proposed system integrates facial expression analysis, speech emotion recognition, and behavioral signal tracking to detect six core emotional states commonly observed in child learners. A hierarchical fusion architecture, combining CNN-LSTM visual encoders and Transformer-based cross-modal attention modules, enables robust emotion classification even under modality loss or environmental noise. In parallel, a rule-enhanced policy engine maps detected emotions to personalized intervention strategies, including task scaffolding, verbal encouragement, and content pacing adjustments. The framework is evaluated on a newly curated multimodal dataset of primary school students engaged in online learning tasks, demonstrating superior performance over unimodal and early-fusion baselines across multiple metrics. In real-world deployment trials, the system significantly improves learners' task completion rates and emotional stability. Furthermore, ablation studies and statistical significance tests confirm the contributions of each modality and fusion mechanism. The results suggest that incorporating multimodal affective computing into online learning platforms offers a promising pathway toward emotionally adaptive and child-centric digital education. This work contributes both a scalable technical solution and empirical evidence supporting the integration of emotional monitoring and intervention in intelligent tutoring systems.