Benchmarking Large Language Models for AAVE and SAE Text Generation
Research Overview
Our research explores the performance of state-of-the-art AI language models in generating text in African American Vernacular English (AAVE) and Standard American English (SAE). We are benchmarking six leading models—GPT-4, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA 3.1, Qwen, and GPT-o1—to assess their ability to maintain semantic consistency, lexical similarity, and sentiment alignment when generating dialect-specific text.
Using a structured evaluation framework, we analyzed model-generated continuations of AAVE and SAE prompts through BLEU and ROUGE scores for text overlap, cosine similarity for semantic consistency, and sentiment distribution analysis to measure alignment with original text sentiment. This study provides valuable insights into how AI models handle linguistic diversity and dialect preservation and highlights areas for improvement in bias mitigation, sentiment fidelity, and language representation.