Senior software engineer at Qualia Labs · Co-founder of Fox.Build Makerspace · Former co-founder of FarmBot

Fine tuning an llm to grade language learning prompts

UPDATE: I donlt think it is worth doing this in 2025 as the quality of foundation models has improved substantially and costs have gone down. This might still be a good idea at scalle, but I donlt operate at scale.

  • background
  • the task
  • data set, training costs: ~1,200 examples cost $1.41
    • Original GPT-4 prompt
    • Manually reviewed collected data.
    • Two orders of magnitude cheaper due to needing fewer hints and prompt tokens.
  • cost reduction
  • past attempts
  • the training data
  • Possible issues:
    • Since many of my cards (~1,200) exist in the initial training set, there may be confirmation bias and the model might see a reduction in quality as new cards are added.
    • The fine tuned model is more permissive about minor grammatical issues whereas GPT-4 often rejects grammar mistakes. This might be fixable with more fine tuning but would require native speaker labeling.
      • Still better than self-grading via Anki, since GPT-3.5 has a better understanding of Korean than I do.