Join PyData Yerevan July 2024 for a talk on “How to Build an LLM for Math Reasoning without Proprietary Data?” featuring Ivan Moshkov, Deep Learning Engineer at NVIDIA, and Daria Gitman, Conversational AI Research Intern at NVIDIA.
Recent research has shown the value of synthetically generated datasets in training #LLMs to acquire targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA and MAmmoTH rely on outputs from closed-source LLMs that have commercially restrictive licenses. One key reason limiting the use of open-source LLMs in data generation pipelines is the gap in the mathematical skills between the best closed-source LLMs, such as GPT-4, and the best open-source LLMs.
In their research, Ivan and Daria constructed OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs using recent progress in open-source LLMs, proposed prompting novelty, and brute-force scaling. Their best model, OpenMath-CodeLlama-70B, trained on a subset of OpenMathInstruct-1, achieves a competitive score of 84.6% on GSM8K and 50.7% on MATH, comparable to top GPT-distilled models.
During the talk, Ivan will introduce the challenge of math reasoning in Natural Language Processing and discuss the process of creating their synthetic dataset. Following this, Daria will explore the Data Explorer tool and share key insights extracted from the data using this tool.
Save the date to attend the talk on July 18, at 19:00, in the PMI Science R&D Center in Armenia (Teryan 105, 13th building).
Register here: https://forms.gle/FSqcmpyf5nJgCjtA8
You can find more tech events happening in Armenia here