Automating Categorization of Curriculum Content with Open Source Machine Learning

May 21, 2018 2:00 PM – 3:30 PM

Pedro Teixeira, Vanderbilt University
Scott Drake, ScholarRx
Tao Le, ScholarRx

Combining sets of content or curricula requires significant time and effort. Content has to be comprehensively categorized in a consistent system but is laborious, and manual categorization can vary significantly. Manual approaches also yield limited qualitative lists of applicable categories without quantitative information.

In this presentation, we will review the process for preparing inputs and establishing categories, classifier performance, and potential applications (e.g., merging curriculum content, curriculum curation, categorizing learners’ notes). We will demonstrate real-time categorization of a set of 10,000 flashcards using TensorFlow and Gensim, open source tools for machine learning and topic modeling. We also plan to demonstrate categorization with other types of curriculum content including PDFs, PowerPoints, and assessments.

Our initial approach focuses on a manually categorized set of 12,751 flashcards within 16 major categories. The resulting pipeline and classifier can rapidly process text, calculate topic vectors, and assign categories. Assignment is weighted and can include multiple categories along with a ranking score based on relevance. Automated categorization can highlight items that are misclassified and can be applied across new sets of content to rapidly classify them according to reference categories. This can enable incorporating new content including entire curricula.