Stable Baselines 3 Tutorial (Computerized Adaptive Testing)

5 minute read

Published:

pic

Figure 1: Figure showing the MDP

The goal of this blog is to present a tutorial on Stable Baselines 3, a popular Reinforcement Learning library with focus on implementing a custom environment and a custom policy. We will first describe our problem statement, discuss the MDP (Markov Decision Process), discuss the algorithms - PPO, custom feature extractor PPO and custom policy (lstm bilinear policy with PPO).

Problem Statement

In the realm of computer science education, evaluating coding homework poses unique challenges that traditional methods struggle to address. Auto-grading systems, which automatically assess correctness and functionality using test cases, offer a valuable solution. However, as the complexity of coding assignments grows, the sheer volume of test cases can hamper system efficiency. Inspired by Computerized Adaptive Testing (CAT), which optimizes assessments based on prior responses, we aim to develop a policy that selects a minimal yet effective set of test cases, ensuring swift and accurate evaluations for both students and instructors.

In Computerized Adaptive Testing (CAT), there are four essential components:

  1. Knowledge Level Estimator: Assesses the student’s current knowledge based on their responses to previously selected items.
  2. Response Model: Estimates the likelihood of a student answering a specific item correctly using the current knowledge level and item features.
  3. Pool of Available Items: A collection of test items from which the adaptive system selects questions.
  4. Item Selection Algorithm: Chooses the next most informative item based on the response model output.

The widely used Item-Response-Theory (IRT) model, specifically the simplest form (1PL), is often employed. In its basic form, the 1PL model is defined as:

\[P(Y_{i, j} = 1) = \sigma(\theta_j - b_i)\]

Here, \(\theta_j\) represents the student’s knowledge level, and \(b_i\) is the difficulty of item \(i\).

MDP Formulation

Notation

We denote the programming question statement as q; the solution (correct) code to the question as a; student j’s code as cj; student j’s ground truth code quality (knowledge level) as θj; student j’s current code quality estimate after selecting kth test cases as θ̂j, k; the total number of K test cases as t1, 2,…, K and each test case’s feature as dk; and the ground truth value of whether the student j’s code passes test case k or not as Yk, j.

MDP Setting

In our MDP setting, the environment contains two models: 1. Large Language Model (LLMe) 2.IRT-based model (IRT). The policy contains two models: 1. Large Language Model (LLMp) 2.Time-distributed long Short-Term Memory model (LSTM). During training, we use the same frozen LLM for LLMe and LLMp for simplicity. Figure 1 shows our MDP.

State Definition

We represent our state as the LLM embeddings of the question statement, the solution code, and the test cases recommended so far minus the embeddings of the question statement, the student code, and the test cases recommended so far, Sk = LLMe(q, a, t1..k) - LLMe(q, cj, t1..k).

Action Definition

Initially, We obtain a matrix Q containing the embeddings of all the test cases t1, 2,…,K with LLMp, which is used for selecting the action. At every time step k, we first calculate the current hidden state hk using LSTM with the state Sk as the input. Next, we project hk to the space of test cases Q using a bi-linear projection matrix W to obtain the logits L’. Finally, we apply the softmax function to the logits L’ to obtain a distribution over the test cases and choose the test case with the highest probability, which is Ak.

  • Q = {LLMp(tk)}k ∈ {1..K}
  • hk = LSTM(Sk)
  • L’ = hkWQT
  • Ak = Argmax(L’)

Reward Definition

Initially, we apply the IRT model to learn the features of all test cases and all students’ ground truth code qualities by maximizing P(Yk, j | θj, dk), which are used to calculate the reward.

If the selected test case k has not been selected before, we use all the selected test cases to update student j’s current code quality estimate θ̂j, k by maximizing the P(Yk, j | θ̂j, k, dk). Then we calculate 1 / |θj - θ̂j, k| as the reward, which is Rk. Rk measures how close the student j’s current code quality estimate is to the corresponding ground truth code quality. The smaller the difference, the bigger the reward. Otherwise, Rk = -10000. Giving a really big negative reward will force the agent to not select the same test case multiple times. We believe that the reward satisfies the Markovian property because it is solely based on the current state and action. Since the current state stores all the previously selected test cases.

MDP definition

In summary, our MDP is defined as follows:

  • S: LLMe(q, a, t1..k) - LLMe(q, cj, t1..k)
  • A: the set of all test cases - {t1..tK}
  • P: P(LLMe(q, a, t1..k+1) | LLMe(q, a, t1..k), Ak) = 1.0
  • R:
    • -10000 if tk in {t1..tk-1}
    • 1 / |θj - θ̂j, k| otherwise
  • d0: LLMe(q, a) - LLMe(q, cj)
  • γ ∈ [0, 1), i.e., any value that is positive and smaller than 1.

Implementation

MDP

We implement our custom environment using OpenAI Gymnasium and Stable Baselines 3. We need to override methods like reset and step.

RL Algorithms

We implement three algorithms PPO, custom feature extractor PPO and custom policy (lstm bilinear policy with PPO).

  1. PPO - We use the standard implementation of PPO using Stable Baselines 3.
  2. Custom Feature Extractor - Instead of using the standard feature extractor we use a custom LSTM network to embed the state representations.
  3. Custom Policy (LSTM Bilinear) - We modify the the policy to include the test case embeddings and a learnable bilinear layer that projects from the feature extraction (LSTM) space to the test case embeddings space.

The code and its explanation can be found in my Github Repository.