In my 4th year I led my second QMIND project, where we researched applying mechanistic interpretability to defend against LLM jailbreaks. It's a cool research idea: can we identify which neurons in an LLM are responsible for bad / harmful behaviour? What happens if we turn these neurons off? At a glance we...

  • poked around the weights of various transformer-based LLMs (GPT-2, Gemma-2-2b-it, etc.)
  • learned a lot about how to do effective reinforcement learning and how strange language models can be
  • showcased our work at CUCAI 2025, Canada's largest Undergraduate AI Conference
  • presented our work to the Amazon Web Service's team in Toronto