For my compiler theory class in college, the students each built a compiler from the ground up for a fictional language, C-. This is the lexer, which parses the text of the program into a tokenized format.