punjabi word puzzle

some game notes -play it here


📅 february 5th, 2022


background

In October of 2021, Josh Wardle released a web browser game calledWordle.I was late to catch on - a friend sent me the link in January, by which point Wordle had millions of daily visitors. Soon after, the New York Timesacquired the gamefor a dollar amount in the "low seven figures."

The gameplay works like this: you get six tries to guess a five-letter word. Each guess must be a valid word itself. Like Mastermind, you're told if you have the right letters and whether they're in the right place. The word resets every day. I learned that other variations of this idea include the1980s game show Lingoand the two-player1954 board game Jotto.

I had fun playing with my friends and decided to attempt a Punjabi version of the game. I learned later that expanding the game in this way is fairly common -this listmaintains references to over 350 variations spanning 100+ languages! That page also links toa guide from Aidan Pinefor language adaptations. If I were starting over from scratch, I might explore that.

the initial code

People are really creative - I was lucky to find severalopen-source versionsof the game, including a Visual Basicimplementation for Word 97and a version thatruns on TI-83/84 graphing calculators.Once I found those, I felt pretty good about my chances for a browser-based version. I ended up going with@hannahcode,whose implementation is MIT-licensed and has over 1k starson GitHub.It turns out Aidan's fork is based on Hannah's version too. Reusing the project got me through the base game state and UX much more quickly than I expected. I made the Punjabi modifications in a few hacky commits. You can findmy fork here.

gameplay changes

I ended up making a few changes from the base Wordle mechanics. I've written about them below alongside some other musings.

providing diacritics

Gurmukhi (common Punjabi script) is anabugida.I am conversational in Punjabi, but learned this word while writing this up. Fromomniglot(underlines mine):

"Abugidas consist of symbols for consonants and vowels. The consonants each have an inherent vowel which can be changed to another vowel or muted by means of diacritics or other modifications. Vowels can also be written with separate letters when they occur at the beginning of a word or on their own."

In this post I'm using "diacritics" somewhat interchangeably with "vowels" and "laga matras" - some people might prefer those terms better. The inherent vowel isn't immediately relevant to Wordle, but it's worth a quick detour. A good example of the inherent sound is the name "Karan." Written in Punjabi it looks like ਕਰਨ:

The inherent vowel sounds like "uh", so ਕਰਨ is pronounced similar to "current" or "dozen". English doesn't work this way - "KRN" doesn't make sense until you add explicit vowels in between.

Changing an inherent sound requires diacritics - these are relevant to our game UX. These symbols can appear above, below, before, or after a letter. Here are some examples with the letter ਨ (n):

diacritic example sound position
ਿ ਨਿ nit before
ਨੀ need after
ਨੋ nope above
ਨੂ noodle below

I wasn't sure how to best deal with this for our game. One idea is to treat each diacritic like its own letter - the box would look something like this for the word ਕਿਸਾਨ (farmer).

ਿ

It feels odd to separate the diacritics like this. It also doesn't work well for examples like ਕੋ or ਕੂ unless we add cells above and below the main letters:

ਿ

I think a three-row UX like above would be interesting to explore further - it seems like guess feedback could include diacritic positions alongside main letter positions. But separating diacritics from associated letters bothered me because word length is no longer obvious. A word like "house" is clearly 5 letters in English. But a Punjabi speaker asked for 3-letter words could easily suggest "ਸਸਤਾ", "ਕਿਤਾਬ", and "ਮਿਰਚ" because you don't think of diacritics separately. Our game would require players to instead count the words as 4, 5, and 4 respectively. In a game where word length is important, this kind of transformation would be an odd ask to make of our users.

For our game, I decided to go a different direction. We provide diacritics to the user at the beginning of the puzzle. We place them in the correct positions, and require the user to guess the associated letter. An example grid for the word ਕਿਸਾਨ looks like:

ਿ

When the user enters letters, we automatically attach placed diacritics and provide feedback. We also switch to 3-letter words; this is a more reasonable word length for common words in Punjabi. Vowels contribute to word length in English, whereas diacritics don't in Punjabi. I ended up finding aHindi versionof Wordle as well that takes a similar approach!

looking for a dictionary

In Wordle, each guess must be a valid word. There's also a 'hard mode' where guesses must use known letters in correct positions. Without this constraint, consecutive guesses of "abcde", "fghij", etc could quickly identify letters.

Adding this functionality requires a dictionary of valid words to check against. The code was open-source, and I wanted to make sure the word list was appropriately licensed too. This is especially important because the game is fully implemented in the browser - so any dictionary would be easy to access for end-users. I also felt a full dictionary, as opposed to just a big corpus, was important. If the word is valid but not in my word list, it's frustrating for users. I'd rather allow incorrect words than ship with an incomplete dictionary.

My attempts at finding a dictionary eventually stalled out. I searched GitHub and Google for relevant txt files, but didn't have much luck. I've used several dictionary websites before, but I'm not aware of any with open data sets. One common site is Shabdkosh - for example, seeall letters starting with ਅ.Like many websites, the terms of use prohibit redistribution of data. This led to some interesting conversations with friends & family around ownership of reference word lists. In any case, I don't have a linguistics nor legal background and am sure others might know better places to look.

The most promising option for me was theDigital Dictionaries of South Asiaproject (DSAL) from the University of Chicago. I found their digitized version of "The Panjabi Dictionary" by Maya Singh, published in 1895. You can find theirword search portalhere and a full scan of the dictionaryin ebook format here.One interesting thing is that several websiteslist this dictionary in the public domain.I think this is because the publication date is quite old (books enter public domain after some time). I emailed DSAL to ask if the underlying dataset was available, but haven't heard back. I did find one reference toa research paperthat analyzed the words in the DSAL Singh dictionary.

When looking atShabdle,I noticed its Hindi wordset came from theGNU aspell spellchecker dictionary.There is a similar Punjabi dictionary on that page, but it unfortunately seems to be missing words. For instance, I couldn't find ਆਸਰਾ (support) or ਗਰਮ (hot).

I also found several Punjabi corpora - often referenced from linguistics research websites. I had the opposite problem here, where some of the data sets included words that I couldn't find in any dictionary. I'm not sure why this is; one guess is that experimentally generated data sets from Internet crawling might pull in misspelled words. In theory this is alright for Wordle. In the end, I figured it was better to go without a word list than have false positives or negatives for user guesses. By this point I was starting to wonder whether limiting guesses in this way was valuable. As long as the answer is a real word, I feel okay.

Here are some of the data sets I found. I decided against using them for the game, but am really glad to have found them. There are English/Punjabi sentence pairs, both written and verbal:

removing the guess limit

Typically a Wordle game ends after 6 guesses. I decided to remove this limit - I'm not sure whether 6 is the right number after our changes. Together with the "no dictionary" decision, this means players can guess any string of letters - even ਸਸਸ or ਮਮਮ. This doesn't bother me much; people should do the puzzle how they like.

For the answers to the game, I typed up some words with my family. As long as the word is relatively common, anything should be fine here. Like normal Wordle, each day the answer rotates to a new word.

thanks

If you read this, thank you! Thanks to@hannahcodefor the initial implementation, and also to my family for fielding question after question about Punjabi.

You can find other posts and contact infohere.