Stemming


When we read words in a sentence, our mind analyses the words in two different ways. The root of the word and the grammatical modification. Both are important and either has its value in identifying the meaning of the sentence as a whole. Stemming is the first part of the process - identifying root of the word. This is an important step in identifying the word vector.

The Concept


There are different ways of stemming a word. Different approaches and different algorithms have been suggested for this. The fundamental and most commonly used approach was suggested by Martin Porter, also called as the Porter Stemming Algorithm. Like most of AI algorithms, this algorithm is not a recent development. (It was proposed way back in early 1980's). But its relevance and utility showed up only recently when we got the requisite processing power and training data.

Porter Stemming is quite simple once we understand the concept and implications. Consider the word "cheer". It has different grammatical modifications:

cheer
cheers
cheered
cheering

Did I miss any?

All of these are derived from the root "cheer". Porter's stemming algorithm tries to identify this, and correlate the root with the word.

The simplest and most intuitive way might be to look for the common suffixes. For example, if W1 is "cheer" and W2 is "cheering", we can correlate them based on the suffix "ing". So a simple way of doing this job is to remove look for the common suffixes and strip them off to get the root word. But that does not always work.

For example, "relate" and "relating" - has the extra "e". Things get even worse in some other words such as "index" and "indices". Or words like "Wand" and "Wander" have no relation with each other, but seem to be related because of the "er" suffix. English language is full of exceptions! So what is the way out? The brute force way of course, is to code the whole dictionary. Can we do something better?

The Porter algorithm tries to tackle this problem.

The Algorithm


To understand the algorithm, we need to brush up some basics - vowels and consonants. Vowels are a, e, i, o, u and also the y following a consonant. Anything else is a consonant. A word is a sequence of sounds. And each sound is defined by a set of consecutive consonants followed by a set of consecutive vowels - example (thr)(ee) is one sound. whereas (g)(o)(d) has two sounds. Size of a word is defined by the number of sounds rather than characters.

Thus, any word can be depicted as depicted by [C](VC)m[V] - where C is a set of consecutive consonants and V is a set of consecutive vowels. The first C and last V are optional. The size of such a word is defined by the number m. With this in place, we develop rules for stemming in the form

(condition) S1 -> S2

With this in place, the Porter Stemming algorithm provides the following set of rules:

Step 1a
 SSES -> SS 
 IES -> I 
 ties -> ti
 SS -> SS 
 S -> 

Step 1b
 (m>0) EED -> EE 
 (*v*) ED -> 
 (*v*) ING ->

If the second or third of the rules in Step 1b is successful, then:

 AT -> ATE 
 BL -> BLE 
 IZ -> IZE

This is followed by Step 1c

 (*v*) Y -> I

Thus, the step 1 strips the plurals and past participles. The following steps deal with adjectives and adverbs.

Step 2
 (m>0) ATIONAL -> ATE 
 (m>0) TIONAL -> TION 
 (m>0) ENCI -> ENCE 
 (m>0) ANCI -> ANCE 
 (m>0) IZER -> IZE 
 (m>0) ABLI -> ABLE 
 (m>0) ALLI -> AL 
 (m>0) ENTLI -> ENT 
 (m>0) ELI -> E 
 (m>0) OUSLI -> OUS 
 (m>0) IZATION -> IZE 
 (m>0) ATION -> ATE 
 (m>0) ATOR -> ATE 
 (m>0) ALISM -> AL 
 (m>0) IVENESS -> IVE 
 (m>0) FULNESS -> FUL 
 (m>0) OUSNESS -> OUS 
 (m>0) ALITI -> AL 
 (m>0) IVITI -> IVE 
 (m>0) BILITI -> BLE
Step 3
 (m>0) ICATE -> IC
 (m>0) ATIVE -> 
 (m>0) ALIZE -> AL 
 (m>0) ICITI -> IC 
 (m>0) ICAL -> IC 
 (m>0) FUL -> 
 (m>0) NESS ->
Step 4
 (m>1) AL -> 
 (m>1) ANCE -> 
 (m>1) ENCE -> 
 (m>1) ER -> 
 (m>1) IC -> 
 (m>1) ABLE -> 
 (m>1) IBLE -> 
 (m>1) ANT -> 
 (m>1) EMENT -> 
 (m>1) MENT -> 
 (m>1) ENT -> 
 (m>1 and (*S or *T)) ION -> 
 (m>1) OU -> 
 (m>1) ISM -> 
 (m>1) ATE -> 
 (m>1) ITI -> 
 (m>1) OUS -> 
 (m>1) IVE -> 
 (m>1) IZE ->

These four steps take care of removing all the suffixes. But we need to clean up a bit more to obtain the roots.

Step 5a
 (m>1) E -> 
 (m=1 and not *o) E -> 

Step 5b
 (m > 1 and *d and *L) ->

Wow! That takes care of most of it. But, not everything. English language is beyond all rules. We do have exceptions that do not fit into any of these. But this takes care of most of it. The rest can be handled as exceptions.

Implementation


The algorithm above is pretty complex, and implementing it in code is a scary thought! Don't worry. People have already done most of it.

Here is a list of links to implementations in various different languages. With this in place, let us look at a Python script for stemming a word.

pip install stemming
from stemming.porter2 import stem
stem("clarification")

That is all! As always, the process of stemming is concept heavy-code lite.