Friday, August 05, 2016

A Persistent Queue with C#

This is just a bit of programming trivia.

I wanted to run a queue of tasks each of which could take a while to complete. Given that the program could be terminated before all tasks had been run then I needed to store any remaining tasks until a later opportunity to process them. Also I wanted to run the task queue on a background thread and be able to add to the queue of tasks from the UI thread.

Fortunately .NET has an inbuilt ConcurrentQueue<T> in System.Collections.Concurrent that can be accessed from multiple threads without issue. All that needed to be added was some code to persist uncompleted queue tasks. In the code below I am using the .NET port of SQLite but a flat file would work fine as would any other database.

As I was using a database I wanted to differentiate between queued tasks already in the relevant table and any added from the UI thread. [Not of course forgetting to delete tasks in the database that are completed.] So a little class to represent a task – the task itself being just a string in this instance.

class QueuedTask {     private string task;     private bool fromDatabase = false;     public QueuedTask(string task)     {         this.task = task;     }     public QueuedTask(string task, bool fromDatabase)     {         this.task = task;         this.fromDatabase = fromDatabase;     }     public bool InDatabase     {         get { return fromDatabase; }     }     public string Task     {         get { return task; }     } }
Creating an instance of my PersistentQueue Class starts a timer that loads any saved tasks into the queue on another thread so as not to block the UI thread.. The main program form FormClosing() event calls the class SaveTasks() method that saves any remaining tasks in the queue to the database.

class PersistentQueue {     public ConcurrentQueue<QueuedTask> TaskQueue = new ConcurrentQueue<QueuedTask>();     private string dbPath = "";     private Timer readTimer = new Timer();     private const int startDelay = 10000;     public PersistentQueue(string dbPath)     {         this.dbPath = dbPath;         readTimer.Interval = startDelay;         readTimer.Elapsed += (sender, e) => loadSavedTasks();         readTimer.Start();     }     public void SaveTasks()     {         readTimer.Close();         if (!TaskQueue.IsEmpty)         {             try             {                 SQLiteConnection newCon = new SQLiteConnection(dbPath);                 newCon.Open();                 SQLiteCommand mCommand = new SQLiteCommand("Insert Into Tasks (Task) Values(@Task)", newCon);                 mCommand.Parameters.Add(new SQLiteParameter("@Task"));                 QueuedTask mTask;                 while (TaskQueue.TryDequeue(out mTask))                 {                     if (!mTask.InDatabase)                     {                         mCommand.Parameters["@Task"].Value = mTask.Task;                         mCommand.ExecuteNonQuery();                     }                 }                 mCommand.Dispose();                 newCon.Close();             }             catch (Exception ex)             {             }         }     }     private async void loadSavedTasks()     {         readTimer.Stop();         await Task.Run(() => loadTasks());         // this is where you might start a new Timer to execute any tasks sitting         // in the queue - again using await Task.Run() and to periodically check the         // queue for new entries after the current queue has been emptied     }     private void loadTasks()     {         try         {             SQLiteConnection newCon = new SQLiteConnection(dbPath);             newCon.Open();             SQLiteCommand mCommand = new SQLiteCommand("Select Task From Tasks", newCon);             SQLiteDataReader mReader = mCommand.ExecuteReader();             while (mReader.Read())             {                 TaskQueue.Enqueue(new QueuedTask(Convert.ToString(mReader[0]), true));             }             mReader.Close();             mCommand.Dispose();             newCon.Close();         }         catch (Exception ex)         {             LiteLog myerror = new LiteLog("PersistentQueue.cs", "loadTasks", ex.Message);             myerror.UpdateClass();         }     } }

Thursday, August 04, 2016

Not a Porter Stemmer

If you are considering indexing a sequence of text documents with a view to applying some sort of search facility then there is a lot to be said for stemming.

Stemming is the process of reducing words to their stem or root such that related words are reduced to the same stem. The result may not be the true etymological stem – all stemming algorithms look for is consistency. The benefit is that searches based upon such stems can find references for related words and this reduces the pressure on search users to guess the optimal words or phrases for a given requirement.

Some examples might be:

caresses => caress
ponies => poni
cats => cat
matting => mat
meetings => meet

Perhaps the best known stemming algorithm is the “Porter Stemmer” by Martin Porter. This algorithm was “frozen” in the past and now new developments in such techniques are based upon the newer Snowball stemmer.

C# Snowball* stemmers for a range of human languages can be found on Codeplex with the warning that the code has been ported from Java (a brief review more or less confirmed that). Martin Porter’s stemming site presents his algorithm in a range of languages but again it is clear from the structure that the C# versions are translated.

I took one of the C# Porter Stemmer versions and ran it against the 23,531 word test set and reassured by a perfect score decided to then “translate” the code into something just a little easier to follow and possibly tweak. Now my version (presented below) is not a Porter Stemmer as the output varies from the standard for a handful of words (I just have to mess with things).

I preferred:
advertise, advertised and advertising being stemmed to advert rather than advertis
merchandise becomes merchand and not merchandis
etc.

You probably get the picture – I have treated “ise” the same as “ize” (which works better in GB English I think) but it would not take anyone 5 minutes to undo if you wanted to be a purist but alternately you might fancy tweaking things further.

The code presented here is almost certainly much slower than the original but it is way easier to follow and debug. I have also added a cache so that words previously stemmed can be retrieved  without re-processing although I will have to watch cache growth if the process is long running.

/// <summary> /// AditStemmer is a stemmer algorithm based upon the Porter stemmer algorithm /// but has some slight but deliberate differences in output /// see http://tartarus.org/~martin/PorterStemmer/ for more details and test data /// This software is completely free for any purpose but is not warrented to be defect free /// </summary> class AditStemmer {     private Dictionary<string, string> cache = new Dictionary<string, string>(); // could initialise this to avoid known issues (eg universal, university, universe)     public string stemWord(string word)     {         string stemWord;         if (!cache.TryGetValue(word, out stemWord))         {             stemWord = tryStem(word);             cache.Add(word, stemWord);         }         return stemWord;     }     private string tryStem(string word)     {         if (word.Length > 2)         {                 // following the steps of giants - step 1 remove plurals, -ed, -ing                 word = step1(word);                 word = step2(word);                 word = step3(word);                 word = step4(word);                 word = step5(word);                 word = step6(word);         }         return word;     }     private string step1(string word)     {         int wordLen = word.Length;         if (word.EndsWith("s"))         {             if (word.EndsWith("sses") || word.EndsWith("ies"))             {                 wordLen -= 2;             }             else if (!word.EndsWith("ss"))             {                 wordLen--;             }             word = word.Substring(0, wordLen);         }         if (word.EndsWith("eed"))         {             if (countConsonants(word, wordLen - 1) > 0)             {                 wordLen--;             }         }         else         {             if ((word.EndsWith("ed") && hasVowelInStem(word, wordLen - 2)) || (word.EndsWith("ing") && hasVowelInStem(word, wordLen - 3)))             {                 int checkTo = wordLen - (word.EndsWith("ed") ? 2 : 3);                 string stem = word.Substring(0, checkTo);                 if (stem.EndsWith("at") || stem.EndsWith("bl") || stem.EndsWith("iz") || stem.EndsWith("is"))                 {                     word = stem + "e";                 }                 else if (isDoubleCon(word, checkTo - 1))                 {                     if (!(stem.EndsWith("l") || stem.EndsWith("s") || stem.EndsWith("z")))                     {                         word = stem.Substring(0, checkTo - 1);                     }                     else { word = stem; }                 }                 else if (countConsonants(word, checkTo) == 1 && isConVowCon(word, checkTo - 1))                 {                     word = stem + "e";                 }                 else                 {                     word = stem;                 }                 wordLen = word.Length;             }         }         return word.Substring(0, wordLen);     }     private string step2(string word)     {         // not too exciting just changes y to i if another vowel in the stem - but that comes in useful later         if (word.EndsWith("y") && hasVowelInStem(word, word.Length - 1))         {             word = word.Substring(0, word.Length - 1) + "i";         }         return word;     }     private string step3(string word)     {         if (word.EndsWith("ational")) { return conditionalReplace(word, "ational", "ate"); }         if (word.EndsWith("tional")) { return conditionalReplace(word, "tional", "tion"); }         if (word.EndsWith("enci")) { return conditionalReplace(word, "enci", "ence"); }         if (word.EndsWith("anci")) { return conditionalReplace(word, "anci", "ance"); }         if (word.EndsWith("izer")) { return conditionalReplace(word, "izer", "ize"); }         if (word.EndsWith("bli")) { return conditionalReplace(word, "bli", "ble"); }         if (word.EndsWith("alli")) { return conditionalReplace(word, "alli", "al"); }         if (word.EndsWith("entli")) { return conditionalReplace(word, "entli", "ent"); }         if (word.EndsWith("eli")) { return conditionalReplace(word, "eli", "e"); }         if (word.EndsWith("ousli")) { return conditionalReplace(word, "ousli", "ous"); }         if (word.EndsWith("ization")) { return conditionalReplace(word, "ization", "ize"); }         if (word.EndsWith("isation")) { return conditionalReplace(word, "isation", "ize"); }         if (word.EndsWith("ation")) { return conditionalReplace(word, "ation", "ate"); }         if (word.EndsWith("ator")) { return conditionalReplace(word, "ator", "ate"); }         if (word.EndsWith("alism")) { return conditionalReplace(word, "alism", "al"); }         if (word.EndsWith("iveness")) { return conditionalReplace(word, "iveness", "ive"); }         if (word.EndsWith("fulness")) { return conditionalReplace(word, "fulness", "ful"); }         if (word.EndsWith("ousness")) { return conditionalReplace(word, "ousness", "ous"); }         if (word.EndsWith("aliti")) { return conditionalReplace(word, "aliti", "al"); }         if (word.EndsWith("iviti")) { return conditionalReplace(word, "iviti", "ive"); }         if (word.EndsWith("biliti")) { return conditionalReplace(word, "biliti", "ble"); }         if (word.EndsWith("logi")) { return conditionalReplace(word, "logi", "log"); }         return word;     }     private string step4(string word)     {         if (word.EndsWith("icate")) { return conditionalReplace(word, "icate", "ic"); }         if (word.EndsWith("ative")) { return conditionalReplace(word, "ative", ""); }         if (word.EndsWith("alize")) { return conditionalReplace(word, "alize", "al"); }         if (word.EndsWith("alise")) { return conditionalReplace(word, "alise", "al"); }         if (word.EndsWith("iciti")) { return conditionalReplace(word, "iciti", "ic"); }         if (word.EndsWith("ical")) { return conditionalReplace(word, "ical", "ic"); }         if (word.EndsWith("ful")) { return conditionalReplace(word, "ful", ""); }         if (word.EndsWith("ness")) { return conditionalReplace(word, "ness", ""); }         return word;     }     private string step5(string word)     {         int truncAt = word.Length;         if (word.EndsWith("al")) { truncAt -= 2; }         if (word.EndsWith("ance") || word.EndsWith("ence")) { truncAt -= 4; }         if (word.EndsWith("er")) { truncAt -= 2; }         if (word.EndsWith("ic")) { truncAt -= 2; }         if (word.EndsWith("able") || word.EndsWith("ible")) { truncAt -= 4; }         if (word.EndsWith("ant")) { truncAt -= 3; }         if (word.EndsWith("ement"))         {             truncAt -= 5;         }         else if (word.EndsWith("ment"))         {             truncAt -= 4;         }         else if (word.EndsWith("ent"))         {             truncAt -= 3;         }         if (word.EndsWith("ion") && word.Length > 4)         {             if (word.ElementAt(word.Length - 4) == 't' || word.ElementAt(word.Length - 4) == 's')             {                 truncAt -= 3;             }         }         if (word.EndsWith("ou")) { truncAt -= 2; }         if (word.EndsWith("ism")) { truncAt -= 3; }         if (word.EndsWith("ate")) { truncAt -= 3; }         if (word.EndsWith("iti")) { truncAt -= 3; }         if (word.EndsWith("ous")) { truncAt -= 3; }         if (word.EndsWith("ive")) { truncAt -= 3; }         if (word.EndsWith("ize")) { truncAt -= 3; }         if (word.EndsWith("ise")) { truncAt -= 3; }         if (countConsonants(word, truncAt) > 1)         {             word = word.Substring(0, truncAt);         }         return word;     }     private string step6(string word)     {         // mostly removes trailing e         if (word.EndsWith("e"))         {             int cCount = countConsonants(word, word.Length - 1);             if (cCount > 1 || cCount == 1 && !isConVowCon(word, word.Length - 2))             {                 word = word.Substring(0, word.Length - 1);             }         }         // reduces ll to l where appropriate         if (word.EndsWith("ll") && countConsonants(word, word.Length - 1) > 1)         {             word = word.Substring(0, word.Length - 1);         }         return word;     }     private string conditionalReplace(string word, string replace, string with)     {         // the condition is that there is at least one preceeding consonant sequence         int stemLength = word.Length - replace.Length;         if (countConsonants(word, stemLength) > 0)         {             word = word.Substring(0, stemLength) + with;         }         return word;     }     private int countConsonants(string word, int to)     {         //counts sequences of consonants before to but after any initial consonant sequence         int cCount = 0;         bool wasVowel = false;         for (var ci = 0; ci < to; ci++)         {             if (hasConsonantAt(word, ci))             {                 if (wasVowel)                 {                     wasVowel = false;                     cCount++;                 }             }             else             {                 wasVowel = true;             }         }         return cCount;     }     private bool isDoubleCon(string word, int at)     {         // is double consonant         if (at < 1) { return false; }         if (word.ElementAt(at) != word.ElementAt(at - 1))         {             return false;         }         return hasConsonantAt(word, at);     }     private bool isConVowCon(string word, int to)     {         /* is true <=> i-2,i-1,i has the form consonant - vowel - consonant      and also if the second c is not w,x or y. this is used when trying to      restore an e at the end of a short word. e.g.          cav(e), lov(e), hop(e), crim(e), but          snow, box, tray.         */         if (to < 2 || !hasConsonantAt(word, to) || hasConsonantAt(word, to - 1) || !hasConsonantAt(word, to - 2))         {             return false;         }         if (word.ElementAt(to) == 'w' || word.ElementAt(to) == 'x' || word.ElementAt(to) == 'y')         {             return false;         }         return true;     }     private bool hasVowelInStem(string word, int to)     {         for (var vi = 0; vi < to; vi++)         {             if (!hasConsonantAt(word, vi))             {                 return true;             }         }         return false;     }     private bool hasConsonantAt(string word, int at)     {         switch (word.ElementAt(at))         {             case 'a':             case 'e':             case 'i':             case 'o':             case 'u':                 return false;             case 'y':                 return (at == 0) ? true : !hasConsonantAt(word, at - 1); // y can act as consonant and vowel             default:                 return true;         }     } }

*More details of the Snowball algorithm can be found here http://snowball.tartarus.org/

<Edit>Fixed the colour errors in the demo code. The cause (something to do with text size) is under investigation</Edit>