Day 1 … sort of

Wed Jun 15 14:27:58 UTC 2016

I decided I needed some structure in this project of mine at work – I’m trying to replace a team’s old Oracle solution for text classification (rule based) and build a new one (machine learning) from scratch that’s integrated into my team’s core application infrastructure.  Today is “Day 1” … even though I’ve been back from my wedding / honeymoon for about two weeks now but, as per usual, not having done any machine learning before it’s been a lot of getting foundational materials setup (e.g. installing NLTK, something happening to our version of Python over my time away, reaching out to the sys admins for help in recompiling our Python version – bleh).

So today, is the first day of real “work” – that wasn’t reading documentation, working with the taxo team to figure out how the old application works, trying to digest machine learning basics as quickly and thoroughly as possible.

Wed Jun 15 20:45:57 UTC 2016

So, a little bit about me.  I actually started out my professional career as a high school math teacher.  After about 3 years of that, I realized (through the help of some solid career counseling) that my personality was not well suited for such work.  Right context (helping people), wrong skill set (classroom instruction and management).  Fast forward a few years to 2010, I’ve got my MS in Applied Stats and have started at my first job at an online retailer as a “Reporting Specialist”.  Now, 5 years into my new career in the data warehousing space, I’d consider myself “proficient” in Python.  By no means an expert, but I can script what I need to get the job done.

All that to say, I’ve been trying to loop through a bunch of results I’ve pulled out from our database and build a training dataset for a Naive Bayes Classifier I want to test out.  The thing is, at the rate I’m going, it’s going to take about a half hour to get the initial data set – not even training the classifier mind you – this is just providing the list to train on!  Stuff like this I know is faster with other methods, but I’m just not familiar with them … or what they are …

Wed Jun 15 21:15:16 UTC 2016

So, wandering around the internet a bit I found echoes of the “map()” function.  This is not something I’ve used before, and frankly, I was not super confident in using it correctly.  However, given the for loop I was trying to run and knowing that it was going to take at least a half hour if not much much longer to process the entire result set, I figured if this was going to be anywhere near an efficient process of testing out the classifier, the building of the training data needed to be as quick as possible.  In the end a loop like this …

 trainingSet = []

 for (i, skuVals) in enumerate(skuValList, start=1):
 print '{} of {} : {}'.format(i, len(skuValList), skuVals)
 skuVals, lookupKey = skuValList[:-1], skuValList[-1]

 trainingSet.append(
 (
 dict(zip(featureNames, skuVals)),
 lookupKey
 )
 )

 return trainingSet

Became much much quicker using a map() instead like this …

 trainingSet = map(self.splitVals, skuValList)
...
 def splitVals(self, values):
 skuVals, lookupKey = values[:-1], values[-1]

 return (dict(zip(self.featureNames, skuVals)), lookupKey)

Apologies on the spacing – new to publishing code on WordPress.  But you get the idea.

I was shocked at how quickly it finished – literally, a few seconds to complete.

Leave a comment