June 2016 – The Data Janitor

Momentum

Posted on June 30, 2016 by thedatajanitor

It’s been a bit quiet here as I’ve spent the last few days banging my head against the table trying to grasp how I was going to replace part of the existing platform’s functionality. Today, in preparation for and during a meeting about another aspect of the production jobs running on this platform, it dawned on me that this piece was much more manageable than the one I was just working on.

By way of review, the application that I’m working on replacing has three main jobs that run:

Classification of a single “lookup key” (label) to each client’s sku hierarchical classification.
Population of key/value pairs indicating the standardized form of a finite set of “attributes” (e.g. “Style_type”: “short sleeve”, when the text provided contains “S/S”)
Classification of a single “context” given a set of phrases provided from our dynamic content management tool.

Building out a POC for the first (and most utilized) job was surprisingly easy. The basis of which I’m using supervised machine learning, trained on the data we already have and building out an infrastructure for othe developers to maintain the classifiers over time. When I began to delve into the second job, I ran up against a number of issues.

Exporting data from the application to retain the lexical “knowledge” we’ve put into the system
Dealing with the fact that the data is exported in different formats
One of the documents being XML and not really having the strongest grasp on what each of the XML tags represented
Taking a minor detour into graph databases (Neo4j) to potentially mimic the same hierarchical matching that the application does

To my credit, I have a better grasp of the data I’ve pulled out of the source system. However, that still leaves me with the issue of re-structuring the data I’ve extracted and putting it back into a system I can use to implement the matching / standardization portion. At this point, I was getting a bit stressed out.

Only after the meeting this mornig, I was reminded that it’s good to step away from problems every now and then.

The meeting was actually pretty helpful and informative for me. Not that I had spent a lot of time thinking about the third job, I was kind of dreading diving into it – “Oh geez, if this last one was so complicated, what makes me think I’ll be able to figure out this one”. But you know what? It’s actually pretty straightforward. Though I don’t have a POC yet, I do have a very concrete idea in my head of how it can be implmented. And not only that, I feel more energized to go back to the second job and try and rethink how I’m approaching that problem, or break it up into smaller chunks for me to manage.

I think this acts as a reminder of how important self talk is. I can easily corner myself into thinking I can’t do something if I tell myself I can’t. That is not to say something won’t be hard or take a lot of time. But what walls am I putting up for myself that don’t need to be there? Maybe the solution requires gettting a lot of help from another coworker. Maybe the solution involves taking an online course to get acquainted with the subject matter. Either way, there is an infinite list of things that can potentially block me from accomplishing my goal – ain’t nobody got time for that – so, maybe I should be focusing on what CAN be done.

Order out of Chaos

Posted on June 24, 2016 by thedatajanitor

Moving off of an old application is hard work. This would be the second time I’ve done soemthing like thist at my company. I think the thing that helped the most the first time through was documenting anything and everything that was related to the existing processes. Having these notes written down provided something concrete to refer back to as a point of reference – this helped to highlight the “special” or “edge” cases that were a bit too prolific to be “special” or “edge” cases. Once again, I’m finding similar lack of uniformity across the environment that our taxonomy team is using.

The reason I don’t like drag-and-drop interfaces, is because it’s harder to automate. To be fair, the developers did provide a CLI as well, but I didn’t need to get THAT much information from the system and I was probably going to do more harm than good given how sparse their documentation was for their GUI – didn’t even bother reading the CLI one. So, a good chunk of today, was the tedium of writing down which data elements I needed to extract and manually clicking through to download the information.

And then the real struggle began.

At the end of the day, it was not that big of a deal. But, having all the data in hand, discrepancies began showing up.

Eleven different misspellings of the word “client” … is just one of many cases.
File extracts not in a uniform layout when downloaded from the Java app.
Definitions of data elements that are somewhat meaningless – until I have a deeper understanding of the inner workings of the application.

I was hoping to do as little ETL as possible with this data set. By now, you’d think I’d learned my lesson.

What I have in hand really, is simply what is spit out by the application, which in turn, is a reflection of what the end users (none of whom are / were engineers) have put together. Does the application work? Sure, but are data elements consistent across processes with similar functionality? Do I even have to answer that?

I am excited, however, to begin to structure and organize what I need out of this information. Once I figure out what I’m dealing with.

If you’re curious as to what exactly I’m building out, basically, I’m trying to replace a rule-based text classification system with a number of different jobs built on machine learning and natural language processing. At the end of the day, it’d be great to say that what we’ve built out to mine text for different features was even used across the company as a common utility. Even if not, it will be enough to say that we no longer have data mapped to both “client” and “cliennt” and “clienrt” and “cleint” and “clieny” and …

Reverse Engineering

Posted on June 22, 2016 by thedatajanitor

Wed Jun 22 16:30:30 UTC 2016

It’s not an easy job reverse engineering a process. It’s an even harder job when no one really owns said process nor are there employees that are actively vocal about it’s output. But you can guaran-damn-tee that if anything breaks, all hell will break loose. I’ve described my feeling about this to my wife as “pleasurably maddening”. It’s a joyful thing to bring order out of chaos, but the fact the chaos frustrates me to no end.

There’s a distinct difference between the application recode I undertook a few years back and what I’m doing now. It’s similar in that I’m re-engineering a process that was a bunch of patch fixes and band-aid solutions without a broader scope of coding for scale and maintainability. The major difference is that I’m wandering in the dark as there is no real engineering “guide” on the trail that I’m going down. It’s hard to figure out what aspects are important or not, what processes are maintaining which tables, or who the end users are for various data elements. That is because over the course of the last three years this team has dwindled from a team of three (with a distinct team lead) to a team of one – and no one along the way was an engineer by trade.

I’ve previously described my last major recode as rebuilding a structure made of Lincoln Logs, Legos, K’nex, and Scotch tape and instead using a single structural element instead of multiple. In that last attempt, I could tell the distinctions between the various building elements – Legos vs. K’nex, and the “necessary” tape joining them together. Only now, I am unfamiliar with the elements, there’s no instruction manual, and I’m not intimately aware of what this structure is supposed to “do” or “be”.

In a small way, I think that’s where some of the pleasure comes from. I am emabrking on a process that is right and good that imbues meaning and importance on an object or process. What I hope to achieve at the end of this is to be able to say, “this exists because … “. From my perspective that is part of the beauty of creation (Creation), there is a reason for existence – things are created, something out of nothing, for a purpose. Small or large, all things are important and have value.

Syntax

Posted on June 17, 2016 by thedatajanitor

Fri Jun 17 15:54:23 UTC 2016

The thing about syntax is that it’s agnostic of intent.

So, that classification project I’ve been working on had a bug. I had copied in a part of the code from elsewhere that was working with the same list, as you do. But of course, I wasn’t thinking about the context in which the list was being used – as a part of string formatting as opposed to creation of a dictionary that I had pasted it in to.

As I was running the code through a sample (very slowly) of our clients, I noticed something very odd. Everything was getting classified to only one category or no categories at all. How odd, since I had tested all the methods individually – save for the one I had just pasted into, obvi. And so begins the familiar story of deconstructing the script back down to it’s individual parts and banging my head against the wall trying to figure out why having run the classification manually was producing myriad results, but not through my script.

And then I saw this …

 featureDict = dict(zip(','.join(self.features), features))

… waiiiiiittt a minute.

Often, it helps me to visually see what was going on and so I put in a slice of the data I was testing with:

>>> dict(zip(','.join(ml.features), features))
{'a': None, 'k': '6733 MOLE 3XL,Mole 3XL', 'n': None, 's': 'Dual Force Reversible Button Down', 'u': 'MENS CASUAL SHIRTS', '_': 'LONG SLEEVE BUTTON-DOWN'}

UGH. This is not what I wanted. In reality, I needed:

 featureDict = dict(zip(self.features, features))

Now it was iterating over the individual elements NOT the letters in the joined string.

And just like that syntax becomes that annoying friend who’s super smart, but has no social clues.

Best laid plans

Posted on June 16, 2016June 16, 2016 by thedatajanitor

Thu Jun 16 18:11:05 UTC 2016

I’mma talk about functional programming, but first … I just spent an hour and a half just on setting up a method to make future development easier with quicker logging. It’s the type of thing I did once for our primary application, and it works fine in that context, but does not port well for one-off development 😦

Reminds me of something …

Thu Jun 16 20:38:59 UTC 2016

I was really hoping to have had tested more results by now. Other than one off syntax errors, the real enemy lies in efficiency. The classification project I’m working on is using NLTK’s Naive Bayes Classifier. It’s running over a number of our clients containing anywhere from a couple score records to 4 million. Not only that, but the inserts are taking an unreasonable amount of time as well.

Efficiency has always been my killer. Efficiency and getting weird errors I have no idea about because I’m still just “messing around” … errors like this:

ValueError: A ELE probability distribution must have at least one bin.

The period at the end isn’t there for syntactical correctness as it is to highlight that the application was so erroneous it gave up actually trying to run the script I had told it to and began worrying about grammar more than anything else.

Day 1 … sort of

Posted on June 15, 2016June 15, 2016 by thedatajanitor

Wed Jun 15 14:27:58 UTC 2016

I decided I needed some structure in this project of mine at work – I’m trying to replace a team’s old Oracle solution for text classification (rule based) and build a new one (machine learning) from scratch that’s integrated into my team’s core application infrastructure. Today is “Day 1” … even though I’ve been back from my wedding / honeymoon for about two weeks now but, as per usual, not having done any machine learning before it’s been a lot of getting foundational materials setup (e.g. installing NLTK, something happening to our version of Python over my time away, reaching out to the sys admins for help in recompiling our Python version – bleh).

So today, is the first day of real “work” – that wasn’t reading documentation, working with the taxo team to figure out how the old application works, trying to digest machine learning basics as quickly and thoroughly as possible.

Wed Jun 15 20:45:57 UTC 2016

So, a little bit about me. I actually started out my professional career as a high school math teacher. After about 3 years of that, I realized (through the help of some solid career counseling) that my personality was not well suited for such work. Right context (helping people), wrong skill set (classroom instruction and management). Fast forward a few years to 2010, I’ve got my MS in Applied Stats and have started at my first job at an online retailer as a “Reporting Specialist”. Now, 5 years into my new career in the data warehousing space, I’d consider myself “proficient” in Python. By no means an expert, but I can script what I need to get the job done.

All that to say, I’ve been trying to loop through a bunch of results I’ve pulled out from our database and build a training dataset for a Naive Bayes Classifier I want to test out. The thing is, at the rate I’m going, it’s going to take about a half hour to get the initial data set – not even training the classifier mind you – this is just providing the list to train on! Stuff like this I know is faster with other methods, but I’m just not familiar with them … or what they are …

Wed Jun 15 21:15:16 UTC 2016

So, wandering around the internet a bit I found echoes of the “map()” function. This is not something I’ve used before, and frankly, I was not super confident in using it correctly. However, given the for loop I was trying to run and knowing that it was going to take at least a half hour if not much much longer to process the entire result set, I figured if this was going to be anywhere near an efficient process of testing out the classifier, the building of the training data needed to be as quick as possible. In the end a loop like this …

 trainingSet = []

 for (i, skuVals) in enumerate(skuValList, start=1):
 print '{} of {} : {}'.format(i, len(skuValList), skuVals)
 skuVals, lookupKey = skuValList[:-1], skuValList[-1]

 trainingSet.append(
 (
 dict(zip(featureNames, skuVals)),
 lookupKey
 )
 )

 return trainingSet

Became much much quicker using a map() instead like this …

 trainingSet = map(self.splitVals, skuValList)
...
 def splitVals(self, values):
 skuVals, lookupKey = values[:-1], values[-1]

 return (dict(zip(self.featureNames, skuVals)), lookupKey)

Apologies on the spacing – new to publishing code on WordPress. But you get the idea.

I was shocked at how quickly it finished – literally, a few seconds to complete.

Month: June 2016