Explain … EXPLAIN

So, I work with a bunch of economists, and they have requests for data to be generated for them to analyze every so often.  I have the results stored in a table in our Redshift cluster and I like to have an at a glance view of what’s been done for them.

A little while ago, they started asking for a metric that was a simple ratio of two counts on the data from each dataset.  Normally the query with the counts indepently didn’t take very long.  But when I wrote the query dividing the two … I waited … and waited … and waited.

Here’s the query I was trying to run:


select
marketplace_id,
sample_id,
case when sample_id~'sample-time' then timestamp 'epoch' + split_part(sample_id, 'sample-time-', 2)::integer * interval '1 second' else null end as sampletime,
count(
) as records,
count(distinct customer_id) as cust_count,
count()::numeric / count(distinct customer_id) as gv_per_cust
from public.civ_agg_metrics where sample_id !~
'DEPRECATED' group by 1,2,3

The explain plan yielded a cost of 291745953.09

When I placed the counts inside of a CTE (which I’ve started a bunch more) I found the query to perform much better, about the same as without the added metric.


with summary as (
select
marketplace_id,
sample_id,
case when sample_id~'sample-time' then timestamp 'epoch' + split_part(sample_id, 'sample-time-', 2)::integer * interval '1 second' else null end as sampletime,
count(
) as records,
count(distinct customer_id) as cust_count
from public.civ_agg_metrics where sample_id !~*'DEPRECATED' group by 1,2,3
)
select
marketplace_id, sample_id, sampletime, records, cust_count,
round(records::numeric / cust_count, 2) as gvs_per_cust
from summary;

The explain was smaller, 288504796.75. Not sure if Redshift wasn’t able to handle the counts and the division of the two counts in the same query but the CTE seemed to help move things along much quicker.

34 on 3/4

This is actually from my personal journal, non-work related, but apparently publishing wasn’t included in their latest v2 release, so I’m posting it here instead.

So, I’m going to run/walk 34 miles next week. On my birthday. March 4th (3/4). And I’m turning 34. I have a feeling it will be an auspicious day, if my numerology serves me correctly. I’ve actually been really excited planning for it. There’ something about setting a goal and seeing your body rise to the challenge that is actually invigorating to me. I’ve been putting landmarks together for Katie (and others if they want to join in at the wee hours of the night), and it’s been thoroughly enjoyable. It’s fun to see everything come together and what needs to happen when and where. I also tested out the virtual partner functionality on my watch – something I’ve never actually felt like I needed to do before. But now that timing is going to be actually critical to getting through this, I figured I’d test it out. I tried to feel out what a 12 minute pace was like, so I set it for 2 miles at 24 minutes. I finished a bit ahead of schedule, which is not surprising, but I’m wondering right now if I could go a bit faster, like 11 minute pace instead so as to not wake up at 1:30 AM. The last few race paces look like this:

Oct 2016: 1/2 marathon – 9:21 pace
Aug 2016: Half IM (1/2 marathon) – 11:37 pace (~4 hours of racing prior, jeez I need to do more brick runs)
Oct 2015: Marathon – 9:17 pace
Sep 2015: IM (marathon) – 11:20 pace (9 hours of racing prior)
Aug 2015: Half IM (1/2 marathon) – 9:38 pace (~4 hours of racing prior)

Even at “peak exhaustion” (if that’s even a thing), during the half and full Ironman distances, I was at 11:30ish. So, 12 minutes is probably a bit overkill coming off of fresh legs. But, given recent training efforts (3 miles-ish) have been in the 9:50 range, I might be better off with 10:30 or even 11:00 pace. Which is great, since it means I could get a whole 30-45 more minutes of sleep!

If you’re reading this and you’d like to save a life by providing clean water – please donate here

Hadoop Pi

Steps taken to producing a 4 node Hadoop 2 cluster on Raspberry pis.

I’m not going to even pretend like I know what I’m doing here – heavily copy/pasted from this tutorial and the subsequent followup

Purpose of project

The whole reason I’m going through the motions and doing this basic “Hello World” type of exercise, is that I believe distributed computing is going to become more and more a distinguishing mark on any data engineer’s trade. I’ve already felt it in my last job search – the lack of Hadoop / Spark / distributed computing experience. My journey began in simple single node typical RDBMS environments (MySQL), and then a share-nothing MPP solution (Postgresql flavored) – this last step started to open my eyes to the benefits of distributed databases. The next step, I think, for me is to dip my toe into the Hadoop / distributed computing world. Even if I’m wrong, and I can have a solid career without the Hadoop experience, I still think it’s a good challenge and good to be exposed to current trends in data engineering technologies – even if I’m more than a few years behind the times.

Components

Download and image the SD – condensed from RaspberryPi.org

Repeat the following steps for each SD card:
1. Download latest version of Raspbian Jessie. Unzip the file.
2. From terminal:
diskutil list
* Identify the disk (not partition) of your SD card
* Unmount your SD card by using the disk identifier, to prepare for copying data to it: diskutil unmountDisk /dev/disk<disk# from diskutil>
* Copy data to SD card, on my MacBook this took almost 7 minutes:
sudo dd bs=1m if=PATH/TO/IMG of=/dev/rdisk&lt;disk# from diskutil&gt;
* PATH/TO/IMG: path where I placed the Raspbian Jessie .img file

Put together Raspberry Pi units – enclose motherboard in case, insert OS SD Card

Set up each Raspberry Pi unit

I’m going to be setting up my NameNode first and then I’ll ssh into all the other boxes from there. As such, I’ll need to set up
Boots to Desktop OS initially, I open up a terminal session and run raspi-config
* Expand Filessystem
* Boot options -> Desktop/CLI -> Console Autologin
* Advanced Options -> SSH
* Advanced Options -> Hostname -> “pi-hdp01”

At this point I also identified the IP address of this particular unit.

Update system and install Java

  • sudo apt-get update && sudo apt-get install oracle-java7-jdk
  • sudo apt-get install pdsh
  • sudo update-alternatives --config java
  • Ensure jdk-8-oracle* is selected

Configure Hadoop user / group

  • sudo addgroup hadoop
  • sudo adduser --ingroup hadoop hduser
  • Use whatever password
  • sudo adduser hduser sudo

Setup SSH for hduser

su hduser
mkdir ~/.ssh
ssh-keygen -t rsa -P "" # press enter when prompted for file location, we want this in the default ~/.ssh location
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
su hduser
ssh localhost # trust the connection
exit

Install protobuf 2.5.0

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar xzvf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure --prefix=/usr
make # I don’t typically run these make commands, so this seemed like it took forever (10 minutes?) and I kept getting the same weird messages over and over
make check
sudo make install

I think the above was unnecessary since I ended up downloading the binary files instead of building from source.

Install Hadoop

At this point, I broke away from the tutorial I was following which tried to build Hadoop from source with Make and all that craziness. I’m just downloading the binary and copying to /opt as per another tutorial

wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
sudo tar -xvzf hadoop-2.7.3.tar.gz -C /opt/
cd /opt
sudo ln --symbolic hadoop-2.7.3/ hadoop
sudo chown -R hduser:hadoop hadoop
sudo chown -R hduser:hadoop hadoop-2.7.3/

Add environment variables

Edit /etc/bash.bashrc – add the following lines to the end of the file:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_INSTALL/bin

Source the new bash profile.
source ~/.bashrc

Test to make sure it’s installed correctly

pi@pi-hdp01:/opt $ su hduser
Password:
hduser@pi-hdp01:/opt $ hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /opt/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
hduser@pi-hdp01:/opt $

Success!!!

Test Hadoop in single node, non-distributed mode

At this point, I’m following Apache’s Documentation on “Single Node Setup”, forgoing any other tutorials. I may go back to other blog posts to revisit specific examples or tests others are doing.

From the /opt/hadoop directory …

sudo mkdir input
sudo cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
--my own example, results are written to a directory “output2”
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output2 'Zoo.*'

hduser@pi-hdp01:/opt/hadoop $ cat output2/*
2 Zookeeper.
1 Zookeeper connection string, a list of hostnames and port comma
1 Zookeeper authentication type, 'none' or 'sasl' (Kerberos).
1 Zookeeper ZNode path where the KMS instances will store and retrieve

Test Hadoop in pseudo-distributed mode

Update ./etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Update ./etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Format filesystem:
bin/hdfs namenode -format
Start name node daemon and data node daemon:
sbin/start-dfs.sh
Update JAVA HOME in ./etc/hadoop/hadoop-env.sh:
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/
Make the HDFS directories required to execute MapReduce jobs:
bin/hdfs dfs -mkdir /user/
bin/hdfs dfs -mkdir /user/hduser

At this point, I tried doing an ls for /user and it wasn’t there … apparently this /user directory is referencing something other than what I am assuming.

Continuing on …
Copy the input files into the distributed filesystem:
bin/hdfs dfs -mkdir input
bin/hdfs dfs -put etc/hadoop/*.xml input

Run an example like earlier:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+’

Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:

hduser@pi-hdp01:/opt/hadoop $ cat output/output/*
1 dfsadmin
1 dfs.replication

I had an existing output directory in /opt/hadoop, so it threw everything in the output directory in the existing output directory.

Note: I was still curious as to where this distribute file system was and apparently it was in /tmp

hduser@pi-hdp01:/opt/hadoop $ grep -r dfs.data.dir ./*
...
./share/doc/hadoop/hadoop-streaming/HadoopStreaming.html:<pre> -D dfs.data.dir=/tmp
./share/doc/hadoop/common/CHANGES.txt: 8. Change dfs.data.dir and mapred.local.dir to be comma-separated

hduser@pi-hdp01:/opt/hadoop $ ls /tmp/
hadoop-hduser hadoop-hduser-secondarynamenode.pid hsperfdata_root Jetty_localhost_46113_datanode____.zi7wi0
hadoop-hduser-datanode.pid hsperfdata_hduser Jetty_0_0_0_0_50070_hdfs____w2cu08 pulse-2L9K88eMlGn7
hadoop-hduser-namenode.pid hsperfdata_pi Jetty_0_0_0_0_50090_secondary____y6aanv

Stop namenode:
sbin/stop-dfs.sh

As an aside, the contents of /tmp remain the same, but the .pid files are removed.

Running YARN on single node

Create etc/hadoop/mapred-site.xml:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Update etc/hadoop/yarn-site.xml:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Start YARN:

hduser@pi-hdp01:/opt/hadoop $ sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop-2.7.3/logs/yarn-hduser-resourcemanager-pi-hdp01.out
localhost: starting nodemanager, logging to /opt/hadoop-2.7.3/logs/yarn-hduser-nodemanager-pi-hdp01.out

SUCCESS (with an SSH tunnel: ssh -L 8088:localhost:8088 hduser@pi-hdp01):

Now I’m going to try running the same MapReduce job they’ve provided in the examples, counting the occurrences of that “dfs” string:
First, bring the distributed file store back up:
sbin/start-dfs.sh
I’ve been following this blog on and off, and found this “jps” command super helpful:

hduser@pi-hdp01:/opt/hadoop $ jps
25520 Jps
25210 DataNode
25114 NameNode
24571 NodeManager
25372 SecondaryNameNode
24478 ResourceManager
hduser@pi-hdp01:/opt/hadoop $

Before, I hadn’t run start-dfs.sh so only Jps, NodeManager, and ResourceManager had shown up.

Start the MapReduce job:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+’

There were a lot of “failures” and it took a lot longer than the other single node and pseudo distributed modes, not sure why – but this is also a pretty simple example and not a typical use case, I’m sure.

Generators

Just realizing I haven’t posted in here in a while – most probably due to the job interview rounds and finally accepting a position at the University of Washington.  But I am committed to the activity of journaling my professional learning and experience as being an iterative process to be shared with others.

Recently, I’ve had to do unit tests for a website at work.  Not something I’ve typically done in the past but good experience and good, in general, to be learning something new.  One of the tests involved testing various URLs for a redirect.  Though I’m not well versed in the state of website frameworks, I would guess it’s a good chance that most websites use some form of regular expression to handle various URLs that load content for different aspects of their site (e.g. mygreatblog.com/posts/12345 might load post number 12345 for that particular site).  Regular expressions are tricky though since multiple strings can match the same regular expression.  The first module I found to handle this produced extremely long strings for patterns such as “\d+” (a string of one or more digits).  Although I could not recreate it and didn’t document the incident while it happened, I can say anecdotally, that I ran into some weird SQLite issue when it tried to request a URL that had an extremely large integer value based on the regular expression provided.  Odd.  So, if I know that the test can potentially fail based on undesirable URLs, the question became, how do I get an accepable string of minimum length.

I was looking around online and found another module that someone built to work with Python’s “sre_parse” module which is the backbone of the regular expression, re, module.  There’s documentation elsewhere that states that you really shouldn’t use sre_parse, but it’s there and sensible so I’m glad someone has gone through the effort of trying to understand what that piece does.  What really triggered this post is that the method that allowed me finer grained control over regex group substitution (the pieces of the pattern that it isn’t explicitly trying to match on) is based around a generator.  I began working with generators in my last project at my last company replacing a piece of licensed Oracle software with a homebrew machine learning approach – which I hope they’ve started using now 🙂  The use of generators in my current project was somewhat trivial – I just needed one example of a URL to request.  But in my last project, it was pretty much a lifesaver.  I was trying to store various legnth permutations of sets of various sizes and quickly ran into memory issues.  What the generator did was rather than create a giant list of all the possibilities and store that in memory, basically write the “rules” for all the potential possibilities and only get another one when you need it.  It’s little victory like that where I realize, “Oh hey, I am growing and developing in my profession”.  Sure maybe not as fast as others, but connections are being made and skills are being developed.  Slowly learning to have the same attitude of acceptance and persistence I have towards racing in my career.

EDIT: on a barely related note – I found my first bug on a github project!

CLI

I like building command line interfaces.  I don’t know what it is, but there’s a certain feeling of accomplishment that I have when something works well and works correctly.  I think part of it is the fact that I’m breaking down a sometimes complicated process and abstracting it from the parts that people don’t care about or that are too labor intensive to do over and over again and to have the program do it’s job.  While this can be benficial to my daily work for sure, most of the time I’m really building a CLI for other people.  And in that sense, it’s an even greater feeling of accomplishment when I see people gravitating towards a tool I’ve built and actually know how to use it – with little to no explanation.  I’m not a UX / UI guy by any means, but I appreciate the thoughtfulness taken towards building something that doesn’t require an instruction manual.  In some larger, grander sense it, in some small way, taps into a universally accepted medium of communication that transcends verbal language … even if it is just to run lots of INSERT and UPDATE statements.

I think this post really resonated with me as a result.  It highlights a lot of themes that just make sense as you’re developing a tool that other people will use a lot.  Keep things familiar.  Provide feedback to the user.  Allow for a variety of verbosity in said feedback.

I’ve also started to play with the curses package.  And that’s exactly what I’ve wanted to do … curse.  Granted, compared to curses, my little printed out menus and interfaces look like trash.  But, to get there, I didn’t realize I had to tell curses every. little. thing. ugh.  And using the wrapper() method – while probably invaluable later on – did not help me really get a grasp of what curses was doing when it set up a lot of the “standard” configuration.  I STILL don’t know how to print out colored text – but I got a sweet border and flashing letters!  I can definitely see myself using curses going forward, if I’m going to build anything interactive within the shell, but not without building a class (or finding one myself) that makes a lot of the mundane parts of using curses a bit more intuitive.  It’s not a user interface, so I shouldn’t have the same expectations now should I?

Sorting

So, my work up to now really hasn’t required writing CPU intensive code.  Though, it’s really not a bad place to be when I can say I built an application for my team that has a high level of uniformity, consistency, extensibility in its operation, logging, and user interface.  That being said, I can certainly still learn a few more tricks.

This taxonomy project has been a great exercise in developing a better sense of efficiency.

One of the jobs I built out had to perform a number of steps.  Initially as a POC, I didn’t care much about efficiency – just that the whole operation worked and spit out something that looked like the current output.  Though, in the back of my mind, I knew the scale to which this had to expand was going to make efficiency a pretty big problem later on.  At the heart of what I’m trying to do is basically this:

  1. Pull out records from a database for the particular client
  2. For various text fields in each record, split out the text into “ngrams” and find various combinations (or permutations, depending on the context) of those ngrams
  3. For each combination see if there’s a match in a set of “rules”
  4. Recursively, rebuild the set of ngrams and determine if new matches can be found
  5. With all your matches, see if you can identify that record as a particular “item” based on another set of “rules”.

(I’m not going to even bother going into the details but you kinda get the gist)

The following outlines some of the speed enhancements that I was able to make:

Stage 0: POC – sequential processing

At this point, everything was running sequentially, database queries everywhere.  Since everything was running in a loop it didn’t make sense to query the database for various “rules” every single time I was running a single record.

At the point when I got the VP and the respective team’s buy in, I started to track a variety of client’s progress through the application.

Having run Client X through the application took: 15 minutes and 47 seconds

Stage 1: Parallelism

 

Given that much of the work done on the server wasn’t CPU intensive – mostly executing commands on our database cluster.  I realized that we could run the same process in parallel using Python’s multiprocessing library.  This forks the process and copies over parts of the code over to the new child process and runs pieces of code under distinct CPUs so as to get around Python’s Global Interpreter Lock.

Using a pool of 8 workers Client X processing time dropped to: 3 minutes and 29 seconds (4.3x faster than the sequential processing)

Stage 2: Parallelism with hash table

While 3 minutes was not too bad, this was also a very small client.  On the order of 2,500 records.  The median size for client records was around 100,000 – 1/40 of the median record size.  While that isn’t outrageously prohibitive for 50% of the clients, it still was not ideal – the very fact that 50% of our clients were vastly larger than that amount is evidence enough that this process should run faster.

My undergrad was in mathematics and not computer science, so various design patterns and algorithms from the perspective of computational efficiency was somewhat new to me.  After a lot of wandering around the vastness of the internet, I stumbled upon this site.  Aside from providing the information I needed, it was a very well designed site and had explained things clearly.

I was already caching the results of a particular recursive function, but I was still running into issues with long processing times from a particular part of the job that did a lot of searching through a giant dictionary.  What I realized was that I needed to find what I wanted faster.  Rather than slog through all the records, what if the data that I was trying to match on had some inherent hint as to where the data may lie – which lead me to the hash table concept.  With a hash table, once the process kicks off, I pull in all the possible values from the database and calculate the length of the string modulo 11 – I’m sure there’s a good number theory reason why primes are used but I haven’t looked into it yet.  Simple trial and error yielded a nice distribution of collections of values into each of the modulo 11 “slots”, as opposed to slotting by other prime length lists.  I’m assuming if I were to run a more systematic test, I could probably find a larger prime number that would make this search faster.  Now, instead of looking through a giant dictionary of values, I compute the length of the string modulo 11 for a particular field in a particular record and all of a sudden I have a MUCH smaller set of records to check against.

As a result, with the hash table implemented (and increased worker size from 8 to 24), time take for Client X to process through the application: 47 seconds (19.3x faster than sequential, 4.4x faster than pool – with 8 workers)

Granted, this is not an “apples to apples” comparison given that I used 3x more workers this time, but even assuming a linear improvement, this is STILL faster than I would have gotten with 24 workers without the hash table.  Simply increasing the worker size would not be a sustainable practice anyway, given that this is a shared environment and other processes (even though they are not as CPU intensive) are running alongside.  The idea is to work smarter not harder.

Multitasking

I feel like I’m completely comfortable performing most tasks asked of me in the shell from the command line.  However, this post reminded me of some functionality that I knew of but haven’t gotten into the habit of using.  (E.g. The “hat” substitution was a convenient shortcut I just used to to quickly switch between tailing my log and tailing the production log.)

Speaking of efficiency and shortcuts, I’ve become quite comfortable with using screen and screen shortcuts; though the thought came to mind, given that I’ve certainly run into more than one person that has not used screen regularly, that other people were used to doing things differently.  I haven’t had extensive use of backgrounding / foregrounding processes – but I would assume that’s what most other people were doing?  And if so, how do people keep track of what’s going on easily?

My NumPy crash course

This project that I’ve been working on for a while now has finally come to some critical components – processing time.  When I started out it was simply a question of, “Can I get this black box system to produce similar results on new code I build yself?”  The answer was yes.  This was quite a long time to just get the POC (proof-of-concept) ready.  It required documenting the old system, determining if I could extract information off of the old system and subsequently using said information.  All this was doable, and had been done.  The next step was storing the returned data from my new code and also building out functionality for use by less technical team members.

All this I feel I had done pretty well, until I realized about a month ago that a single record to process took well over a minute.  So, what happens when I have literally millions (in a few cases tens of millions) of records to process (potentially).  Yikes.  A lot has happened in the last month – my wife and I moved from Chicago to Seattle, and all the little things that come with that like putting our home together and unpacking!  Just a few small details.  These last two weeks were really the first chance I’ve been able to tackle the question of speed and I feel like I’ve made a few strides.

When it was just the POC, I was hitting the database on multiple occassions (see previous post).  I knew what data I wanted and I was most comfortable getting it off the database, why worry about doing anything fancier than that?  A particular set of data, call it Data Set X, processed in 9 minutes (roughly 100 records).  Okay, not great, but I was glad to see it chugging along and producing expected results.

Beginning last week, I knew I needed to pull the data in one go as the process began and manipulate it into the form that I needed on the fly.  This required a bunch of list comprehensions.  I feel like list comprehensions are either great or horrid to look at, let alone construct.  Regardless, I was working remotely with few distractions, and was able to crunch it out.  Data Set X was now processing in under 4 minutes – YES!

This last Friday and the weekend was a bit of a wash since I was flying out of town for a wedding and got delayed at LAX because my first leg of the flight was just sitting on the runway waiting to get to the gate so I had to take a later flight.  I mention this really more to gripe about delays at the airport and encourage people to fix a broken system within air transportation / logistics.

This week I finally buckled down and knew I needed to make this faster, but wasn’t completely sure how.  I was doing a lot of searching for matching records in dictionaries and figured maybe there’s a better way to do this – which lead me to NumPy.  I knew the module existed but had never used it before, and frankly had never had the need to write something that processed records this quickly on a single server (next step would be to try and spin up a cluster on AWS and approach it that way).  Thankfully the DevOps team at my company installed Anaconda recently as our Python package manager – this had NumPy already installed and I could start playing.  Once I had a beginning understanding of NumPy arrays, I rewrote some of the methods in my process to use arrays instead of standard list and dictionaries.  Now, processing Data Set X dropped from 4 mintues down to 1 minutes and 25 seconds.  Not too shabby!

Now this is still too slow.  For ~100 records, I need to be in the 10 – 15 second range to even be close to fully production ready to handle any type of request – about an 6-8x speed improvement.  It’ll be another light bulb moment before I hit that mark, but at the same time I think I’m feeling a bit more confident given that I started at 9 minutes and am now down to just over 1.  While I’ve had a lot of experience writing production code, this is the first time speed has ever been a critical factor.  If anything, this has given me a chance to grow in other ways of development considerations and align that with learning new functionality with other Python modules.

Comfort

Up to now I’ve worked primarily in scripting languages (Python and Perl) and relational databases.  At Conversant I got to extend my database background in the MPP space with a software solution from XtremeData, bigger scale but same toy.  I haven’t been able to play in the Hadoop ecosystem yet but I feel like that might be the next step in terms of skills to pick up.  Not because it’s what other people are using / talking about, but I feel like just as the next step in terms of relational database was a cluster, now my scripting and computing ought to extend in the same parallel processing sense.

Case in point – I’ve gotten this taxonomy project mostly off the ground, but my design has been pretty slow.  A lot fo the slowness was due to database hits, which was a decision made at the time I was just getting this off the ground as proof-of-concept (POC).  The things works and outputs similar results to what the old system is currently but efficiency was not something I was concerned with.  To give you a sense, there’s roughly 200+ clients that provide data that needs to be processed, maybe 25% of them have at least 100k (up to 100s of millions) of records.  Each record in my POC was taking a minute or a few minutes to complete.  To reprocess every sku (which was not asked of the re-architecture, but it SHOULD be able to handle it), this would take until shortly before I die to complete.  Ain’t nobody got time for that!

Using the database to pull out the necessary info and have it stored in memory for use is where I needed to go next.  But, how do I rank data elements in Python without my handy dandy rank() in SQL?  What I was asking for was not a huge stretch so I knew it was doable even with skills I had at my disposal.  In fact it was a good challenge for me to think through what it was I was trying to do and write the code to reflect that rather than think of what I wanted the data to “look” like and write the appropraite SQL … if that makes sense.  It was a different question I was asking myself in Python than in SQL.

Me in SQL: “Give me the this dataset aggregated by [column X] and return some computations and order them by the following three columns and finally rank them so I can tell the top result and handle a tie at the top appropriately”

Me in Python: “Um … where’s the sorting method that lets me do what I want … COMPUTER – do the sorting please!”

Okay, it wasn’t that bad, but it did stretch me in thinking about the problem in a different way.  Where in SQL I wanted pretty much what I wanted above, I ended up writing the same computations in Python and then sorting them and handling the ties by checking for equivalence on the next most result in the list.

By moving that (and other methods) that had database hits I was able to shave processing time for a sku down from a minute or more to a second or two.  Even still, to reprocess everything will still take me until I retire – not really.  I’ll just see this next challenge as a way to rethink what I’ve found comfortable or doable.