I was playing around with the Dunnhumby dataset which is a really nice dataset when you are working with retail clients. We were and now wanted to showcase the lessons from the engagement. The intellectual property of the data was not ours so I used the Dunnhumby dataset to allow me to use the same pipelines and techniques but no longer sensitive data. The ingestion worked (a bit faster than I would have liked, around 40 seconds for the 40 GB dataset, I needed time to talk about the process a bit more) and the Dataflow pipeline worked nicely.
As I was going through the motions of setting the data up for the analysis I picked up something very strange. I did a scatter plot per store with SPEND and QUANTITY on the axes and data partitioned by store. I also created an LOD (level of detail calculation) that would give me the ZScore (Normalised data value). I could then tell per line what the outliers (|Z|>2) were and flag them.
Initially nothing was untoward and the higher the quantity the higher the spend, nice linear relationship. In fact this was too linear. This means that the buying distribution is not really that random and that there are no speciality products or region specific products or deviations in the set. Even if we look at the data by year and region as well.
I like Z-Scores, it gives you a quick way of filtering out data or picking up trends in your outlier sets. It is a term from DataCamp so a lot of people use normalisation so I will use them fairly interchangeably. When I coloured the data by outlier (or not) then it got really interesting where the inliers and outliers were specifically linear. This means that Principle Component Analysis could be used, a decision tree (with one branch) and linear regression would give almost perfect results.