Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Sunday, March 11, 2018

Are you smarter than a fifth grader?

"the editorial principle that nothing should be given both graphically and in tabular form has to become unacceptable" - John W. Tukey

Back to school

In the United States, most fifth grade students are learning about a fairly powerful type of visualization of data. In some states, it starts at an even younger age, in the 4th grade. As classwork and homework, they will produce many of these plots:


They are called stem-and-leaf displays, or stem-and-leaf plots. The left side of the vertical bar is the stem, and the right side, the leaves. The key or scale is important as it indicates the multiplier. The top row in the image above has a stem of 2 and leaves 0,6 and 7, representing 20, 26 and 27. Invented by John W. Tukey in the 1970's (see the statistics section of part II and the classics section of part V of my "ex-libris" series), few people use them once they leave school. Doing stem-and-leaf plots by hand is not the most entertaining thing to do. The original plot was also limited to handling small data sets. But there is a variation on the original display that gets around these limitations.

"Data! Data! Data!"

Powerful? Why did I say that in the first paragraph?

And why should stem-and-leaf plots be of interest to students, teachers, analysts, data scientists, auditors, statisticians, economists, managers and other people teaching, learning or working with data? There are a few reasons, with the two most important being:
  • they represent not only the overall distribution of data, but the individual data points themselves (or a close approximation)
  • They can be more useful than histograms as data size increases, particularly on long tailed distributions

 

An example with annual salaries

We will look at a data set of the salaries for government employees in Texas (over 690,000 values, from an August 2016 snapshot of the data from the Texas Tribune Salary Explorer). From this we create a histogram, one of the most popular plot for looking at distributions. As can be seen, we can't really tell any detail (left is Python Pandas hist, right is R hist):


It really doesn't matter the language or software package used, we get one very large bar with almost all the observations, and perhaps (as in R or seaborn), a second tiny bar next to it. A box plot (another plot popularized by John Tukey) would have been a bit more useful here adding some "outliers" dots. And, how about a stem-and-leaf plot? We are not going to sort and draw something by hand with close to 700,000 values...

Fortunately, I've built a package (python modules plus a command line tool) that handles stem-and-leaf plots at that scale (and much, much larger). It is available from http://stemgraphic.org and also from github (the code has been available as open source since 2016) and pypi (pip install stemgraphic).
So how does it look for the same data set?


Now we can see a lot of detail. Scale was automatically found to be optimal as 10000, with consecutive stems ranging from 0 to 35 (350000). We can read numbers directly, without having to refer to a color coded legend or other similar approach. At the bottom, we see a value of 0.00 (who works and is considered employed for $0 annual income? apparently, quite a few in this data set), and a maximum of $5,266,667.00 (hint, sports related), we see a median of about $42K and we see multiple classes of employees, ranging from non managerial, to middle management, upper management and beyond ($350,000+). We've limited the display here to 500 observations, and that is what the aggregate count on the leftmost column tells us. Notice also how we have a convenient sub-binning going on, allowing us to see which $1000 ranges are more common. All this from one simple display. And of course we can further trim, zoom, filter or limit what data or slice of data we want to inspect.

Knowing your data (particularly at scale) is a fundamental first step to turning it into insight. Here, we were able to know our data a lot better by simply using the function stem_graphic() instead of hist() (or use the included stem command line tool - compatible with Windows, Mac OS and Linux).

Tune in next episode...

Customers already using my software products for data governance, anomaly detection and data quality are already familiar with it. Many other companies, universities and individuals are using stemgraphic in one way or another. For everybody else, hopefully this has raised your interest, you'll master this visualization in no time, and you'll be able to answer the title question affirmatively...

Stemgraphic has another dozen types of visualizations, including some interactive and beyond numbers, adding support for categorical data and for text (as of version 0.5.x). In the following months I'll talk a bit more about a few of them.


Francois Dion
@f_dion

N.B. This article was originally published on LinkedIn at:

https://www.linkedin.com/pulse/you-smarter-than-fifth-grader-francois-dion/

Thursday, October 20, 2016

Stemgraphic, a new visualization tool

PyData Carolinas 2016

At PyData Carolinas 2016 I presented the talk Stemgraphic: A Stem-and-Leaf Plot for the Age of Big Data.

Intro

The stem-and-leaf plot is one of the most powerful tools not found in a data scientist or statistician’s toolbox. If we go back in time thirty some years we find the exact opposite. What happened to the stem-and-leaf plot? Finding the answer led me to design and implement an improved graphical version of the stem-and-leaf plot, as a python package. As a companion to the talk, a printed research paper was provided to the audience (a PDF is now available through artchiv.es)

The talk




Thanks to the organizers of PyData Carolinas, videos of all the talks and tutorials have been posted on youtube. In just 30 minutes, this is a great way to learn more about stemgraphic and the history of the stem-and-leaf plot for EDA work. This updated version does include the animated intro sequence, but unfortunately the sound was recorded from the microphone, and not the mixer. You can see the intro sequence in higher audio and video quality on the main page of the website below.

Stemgraphic.org

I've created a web site for stemgraphic, as I'll be posting some tutorials and demo some of the more advanced features, particularly as to how stemgraphic can be used in a data science pipeline, as a data wrangling tool, as an intermediary to big data on HDFS, as a visual validation for building models and as a superior distribution plot, particularly when faced with non uniform distributions or distributions showing a high degree of skewness (long tails).

Github Repo

https://github.com/fdion/stemgraphic


Francois Dion
@f_dion
 



Wednesday, October 5, 2016

Something For Your Mind, Polymath Podcast episode 2

A is for Anomaly

In this episode, "A is for Anomaly", our first of the alphabetical episodes, we cover financial fraud, the Roman quaestores, outliers, PDFs and EKGs. Bleep... Bleep... Bleep...
"so perhaps this is not the ideal way of keeping track of 15 individuals..."

Something for your mind is available on



art·chiv.es

/'ärt,kīv/



Francois Dion
@f_dion
P.S. There is a bit more detail on this podcast as a whole, on linkedin.

Sunday, September 18, 2016

Something for your mind: Polymath Podcast launched


Some episodes
will have more Art content, some will have more Business content, some will have more Science content, and some will be a nice blend of different things. But for sure, the show will live up to its name and provide you with “something for your mind”. It might raise more questions than it answers, and that is fine too.

Episode 000
Listen to Something for your mind on http://Artchiv.es

Francois Dion
@f_dion

Saturday, December 5, 2015

Tensorflow jupyter notebook

In an orbit near you


At the last PYPTUG meeting, I demoed Google's Tensorflow deep learning toolkit. I did that using the Jupyter notebook. If you are not familiar with this, check out try.jupyter.org and you'll be  able to play with Python, R, Ruby, Scala, Bash etc.

To install jupyter on your computer, pip3 is your friend (more detail at http://jupyter.readthedocs.org/en/latest/install.html):

pip3 install jupyter
By installing jupyter, you'll also get the ipython kernel, so you'll be able to create new jupyter notebooks for python. There are over 50 programming languages supported by jupyter. But that is not all. You can also create specific environments and associate notebooks with them. It works great on pretty much any platform, including the Raspberry Pi, Windows, the Mac, Linux etc. Each kernel has a varying degree of availability, and the same can be said of python modules. Tensorflow will not run on the Pi at this time...

Tensorflow notebook


New notebook dropdown in Jupyter 4
What we are going to do here is to install Tensorflow in a virtual environment, and create a notebook configuration so we can have the choice in the new-> dropdown, as pictured above.

If you've tried to install Tensorflow, particularly on a Mac,  you have probably found it challenging, as Tensorflow requires Python 2.7.


Install

On a Mac, since Apple's Python 2.7 is kind of messed up...
brew install python
Create a virtualenv and activate it, then install requirements:
virtualenv -p /usr/local/bin/python2.7 tensor
source tensor/bin/activate

pip install numpy scipy sklearn pandas matplotlib
pip install https://storage.googleapis.com/tensorflow/mac/tensorflow-0.5.0-py2-none-any.whl

Configure Jupyter

I have a globally available Jupyter notebook, on Python 3. This allows me to run various kernels from one notebook. Even virtualenvs, as I'll show here. The below is for Mac. On a Linux or Windows machine it'll be similar: use jupyter --paths to find the data directory (the kernels folder will be in there).
(tensor)LTOXFDION:~ francois$ pip install ipython ipykernel
LTOXFDION:kernels francois$ pwd
/Users/francois/Library/Jupyter/kernels
LTOXFDION:kernels francois$ ls
ir python2
LTOXFDION:kernels francois$ cp -r python2/ tensor
LTOXFDION:kernels francois$ cd tensor/
LTOXFDION:tensor francois$ vi kernel.json

In the editor you'll have to modify the path to python to point to your directory. If you dont have a python2 directory to copy, just create a tensor directory and create the kernel.json file. In the end, your kernel.json file should look something like:

{
 "display_name": "TensorFlow (Py2)",
 "language": "python",
 "argv": [
  "/Users/francois/tensor/bin/python2.7",
  "-c", "from ipykernel.kernelapp import main; main()",
  "-f", "{connection_file}"
 ],
 "codemirror_mode": {
  "version": 2,
  "name": "python"
 }
}

You should be good to go now. Start jupyter:

  jupyter notebook

You'll be able to create a new notebook for Tensorflow. From there, all you have to do is import it:

    import tensorflow as tf

We'll continue on this next time.

Francois Dion
@f_dion