Some data projects

This page has some for-fun projects that I've worked on.


Roy Keyes

Radiation dose estimation via Machine Learning

In 2016 I started a side project aimed at estimating radiation doses for cancer therapy using machine learning algorithms instead of physics based Monte Carlo.

You can watch the talk I gave about this on 14 July 2017 at SciPy 2017 in Austin, TX here and check out the slides here [PDF version].

Proton dose



Perceptron

Intro to Deep Learning

Over the span of a few months in 2016 and 2017 I gave some introductory talks on neural networks and deep learning for the Houston Data Science Meetup group. The slides are linked below.



slots - a multi-armed bandit library in Python

slots is a Python library that lets you explore and use several strategies for the multi-armed bandit problem. slots is available for installation from PyPI via "pip install slots".

You can read about slots and the multi-armed bandit problem in my blog post here and check it out on Github here.

slots



Klackers

Klackers strategy

Klackers (a.k.a. Shut the Box) is a dice game, often played in bars. The Klackers box has nine "tiles" numbered 1-9. A player rolls two dice, then flips down tiles that sum to the value of their roll. The player continues to roll the dice and flip tiles until they are no longer able to find a combination of tiles that sum to the dice or they have flipped all of the tiles. The player's score is the sum of the un-flipped tiles.

To determine the best simple strategy for Klackers I ran a series of Monte Carlo simulations. I created the simulation in Python. The code is found here on GitHub.



Chutes and Ladders via Markov chains in D3

Markov chains can be used to model probabilistic processes, such as financial markets or, in this case, children's games. This project is a visualization of the Markov chain model described by Nick Barry in a popular post contrasting the Monte Carlo and Markov chain methods.

For this project I created the Markov chain simulation with Python and Numpy and created the vizualization with the D3.js Javascript library. Because the visualization is rendered as scaled vector graphics (SVG), it may not appear correctly on less modern browsers. Chrome, Firefox, and Safari should work. The code is found here on GitHub.

Chutes and Ladders via Markov chains



ABQ Bikeability.

How bikable is Albuquerque?

As someone who enjoys getting around town by bicyle, I thought it would be interesting to try to quantify how "bikeable" different parts of Albuquerque are. Using data made available by the city of Albuquerque, Samat Jain and I put together this "bikeability" map as part of the 2013 ABQ Hack Day.

Using Python and pandas, we converted XML files to JSON, extracted the values of interest, and calculated a score based on presence and type of biking infrastructure. The map was created with Leaflet.js via Folium, OpenStreetMaps data, and Jinja2. The code is found here on GitHub.



A talk about data science

In 2013 I gave a talk with Steve Koch at the ABQ Tech Fiesta titled "Data Science, Big Data, and other buzzwords". We decided to make a more generic version (PDF) of that talk and open source it. This talk was aimed at a general technical audience and discussed big data, its history, and data science and its components, including data munging, statistics, machine learning, and visualizations. Hopefully others will find the slides and graphics useful for their purposes.

These slides were made using the Beamer package for LaTeX. Our original graphics were created as SVG's in Inkscape. The presentation is primarily licensed under the CC BY-SA 4.0 terms (see the License.txt file for full details). The source code is found here on GitHub.

Data science talk: normal distributions





A look at UNM grad student salaries in 2011

As a grad student I often wondered how salaries varied across and within departments. I was able to take UNM's public salary data and do a little data crunching to come up with some comparisons.

This project was 95+% data munging/cleaning. The data went from paper → PDF → OCR → semi-structured text. Tools used included: Python, pdftk, IPython, OpenRefine (a.k.a Google Refine), pandas, matplotlib, numpy, and Inkscape.



Medical physics articles on arXiv.org

Medical physics has been slow to join the open access movement, but it's catching on. In this project I took a look at the papers submitted to the medical physics category (physics.med-ph) on arXiv.org.

To get the data I used the arXiv's public API. Data munging, analysis, and visualization was done with Python, IPython, feedparser, fuzzywuzzy, numpy, scipy, matplotlib, NetworkX, and Inkscape.

Medical physics papers on arXiv.org.



Sparkmeters

Sparkmeters

Not strictly a data project, but rather an information design project, sparkmeters are small graphics that are inline with text showing the relative value of the item they follow. They are inspired by and reminiscent of sparklines.

Sparkmeters were implemented in the D3.js Javascript library. Because they are rendered as scaled vector graphics (SVG), they may not appear correctly on less modern browsers. Chrome, Firefox, and Safari should work. The code is found here on GitHub.



Roy Keyes