Developed during SB Hacks 2018 by Justin Tennant and Travis Cramer, Dance Hack utilizes the OpenPose computer vision library for estimating human body poses. We calculate how many people in view of the webcam are dancing, and make appropriate API calls to Spotify to change the song or volume.
A demonstration of the body pose estimation running in real time.
One of the many tests performed on our system. In this instance, Travis is testing how the movement of his arms is controlling the volume of the music.
Dance Hack runs a multi-threaded OpenPose application, in which one thread handles the webcam & pose estimation calculations. We then run a concurrent worker thread which asynchonously queries for processed frames, acquiring the image data & body pose estimation data, and display the visuals to the GUI. In that same worker thread, we also calculate our level of "dancing":
For a given body in a scene, we store the person's two-dimensional coordinates of their hands (we figured tracking the arms was the simplest and most efficient way to determine if someone was "dancing") relative to the webcam frame's dimensions, in a queue which stores ten frames at a time. After each passage of ten frames, we calculate the standard deviation of each hand's vertical and horizonal movement over those past ten frames, and classify that final deviation as "dancing" or "not dancing". Depending on either result, we adjust the volume or song, as described below:
NOTE: At the time of Dance Hack's initial development (and of this webpage, January 2018), OpenPose does not support body 'tracking'; that is, tracking specific individual bodies in a given scene over time. Therefore, in our "level of dancing" calculations, we only calculated based on the first person we found in the scene, usually whoever was nearest to the top-left corner of the frame.
This next step involves taking the standard deviation calculations done in the first step and converting them somehow into a volume metric, then sending a request to the Spotify API to change the volume of the currently playing track. This is where the magic happens. There is a lot of room for tweaking here. After experimenting with several methods, we ended up taking the average of the four standard deviations that we calculated (vertical left hand, horizontal left hand, vertical right hand, horizontal right hand), then normalizing that value into a range somewhere between 35 and 100 so that it translated into a volume percentage (in our case we decided the volume shouldn't go below 35%, and the max we let be 100%).
In order to do this, we also had to decide on a "range" of possible average standard deviation values. The minimum we set as 0, which was an easy choice. However, the maximum for the range of the average standard deviation was harder to come up with. Basically, our question was: for what average standard deviation value (and above) should the program set volume to 100%? We ended up choosing 50 pixels of average standard deviation (or more) to translate to 100% volume. This number, however, is dependent on the resolution of your video input from your webcam (which we intentionally lowered in order to free up processing power). This is a decision that can definitely be tweaked and please feel free to play around with it yourself by cloning our repo here on Github!