Dataset

As one of the resident data scientists of my rowing club, I decided to play a little with the data generated by boat booking over the year 2019. Data was collected from a shared online Google Sheets file, where members wrote down their names for the boat they intended to use and the date/time where they were rowing. Since this analysis was done, stricter logging policies were implemented (standardizing names used by members by forcing the choice to be made from a drop down list, and enforcing full list of crew names for crew boats), so next year’s of data analysis should be less painful! Of course, this dataset doesn’t take into account those with private single boats - they appear on the sheet only when they row in crew boats.

This is the first time I’ve played with data with some “natural language” properties outside of school since my work consists mainly of IoT time series data. To nobody’s surpise this means loads of preprocessing and cleaning. A quick look at data from the first day of the year:

     Day           Time        #09-Single  #11-Single  #14-Single  #15-Single  Suu Balm-Single  Birra Viola-Single  4Fingers-Single  ERC blue#1  ERC blue#2  SRA1-Double  SRA2-Double  Brotzeit-Double  Osteria-Double  Pair  SRA7-Pair  #01-Four  #02-Four  Nelly-Quad  Aisyah-Quad  Eights A    Datetime
0  Tue, 1/1/19  6.30-8.00        NaN         NaN         NaN         NaN      Yu Jia (21k)         NaN              NaN              NaN         NaN          NaN          NaN          NaN             NaN             NaN  Waiting      NaN       NaN       NaN         NaN          NaN       1/1/19 6.30
1  Tue, 1/1/19  8.00-9.30        NaN         NaN         NaN         NaN         NaN              NaN              NaN              NaN         NaN          NaN          NaN          NaN             NaN             NaN   NaN          NaN       NaN       NaN         NaN          NaN       1/1/19 8.00
2  Tue, 1/1/19  9.30-11.00       NaN         NaN         NaN         NaN         NaN              NaN              NaN              NaN         NaN          NaN          NaN          NaN             NaN             NaN   NaN          NaN       NaN       NaN         NaN          NaN       1/1/19 9.30
3  Tue, 1/1/19  16.00-17.30      NaN         NaN         NaN         NaN         NaN              NaN              NaN              NaN         NaN          NaN          NaN          NaN          Olga/Jason         NaN   NaN          NaN       NaN       NaN         NaN          NaN       1/1/19 16.00

Yikes. The first thing I did was to standardize the date and time columns and clean up the column names:

Datetime#09 - Single LW#11 - Single MW#14 - Single MW#15 - Single HWSuu Balm - Single LWBirra Viola - Single HW4 Fingers - Single MWERC blue 1 - Training SingleERC blue 2 - Training SingleSRA 1 - Double HWSRA 2 - Double LWBrotzeit - Double MWOsteria - Double MWN/A - Pair HW;SRA 7 - Pair#01 - Four HW#02 - FourNelly - QuadAisyah - Quad HWEights SRA
2019-01-01 06:30:00nannannannanYu Jia (21k)nannannannannannannannannannannannannannannan
2019-01-01 08:00:00nannannannannannannannannannannannannannannannannannannannan
2019-01-01 09:30:00nannannannannannannannannannannannannannannannannannannannan
2019-01-01 16:00:00nannannannannannannannannannannannanOlga/Jasonnannannannannannannan

Now we can get to work!

Analysis

Counting the number of non-null entries per boat seemed like the most obvious and helpful thing to do - this allows the club to know which boats are most used and in highest demand.

Bokeh Plot

A clear preference for singles and doubles - understandable due to our small club sizes! Some of the needs surfaced from this analysis confirmed some of the felt needs of the club and translated to new boats bought over the year (the ERC blues, towards the end of the year).

I then regrouped the data and did the same plot by weight class and boat type.Bokeh Plot

Bokeh Plot

At first glance it might seem that singles are the most popular boat, but remember that we should in fact correct for the number of people in the boats:Bokeh Plot

Even then, this isn’t exactly fair as we have more singles and doubles than fours and quads in the club.

When do our rowers get on the water? Doing the same count but by time slot of the week:

Bokeh Plot

As expected, weekends are when we get the most rowers down. There’s also a clear preference for morning rowing rather than afternoon rowing.

Rower statistics

I was tasked with finding the most frequent rowers of the club who are not already in the club committee and who might be susceptible to take up some club responsibility. This was difficult because members used different names when booking, for example, I personally used “Yu Jia” and “YJ” interchangeably. To confound matters, crew boats were booked with a single entry, depending on style it could be “A + B”, “A & B”, “A / B”, or quite simply “A B”! There was also no guarantee that all members of the crew boat would be listed - the ladies often booked boats with “Ladies”, and similarly in other cases only two or three names out of four would be present. Luckily our club boasts only about 70 members, with some 50 active rowers - I created a standardized list of users and did some search and replace, dropping spaces, symbols, cases… not the most exact analysis, but with the standardized name list next year, things should be easier!

Anonymizing slightly the names, we have:

Bokeh Plot

Unfortunately, our most frequent rowers were all in our under 18 junior team, or already have served / are serving in the committee, which means no rest for the current committee yet…

As a side note, I had personally counted 82 rows in 2019. Barring two days of regatta and an outing in France, the above count is still missing a good number of rows! This is probably due to the couple of weeks where fumigation was being carried out at the reservoir and rowing times were shifted, resulting in the boat usage data being temporarily logged on a separate Google Sheet.

Time to give myself a pat on the back for being one of the most frequent rowers in the club, and retire my analysis notebook till the end of 2020…