@SydStats at Open Data Day

This post is a summary of the pitch I gave at the Transport for NSW Open Data Day 2019, introducing my @SydStats project.

Aim of @SydStats

Our aim is to simply package up and deliver network statistics that are:

  • Human readable
  • Happen in a timely manner (can be accessed while or soon after a commuter has used the network)
  • Relevant (tell a story the commuter cares about)
  • Big picture (indicate how the network is functioning as a whole)

We are doing this to encourage commuters to discuss the state of the network, and to give them hard data in their complaints and feedback, rather than just relying on personal anecdotes and experiences.

Process and Technology

This is done in a very simple process. We collect the data using the GTFS real-time and schedule datasets through the TfNSW Open Data API. We then analyse this data to calculate delay times across the network during certain periods of the day. We translate those findings into palatable data representations that everyday commuters can engage with and understand, and then we deliver this information to commuters.

Live Project

The first iteration of these aims is live as a Twitter account, as alluded to by the handle in the title of this post. In November last year we started on a very simple tweetbot that runs the analysis, and distils this down into two tweets for each peak period each day:

  1. Percentage of services that experienced delays
  2. Worst delay in that time period

This started tweeting in December last year. With a pretty simple concept, and a pretty low key start with no promotion, we’re making about 1000 impressions per day, and it has already generated some interesting discussion:

On the 19th of February in the morning, we had 71% delays, which this user has quoted. Trains Info has countered that 84.8% of services were on time. This is based on the TfNSW benchmarks that services are only ‘delayed’ if they reach their last station more than 5 minutes late:

Roadmap

There are improvements to be made on our process, to gather more insights and to deliver them in an appropriate tone. These insights may include:

  • Number of delays that were longer than X minutes
  • Running analysis per station or per line
  • Publishing weekly or monthly summaries of delays and issues.

There are also a few bugs which can lead to delays being underreported at this stage.

While the project is open source on Github, once it becomes are bit more stable we would like to publicise this fact by linking to the project from Twitter. This way, the statistics are verifiable.

Finally, as the project matures and is delivering insightful, verified, and readable statements, we can promote the tweets more heavily, or look to other platforms to get the message out.

Conclusion

We want to generate more discussion and make commuters aware of the state of the infrastructure they are paying for. Complaints and feedback are much more affective when they come from data rather than anecdotes and personal experience. Sydneysiders are always demanding better services, and this will help their voices be heard.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s