Spark: more computing capacity for big data productions

07/02/2017 16:00 / Author(s): Masja de Ree

Spark, the big-data processing platform, enables researchers and statisticians to process large amounts of data at high speed. After successful experiments with this new cluster computing engine over the past twelve months, Statistics Netherlands (CBS) has decided to expand the system.

Spark the definite choice

CBS is working with increasingly large datasets, and this calls for more computing capacity. The Spark clustering framework offers a solution in the form of a software layer that enables multiple computers to perform calculations on the same task simultaneously. Combined with high-speed computers and storage, this makes for accurate calculations with large amounts of data. About a year ago, CBS acquired a small Spark cluster. A second cluster, four times as big, was added recently. CBS process development and methodology manager Winfried Ypma explains: ‘This means our final decision is to go with Spark. As our research over the past year has shown, this cluster engine provides the support we need to keep up with our big data developments. We have conducted a number of successful proofs of concept. Besides, Spark is relatively inexpensive being open source software.’

Operational and security tests

Using Spark in the production of new statistics requires a new way of thinking and working as well as a redesign of the statistical process. CBS has been looking at ways to re-organise the statistical process for some time now. For example, Spark is used in the production of CBS statistics based on traffic loop data: millions of sensor data released every hour by Rijkswaterstaat (Directorate-General of Public Works and Water Management). ‘But a Spark cluster would be equally useful in the production of income statistics’, says John van Rooijen, who heads the technical management of the ICT infrastructure. ‘We are now looking at the efforts required and the potential benefits of a suitable production process using this cluster. We are also reviewing the operational resources and security conditions we would need to make production as safe and as stable as possible. So far, the results are reassuring.’

‘We have achieved a starting point that offers great opportunities for the Centre for Big Data Statistics set up last September’

Pressure cooker of ideas

CBS works on big data in close cooperation with the University of Twente, which is also running a Spark cluster. Ypma: ‘Every year, we jointly organise a DataCamp to generate concrete ideas on how to use big data. DataCamps are pretty intensive, pressure-cooker like sessions. Our Spark equipment allows us to develop further the concepts created there, right here at CBS, which is a major advantage.’

Starting point achieved

Various CBS departments are working together on the development of the calculation program in one single project team: ICT staff, methodologists and statisticians. Van Rooijen: ‘They all work from a different perspective. For ICT, system stability and security are important; methodologists want to be free to try out new ideas, while statisticians are mainly focused on faster production.’ Ypma: ‘Together, this project team generates a great deal of energy. Having contributions from different departments ensures commitment, support and quality. We have now achieved a starting point that offers great opportunities for the Centre for Big Data Statistics which was set up last September.’