Big data innovation at CBS/UT DataCamp
What kind of big data are interesting to both Statistics Netherlands (CBS) and the University of Twente (UT)? How fast does one arrive at conclusions from vast volumes of data? Which knowledge, skills and programs are needed by staff in order to get started on big data? From 6 to 9 December 2016, ten different CBS researchers explored such questions together with a number of PhD students and postdocs from UT and SIKS, the Dutch Research School for Information and Knowledge Systems, during an intensive and challenging CBS/UT DataCamp.
Complex issues
The aim of the annual DataCamp event is to find solutions for complex issues by using big data. Among the research topics this time were the use of big data in creating Sustainable Development Goal indicators; use of smart city data to tackle environmental and mobility issues; which useful information might alter the position of Google search results; and whether digital traces left by tourists on the internet might be useful for tourism statistics.
Latest programming languages
Among the DataCamp participants representing CBS was Jurriaan Biesheuvel. Before joining the statistical bureau in August 2015, he graduated in experimental physics and completed his PhD at the University of Amsterdam. As a statistical researcher, he is currently involved in business cycle statistics; in addition, he is working on the introduction of a new seasonal adjustment program as well as on harmonisation of statistics. He is furthermore part of a data group within his directorate, learning the latest programming languages and familiarising himself with other facets of the profession.
Analysing DDoS attacks
Biesheuvel summarises his experience during the DataCamp in a few keywords: intensive, challenging, innovative. Much was new and unfamiliar to him, for example working with Spark (a computer system researchers can use to analyse huge volumes of data, ed.). ‘Aside from that, mapping of all DDoS (cyber) attacks in the period 2010-2015 based on Twitter feeds and newspaper articles, together with 2 other researchers; using Google Trends, we were able to retrieve the number and the timing of searches related to this word. The bulk of our search results corresponded with the dates on which these DDoS attacks took place.’
Tourism statistics
Shirley Ortega graduated in engineering technology with a PhD from Maastricht University. Since 2013, she has worked on statistics on Culture, tourism and technology as well as being in charge of Caribbean Netherlands statistics at CBS. She was furthermore involved in the recent launch of an innovative CBS project to map the Dutch internet economy in collaboration with Google and Dataprovider. In her discipline, she had come across big data on various occasions; for example last month during a European big data training programme organised by CBS researchers. Ortega is also familiar with innovative methodology in the compilation of tourism statistics. ‘We’ve conducted a bot test at CBS to map companies operating in the tourism sector.’
Social media
Just like Biesheuvel, Ortega picked up a great deal of knowledge at the DataCamp. Her assignment was to find out, based on social media usage, whether a link exists between the location of tourists on the Dutch island of Texel and their behaviour. ‘After making a selection based on a number of criteria, we obtained two different data sources, namely 3,000 Instagram posts and 12,000 Twitter feeds. We found out – aided by the Geographical Information System (GIS) – that Instagrammers mainly like the coast while Twitter users are the ones who like to visit the villages on Texel.’
Tweets
One of the organisers Djoerd Hiemstra of UT, in charge of arrangements alongside Barteld Braaksma and Piet Daas of CBS. Hiemstra is very enthusiastic about what the participants came up with. ‘The event went really well. We were off to a solid start. The first two days featured all kinds of presentations. Two statistical reseachers from Costa Rica held a presentation on the issues they are facing in their work, mapping protests and rallies based on Twitter posts; professor Arjen de Vries of the University of Nijmegen introduced Spark; UT’s Hamed Mehdipoor talked about the follow-up research on the results discovered during the previous DataCamp which he is conducting together with Maaike Hersevoort of CBS, after they paired observations of the blooming period of the wood anemone with detailed temperature data from the Royal Netherlands Meteorological Institute (KNMI).’ CBS big data specialists Piet Daas and Marco Puts held a presentation on the use of big data for official statistics and provided many examples. After this, the participants were good to go with support from both UT and CBS staff.
Scientific articles
Hiemstra is proud of the fact that several important findings from the 2015 DataCamp resulted in further research and the writing of scientific articles. He anticipates a similar outcome for the second DataCamp in 2016. ‘Also, the DataCamp has been such as success that we have already confirmed the next DataCamp event will be held toward the end of 2017.’ Hiemstra represented UT during the drafting of a letter of intent which was signed by CBS director Astrid Boeijen and UT dean prof. Peter Apers as the DataCamp drew to an end. ‘This declaration of intent is a good opportunity and great promotion! Obviously, the collaboration between researchers was already there, as they were able to find each other directly. It’s is a very important key to success!’
Datahouse
Astrid Boeijen is director of data collection and manages the big data portfolio at CBS. She refers to the signing of the letter of intent as ‘an excellent step forward’. ‘It provides an extra incentive to already existing collaboration between CBS and UT, and also stimulates the development of the Center for Big Data Statistics (CBDS) which was launched in September 2016. CBDS currently has 40 external collaborative partners ranging from national and international universities to large companies and statistical institutes from around the world. By launching CBDS, CBS has acquired a pioneering role in the area of big data. This is a logical step, given our home position as the Netherlands’ largest data house, our unique expertise in the area of big data and many decades of experience in privacy protection. Big data offers many opportunities but raises many questions, too. This is such an extensive area that we as CBS cannot provide the answers to all these questions alone, but would like to do so together with other parties. It is great that UT wants to tackle this adventure with us.’
Showcases
CBDS and UT collaborate in a number of areas including research and submitting subsidy applications. Boeijen is very pleased with the results of the CBS/UT DataCamp: ‘As we know from the results of our previous DataCamp, there is an enormous amount of creativity. We saw this again during the second DataCamp. They have produced beautiful results, and these will prove excellent showcases during our talks with potential CBDS partners!’