Defining types of innovation by means of text analysis
How can we get a good idea of the different types of innovation in companies in the Netherlands? Statistics Netherlands (CBS) has researched this at the Center for Big Data Statistics. In order to define the different types of innovation, a data-driven method has been developed that compares the words used on innovative companies’ websites. This method makes it possible to define different types of innovation without using a predetermined classification and without any need to send questionnaires to companies.
Working method
The text on the homepage of innovative companies’ websites is used to define the different types of innovation. To guarantee the widest possible range of subjects and areas, the research was based on companies that formed part of the KvK (Chamber of Commerce) Innovation Top 100 during the period from 2010 to 2018. The Innovation Top 100 website lists a total of 900 innovative Dutch companies and includes links to their websites. It is notable that these companies’ websites are written either in English or in Dutch (often with a number of English words). All texts were translated into English to allow a proper comparison of the texts on the various websites. Punctuation marks and frequently occurring general words were deleted from the website texts. The remaining words were compared across the various websites and classified into different innovation groups using a number of different, self-learning algorithms. By repeating this very often and starting with different words each time, it was possible to determine which algorithm was most suitable and which classification was ultimately the best.
Results
The most suitable algorithm was found to be a ‘latent Dirichlet allocation’, which starts with a very broad classification into groups and then teaches itself to distinguish the groups increasingly clearly. A classification distinguishing 10 different innovation groups was found to be the best. This was determined by assessing the similarity between the words in the groups and the differences between the words in the groups. Various calculation methods were used for this and the results were compared. A number of independent checks were also carried out on the results, all of which confirmed that the optimum number of different innovation types was 10.
The words occurring most frequently on companies’ websites were then compared. Figure 1 shows these findings for each of the 10 innovative subjects in a ‘word cloud’. The bigger the words in the picture, the more often they occur on the websites. But how can you find an appropriate name for each of these subjects? The algorithm used by CBS does not include fully automatic allocation of names to each subject. Indeed, there is no algorithms currently in existence that can provide fully automatic allocation without human intervention. This is partly because the most frequently occurring word, or a combination of frequently occurring words, does not always accurately describe the subject; the other words in the context are important in this regard. A selection of websites was assessed for each group and the best possible name was then chosen for each innovative subject in consultation with experts at CBS.
The names of the 10 innovation types are:
- Sustainable Energy
- Food & Agriculture
- Logistics
- Creative Industry
- Health Care
- Sustainable Construction
- ICT Software
- Internet of Things
- Technology & Engineering
- Industrial Maintenance & Service
Figure 1. Word cloud containing the most common words for each of the 10 types of innovation
There are a remarkable number of similarities between the 10 innovative subjects found and the nine top sectors defined by the Dutch government. These nine top sectors are: Agri & Food, Chemicals, Creative Industry, Energy, Hightech Systems & Materials, Logistics, Life Sciences & Health, Horticulture & Basic Materials and Water & Maritime.
Some of the innovative subjects found closely match the top sectors. These are Sustainable Energy (= Energy), Food & Agriculture (= Agri & Food), Logistics, Creative Industry and Health (= Life Sciences & Health). The innovative subject of ICT & Software can be viewed as part of the Creative Industry top sector. The Hightech Systems & Materials top sector largely coincides with the innovative subjects of Sustainable Construction, Internet of Things, Technology & Engineering and Industrial Maintenance & Service. These similarities show that companies in the Netherlands are genuinely taking innovative steps in these areas. After all, the classification developed by CBS is based on the 100 most important types of innovation over the past nine years.
It is also notable that the top sectors of Chemicals, Water & Maritime and Horticulture & Basic Materials were not found in our classification. There may be various reasons for this. The first reason may be that there are fewer companies operating in those top sectors. That would explain why – in terms of absolute numbers – fewer companies in these top sectors are innovative. It is therefore difficult to find these types of innovation with the approach used by CBS. The second reason may be that few if any companies in the Chemicals, Water & Maritime and Horticulture & Basic Materials sectors registered to participate in the Innovation Top 100. At the moment, however, it is unclear why these top sectors did not appear in our results. It would therefore certainly be useful to look more closely at innovation in these areas.
Challenges
One of the main challenges is the automatic allocation of names to the different types of innovation. Recent developments in artificial intelligence and natural language processing offer increasing potential in this regard. It is also important to develop methods to properly assess less common types of innovation. This will help to gain a detailed picture of progress in a particular area, especially if it is one in which few companies are operating.
Privacy
Only links to websites of companies appearing in the KvK Innovation Top 100 were used in the development of this method. In addition, only the texts on the homepages of those websites were examined. The KvK Innovation Top 100 organisation always asks the companies selected by the jury as Top 100 constituents for permission to publish the name of the company, a link to the website and a brief description of the company’s innovation. CBS cleans up the texts on these companies’ websites and aggregates them. Consequently the individual words cannot be traced back to particular companies.
Applications
When distinguishing different types of innovation based on text on companies’ websites, it is important to understand the different types of innovation that can be detected with such texts. For technological innovation a method has been developed that was previously published as a beta product. The approach described above shows that other types of innovation can also be defined. That makes it possible to find innovative companies operating in certain top sectors, for example. The impact of incentive policies in these sectors can then be determined. Detailed maps could also be made of areas in which such innovative companies are based. This will be of particular interest to municipalities and provinces.
Feedback
CBS is interested in your opinion of this study and the method that has been developed. For example, do you have any ideas about possible applications or suggestions for refining this method? Please send us your feedback using the form below.