Connecting users to quality journalism with AI-powered summaries

In the massive flood of information that meets the modern media consumer on digital platforms, it can often be hard to spot the real editorial gems. News is ubiquitous and quality journalism runs the risk of getting lost in the abundance of content. At the same time, the thirst for outstanding reporting is greater than ever. All respectable news outlets put great effort into producing well-crafted pieces of particularly high journalistic value. Those are the unforgettable stories that include unique voices or perspectives, that add deeper analysis and context or that excel in truly captivating storytelling. To put it simply: the very best of our journalism.

The hypothesis for this study is that AI can play a role in increasing the visibility and use of these high-value stories. We will explore how AI-powered, automated summaries can be used in the modern newsroom. Our main approach is rooted in the idea of ‘structured journalism’ that aims at atomising existing content and using automated repackaging to create new journalistic products. AI applications can be used in both the deconstruction and reassembly phases. This report touches on a wide range of editorial aspects like quality control, raising audience engagement, smoothen newsroom workflows, but also technical challenges such as finding the most fruitful algorithmic models for journalistic summarisation.

EDITOR'S NOTE:

This report is produced by an international team of journalists as part of the JournalismAI Collab, a project of the Polis think-tank at LSE, supported by the Google News Initiative.

The Collab is a collaborative experiment where news organisations from around the world teamed up to explore innovative solutions to improve journalism through AI.

You can find out more about the initiative, and about the work of the other teams, on the Collab page.

The team behind the study – initiated as a part of the JournalismAI Collab at LSE/Polis – brings together a remarkably varied set of competencies and experiences. It includes media companies from four different countries: Germany, India, Switzerland and Sweden. The tests have been conducted in four languages: English, German, Hindi and Swedish. To widen the scope of the study, journalistic items in both text and audio form have been used. Also, automatic transcription and translation have been part of the experimentation process that added valuable takeaways.

The empirical base of the study lies in several tests that have been run on content from the five participating media organisations: Bavarian Broadcasting (Germany), Der Spiegel (Germany), Jagran Media (India), TX Group (Switzerland), and Swedish Radio (Sweden). There are two main types of tests that the conclusions are drawn from:

1) Testing of tools for automated summaries

2) A/B/C/D testing on Der Spiegel's website to see how the audience reacts to different summarised formats.

The goal of this report is to share insights into the possibilities and challenges of using automated summaries in news journalism. We want the findings to be as practical and useful as possible, especially to people working in newsrooms, academia and tech companies. So we have included chapters with concrete use cases for the use of AI-summaries and what we think is needed to introduce this solution into newsroom workflows.

There are some important limitations to our methodology that we want to mention from the outset. Due to time restrictions, the study is based on practical tests of only one productive, commercial tool – Agolo – as well as a prototype developed by BR. Agolo’s management – which generously gave us access to a test version – has made it clear that their model is trained on traditional news articles and optimised for speakable summaries.

In our testing, however, we used evergreen articles of different styles and lengths, mindful of the fact that this is not the type of content the tool was designed to summarise. This mismatch is a shortcoming in our methodology and we will remind readers that some of the test results are products of this deliberate choice rather than shortcomings of the tested tools.

Our overall conclusion is that there are substantial opportunities for publishers to use automated summaries, especially as the technology evolves. However, there are clear limitations to the quality of these AI-summaries, even if the results often impress at first sight. At this point, human editorial judgment is still a necessity to safeguard the quality of the output from the tools, especially when the ingoing text is complex and more analytical. A hybrid workflow that combines journalistic skills and judgement with algorithmic efficiency is the preferable way forward for the foreseeable future.

AI-generated summary of the introduction

We ran this introduction through the Agolo summarisation tool. This is what it came up with:

All respectable news outlets put great effort into producing well-crafted pieces of particularly high journalistic value.
The hypothesis behind this study is a firm belief that AI can play a role in increasing the visibility and use of these high-value stories.
More specifically we will explore how AI-powered, automated summaries can be used in the modern newsroom.

Headshot-Uli

Uli Köppen

Uli Köppen is Head of the AI + Automation Lab and Co-Lead of the investigative Data Team, BR Data, at German Public Broadcaster Bayerischer Rundfunk. In this role, she’s working with interdisciplinary teams of journalists, coders and product developers specialising in investigative data stories, interactive storytelling and experimentation with new research methods such as bots and machine learning. As a 2019 Nieman Fellow, she spent an academic year at Harvard and MIT and has won several awards together with her colleagues.

Bayerischer Rundfunk (BR) is part of the network of German Public Broadcasting and serves its audience on radio, TV and web. BR is exploring AI and automation in content and product with dedicated teams, the AI + Automation Lab and BR Data, as well as in different projects ranging from Algorithmic Accountability Reporting to IT and the archives.

AI is one of the things you can’t figure out on your own – you need an interdisciplinary approach and buy-in from many departments in your news outlet. And you need to dive into the experiences of other news outlets around the world – which makes me enjoy the exchange with all the smart people in the JournalismAI Collab even more.

Cécile Schneider

Cécile Schneider steers product development at the AI + Automation Lab of German Public Broadcaster Bayerischer Rundfunk (BR). She is a Scrum certified product owner, setting up hybrid workflows to make AI and automation beneficial for newsrooms. As Medialab Bayern entrepreneurship fellow, her product approach is infused with design thinking and focused on solving the right challenge. Cécile has worked in the tech and media industry for more than 10 years, among others for Telefónica Germany's data unit and various renowned gaming brands. Her Telefónica team won the German Online Communications Award with a data visualisation project. She leads the Munich Chapter of the Digital Media Women's nonprofit.

The challenge we’re facing is to get humans to work together with AI-based applications. Therefore, AI applications must be designed to match human needs and workflows and eventually benefit human goals. Therefore, we still need to do a lot of myth-busting in our society as a whole about what AI can and cannot do. The JournalismAI Collab is a blessing in bringing so many people working on this together. It helps to circulate worthwhile challenges to see if something is an issue in other newsrooms or not.

Headshot-Pratyush

Pratyush Ranjan

Pratyush Ranjan works as Senior Editor at Jagran New Media in India. His key expertise areas are Digital Content Management, Newsroom and Editorial Process Management, Cross-functional Team Leadership, and Search Engine and Social Media Optimisation. Pratyush is one of the Fact-Check Trainers at Google News Initiative (GNI) India Network. He has conducted many fact-checking workshops and training sessions and trained more than 3000 people across India.

Jagran New Media is the digital wing of Jagran Prakashan Limited – India's leading media and communications group with interests spanning print, OOH, activations, radio and digital. Jagran New Media creates and publishes online news and information content which informs, educates and helps users to make better life decisions. The company portfolio includes 9 digital platforms that provide content across genres like news, education, lifestyle, entertainment, health and youth.

It's high time the journalism world started keeping pace with the evolution of new technologies like the use of AI in newsrooms and creating new business models to deal with the challenges of the post-COVID situation. AI has helped develop systems for detecting misinformation and deep-fakes across the world.

Headshot-Christina

Christina Elmer

Christina Elmer is deputy head of the Editorial RnD team at DER SPIEGEL, where she had previously established the data journalism department. Before joining SPIEGEL ONLINE as a science editor in 2013, Christina worked at Stern magazine’s investigative unit. Her journalistic career began in 2007 at the German press agency dpa, where she was part of a team which set up Germany’s first department for data journalism. She is a board member of Netzwerk Recherche, Germany’s largest association supporting investigative reporters.

DER SPIEGEL is Germany's leading news magazine and news website, characterised by in-depth investigations. More than 14 million people study DER SPIEGEL content every week, online and in print. To reach new audiences and improve the overall user experience, the newsroom is already using selected AI-driven tools and is exploring further opportunities.

For journalism to resonate with humans, it cannot be done without humans. But AI-powered solutions could be used much more commonly in journalism to better serve the public. The JournalismAI Collab is the perfect network to explore concrete approaches with like-minded fellows.

Headshot-Didier

Didier Orel

Didier Orel is Head of TX Group Data Analytics, where he supports and leads data-driven initiatives, be it recommendation engines, marketing tools and editorial production workflows. He spent most of his career in media, driving digitalisation projects and providing newsrooms with performant editorial production tools.

TX Group is a network of digital platforms that provide users with information, orientation, entertainment and services every day. Four independent companies operate under the umbrella of TX Group: TX Markets comprises the classifieds and marketplaces; Goldbach works on advertising marketing in Switzerland, Germany and Austria; 20 Minuten is a commuter media company in Switzerland and abroad; Tamedia leads paid daily and weekly newspapers and magazines into the future.

The JournalismAI Collab is the perfect environment in which to identify trends and leverage AI capabilities to improve journalism. In this international group, the variety of perspectives allows to find the right balance between the power of technology and its limits.

Headshot-Olle

Olle Zachrison

Olle Zachrison is Head of Digital News Development at Swedish Radio (SR). He is also Chairperson of the EBU Investigative Projects & Network. Before taking on his new role at SR last year, he was Head of News & Current Affairs for four years. He has a background in newspaper journalism and has been Managing Editor and Business Editor of the Swedish national daily Svenska Dagbladet.

Swedish Radio is Sweden’s national public broadcaster and leading audio company, with 1900 staff stationed in over 50 locations around the country. The SR vision is: “More voices and more powerful stories for greater understanding.” SR is actively exploring AI to serve its diverse audience better.

I love international cooperation and to approach challenges in multidisciplinary teams. So for me, the JournalismAI Collab project has been the perfect match.

For newsrooms, artificial intelligence poses significant challenges, as set out in the JournalismAI report published in 2019: To implement new approaches in a meaningful way, they need internal experts or external partners and a good level of algorithmic literacy among their editorial teams. Successful use of this technology could be very worthwhile. With the automation of repetitive workflows and potential scaling effects, artificial intelligence tools could enable newsrooms to improve the effectiveness of their work, to reach new and bigger audiences and build more resilient business models. But can newsrooms overcome the knowledge gap? How far are the tools currently developed? And is it worthwhile to implement hybrid workflows, in which algorithms and humans produce journalistic content together? With this study, we want to answer these questions and suggest ways to make it happen.

In journalism, AI and automation are frequently associated with speed and computational efficiency, such as “robot journalism” producing short stories on sports results or corporate earnings. AI is less often used – or even considered an asset – in the core editorial creation process for the lifting and integration of editorial gems, or premium content, sometimes referred to as ‘evergreen’ pieces. We consider this field as extremely promising for future journalistic services.

Evergreens are high-quality editorial pieces that stay relevant for a long time and might enrich new content by adding in-depth backgrounds and further perspectives. Evergreens are plentiful in the archives of many news sites and often accessible via search engines, but are too rarely used in current news production. However, especially for complex issues, Evergreen content would be helpful in providing readers with a deeper understanding of events and issues. In addition, newsrooms could use these editorial gems to show their value to the public as organisations with a track record of good reporting, distinguishing themselves from non-journalistic sources.

Automation and artificial intelligence technologies might be able to do this all in a structured way. To find out, we will explore if the modern newsroom can make use of AI-created summaries to give its audience a higher quality experience. How effective are available summarisation tools at capturing the essence of a journalistic evergreen? Is the quality good enough for immediate, audience-facing use? Could these automated summaries attract more people to the premium pieces? And could inserting summaries of our best journalism in other pieces enhance the experience there?

We see 3 central starting points for developments in this field:

3.1 Identification

To make evergreens usable, they first have to be identified. A thorough definition of what is ‘evergreen’ is, therefore, essential. At best, this definition is based on data that can be collected and processed automatically and thus enables smooth, scalable processes. We are convinced that this task could be supported by artificial intelligence, for example in setting up models to identify newly published evergreens based on existing selection criteria.

Nevertheless, we have decided not to develop a universal solution for this part of the challenge. This is because the definition of evergreens is closely linked to the strategic orientation of a particular newsroom and the way it characterises its content. This became clear when we compared the evergreen criteria in our own different media organisations and looked at models used elsewhere. All approaches had this in common: They considered both some kind of traffic data and the search engine visibility as important, but with very different priorities and weights. We found one of the best definitions in a scientific paper by researchers from Pennsylvania State University in cooperation with The Washington Post.

The differences between news organisations showed that the identification of editorial gems should be addressed individually, to create meaningful solutions for specific newsrooms. Therefore, we manually selected the evergreens we used in our experiments.

3.2 Summarisation

Current reporting and evergreens must always be combined in a user-friendly way. This can best be achieved with flexible short formats summarising the evergreens’ content, be it in short abstracts, meaningful quotes or questions that arouse interest. Such snippets can be used and recomposed flexibly in different contexts, enabling journalistic content to be adapted to diverse usage situations, and even creating new products. Following a structured journalism approach, we consider this the most effective part of our project. If newsrooms succeed in breaking down their high-quality content into manageable components, they can react much more easily to their constantly changing digital environment.

The potential of summarising short formats depends on being able to compose them in an automated way – a challenge which several AI-driven tools are currently focusing on. We were able to evaluate two of them intensively in our study: Agolo, a solution that is already in use at the Associated Press (AP) and has been trained intensively to summarise news pieces in English language, and a German prototype developed by the AI + Automation Lab at Bayerischer Rundfunk (BR). Since not all of our newsrooms publish in English or German, we translated several of our articles in advance using the AI-driven tools DeepL and Google Translate.

To assess their suitability for journalistic input, we compared the AI-generated results with manually-produced summaries, written by editors. Four quality criteria were used: the ability to capture facts; grammatical correctness; journalistic text quality; and usability as a teaser. The overarching goal of the experiment was this: It should help us to understand the state of development in this field and to identify the problems that AI-powered solutions face when dealing with journalistic content. So, we also considered differences between languages, genres and formats. Journalistic styles can vary greatly depending on the language area, and specific formats and text genres can also present their own unique challenges. [See Chapter 4 for more details.]

Our expectation was not that machine-written summaries would be directly publishable. But how much editing is still needed? Are minor corrections sufficient or is so much manual optimisation necessary that it is easier to have the summaries written directly by the authors? This assessment is crucial when it comes to the question of what newsrooms should be advised to do at this stage. We aim to contribute to this and provide media managers with a basis for sound decision-making that considers the special requirements and quality criteria of journalistic work.

3.3 Integration

Only by including the user perspective can we properly evaluate the automatically-generated summaries. So we tested them ‘in the wild’: manually-written summaries, quotes, and questions were automatically integrated into current news pieces published on the website of Der Spiegel. The news environment is fast-moving and, therefore, poses a special challenge for in-depth content. Many readers only visit the site to quickly inform themselves about global events. But with a total of around five million readers daily, we could nevertheless assume that a sufficient number of views could be tracked for analysis.

We focused both on the direct and indirect effects of the integrated elements. Since all of them contain a link to the original article, we could measure their potential to attract readers and guide them to in-depth background articles. As the snippets and the summaries provided meaningful information themselves, indirect effects could also be expected: Ideally, the short formats enrich a news article so that it is read more intensively than a normal piece. To evaluate both kinds of effects, we set up a testing scenario with twenty elements based on five evergreen articles on climate change and compared various key performance indicators (KPIs), both in terms of different snippet formats and in comparison to a conventional article.

This part of our study is designed to help newsrooms to strategically arrange and integrate summarising elements in their up-to-date content. To do this, a better understanding of the impact of different formats on performance indicators such as click-through rate, read-depth and time-on-page is needed. Only if newsrooms can strategically approach the use of automated formats in this way can summaries be designed appropriately and used in a targeted way. Our analysis is intended to provide reliable evidence to support this.

While text summarisation has been an academic challenge for a long time (see proceedings of the Document Understanding Conference since 2001), it has recently gained traction as more computing power and new models for text summarisation have emerged.

Currently, there are two major approaches to text summarisation: Extractive and Abstractive models. Extractive models use statistical and machine learning frameworks to pull the most relevant sentences out of a given text. Abstractive models powered by machine learning and deep neural networks try to write new text, summarising the original text. Here’s a brief introduction to text summarisation with machine learning. We learned that abstractive models are not that relevant to journalistic use cases, at least for now, because they create false narratives and make up facts – which is clearly a problem for journalism!

There is a broad array of summarisation tools available. Some are free, some for scientific use only, and a few are commercial grade. Our team member Didier Orel of TX Group has compiled a selective list of these tools. Many of them employ less sophisticated methods of summarisation, or are made for specific purposes like summarising text for search engine optimisation or student papers.

Not all summaries are created equal and the output varies. The tools produce summaries in different shapes and sizes. Apart from regular text summaries, we encountered bullet point summaries; quotes; voice-optimised summaries; and ‘questions and answers’. These different formats might enable newsrooms to derive value from AI-based summarisation in different editorial contexts.

4.1 The tools we tried: Agolo, a BR prototype, and experiments by Swedish Radio and TX Group

In the context of the JournalismAI Collab, our main focus was on the technology and on journalistic use cases. Our goal was not to carry out a market research, nor to endorse a particular summarisation solution. We wanted to find out how AI-summarisation works towards our goal of reviving and re-serving evergreen articles and then share our findings.

Through AP’s Lisa Gibbs, we learned about a tool specialising in summaries for journalistic use that is being tried out at AP. The AI-summariser was developed by US start-up Agolo and currently supports English language. The company presents itself as having “the world's most powerful text-summarisation software” and offers an extractive summarisation model trained on a vast number of news texts from Reuters and AP.

Through the JournalismAI Collab, we were generously granted free trial access by Agolo for the purpose of this study. The choice to go with Agolo was made because of the practical accessibility, the forthcoming attitude of Agolo management and positive feedback around the tool. As mentioned, our research includes other commercial tools and ideally more services should have been tested. But this was a way of limiting our scope for the time of this Collab.

Moreover, a prototype version of another tool was also tested, developed by a public service broadcaster. With BR's AI + Automation Lab having machine learning expertise on the team, the Lab built their own text-summarisation prototype for German language text, to learn more about the technical setup, limitations and how editors could work with the output. Agolo and this prototype were both used to summarise a wide array of different texts from our respective organisations, in part also with AI-translated and transcribed texts.

Swedish Radio and TX Group also experimented with building their own summarisation models. SR and TX did limited exploration in their respective experiments. We will now give a brief overview of each tool’s characteristics.

4.1.1 Agolo

Agolo creates AI-generated summaries for smart-speaker platforms like Alexa, Google Assistant, and Siri. When explaining its use of an extractive summarisation model, the company says that this approach makes it easier to stay within the particular tone and voice of a given outlet when compared to abstractive summaries. Agolo offers summarisation solutions that cater to different use cases. The Agolo’s Speakable summaries are heavily optimised to be consumed on voice. Thereby, Agolo can help media companies that rely mostly on text to extend their offer to smart speakers and other voice-based channels.

Fenil Dedhia, Technical Product Manager at Agolo, says about their solution: "Our Speakable Summarisation engine is a collection of our proprietary NLP techniques and algorithms. It determines what is salient from the input text, as a means of deciding whether that information should be included in the summary. However, people don’t process sound as quickly as sight, so we need to keep sentences short and smoothen out our summaries to make them suitable for auditory consumption. While many of our clients consume Speakable Summaries for the eye (digital print), its primary use case is for the ear (voice media). To support the auditory consumption, we have to make some edits to the original text but we ensure that the voice of the author and the audit trail is maintained."

The Agolo model is made to favour two journalistic writing styles. The classical, inverted-pyramid news style leading from the top of the article with the most important facts. But it is also able to process a more narrative approach with a general entry and more facts that follow, like this story from CNN.

This focus on the genre is crucial because the version of the Agolo tool we tried is built for a very specific use case: summarising news pieces ranging from 600 to 1500 words to be consumed on voice. That means that our study setup puts Agolo to a test it was not designed for. Our evergreen articles were up to 6000 words long and more on the creative side, covering complex scientific topics. Some were composed in alternative styles, like listicles and ‘how-to’ explainers, which are very different from the classical news-style text.

Learning how important it is to design a tool specifically geared towards a newsroom’s specific challenge in AI-based summarisation is one of our most important findings in this study. As will be shown later, weaker or missing summaries from the Agolo tool seem to be a result of the fact that the input text is different from the text normally processed by Agolo. Nevertheless, we think that running evergreens through the existing versions of these tools give valuable insights to publishers considering employing AI-summaries in that area in the future.

While we tested Agolo’s technology via a graphical user interface, it is also accessible via on-premise or API deployments. Via API, the summariser can be integrated into content management systems for easier editorial workflows.

Agolo features

The Agolo’s demo dashboard that we had trial access to for a week featured three different categories of output:

1) a speakable summary optimised for voice,

2) a bullet point summary,

3) quoted statements from the text.

The speakable summary can be dynamically generated with two default length settings: informative or indicative. Informative summaries span from 150 to 300 words while the indicative summaries span from 75 to 120 words. Since it is mostly an extractive summarisation, the minimum required length for a summary is 50 words. It is optimised for voice output with a couple of features that also work when used as text, such as:

1) co-reference resolution (replacing pronouns with the corresponding names/words);

2) identifying sources with their title/function before their name;

3) simplifying odd and decimal digit numbers;

4) removing unwanted text (like photo captions, ad text) to eliminate noise.

Points 2 and 3 were developed as a custom feature to match AP’s news style. The speakable summary is read by computer voice with the click of a button.

The bullet-summary option extracts the sentences perceived as most important in bullet-point style. The Quotes output option extracts all the quoted statements from individuals in the text. This feature automatically excludes non-attributed quotes to keep output relevant. Outputs are organised around news production needs and can also contain Q&As. This feature was not tested in our setting.

4.1.2 BR prototype

The BR prototype AI summariser is accessed over a website interface. The text is pasted into a text box and then summarised. With a simple switch below the text box, users can choose to summarise with sentence embedding or "Transformers".

Users can also adjust the output by choosing the number of sentences they wish to be included in the summary. The number of sentences can be set to a number ranging from 2 to 15 based on the input text. The user has also the possibility to select specific entity tags. This will lead the model to include more or fewer sentences containing the selected tags.

While both models of the prototype are based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, the Transformers model uses a fine-tuned German pre-trained BERT. It was trained further with around 60,000 news articles and their leads from BR’s news website BR24.de.

The BR team modified the Transformers model to make it more efficient in terms of computational costs. They also adjusted the model for German language. To give a sense of the computing power necessary to achieve the current status, it was trained on an NVIDIA P5000 GPU for about 2.5 days. More training on more data is likely to improve performance, but computational costs in cloud services can be significant and must be weighed with the expected output improvement.

As with Agolo, both models of the BR prototype employ extractive methods of text summarisation. The BR team experimented with abstractive models, but found them to be ‘hallucinating’. This means the models make up facts that are not contained in the original text based on its deep learning “knowledge”, a common problem in this field. Therefore, the team did not see the current state of abstractive summarisation as a viable route for journalistic text summarisation, where having the facts straight is essential. However, the team considered using abstractive models on top of extractive summarisation to smoothen the output but has not tried it to date.

Very long text created another challenge because the available models are designed to have short text as input. To solve this challenge, the team used a “greedy algorithm”. This algorithm splits the text into smaller chunks and summarises each one separately. The final summary is generated from these mini-summaries. ‘Greedy algorithms’ do not always solve challenges in an ideal fashion, but were a cost-effective way for the team to make the processing of longer texts possible.

One important feature the team wanted to integrate was coreference resolution. Coreference Resolution in Natural Language Processing is the task of finding all entities in a text referring to the same entity. It is very relevant for summarisation. For example: when sentences containing “he” or “she” are selected for a summary, the tool should replace the pronoun with the name of the person in question. This was seen as having the most impact on output quality but was not possible to realise due to the lack of adequate training data in German.

4.1.3 Swedish Radio's experiments

Swedish Radio has not yet developed its own prototype tool for summarisation, like German public service broadcaster BR. Nor has it licensed a commercial tool, like Agolo. However, some explorations of both the editorial and technical requirements in this area were made by SR in the autumn of 2020, coinciding with the Journalism AI Collab.

During October 2020, the SR team of development engineers – Tobias Björnsson and Carl-Johan Rosén – who have considerable experience in working with AI-models in the audio sphere, explored three different solutions: Latent Semantic Analysis (extractive), a deep learning model which we trained using in-house data (abstractive), and finally a pre-trained BERT model. All the models are language-agnostic but tests were primarily done in Swedish and to a lesser extent in English.

The pre-trained BERT model worked remarkably well without fine-tuning for SR’s purposes. But not having access to clean training data was a challenge which made training a model from scratch difficult.

The SR team did not have enough time to create a clean training dataset to train their model or to fine-tune BERT. Latent Semantic Analysis worked sometimes, but they saw better results with BERT.

In the future, the team would like to create clean training datasets and fine-tune a pre-trained BERT model to SR needs. Exploring BERT would also open up other possible solutions to NLP problems.

Another interesting AI application developed by Swedish Radio – unrelated to the contentn of this study – is an algorithm for empowering public service news. Read about it here.

4.1.4 TX Group's experiments

TX Group has been experimenting with summarisation for a very different use case in the sourcing of information, as described in chapter 7. Still, we want to include the insights of TX group data scientists Milena Djordjevic and Tim Nonner, who implemented the first version. They developed a tool for extractive summaries in the shape of bullet points and speakable short text. They also tested abstractive summarisation in comparison to extractive tools.

Along with the other developments undertaken, TX Group found extractive models to be much more controllable and explainable, and their implementation is easier. In our case, testing concluded that the abstractive tool was slower and in most cases inserted words that were not contained in the article before summarisation. The goal at TX Group is to iteratively improve existing tools with the most recent technology and also continue working on abstractive summarisation.

4.2 Study setup

In our study, the focus is on evergreens. A limited number of the tested pieces are written in traditional news style, while a lot of them contain long, creative reporting and science pieces, even listicles. During the course of our work, we found this selection to be somewhat at odds with the training of tools at our disposal. With Agolo being trained for short news articles, the difference in input formats was most relevant. The same goes to some extent for the BR prototype, where the model was tweaked with 60,000 articles containing mostly news.

Our text sample is therefore limited in size, and also somewhat skewed in genre, in a way that is less digestible for the tools at hand. This must always be taken into account when interpreting the results. We do not rate the overall performance of tools, but the output quality limited to our very specific use case and sample.

Considering our goal to compare summaries from different tools across different newsrooms with different evergreens, we looked to science for criteria to evaluate automatically-generated summaries. In the paper “Manual and automatic evaluation of summaries”, the authors propose different criteria for manual text evaluation:

"To measure quality, assessors rate grammaticality, cohesion, and coherence at five different levels: all, most, some, hardly any, or none:

Grammaticality: Does the summary observe (English) grammatical rules independent of its content?

Cohesion: Do sentences in the summary fit in with their surrounding sentences?

Coherence: Is the content of the summary expressed and organized in an effective way?"

For journalistic content, we decided to adjust this framework to better fit our purposes and created four categories to rate our summaries:

Grammaticality is one dimension we kept from the approach in the paper, simply measuring if the summary follows grammatical rules in its language.
Capturing of facts and logic: Journalistic content is based on facts and, of course, need the facts to be right and presented in the right way. We added this criterion to measure the selection of facts from the original text in the summary.
Journalistic text quality: This category unites coherence and cohesion as well as the general usefulness of the summary in journalistic contexts.
Usability as a teaser: With this, we rate the possibility to use the summary as a link teaser.

We rated all these criteria in a large table with five steps (5=perfect, 4=good, 3=mediocre, 2=flawed, 1=bad). We are aware that the concepts of “most important facts” or “journalistic usability” might differ from newsroom to newsroom. However, we still think that applying a set of criteria and ratings makes it easier to discuss and compare results and also identify potential challenges.

To make our ratings less arbitrary, we created a human summary as a benchmark for every article we evaluated before rating the automated summary. We also consolidated summarisation settings across the tools in a way that supports our use case to present very consolidated evergreen content.

For Agolo, we used the 50-word setting for the speakable summary. To be fair, this is shorter than the recommended setting of 150 words. Moreover, we upped the challenge for the tools to make a good selection of facts, also by summarising articles longer than the advised 600 to 1500 words.
In the BR prototype, we used 3 sentences – as the equivalent nearest to the 50 words of Agolo.
For Agolo bullet summaries, we settled on 3 bullet points (the default setting is five).

We also discussed including automatic evaluation of summaries with metrics like BLEU, ROUGE or BERT-score being applied in the machine learning community. We decided against using these. From our understanding, these are mostly used as a proxy and to compare results and progress when human assessors are not available for very large batches of text. We also did not see any additional benefit for our study to include them.

Our tests of summarisation tools show some encouraging results but point to significant challenges if the end-goal is a totally automated chain from evergreen text to an audience-facing summary. The extractive approach of AI-driven summarisation tools gives a high relevance to the initial parts of the underlying articles, like titles, leads and first sentences, depending on the content on which the tools were trained.

It is evident that the tested models are optimised and trained on standard news articles which causes a lower tolerance for unusual, creative formats and other genres like features and listicles. The inverted pyramid news style is considered as the norm. The ideal text is factual, concise and with a start that is clear and logically related to the headline, at least to produce a good result with the Agolo’s model.

Moreover, the tools generally had significant problems summarising extensive articles, no matter the genre. The tested tools are very impressive at summarising the short and straightforward but struggle with the long and creative. It is important to note, however, that the tool-makers do not claim that their products will work on the longer and more innovative formats that often characterise the evergreens used in this study. We can well envisage tools becoming much better at summarising evergreens if they are provided with a relevant training set. But as the tools work best on clear and homogenous text structures, it could well be that existing formats are cemented by integrating a summarisation tool in the newsroom workflow, thus impeding creative experimentation.

Generally, the speakable summaries (of up to 50 words) rated significantly higher in our tests than the bullet points. Agolo serves a convincing case in summarising English language news for voice platforms. The reason is that the extractive model picks out and includes sentences from the lead of the articles. The bullets on the other hand seem to be picked from the whole length of the text. This sometimes muddles the logic between the extracted bullets.

These findings are also relevant when assessing the journalistic text quality of the summaries. Cohesion between text elements inside sentences is not a problem as the extractive models keep the original, single sentences intact. Coherence, on the other hand, proves to be more challenging as the tools glue single sentences together in a way that frequently obscures logical connections and sometimes creates grammatical mistakes.

Having said that, the tools generally performed well on grammar. Again, this is because of the extractive nature of the tools. Journalistic texts are supposed to be flawless or at least strong grammatically. That strength is carried through to the summaries. Transcription and quotes, though, were generally detrimental to the grammatical quality. Furthermore, it is only in very rare cases that automated summaries can serve as perfect teasers dropping out of the AI-machine. A well-written teaser stimulates the reader’s curiosity without being too explicit. This very human, editorial quality gets lost in the AI-summaries.

One of the SR audio tests produced some very interesting results. A complicated chain of steps: 1) transcription, 2) translation, 3) summarisation, can actually work and provide an almost perfect summary. But the usual preconditions apply: the start of the text has to be so factual that it transcribes well, and the summary format should not be too long.

Finer details of summarisation, like the coreference resolution feature, depend on the availability of annotated training data. In our case, these were not available in German because the corpus was reserved for academic use. In English, the availability of training data is generally better. This makes development and training in non-English languages significantly harder. Hence, it is not surprising that most practical integrations of AI-summaries to date have been done in American newsrooms, where the linguistic, technical, and financial environments are most favourable.

5.1 Capturing of facts and logic

Most reporters are taught on their first day of journalism school that the start of a text is crucial for reader engagement. The start of the article is also decisive for how well facts are captured by AI-summaries. This point is well illustrated by the tests Der Spiegel did with summarising background articles on climate change. These long, well-written articles often have scenic introductions which cause a problem for the summarisation tools. On several occasions, the tools did not get the main point as they were led astray by the creative introductions and they produced logic mistakes when combining two sentences that did not fit together (wrong reference).

Gauging the results from the Agolo tool underscores these conclusions. The speakable summaries by Agolo of the same articles were generally good and got high scores because in this case the tool exclusively chose sentences from the lead or the introduction of the articles. It is almost as if the tool “capitulates” to the extensive articles and extracts something from the top as the easy way out. For example, this article about different technological solutions tackling climate change causes was summarised by extracting the first sentence of the lead:

“The climate goals can only be achieved if we actively remove CO2 from the atmosphere.”

Through our contacts with Agolo, we know that their model ranks the sentences in the body of the text in relation to the headline, and there is more likely that you find sentences that logically correspond to the headline early on. The bullet points, however, are in many cases extracted from the whole text and got very low scores for capturing facts and logic in Der Spiegel’s tests, which is presumably also due to the significant length of the articles.

The Swedish Radio results point in the same direction. This audio piece, 2:18 minutes long and in Swedish, entitled “Kina kritiserar Sveriges coronahantering” (in English: China criticizes Sweden's corona handling”) was automatically transcribed using the SR transcription tool, then translated through Google Translate into English. It was then run through the Agolo summariser, which produced this summary in 38 words:

“China criticizes Sweden's handling of the Coronavirus and in a newspaper affiliated with the ruling Communist Party calls on the international community and the EU to condemn Sweden, which is believed to have capitulated to the virus.”

This summary is remarkably close to the manual benchmark summary we did beforehand:

“China criticizes Sweden's handling of the coronavirus. A newspaper close to the ruling Communist Party calls on the international community and the EU to condemn Sweden, which is believed to have "capitulated" before the virus outbreak. Experts see the criticism as serious and a sign of the bad relations between the countries.”

It is important to bear in mind that the text was both automatically transcribed and translated from Swedish, but the tool still produced an accurate result. The reason is simple, and reminds us of Der Spiegel’s findings: the first sentence of the story was in itself a very good summary of the whole piece as well as being straightforward and factual. Thus, the transcription and translation both carried the original meaning in a good way. But already in the second sentence, the transcription tool committed a fatal error. Faulty punctuation and a difficult name of the interviewee made the transcription/translation incorrect, which was then reproduced in the second bullet point provided by Agolo:

"The crime Björn gives it a head of the Olympic program at the Foreign Policy Institute sees the criticism as serious."

As you see, this bullet point is both incomprehensible and flawed in a grammatical sense. The name “Björn Jerdén” is understood as “Björn ger den” (in translation: “Björn gives it”). And an even more curious detail is how this “Olympic program” appeared here as it has nothing to do with the story. The reason is again a misrepresentation by the Swedish transcription tool: “Asienprogrammet” (which means “Asia program”) is transcribed as “OS programmet” (which translates to “Olympic program”).

The Swiss tests by TX Group also point to significant problems with capturing fact and logic when automatic translation is introduced as one of the steps. In the summaries that came out from the articles that were first translated from German to English, then summarised, then translated back to German, essential facts were often missing. In some cases, there were substantial factual errors included in the summaries.

BR tried both its own prototype tool in German and the Agolo tool in English. As most of the selected articles are listicles, the prototype struggled to make a meaningful choice of facts here. Scores for how well the summaries caught the facts and logic were medium to low, with the exception of this scientific article where the facts are straight – though the summary added a bit “too much detail”.

“Doch man weiß nicht genau, mit welcher Geschwindigkeit sich das Weltall ausdehnt. Eine Methode, um die Ausdehnung des Universums und damit die Hubble-Konstante zu messen, ist die Beobachtung von sogenannten Standardkerzen im Weltall. Bei diesen wissen Astrophysiker genau, wie hell sie in absoluten Werten sind, und können damit auch große Entfernungen im Weltall sehr exakt vermessen.”

Agolo did an overall good job in the capturing of facts of the BR articles. In two cases, the main sources of the articles were pulled into the summary. In one case, it is just a description of the person without knowing what he does or what his connection is to the story: “Bui Thanh Hieu is one of the best-known bloggers from Vietnam.” In the other case, a rather random fact about the person was presented: “For two to three weeks Michael was doubting the existence of the Holocaust.”

As in the tests at both SR and Der Spiegel, the bullet summaries did not capture the facts as well as the speakable summaries and they ended up with medium to low ratings. Agolo completely missed facts that it correctly identified as relevant in the speakable versions. On several occasions, the bullet points are composed of multiple sentences like passages with quotes. Again, in an apparent attempt to present protagonists, strange sentences like this one show up without further context: “When Sven Drewert, in his late 30s, wants to increase his credit card limit for the holidays, he experiences a surprise.”

It has been hard to assess how well extracted quotes caught facts and logic. Catching quotes did not work well for any company because identification was difficult and sources were not always clear. They got the lowest possible ratings because there was no connection to the facts of the story. But to be fair, the quotes are often not included in a text to add crucial facts but rather to add opinion and human flavour. So judging how well the few quotes extracted caught the facts may be less significant.

In conclusion: both the start of the text and the genre are crucial factors in extracting summaries that are good at capturing facts and logic. The ideal text is preferably factual, concise and with a start that is clear and logically related to the headline, at least to produce a good result in Agolo’s model. ‘Newsy’ texts are more successful than more creative feature pieces or listicles.

Generally, the speakable summaries (of up to 50 words) got significantly higher ratings in our tests than the bullet points. The reason is that Agolo’s extractive model picks out and includes sentences from the lead or the introduction of the articles for the short, speakable summaries. The bullets are, on the other hand, picked from the whole length of the text. This muddles the logic between the extracted bullets and includes more irrelevant facts.

As shown above, one of the SR tests produced some very encouraging results. A complicated chain of steps: 1) transcription, 2) translation, 3) summarisation, can actually work and provide an almost perfect summary, if the start of the text is so matter-of-fact that it transcribes well, and if the summary format is not too long. However, the more complicated the text is – with peoples' names and specialised terms – the more confused the transcription gets and that also becomes highly detrimental for the summary results.

The obvious observation – that short, newsy articles are easier to summarise than long, creative ones – could perhaps lead to the hasty conclusion that such summaries are superfluous. But having an AI-tool provide short and accurate summaries even from a compact news article could serve multiple purposes, like saving time by excluding this step from the reporter’s workflow; surfacing the summaries as teasers on other pages; or converting news text into audio snippets to be consumed on smart speakers. The SR case also points to a potentially very exciting use case: to use automated summaries of translated (and transcribed) news pieces as a way of opening up your content to an audience that does not speak the original language.

5.2 Grammaticality

The tools we experimented with are mainly extractive – which means they select the sentences that the algorithm evaluates as the most representative. As a consequence, the grammar, which is supposed to be correct in the original text, is also correct in the summary. This conclusion holds, whatever tools or formats we analysed, as long as the tools are used in the language they are trained for: English for Agolo and German for BR.

The most frequent errors we found are due to translation. The issue here is not only a matter of translation quality, but can also be related to specific typographic signs, like the German quotation mark, which is not correctly recognised by Agolo. However, besides translation, a few other limitations are worth mentioning.

The first issue happens when quotes are used as part of the summaries. When a quote is used in the text, the grammaticality is often not as good as in the surrounding text written or reported by the author. This is especially true if the text is an audio transcript. In a radio piece, the reporter uses a written script, which makes it easier for the transcription tool to grasp it and to reproduce it in good grammar. But quotes from interviewees are often less cohesive.

This audio story is an example. This Swedish audio clip (which in English translation is entitled: “Home quarantine - this is what you should keep in mind”) was transcribed to Swedish text, then translated with Google Translate. The first two Agolo bullets are very good from a grammatical point of view, considering the whole text has been automatically transcribed and translated:

Many companies are now asking their employees to work from home or put themselves in home quarantine if they have traveled to places with many confirmed cases of the coronavirus, covid-19.
Gunilla Ockborn is an infection control doctor in the Västra Götaland region and she believes that hygiene is important to avoid the spread of infection.

But as the third bullet is based on a quote, the grammatical structure becomes poor:

If you have access to several toilets, then maybe you can use that the person who has symptoms has a toilet and this is still required to make sure to take care of their hand hygiene.

One reason for this is that the word-by-word Swedish transcription – which in itself is an accurate representation of what the interviewee says – is not good in the grammatical sense. That affects the translation and hence the third bullet.

In conclusion, grammar is by far the parameter that received the highest score among all the criteria we used to evaluate the tools. The grammaticality is generally good because of the extractive nature of the tools. When whole sentences are extracted and glued together with others, this sometimes creates grammatical problems in the new context. Quotes are generally detrimental to the grammatical quality of the summaries as they have a weaker grammatical composition to start with, a problem made worse if the quote is automatically transcribed.

5.3 Journalistic text quality

Journalists write their stories following specific guidelines, adhering both to genre conventions and their newsroom’s particular style. The classic news articlearticle, for example, leads with the most important facts and ranks further information depending on importance. This creates a “per newsroom” definition of quality and makes criteria between the study participants hard to compare. Different genres of articles also played a huge role in defining quality differences. For example, news-style text was generally closer to the human benchmark with the tools at hand. The essential question is how usable the output of the summarisation tool is. Can it be used as it is or is further editing required? How close is the result to the human benchmark summary?

As mentioned in our study setup, this category also considers both coherence and cohesion on top of the general usefulness of the summary in journalistic contexts. Cohesion refers to the creation of meaning on the level of single sentences. Cohesion was not a problem at all with the tools tested in our setup. The extractive models keep the original sentences intact. Therefore, meaning is rarely lost from deconstructing or re-writing single sentences. Coherence is much more relevant to the topic of journalistic text quality. We found many cases where meaning was lost because the summarisation tools strung single sentences together in a way that obscured logical connections and the grammatical subjects of the sentences. This happened a lot when pronouns were used instead of the subject.

Agolo – speakable summary

Agolo’s summaries scored well with Der Spiegel in this category. Minor deductions happened for unsuitable sentence transitions. One otherwise good summary shows one of these coherence issues with sources. The source, Ricarda Winkelmann, shows up unintroduced and out of context:

"It starts slowly – and does not stop for centuries: New simulations show how massively the melting of Antarctic ice is changing the planet. Only one measure would help. " Suddenly," says Ricarda Winkelmann, "it became dark outside the cabin window."

For BR, Agolo served up two perfect speakable summaries and three average ones from long texts that generated no output before that. This is impressive because the texts used by BR were very different from the news texts that the tool is optimised for. As with Der Spiegel above, Agolo pulled out two sentences mentioning sources in the BR summaries that were unrelated to the previous sentences from the lead and did not introduce the respective sources.

For TX Group, the speakable summaries did not work well. They scored lower on the factual rating and contained some repetitions and unclear formulations. While the first sentences were often close or matching the human summary, the next sentences derived from the editorial expectations.

SR also achieved the highest score in one case. They took the audio file from this story: Denmark closes borders as a result of coronavirus outbreak. This 36-second clip was then transcribed using the SR transcription tool which uses an NLP-model from Speechmatics. The Agolo tool produced this short summary:

“At twelve o'clock today, Denmark closed its borders to travelers from Sweden in order to prevent further spread of the Corona virus. Swedish citizens planning to fly from Copenhagen's airport, Kastrup will not be allowed in and trains will not be allowed to cross the bridge between the two countries.”

This summary is relevant, it captures the most important facts and it is correct – if not perfect – grammatically. We can see that it is close to the manual summary done for benchmark purposes:

“Today Denmark closed its borders to travellers from Sweden in order to prevent further spread of the coronavirus. Roadblocks have been set up on the bridge to make sure only Danish citizens proceed across the bridge.”

Jagran New Media made an interesting observation on summary text length. They compared Agolo’s recommended 150 words setting for the speakable summary with the 50 words setting of our study. The speakable summary in 50 words didn't work properly. They were more happy with the 150 words length. The editorial team suggested that the Agolo tool can be explored in a selective way such as selecting the options for 5 bullet points and 150-word speakable summaries if the content length is above 500 words, and going for 3 bullet points and 50-word speakable summaries if the content has less than 350 words. This shows again how much input well-suited to the summarisation model matters.

The conclusion from this specific test is that AI-summaries from automatically transcribed audio can be excellent when it comes to journalistic text quality. However, there are some very important preconditions: the text has to be similar to what the model was trained on, and the transcription/translation should be of good quality without any substantial errors. In some of these cases tested at SR, it was the automatic transcription and translation that led to significant errors that negatively affected the journalistic text quality. To get the required text quality, it helps greatly if only one person is reporting in the story, and that no additional voices from experts with complicated names and titles are included.

Agolo – bullet-point summaries

Bullet-point summaries were a little tricky to extract from the longer articles: the bullet points ended up being too long, too. This was the case at Der Spiegel and BR, where the texts exceeded the recommended length for the tool. TX Group concluded that the bullet summaries contained some repetitions and unclear formulations, but the human edits required were limited.

For the shorter texts at Jagran New Media, bullet summaries worked well and quickly fetched three lines from the original stories, regardless of their lenght. All these lines were relevant and in context with the sense of the article. When JagranNew Media tried the tool with the option of 5 bullet points instead of 3, the tool worked well and the outcome was still excellent.

For BR, one text on the Ischgl research scored a perfect rating by presenting the research and its result very well in just 3 bullet points:

Reporters have analysed more than 4,000 Instagram posts that were posted in Ischgl between the end of February and the beginning of March; the timespan when the first holiday-makers were probably infected and after which several Corona cases were confirmed.
Tourists who might have contracted the virus in Ischgl could have spread it all over Europe, for example to Great Britain, Iceland, Poland or the Czech Republic.
An especially large number of Instagram users in Ischgl then went to the Netherlands and Belgium, to Switzerland, Scandinavia - and, in particular, to Germany.

Agolo also highlighted an important difference to speakable summaries that are relevant to this category: Whereas speakable summaries are more focused on coherence than the bullet points, the latter are more optimised towards covering a broad range of information from the text.

Quotes

With quotes, some parties had issues with extraction in longer texts – or because of the different styles of quotation marks used in different countries and journalistic traditions. The Agolo tool version is designed to pick up the normal quotation marks in English: single (‘English’) or double (“English”) quotation marks, also called inverted commas.

The German quotes are marked with German quotation marks, as illustrated here:

"Professor Gigerenzer puts it this way: „Personally, I think we need a change here. The change must be in that to give consumers the right to understand how their value comes about.“

This could easily be adjusted for paying customers as opposed to our free trial, Agolo told us. Otherwise, extraction worked well and assigning a rating for summary quality makes no sense for quotes because they are not meant to capture the meaning of the whole text. The quote feature was not present in the other tools we tried.

BR tool

Der Spiegel tested the BR prototype with background articles on climate change and saw medium to high ratings for journalistic text quality. Deductions occured in some cases due to unsuitable sentence transitions or a too-high level of detail. This is one example of too much detail, which makes the summary hard to understand for readers without expert knowledge about climate models, as explained in this article:

"Das Kürzel RCP 8,5 steht für ein Worst-Case-Szenario. Das RCP-8,5-Szenario sei am besten geeignet, um den Klimawandel in den vergangenen und den kommenden Jahrzehnten bis 2050 zu beschreiben, erklären Schwalm und Kollegen."

BR saw logic suffer in some cases from missing coreference resolution when the subject of the sentence is unclear as in this example. “Sie” refers to parents, which is not understandable from the summary:

"Der Kinderbonus in Höhe von 300 Euro pro Kind ist einer der Maßnahmen im Rahmen des Corona-Konjunkturpakets. Sie erhalten nun auch den Kinderbonus - ohne dazu einen gesonderten Antrag stellen zu müssen. Wann wird der Kinderbonus ausgezahlt?"

Coherence problems showed up more often in the listicle text summaries, where the logical connection between individual sentences went missing. Overall, this led to medium to low ratings for the summaries. In the best cases, a text would be usable with minor edits. This is an example from a text about Covid restrictions in Germany and Sweden:

"Die Auflagen, die in Schweden gelten, werden von den Behörden des Landes auf einer Kriseninformations-Seite veröffentlicht. Schweden hat verglichen mit Deutschland relativ viele Todesfälle in der Corona-Pandemie zu beklagen. Besonders in die Kritik geraten ist, dass in Schweden viele alte Menschen in Pflegeheimen gestorben sind."

This summary is, apart from the uninformative first sentence, quite good.

5.4 Usability as a teaser

Our testing showed that summarisation tools can be useful to augment the production of teasers – but in very rare cases automated summaries care ready-to-use as they drop out of the machine. There are different reasons for that, but the most important one is that teasers are an example of a relatively sophisticated journalistic skill. Depending on the style of the publication, teasers should ideally summarise the article to a certain extent – but usually not completely, as they should also serve as a cliffhanger to guide readers’ attention to the actual text. This is very difficult to achieve by an automated text – be it as result of an extractive or an abstractive model. Furthermore if there is automated translation in between.

When we consider possible use cases, our testing results are pointing to augmenting existing workflows rather than replacing them entirely. But automated summaries would still be an effective tool, for example for SR, where audio clips accompanied by teasers are at the center of the company’s news strategy. These teasers are today compiled manually so a workflow where an AI-model could suggest three bullet points out of a transcribed audio file could save significant time, even if a human editor had to go over and correct those bullet points. It could also be interesting to summarise audio segments that are cut out from the linear programming.

Going by our testing results, newsrooms still need an editor to shorten the results of Agolo’s bullet-point summaries as the tool mostly offered too-long summaries per each bullet point. This happened mostly when the input exceeded the recommended word count. Often, quotes were included in the bullet points which made the selection non-coherent. Also, in some cases the choice of facts was not excellent compared to a human ranking, so editors would have to check if all the important facts are included. This last aspect points to one of the biggest issues using automated summaries: Shortening a summary or checking the text for coherence is, in most cases, a task that still makes the workflow more effective in comparison with drafting the text yourself. But having to check in the original source if everything important is included, perhaps makes the augmented workflow less efficient than the manual one.

This issue might be eased by using the tools only for factual news stories with a certain length. We found that both BR’s prototype and Agolo work best with short, factual and single-voiced stories, such as SR’s Denmark closes borders as a result of coronavirus outbreak, which was the only article that got a high score in the category “useful as teaser”. Here, Agolo produced three bullet points that would work well as a teaser, with the minor reservation that the last one is a bit vague:

At twelve o'clock today, Denmark closed its borders to travelers from Sweden in order to prevent further spread of the Corona virus.
Swedish citizens planning to fly from Copenhagen's airport, Kastrup will not be allowed in and trains will not be allowed to cross the bridge between the two countries.
Roadblocks have been set up on the bridge, and Danish police are checking the occupants of any vehicles to make sure.

The speakable summaries got medium rankings in most cases. Some of them were just too short to be used as teasers and contained too little information to attract readers, like the one based on this article on technological solutions to tackle climate change causes:

The climate goals can only be achieved if we actively remove CO2 from the atmosphere.

There were also speakable summaries with a good representation of facts. They had the issue that they worked perfectly as summaries but not as teasers as they contained no incentive for the users to read on. A good teaser adds a “meta dimension” to the facts, i.e. hints at a compelling reason to devote time to the piece, and this is for obvious reasons not something that the extractive AI-summariser does or pretends to do.

The BR tool got a wide range of ratings from very useful to not very useful summaries – mostly affected by the genre of the text. In most cases, a lower scoring was influenced by special article format or genre that led the tool astray, so that scenes or side issues were included in the summary. Again, the best ratings occurred with traditional news texts.

5.5 Delta to publication

The use of AI-driven solutions in newsrooms is not a goal in itself. Development and implementation efforts are only a good investment if processes can become more effective as a result. Therefore the delta, or pathway to publication, is key when it comes to assessing tools and use cases. As demonstrated in this evaluation, even established AI systems need intensive training or manual support to meet the requirements of journalistic summaries. This means that we must address the expectations towards AI tools.

The most important conclusion to start with: don’t expect AI-driven systems to autonomously produce and publish summaries. There are several, mainly quality-related and legal reasons not to aim for this. For example, extractive systems could combine relevant sentences in a way that creates problematic contexts, as we have observed in our evaluation. Abstractive AI summarisers could in turn generate new contexts, or use misleading terms because of biased training datasets. Especially with sensitive or controversial topics, both approaches could create legal issues.

Of course, AI tools can still be useful within the journalistic workflow. For processes that aim at publishing summaries, we consider hybrid approaches to be most valuable. Journalists would have to carry out quality control before publication and a certain degree of editing would be necessary. Ideally, this should be combined with feedback loops to further train the AI algorithms.

Besides publication, we consider many more use cases as relevant for AI-generated summaries [see Chapter 7 on potential future use cases]. In cases such as archive management or research support, a less perfect summary might be acceptable and even valuable. Summaries slightly missing the high expectations towards journalistic teasers would work in these use cases, although context-related quality criteria like the capturing of facts would still be crucial in these settings.

So deciding which quality is required as a minimum depends strongly on the application context. The authors of this study represent media companies with high editorial standards. But other outlets with different standards might be prepared to accept lower quality or publish automated summaries with a disclaimer. Therefore, it seems advisable to first clearly define this setting before looking for and training an AI-driven system for it. Given that alternating genres pose huge challenges for these tools, we would also recommend to clearly define the format and focus on news in the beginning.

In summary: AI-driven summarisation tools can only help us to solve problems if we are prepared to define them consciously, and closely guide the learning process. We should not expect them to be autonomous ghost-writing robots, and most of all we should not – being serious publishers – hand over publishing to them. Only if we commit ourselves to hybrid processes we can really leverage the potential of AI-driven applications.

Summaries of articles can be used in many areas of the journalistic production process. During investigations, they could enable editors to grasp external sources and those in their own archives more quickly. They could also support the meaningful interlinking of articles and the management of personal recommendations. And they can be used to enrich the reporting with background information and provide an overview of existing content in a specific field of reporting. To evaluate this user-facing approach, we have conducted a limited experiment in the field.

For twenty days, in October and November 2020, we automatically inserted different summarizing formats in articles published on the SPIEGEL website and measured their and the articles’ performance. Our main findings are:

In the examined setting, the summaries had no measurable, significant effect on relevant performance metrics of the surrounding article or the website as a whole.
Only very rarely were the elements used as links to visit the background articles. There is no detectable difference between the four formats in our setup in this respect.
Despite or because of their compactness, some quotes and questions do particularly well in comparison. They could be especially useful for engaging readers with a structured journalism approach.

6.1 Setup of the experiment

To implement automated summaries in the editorial process effectively, you need a topic area that includes both in-depth background articles and current news of great relevance. Climate change has proven to be the ideal experimentation field for this part of the study. A new editorial section was recently launched within the SPIEGEL website – dedicated to climate change reporting – perfectly fitting the timeframe of our experiment. As a result, we could expect increased reporting from different topic areas, which would be accessible via this thematic focus.

Together with the science department, five in-depth evergreens were selected, each focusing on different areas within the topic of climate change. They were manually created, in the form of a continuous text summary, a bullet points summary, a quote, and a question. The resulting twenty snippets were built into the SPIEGEL digital layout in a clean, unobtrusive design. They had a different background color, so users could easily distinguish them from the surrounding content.

Links to the background articles enabled users a deep dive into the evergreens or at least to become aware of this possibility. In their wording, the snippets invited the user to study the evergreens, but the summaries were coherent and not formulated in a particularly engaging way.

The setup of the test: At fixed positions in all new articles assigned to the topic special, the twenty snippet variants and one version without a snippet were placed in rotation. To avoid collisions with other fixed article elements or advertising positions, the summaries were placed after the twelfth paragraph of each article. For this reason, very short news pieces could not be included in this test. We measured both click-through rates to the evergreen articles and performance metrics regarding the enhanced articles. The experiment took twenty days. We used our standard toolset for A/B-testing to conduct the experiment: Adobe Target, which is part of the data management platform underlying the SPIEGEL website.

The following example gives an impression of the different formats, as they were visible in a guest opinion piece by two German activists, discussing the necessity of a system change in the face of the climate crisis.

6 Snippets_A — An example of summary (blue background) integrated in an opinion piece about the climate crisis in the SPIEGEL website.

6.2 Results and evaluation

In the twenty days of the experiment, about 876,000 instances were recorded where users read an article that included one of the summaries. In 20% of these cases, the control variant without a snippet has been displayed, each of the other variants accounts for 4% of the views. This scenario allowed us to assess the effect of the summaries, questions and quotes compared to an article without these snippets.

On a micro level, we can count how often the snippets were not just read, but also used as a link to background articles. This happened in only about 750 cases, meaning 0.09% of the times. So a first conclusion we can make is that it was not possible to create a significant, deeper interest in the background articles. Nevertheless, the better internal linking of articles on a news site might be useful, for example to increase the visibility of the content for search engines. However, we have not investigated this further in our experiment.

Given the low level of activation, did certain variants work better than others in guiding readers to the evergreens? Not on a systematic level, as the evaluation of the data shows. Bullet-point summaries, continuous texts, quotes, and questions are widely spread over the ranking. But for further projects, it might be helpful to shed light on the two snippets that worked best.

1) The most effective was a quote on the impact of climate change on weather extremes in Germany: „Der Winter wird kürzer, das Frühjahr und der Herbst werden länger.” ("The winter becomes shorter, the spring and autumn become longer.”). Presumably the proximity of the audience and the personal involvement with the phenomena described here made readers want to learn more about the background. A total of 61 readers clicked on the link to this article.

2) The second most effective instance was a question on the effects of melting ice at the South Pole: „Wie hoch würde der Meeresspiegel ansteigen, wenn das Eis am Südpol komplett taut?” (“How high would the sea level rise if the ice at the South Pole thaws completely?”). Again, a personal involvement is at least suggested here. Besides, readers can expect a concretely conceivable scenario when clicking on the link. 50 readers followed the link to the evergreen.

At first glance, these examples illustrate that well-known mechanisms of attention-raising and relevance perception in journalism are strongly influencing this experiment. They show how well quotes and questions can work in condensing content and attracting readers. Given that newsrooms have already established article leads as formats aiming at exactly this, it could be concluded that very short snippets like the ones above could also be helpful in this context.

We could also compare how the short summaries influenced the performance metrics of the articles in which they were integrated: No significant effect was noticed. The completion rates of users in articles with or without snippets do not show significant differences. In both clusters, about 35 percent of the users reached the end of the articles. Also, the time users spent on the page did not vary significantly. A mean time frame of nearly three minutes was recorded for both clusters

On the macro level, we see the same consistency regarding the visiting time on the whole website. No matter whether they were shown snippets or not, readers of the climate change articles spent an average of about 980 seconds on the website. Only for the number of articles read within this timeframe, a slight difference between the clusters can be measured. But since both values are close to the overall average of 5.5 articles per website visit, this difference hardly seems significant.

These observations show that elements such as those tested in this experiment do not appear to have a directly measurable effect on the use of an article or website. To investigate how the summaries are perceived by readers, further qualitative studies would be necessary. At this point we can at least report that no negative feedback on the elements has been expressed by the editorial staff or by the readers.

The mechanism of an automated integration of evergreens demonstrated here could also be used, for example, to insert personal recommendations or for search engine optimisation. These relevant use cases were not directly investigated in this study. However, as online journalism becomes increasingly structured, such elements would in any case only be a very first step to a more comprehensive user experience.

Using summaries as teasers is just one possible use case. There are many more – already in production or on the horizon – changing the way we are producing, curating and consuming media. Talking to product managers, reporters and editors in our newsrooms and archives, as well as other experts, we found other interesting use cases and possible reference points in the still uncharted territories of structured journalism. We want to use this chapter to open up to further possibilities although we do not claim it to be comprehensive – other ideas might be out there!

7.1 Tailoring news feeds and aggregating

For David Caswell, Head of Product at BBC News Labs, automated summaries are a tool of efficiency for both newsrooms and users: “It’s efficient for consumption AND for production.” BBC News Labs built their own prototype which is meant to cater to both sides and can break articles down into two kinds of summaries: a set of bullet points and a collection of captioned images for Instagram stories. It is being built for integration in a new CMS to produce different user experiences for news – an increasingly important use case for summaries.

The Labs' approach is trying to tackle what aggressive aggregators are getting right: Serving users the right news format in the right situation, ideally per default and with the help of extended user metrics. How much users consume news depends on both their preferences and their physical context (on headphones, commuting etc). Caswell cites the highly successful Chinese news app Toutiao as an example of using machine learning to serve users who especially enjoy lists with an overview of the news compared to summaries or-full page swipe selections for the reader who prefers a deeper dive experience. Automated generated summaries can serve as a tool to produce this kind of tailored news feed for users.

In his book “Newsmakers - Artificial intelligence and the Future of Journalism”, Francesco Marconi names an example from Bloomberg. The Bulletin was launched as a feature on the agency’s mobile app in 2018, and according to the company it “provides a summary of the top three stories of the moment, leveraging machine learning and AI technology to generate a single sentence summary of the most important articles within Bloomberg’s global news network.”

Alongside the fully automated news aggregators, like Google News and Apple News, other companies are offering a human-curated aggregated news service, like Swedish Omni (owned by Schibsted) and American Newser. Their production model is based on the scraping of hundreds of different news websites and manually summarising the most interesting hand-picked stories, as well as linking back to the original source. Including a summarisation tool in their workflow could enhance efficiency and speed.

7.2 Easing workflows

We can’t stress enough that any AI tool can only be as good as the workflows it is embedded in. This is decisive for any AI tool used in the news business, but especially with automated summaries where results are not ready to be published directly without some form of human editorial oversight or intervention. BBC’s David Caswell says they would not publish directly from their summariser prototype, “though we’re very happy with the output. A semi-automated workflow is also our approach for NLG and some machine learning outputs at the BBC.”

How much additional editing is needed depends largely on the quality of the generated summaries. Some workflows can be augmented by summarisation tools, even if quality is not good enough for a machine-only publication – and there might even be a benefit from a human-machine collaboration that trains the algorithm continuously to improve summary quality in the long run and reduce the need for human help and supervision.

At Swedish Radio a possible use case would be to install a semi-automated workflow to write the three bullet points heading each SR news clip on the website and in the app. That would allow reporters to file an audio piece, let the AI summarise it, and then manually check and edit the result. “This could save significant time to the reporters and the accuracy might increase as the model is trained over time,” says Olle Zachrison, part of our team and Head of Digital News Development at SR.

Summaries might also help to consolidate existing news workflows. At Bavarian Broadcasting, Katharina Kerzdörfer, who is heading one of the regional news departments, has the mission to create a new single workflow for news briefs that are currently produced separately by several news departments of the broadcaster. The idea is to find a workflow where the fact-checking and writing of the news pieces is centralised and then flowing from there to the different departments and channels of the broadcaster. “This helps to reduce redundancy and potential sources of error and frees up more resources for in-depth reporting,” says Kerzdörfer.

She imagines integrating automatically-generated summaries in the future workflow at two steps of the news process. Firstly, as a reporter tool to help colleagues sift through longer texts, agency material or analyses, and decide if there is enough news material. Secondly, as a production tool to ease the writing of shorter news pieces from longer versions. Katharina Kerzdörfer stresses the importance of designing a human-machine workflow and of taking care of the journalistic due diligence rules: “but as an efficiency tool we’re up to experiment with it, if it allows us to have our hands free for real reporting,” she says.

7.3 Enhancing SEO, metadata and search – for newsrooms and users

Summaries can be a useful tool to make texts, audios and videos more searchable and add metadata to existing media objects. This helps prepare the infrastructure of media outlets for ewa future when news production becomes more atomised and integrated around data and diverse platforms as well as to enhance their SEO approaches.

Swedish Radio (SR) engineers Tobias Björnsson and Carl-Johan Rosén are looking into pairing AI-powered transcription and Natural Language Processing with automated summaries. This could be used to extract summaries out of long streams of live radio, or to enhance the searchability of audio segments: “The summary would be presented in the search results of audio transcripts for users, as well as for the newsroom. We would also like to pre-populate our CMS with short, summarised articles from transcribed audio clips.”

Gaby Wenger-Glemser, Head of BR’s Documentation and Research Services, is working on solutions to extract information from media objects and is looking for more atomised forms of summaries such as extracting quotes, persons, and topics. Ideally, she would like to connect those ‘atoms’ as search results and present to the newsroom lists of quotes related to persons and topics. “This goes into the direction of knowledge graphs, but there is still a way to go”, Wenger-Glemser says. Her team is experimenting with extractive summarisation models and she would like to use them to extract topics, titles and segment subtitles of audio and video transcripts,to prepare media objects for segmentation in the future.

Metadata also plays a crucial role in another application scenario. The Science Media Center (SMC) in Germany provides journalists with statements, fact sheets and background information on current scientific news events. In their data lab, they have started a research project around structured journalism, aiming to make curated information and expert assessments more valuable in the long term using metadata. Both the SMC editorial team and journalists in external newsrooms could assess constantly-updated overviews of scientific studies and contextual information.

Automated summaries could help both groups to capture large amounts of scientific information – if the quality of the output is sufficient. Especially in the field of up-to-date science journalism, current news poses a particular challenge for AI-driven tools, as those articles are often fundamentally new and also require a high degree of precision.

The idea of atoms and segmentation goes in the direction of Structured Journalism, where media objects are broken down in smaller parts, enhanced with metadata and re-combined to form new media products. David Caswell emphasises that text summaries, especially abstractive models, are “not structured – because they are representations of entire articles, not editorially-specific extracts from them.” But extractive models, or atomised information out of media pieces, are part of the more narrow definition of Structured Journalism, and even abstractive summaries can enrich metadata of media objects. The infrastructure behind Structured Journalism approaches, more precisely cloud-based content hubs with segmented media objects enriched with metadata, are crucial for any kind of personalisation in the future – and automated summaries are part of it.

7.4 Using third-party content

An example from the JournalismAI Collab shows how scraping – technical for gathering of information – combined with summarisation can add value to the newsroom as a sourcing tool, and in case of publication as a public service algorithm.

For some years, the Swiss TX Group has been looking to create a sourcing tool complementary to news agencies. One use-case consists of covering local or even hyperlocal news. Sourcing is one of the first steps in the editorial value-chain before content production and publication. The motivation behind this project is that high-quality sourcing provides the newsroom with a wider range of ways to get and keep the readers’ attention.

The idea of TEX, the tool TX Group currently uses, is to collect available local information: crawling communities, associations, and other local websites and social media and then deliver this information in a consumable way to the newsroom. The first experiments showed that the way this information is consumed must be efficient. The quantity of material collected by TEX can become overwhelming. Besides an efficient search engine enhanced with entity and category extraction, the developers want to offer the newsroom a snackable summary for each collected source content. If the summary collects most of the facts embedded in the document, and even if the style and grammar are not perfect, a journalist can quickly go through a high quantity of documents to get a sense of what is relevant to the topic they are interested in. This combination of scraping data and summarising text can be used for any kind of alert or research tool for newsrooms and can inform the investigations of reporters and editors.

The same workflow could also boost live reporting by integrating content from other sources or user-generated content in summarised form into live blogs – reviewed by a human editor or reporter before publication, as set out above.

7.5 Using translated content

The combination of automatic translation and summarisation could in the future be interesting to news outlets with an international or multi-lingual audience.

Swedish Radio tries to serve its diverse audience in different languages. Currently, SR publishes content in around ten languages other than Swedish, with Arabic, English and Finnish being some of the most important. Daily summaries of the main national and local news in those languages could dramatically open up SR content to a larger audience. This improved accessibility could play an important civic role for communities that don’t have Swedish as their native language and would be at the core of the public service mission of a broadcaster like SR.

This would as well be an interesting use case for multilingual newsrooms such as Jagran New Media, which produces content in English, Hindi and ten other Indian languages, with a total daily production of 3,000 stories in Hindi languages from 30 different newsrooms across the country.

7.6 Emerging offers by commercial players

To identify potential use cases, it is valuable to look at how commercial firms offering summarisation services position themselves and where they see enough demand or relevance to create solutions. In their marketing, Agolo, for example, presents several different use cases with partners from the publishing industry, like Forbes and AP:

Creating easily digestible content. Summaries of longer texts can be used in multiple ways. They can be used for on-page summaries to give readers a quicker grasp of the full story, to use as teasers on social media, or in newsletters. Forbes has a large editorial team of nearly 2,000 staff and freelance writers, and so scaling their production of news summaries for use in digital print media was a big challenge. Now, the Agolo summarisation software has been integrated into Forbes’ CMS and is producing automatic summaries at scale each month.
Market Briefings. Personalised market briefings is another service offered by Agolo. Here the tool can summarise financial earnings reports, either for media companies or the financial services industry.
Speakable News Summaries. The growth of smart speakers is driving demand for more audio content in broadcast form. Text-based publishers can use “speakable” summaries to make snippets of their content available via Alexa and Google Home, thus opening it up for audiences on those platforms as well as to “generate new revenue streams”, as Agolo puts it. This is pointed out as one of the primary driving forces behind AP's decision to start using the Agolo service at scale. As smart devices proliferate, AP saw an increasing demand for content that could be voiced, either by a human news-reader or an automated text-to-speech engine. The starting point for such content is a short summary of a longer news article, composed in clear, declarative sentences that are designed to be spoken.

For a news agency covering events worldwide, the volume of stories requiring summarisation can reach into the thousands each day. In early 2019, AP began working on a project employing Agolo technology to automatically generate story summaries to streamline the editing process. It is showing signs that it could deliver a productivity boost for news staffers and saving valuable minutes in relaying important news to an array of outlets that require “speakable” summaries for a listening audience.

In a continuing collaboration with Agolo, the AP is now testing summarisation around the versioning of text, generating multiple new outputs from a single news story, such as an executive-style bullet summary, a list of important quoted statements, and more. The additional, automatically generated, outputs are being designed around a variety of news production needs, ranging from Q&A’s to social network posts to graphical text overlays.

Sage Wohns, CEO of Agolo, explains how important it is for their media customers to have the summaries reflect their particular editorial tone and style: "It's critical that news summaries do not try to interpret or modify the intent of the original stories. That's why Agolo uses an extractive approach to summarisation, identifying the most salient points in the story and sharing them in the way they were written.”

Besides Agolo, there is a large array of other offers that we were not able to look at in detail due to the time constraints of our study. We’d like to point again to our list of other free and commercial summarisation resources.

Robots will not ‘take over’ journalism. At the same time, many diverse and promising use cases suggest that we should think about augmented solutions to leverage scaling effects. This is all the more important to consider for publishers given the economic pressure many media organisations are facing and the evolving dynamics of the digital sphere in which our products have to be successful.

We should implement hybrid processes where journalists are assisted by AI. It will be challenging not only from a technical point of view but also from a social, strategic and cultural one. It requires a broad discussion of what lies at the heart of our journalistic mission – and what we can confidently pass on to algorithms. The resulting challenges could and should certainly be described in much more detail. Here we would like to point out a few selected problems in particular.

8.1 Technical chalenges

Before setting out to purchase a summarising solution or starting to create your own summarisation system, it is crucial to match summarisation models to specific use cases. Off-the-shelf pre-trained models like the ones used in our experiments did not meet journalism quality standards without adaptation. Careful fine-tuning of the models is required with matching training data. As shown by the collaboration between Agolo and AP, the text genre of the training data matters a great deal. They optimised their model specifically for news texts up to 1500 words and implemented AP-specific style elements.

Getting matching training data can be a challenge in itself. While there are lots of datasets available in English, training data for other languages like German (or Swedish, or Hindi) are harder to find. Licensing is another issue because an annotated corpus that the BR team could have used to implement coreference resolution was for academic use only. Generating training data from the output of the newsroom itself is time-consuming, but did improve results in the development process of the BR prototype. If newsrooms want to achieve good quality results, they need to be very clear about their input genre(s), output formats (like written or spoken summaries, bullet points, etc), text lengths, and additional features they want in a summary. These expectations must be communicated to the partner or development team very clearly in a way that ensures mutual understanding. Having individuals in the team who speak both “journalism” and “tech” helps greatly. When the system is up and running, it still needs to be included in the company’s architecture in a user-friendly way, also taking front-end development into account. Automated summaries can either be automatically published or presented to journalists for supporting them in their production workflow. The latter seems to be the best solution at the stage where summarisation technology is today.

In any case, the summaries must be stored in the content storage the CMS relies on. For this, both AI and back-end tech teams must be involved. The former to create a technical service which, for a story, delivers a summary. The latter for calling this service and storing the summary in the CMS, and also for implementing a feature in the CMS so that journalists can edit the summaries. Here the challenges are more organisational than technical. Back-end teams always have a backlog of work to implement, meaning that it’s a matter of priority management. Relying on an in-house developed CMS or a standard market solution makes a huge difference, influencing an in-house CMS roadmap being much easier than a commercial solution, which must serve multiple customers with heterogeneous requests.

Once this key step is implemented, publishing these summaries requires the involvement of the front-end team to eventually adapt the “more about” box. One key challenge here is to implement the feedback loop which triggers the continuous improvements of the auto-summary tool. The AI team needs to constantly check different variants and measure their performance. Front-end teams have to deliver these KPIs, be it click-through-rate or conversion rate for each published summary. This kind of A/B testing results are then sent to the AI team for monitoring and tuning the algorithms.

8.2 Editorial challenges

All parts of the news organisation must have a basic knowledge about the functionality, capabilities and limits of artificial intelligence. To implement AI-driven systems in a meaningful way, journalists have to learn how to collaborate with the product developers and what to expect from them.

This is all the more important because not all work steps of artificial intelligence tools can be fully traced. Self-learning algorithms should only work under human supervision, especially during learning phases. To ensure this, it is not enough to have a few experts ‘somewhere near’ the newsroom. We need widespread algorithmic, data and AI literacy, promoted through workplace communications and training programs.

Second, we also have to fundamentally rebuild our editorial processes, workflows and tools in several areas. To meaningfully embed AI-driven solutions, existing routines must be reconsidered. This process needs not only the necessary resources, but also a much better informed understanding of the workflows within the newsroom. From the perspective of many other professionals, journalism still feels like a rather creative process. This is understandable given the historical origins of journalistic work and also has its positive aspects, for example the ability to react to new developments immediately. But if we want to transfer certain repetitive components from these workflows to AI-driven systems, we need to be able to identify and organise them much better than today. To achieve this, management skills, as well clear roles and processes would be particularly helpful.

Thirdly, the connectivity for AI tools also extends to fundamental, strategic issues. Algorithms need robust metrics. Learning systems must be told which concrete goal they should optimise their decision rules for. At this point, newsroom management needs to know clearly where to go, which metrics are at the focus of the strategy and what a meaningful prioritisation looks like.

In addition to business aspects, it is also a question of the criteria and ethics that characterise good journalism. For example, do AI-based systems prefer solutions that lead users to more intensive media consumption on one page or a greater variety of visited sources? Only if such questions are clarified at an early stage, can collaboration between journalists and AI takes place successfully in newsrooms.

The great organisational and cultural challenge is, therefore, to narrow the gap between journalists and engineers. AI is an integral part of modern journalism and technical departments need the editorial eye to assess audience relevance in every step and identify viable use cases. Editors, on the other hand, need to understand that AI-components, like automated summaries, will be a part of journalistic products in the future and the audience will see editorial staff as responsible for that output.

Before getting into the concrete conclusions of our tests, it is worth pointing out a couple of general insights. AI-powered summarisation works and is getting more sophisticated. If trained on relevant and large datasets, the tools can be of significant importance to a modern newsroom, both for internal and audience-facing use. It is possible to customise the tools for the particular needs of your news outlet if you are able to clearly define the concrete use cases. Specific training with consistent material and proper supervision are crucial for the quality of AI-driven results. The knowledge that you can get summaries in different formats – like short, speakable summaries, bullet points, and more innovative formats like automated Q&As – opens up to exciting prospects for innovation.

9.1 Summarisation

Optimising the selected solution for AI-summarisation to your specific use case is critical. One main conclusion from the first part of the study is linked to this insight: The tested summarising tools are best at capturing the essentials from the types of content that the AI is trained on. Our tested tools are largely trained on short, traditional news pieces. Subsequently, when we run evergreen content written in this genre or style, we get impressive results. However, we also put the models to much more demanding tests by inserting long feature articles, text automatically transcribed from audio, and even listicles. Although the results from those tests generally scored much lower, they provided us with valuable insights.

In light of the the four quality criteria in focus – capturing of facts, grammatical correctness, journalistic text quality, and usability as a teaser – we can highlight some of the main conclusions:

The start of the text is crucial, especially the headline. The extracted sentences were more often picked from the beginning of the articles, and ranked in connection to the headline.
The short speakable summaries tend to be of significantly higher quality than the bullet points, as the logical connection was sometimes weak between bullets.
The capturing of the facts is generally good in the sense that the included facts are not wrong. But we see challenges in capturing the most important facts from longer texts.
The tools generally had problems summarising extensive articles, no matter the genre.
The grammar is the dimension that got the best scores. It was generally very strong due to the extractive nature of the summary models.
The journalistic text quality is heavily dependent on how well a particular summary captures the facts and its grammatical correctness. In general, coherence was more challenging than cohesion, again due to the extractive nature of the tools. When factually and grammatically correct sentences are lined up in a way that obscures connections between them, coherence, and with it the text quality, suffers.
The very difficult task of summarising a text that had been automatically transcribed from an audio-file, then translated, of course causes problems if the transcription contains errors that are then propagated to the translation, and then on to the summary. But a short, very factual news piece can be remarkably well summarised even if it has passed through those two complicating steps before.

With these conclusions in mind, let us go back to some of the main questions for this study, like: Is the quality of the summaries good enough for immediate, audience-facing use? The short answer is “no” when it comes to summarising the selected evergreens. Subsequently the reply to: “How much editing is still needed?” is: Quite a lot if you look at the wide variety of evergreens in our tests. The structure and style of this premium content is often too complex and editorially creative for the tested versions to grasp. Substantial human editing is still needed if you want to use these shortened versions in the newsroom today.

Having said that, there are a lot of further explorations that we recommend. A natural starting point is of course training the summarisation tools on a larger set that include evergreens. Our results summarising evergreens composed in a traditional news style were very encouraging, so that would be an easier way forward then training the AI on the most creative formats.

Whatever road you take, one crucial question remains: how do you identify and pick out evergreens on a larger scale? As stated in Chapter 3, this is a complex challenge that we did not take on in this study, but it is an area that needs further exploration, which has to be done with the particular mission of the individual media companies in mind. The evergreen types that are worth resurfacing are most likely not the same for BR in Germany and for Jagran in India, which we also saw in the texts that were manually selected for this test.

One necessary limitation of our study was to focus on summarising text formats. Even the more creative formats we used, like listicles and explainers, were text-based. But digital journalism employs so many more formats. Though challenging, it would be interesting to explore how AI-powered tools can help summarise evergreen content in all genres and formats, including their visual elements. Automatically selecting one or two pictures as a way of summarising a slideshow, or identifying key quotes in a news video, could certainly facilitate workflows in the modern newsroom.

9.2 Integration

In the second part of our evaluation, we involved thousands of readers to gain insights on the direct and indirect impact of including summarised snippets within articles. The testing on the SPIEGEL website worked properly, but no significant effects were detected with the measured performance data. The snippets did not significantly help the surrounding article to be read more intensively. They were also only rarely used as a vehicle for users to reach in-depth background information.

An explanation for these results can be found for example in the unobtrusive design and positioning of the snippets. To avoid collisions with other elements or ads, the summaries, quotes, and questions could only be set after twelve paragraphs. Since this point is only reached by users with a great interest in the topic anyway, inducing a variation in user behaviour could be difficult just for this reason.

Although the data does not suggest that summary snippets were effective in converting users to the evergreens, it is interesting to note that those that performed best were quotes and questions. This could suggest that more enticing formats that are not just traditional summaries, but rather excerpts that spark curiosity, work best as teasers. This is also an area that needs further exploration. For example, it could be envisaged that it would be easier to extract a central question or quote out of a long, feature article than doing justice to the whole piece in a concise summary. So maybe this could be the best way of using AI-summarisation tools in relation to the kind of evergreens that the tested tools found most problematic. However, further testing is needed to draw any definite conclusions.

In the past years, we have noticed that further content recommendations are generally rarely viewed on the website. One reason for this could be that the SPIEGEL website is used predominantly to stay up-to-date with the latest news. And even though the snippets had their own design, it seems plausible that users might have mixed them up with traditional article recommendations – or with other elements, such as advertising, which they usually scroll over.

It is therefore questionable to what extent the results can be transferred to other media platforms and forms of presentation. Further and more qualitative studies would be needed to gain more comprehensive insights on the impact of summaries within new content. For example, we know very little of what the audience actually thought of finding a summary where they usually met just a headline and a link. Does this enhance the overall experience? That remains to be discovered, maybe via an audience survey.

However, the approach of structured journalism goes much further, by systematically breaking down journalistic products into pieces and assembling them appropriately according to the concrete usage situation. In this approach, we see enormous potential for modern journalism and would, therefore, suggest establishing extended research in this area.

To media professionals, we recommend using their existing toolset to experimentally test different formats of linking and teasing. Since it can be assumed that the usage characteristics differ depending on the brand, environment, content, and audience – different approaches could prove to be successful in different contexts. Unusual formats such as quotes and questions performed surprisingly well in our test setup, which can be seen as an encouraging signal to move more strongly towards quiz or knowledge formats, for example. Also, a comparison of different optical characteristics could be worthwhile.

FINAL RECOMMENDATIONS

Why do we need AI-generated summaries in the modern newsroom? In Chapter 7, a wide range of different use cases, both in production and potential ones, were discussed. AI-generated summaries could constitute the kind of “deconstructed”, modular content that could then be used to tailor new kinds of news feeds. They could also have internal benefits, like easing workflows, enhancing SEO, and improving search in archives. They could be used for summarising content of third parties, like scraping information from public records, either to be served to the newsroom or directly to the audience. The use of summaries in combination with translation also opens up to the potential of increasing accessibility in our multilingual societies.

If one of the ideas in this study has got you interested, our strong recommendation is that you explore the use cases for your specific media company. Your unique set of editorial profile, targeted audience, technical systems and internal competencies has to be taken into consideration. There is not a “quick fix” that can be integrated off the shelf, especially not in languages other than English. When having this discussion, we recommend you do it in a multi-skilled team that includes both journalists and coders.

Building new AI-systems requires time and considerable resources, so it is worth contemplating teaming up with external partners. Could you design a summarisation model in conjunction with a university? Could you do it together with a commercial partner? Could you do it with competitors to split the costs? If you are a public service company, could you do it with your peers in other countries?

Going into the study, we did not expect existing AI-driven systems to fully autonomously produce and publish summaries. For the serious publisher for whom trust is key, our results show that some summaries are inadequate, incoherent and factually wrong, and the risk of automatically publishing such sub-standard content should be carefully considered . In a modern media landscape, it is not a sustainable position to “blame the AI” for any errors. To have checks and balances in place is the editor’s responsibility, no matter what technology is involved in the production.

As in so many cases involving AI and machine learning, the recommended way forward is a hybrid approach where the journalist can be assisted by the AI in a semi-automated workflow. Having the AI-system suggest a short summary or bullet points certainly can speed things up, particularly in hard news, if the AI is customised for your needs and trained on your content. But you still have to carefully assess if the summary captures the essence by comparing it to the full length of the original piece, and this requires a greater human effort for special formats and more creative editorial genres. Teasers in particular need human, creative editing.

These however are conclusions for audience-facing output. In internal systems there is more room to manoeuvre. Automatic summaries could be a way to more easily scan archives or monitor large amounts of text while researching a topic. In such situations, clear disclaimers that a specific text body has been automatically summarised should be included.

We think that AI-summaries will be an integral part of journalistic production processes in the future. Every organisation has a lot to win from optimising their summarisation tools towards a specific use case. As we are in the early days of technical development, it is logical that the tools have started with the lowest hanging fruits, like summarising the most traditional and factual news articles. Exploring how the technology can assist in repackaging, resurfacing and making the very best of our journalism more visible is an exciting challenge that we hope will engage clever minds from the global news community in the years to come.

APPENDIX: Unexhaustive list of AI-powered summarisation tools