Newsrooms produce large quantities of content every day. And what defines news is – at least in part – that the content is new. That it covers current affairs and keep readers up-to-date with what's going on in the world.

The amount of new content produced every day leaves news organisations with enormous amounts of unstructured text in their archives. In most cases, all that content offers almost no value to the news business, as it is considered no longer relevant when it loses its "newsiness".

However, these archives contain untapped potential that could be brought to life through AI technology to support the work of the journalists. But how can newsrooms leverage archived articles? What kind of tasks can be supported by AI and what kind of solutions are out there already?

Exploring these questions is the objective of our team in the context of the JournalismAI Collab: We want to facilitate the creation of an archived content suggestion engine for journalists.

At first, we hoped to design a universal tool for newsrooms to leverage archived content. But we soon met a range of challenges and realised that a one-size-fits-all tool might be difficult to achieve, for a number of reasons that we will explore in this study.

So we adapted our goal and this is the result of our work: A catalogue of solutions for news organisations that want to leverage the content in their archives, but also a call to action for tech companies to work with journalists to develop the tools that are most needed. Because the potential is there.

EDITOR'S NOTE:

This report is produced by an international team of journalists as part of the JournalismAI Collab, a project of the Polis think-tank at LSE, supported by the Google News Initiative.

The Collab is a collaborative experiment where news organisations from around the world teamed up to explore innovative solutions to improve journalism through AI.

You can find out more about the initiative, and about the work of the other teams, on the Collab page.

1. Introduction & Definitions

Here are some examples of the challenges and limitations we encountered in the development of this study:

Language barriers

First of all, we realised that because of the language differences, a prototype tool developed for a specific newsroom might not work for newsrooms in other countries. Secondly, when working with AI, it is not irrelevant in what language you operate: English often works better because of the availability of existing English-language AI tools, training data, and libraries. The ‘smaller’ the language, the fewer off-the-shelf solutions we found.

Newsroom culture

A universal obstacle for newsrooms to overcome is the challenge to encourage journalists to adopt AI tools (and technology in general) in the journalistic process. A classic example is manual tagging systems for articles, where consistency is often absent. Also, it can be difficult to introduce new routines and convince journalists about the benefits of making an effort to adapt to new tools. Our team conducted a survey in our respective newsrooms, where we tried to identify what AI tools journalists wanted. In this survey, the respondents could indicate if they consider themselves as pioneers, curious, moderate, or conservative regarding their use of technology in the production of news. Although most respondents described themselves as 'curious', we received answers from across the four categories and their feedback confirmed the variety of attitudes towards technology in the journalistic production.

Newsroom-specific needs

We soon realised that our Collab team itself included many different types of news organisations, including large legacy media; news agencies; local and regional newspapers; and specialised b2b media companies. Even though the challenges are sometimes similar, they still differ to the extent that universal tools may be difficult to implement, as definitions and goals vary significantly. A very concrete example is Machine Learning (ML), a method often suggested in this study. A universal tool using ML would require training on any specific newsroom's archive because of their differences.

Individual routines

Even more specific is how the tool would have to work at the individual level. A successful tool would have to fit with how individual journalists produce their content. For example, if their routine is to write the article in a Word document before copy-pasting it into the CMS, or if they would write the draft straight into the CMS article templates.

The resource question

Another limitation that is widespread among newsrooms is resource availability. When working with advanced technology in media organisations, resources are most often allocated to the business intelligence departments for business insights, rather than content production. Often, the investments also centre on revenue-generating areas of the business, such as the Commercial Function. For newsrooms who do not have paywalls or a subscription model, it is challenging to create a business case which justifies the skills and expertise of data scientists when focusing just on ad-impressions.

Reusability is a critical factor when thinking of how to make the best use of the content in our archives. But what makes a piece of content reusable is not universal. Looking for a shared definition of 'evergreen' content, we realised that all newsrooms have their standards and definitions, which make it complicated to generalise. We decided to classify potential definitions in four macro-categories.

Definition by content type

One category that is related to several of our case studies is the definition by content type. Some types of content stay more relevant for a longer time than others. Examples are explainers, guides, reviews, and portraits. These are most often connected to a news topic rather than to the daily news cycle and therefore expire their value more slowly. These are types of content that could potentially even be updated and republished once the same theme reappears on the agenda.

Cyclic definition

Some articles do not depend on current affairs, but are related to certain recurring events. Examples vary a lot between newsrooms, but for a local media like Nice-Matin, they could be articles such as "what to do in case of a mosquito or jellyfish bites" or "the best way to keep your apartment cool in the summer" – clearly articles that could be reused every year. For other newsrooms, they could be stories about traditions related to public holidays, or anniversaries of important events.

Contextual definition

Another way of defining evergreen is based on ‘related articles’. Altinget and Archant describe articles in this category as historic content that supports or enriches the story of a new article. For example, a story on a new trend in car crashes would be supported by articles previously published on specific car crashes. In Altinget’s case, they built automated timelines on law proposals where the archived content provides the background to a new article as this one appears on the proposal’s timeline.

Definition by metrics

Last, another way to define evergreen is based on performance metrics. If a piece of content experiences a sudden growth in traffic from Google Search or social media a while after release, it can indicate a renewed interest in the story. If the editor becomes aware of that, it can interpret it as a news lead for a new story, or decide to republish the original story to capitalise on the renewed interest. Updating the time-stamp also strengthens the SEO-value of the story. The high SEO-value of an article is a key component for some newsrooms in defining that article as 'evergreen'. If a piece of content sustains high traffic numbers for an extended time after release, that might also be an indication that the story has a ‘timeless’ quality and potential evergreen value.

1 how we define evergreen — Our team used Miro as a digital whiteboard to collect information and ideate. Here's our collection of definitions of evergreen content.

The tool we wanted to design has more than one potential use or configuration. Some applications would be more relevant for some newsrooms than others and they could support different phases of the editorial process.

Ideation and research phase:

As inspiration for new stories, or suggestion for resurfacing earlier articles, the tool should be able to notify the editors if an archived article suddenly reappears in traffic reports from search engines or social media. That could be an indication of renewed user interest. This could be achieved through the use of engines such as Google Trends or Parse.ly Currents. Parse.ly is working to develop universal keyword clusters that would help newsrooms to identify their own content that could be relevant to support the narration of resurfacing trends.

Research phase:

Content in the archives could be useful in the early phase of production of a new story. For example, if the tool could help find related stories in the archive based on the first paragraphs drafted by a journalist. This could help the journalist resurface useful information to support their writing as well as finding previous quotes or charts that might be reused.

Finishing/ornamentation phase:

The tool should also be able to suggest related articles to the journalist or editor, as well as to the users. This could be a way to support reader retention and strengthen the story’s value by adding context and additional information from older related articles. This also has the added value of signalling to the user that this newsroom or journalist has a track record of reporting on the topic and, therefore, a certain established reputation on reporting on that same topic.

Relevant in this phase is also the choice of the headline. Based on the text in the article and based on the performance of related historic content, there are different methods to produce auto-generated suggestions for the headline, with special attention to optimising performance on search engines (SEO). This is not directly related to our objective, but we decided to explore potential solutions anyways. [See sections 2.6 and 2.7 of this study].

Our team used Miro as a digital whiteboard to collect information and ideate. Here's our mapping of needs in the different phases of the journalistic process.

We call our imagined new product, ArcAI. And the (imaginary) product description reads:

ArcAI is a smart bot that sifts through your archive to recommend the best matches whenever a journalist starts to write a new article. ArcAI’s suggestion engine has three goals: to reuse, to inspire, to interlink.

Based on Natural Language Processing (NLP) and other related technologies, ArcAI assigns a score to each article in the archive to find potential matches as soon as a journalist starts writing a new article in the CMS.

To detect evergreen content, we discovered that some methods and tools are already developed. Often, they do not even require any advanced technology to operate:

Metrics

Some newsrooms are detecting relevant evergreens simply based on how their archived content performs. One example is the Financial Times that in 2017 developed a dashboard that identifies renewed traffic in older posts and where that traffic origins from – for example, search, social, or internal references. This notifies the editors if a topic is reemerging and if earlier articles could be updated and resurfaced.

Evergreen tag

Some newsrooms are working with tags, manually assigned by the authors if they consider an article as a potential evergreen. Nice-Matin has worked with this model [See section 2.2] as well as The Wall Street Journal. However, this method very much depends on journalists' consistency in tagging each and every story and it is very difficult to automate.

Cyclic notifications

With a certain amount of preparatory work, recurring events – and related stories – could also be coded into the editorial calendar. This way, editors could be notified when certain examples of evergreen content reach the cyclic time to shine.

Qualitative indicators

Another method, tested La Naición, is to look at qualitative indicators in the articles. In this context, one way to define what is evergreen or not is to look at some specific indicators within the content itself: Are there many people quoted? Did the piece perform well on social media? How was the dwell time? Did it receive many comments and shares? Does it include infographics? – and so on. Each of these indicators might help qualify whether a piece of content has the potential to be an evergreen. Archant has worked with attribution of qualitative properties to articles in the archive based on performance. [See section 2.5]

The real challenge in our work was to find the best method to match a new article in the process of being created with a piece of existing content from the archive. Our research revealed various potential solutions – which all seem useful only under certain circumstances. This realisation contributed to our decision to not try to develop a universal tool but rather to offer an overview of the different ways AI can be used to leverage a newsroom's archive. As illustrated below, we tried to map different approaches. The following paragraphs summarise what we found out about those approaches and how they function.

Our team used Miro as a digital whiteboard to collect information and ideate. Here's our mapping of technical methods to match articles in progress with content from the archive.

The first task is to identify the potential matches in the archive. Secondly, we want to add some quality criteria to those matches, to increase the chances of finding a match with an article that has evergreen value. This match can either be done by keyword searching, or by automating the process of finding a match between an article being written in a CMS-template with an article in the archive. These are the main findings of our mapping process:

Minimal version

One way to implement keyword search easily is simply to embed your media search field visibly in your CMS article template. This will create easy access to the archive and remind the journalist of the possibilities offered by the content in there. This method requires that the quality of your native search engine is acceptable, and native search engines are something that many newsrooms struggle to optimise. Alternatively, the Google programmable search engine connected to your own domain can offer an alternative solution.

Matching tags

The next option is based on tags: it consists in generating a list of suggested articles according to the number of tags that match words used in your article. This method depends on how you use tags in your organisation and contain obvious limitations as manual tags are often inconsistent and mistakes occur. The quality of tagging systems also varies a lot between newsrooms. TX Group has a successful automated tagging system based on open external libraries in combination with manual tags. Matching this metadata with new articles looks like a promising opportunity. [See section 2.4].

AI-based matching techniques

To avoid manual tagging, there are several other ways of finding relevant matches. For example, a machine learning model can be trained to automate the creation of a library of keywords to facilitate relevant matches in your archive. We learned that there are variations in methodology between different machine learning options and that a number of different algorithms are available. One example is the “StarSpace” algorithm, developed by Facebook RResearch, that the travel platform Culture Trip used to develop their own auto-tagging system. This algorithm decodes and matches entire texts instead of focusing on the comparison between specific words. [See section 2.1 to find out more].

Machine learning can also be combined with other methods, for example entity extraction of named entities (people, organisations, or places), and natural language processing (NLP) that takes the relations and properties of words into account. Statistical methods such as TF-IDF could also be of use, where algorithms compare the frequency of words to make statistically-based matches.

Unfortunately, we did not manage to test combinations of these methods. However, it is worth taking into consideration the cost of the development and the cost of running complex calculations on cloud servers every time an article is being published.

Explorative model

Whichever way you decide to put together your matching model, it will produce a calculated list of matching results, that you might find relevant or not. But if you apply filters or open the calculation in a visible knowledge graph that you can edit according to your preferences, journalists will gain access to a qualified list of suggestions. Nevertheless, a question worth exploring is whether the effort needed to edit the results will be perceived as an obstacle rather than help.

2. Case studies on leveraging the archives

Instead of developing a universal tool for newsrooms, we have collected inspiring and constructive experiences from various media organisations that have successfully developed innovative ways to leverage the content in their archives through AI and related technologies. These are based on our team members’ experiments as well as experiences of newsrooms and organisations connected to the Collab.

To shed light on how large amounts of evergreens can be successfully leveraged, we talked with travel startup Culture Trip. The business is based on a vast catalogue of articles about places to visit around the world, categorised in a taxonomy that makes it easy for travellers to find the content that is relevant for them.

The entire site is built on evergreen content, in the sense that it prioritises articles of highly-relevant value in a given context. Culture Trip articles are not news, but they offer great information in a very specific context. The taxonomy is a very well-defined set of tags that indicate the main features and geographic location of the article.

Articles are indexed through auto-tagging. Culture Trip has indexed around 80,000 articles using a machine learning algorithm called StarSpace, developed by Facebook Research. StarSpace is an algorithm that creates text embeddings but also includes the extension to create tag embeddings simultaneously in the same vector space. A text embedding is an alternative machine-readable representation of a text that aims to describe its semantic properties in an abstract way. According to Culture Trip, this method is very reliable, with 90% precision, but it also requires a very strict taxonomy and a carefully trained algorithm.

But even evergreen content doesn't really stay green forever. Deep links expire, details get outdated and pictures get old. To be able to still use that content effectively, it's important to have a method to maintain it. This is especially true for a business like Culture Trip, that aims to guide users around the world.

Culture Trip put significant effort into identifying the lifecycle of a piece of content, which helps them to understand and handle the biology of the content. Certain stages of the lifecycle call for a review, while others might suggest when it's time to "retire" the content. They also realised further advantages of updating an article: the accuracy of the content they produced is crucial for their business model but an updated timestamp also triggers better performance in search engines.

To manage to keep 80,000 articles constantly up-to-date, Culture Trip once again made use of technology, and established what they call a “Maintenance Machine”. It is basically a system that monitors a ranked set of parameters in the article database. That includes dead links, low-resolution images, or outdated information, for example about whether a business is closed – information that is available from the Google Places API. The Maintenance Machine displays a list of articles that need review, ordered by how urgent that same review is.

The approach of Culture Trip offers interesting insights for newsrooms, in spite of the clear differences in the objectives of a news organisation. In the context of news, taxonomies and tagging structures are more complex and less strict, and the definition of evergreen content varies a lot. However, the kind of content that Culture Trip offers is often relatable to at least some of the content types in the news business, such as guides, reviews, and recommendations. Newsrooms can surely learn from the Culture Trip methodology if they want to explore innovative ways to leverage the content in their archives.

Thanks to Roop Gill Axelsen, Product Manager, and Ana Jakimovska, Chief Product Officer at Culture Trip – and Collab coach – for sharing their knowledge and experience.

Some media houses are experimenting with introducing evergreen tags – the Wall Street Journal and Nice-Matin in France, among others. In the case of Nice-Matin, an 'evergreen' tick-box was created in the CMS back in 2016, to allow the journalists to catalogue an evergreen article. We spoke with Damien Allemand, Head of Digital Content, to find out more about that:

What was behind the decision to add tags to all Nice-Matin articles?

We realised that we were too often rewriting articles that we had already published. And when we did, the new article was just as good as the version that was already published.

So we wanted to build a database of all these published articles that a journalist could retrieve with a simple keyword-based filter. For example, a lifestyle article on "The 30 reasons to never go to the Côte d'Azur" had 100,000 views when it was first published; three months later, we republished the same piece and recorded 200,000 views. Since then, it has been republished about ten times, each time driving good traffic to the website.

At the same time, this strategy allows us to save significant time to our journalists. When they are about to write a new article, they can easily find if a similar story was already published and, if it was, maybe they will just spend their time better writing about something else.

A screenshot of the CMS used by Nice-Matin, where journalists can tick the 'Recyclable' box to mark an article as evergreen.

What would you say is the proportion of evergreen articles in Nice-Matin 's archive?

In our estimation, the content that can be republished without the need for much editing represents around 10% of the archive. If we add to that other articles that only need to be slightly updated, we reach around 20%. These are articles such as "What is a red weather warning?", or "Why can we see Corsica from Nice?", which can be republished pretty much every year around the same time.

What are the challenges you have encountered with this manual-tagging system?

What doesn't work at times is that journalists do not always remember to tick the tagging box. It is hard to build the habit to do it with the consistency that is required for the system to bear fruits. Some reporters maybe do it for a few days and then they stop – maybe because it's hard to see immediate benefits.

What would it take to make the system work?

To optimise the use of evergreen content and build an efficient recommendation engine for your journalists, you need a detailed plan, good organisation, and buy-in from the newsroom. The tagging of evergreen articles is a task that might also be assigned to someone in the archive department. But that takes resources and willingness to invest them, which is often a problem for local and regional outlets like we are at Nice-Matin.

Thanks to Daniel Allemand, Head of Digital Content at Nice-Matin, for sharing his insights.

With so many headlines, stories, photos, videos and graphics published at Reuters – as many as 9,000 individual items per day – accurate metadata application, the use of topics to describe a piece of news, will always be ripe for automation. The challenge is also to recapture the total number of hours our journalists spend manually applying this metadata, a significant savings in time that can be turned toward newsgathering.

Feedback from our clients, especially other news organisations, tells us that they like stories that summarise newsworthy subjects with relevant, timely facts, what we refer to as evergreens. That has raised yet another challenge: how to use metadata to quickly surface evergreen story candidates, and possibly assemble a large portion of an evergreen story.

There is already some automation that has proved useful. For nearly every story, an algorithm applies topics whenever a story is transferred to an editor or between editors; the use of the ‘Add Topics’ button in the user interface is also encouraged; and some topics are applied based on trading symbols, or white-listed geographies. Applying this to thousands of stories per day helps, but it still feels incomplete.

In analysing how we apply metadata internally, we assumed the gold standard in assigning topics was going to be published-content that had passed review or insertion by a human editor. But that wasn’t necessarily the case. The number of topics is so great, with updated versions every few weeks, that no one person knew the codes and their definitions in detail, and codes often went missing or were over-applied.

Reuters has several story-types that are de facto evergreen items, with codes and associated names such as Explainer, Factbox and Take-A-Look. As long as a story was deemed evergreen from the start, the metadata suggestion approach worked well. It was less successful in applying metadata in determining what potentially could be an evergreen story.

We still do not have a quick way to say, ”This Factbox is relevant again this week, what edits does it need in order to be updated and republished?” It's a human task that could use an automated or semi-automated solution. Factboxes about politics are especially prone to age rapidly. Ensuring proper context and updating dates, times and facts is critical to these items, and make up the majority of editing time.

We have other automated efforts with structured data, around sports, corporate results and markets coverage. We found machine-produced text to be useful, if not sparkling prose. Most of the feedback internally was on the sometimes stilted nature of the writing and the punctuation. Moving from sentences using structured data to suggestions based on unstructured text is worth experimenting.

We may find in experimenting that certain facts, as complete sentences, are evergreens, and can be suggested wholesale as the story writing progresses. For example: “This is the fourth story today, or this week, that can be grouped together. May I suggest these facts?” Or: “Based on previously published stories about this topic, you are writing on a subject that would make a popular Explainer.” Or perhaps an evergreen “score” that could help guide journalists.

Thanks to the Editor for Automation and News Technology at Reuters, Padraic Cassidy, for sharing this case study.

One way to apply machine learning, the most common form of AI nowadays, is to split its application into two steps:

1) In a first technical step, the objects of interest – for instance, news articles for media houses – are processed into a structured format;

2) While in a second heuristic step, this structured format is exploited for various applications. A natural application is to search for similar news articles.

We illustrate this process in Figure 1:

Figure 1: First abstract objects – for instance, raw news articles – and then use these abstractions – for instance, extratcted tags – in several applications.

A popular approach to give structure to news articles is to automatically assign tags. Tags can be broadly classified as follows:

1) Tags that are directly contained in the text, for example the name of a person – say, 'Joe Biden';

2) Abstract tags that are not necessarily explicit in the text. For example, a topic like 'Politics';

3) Meta-information like a manually-assigned 'Evergreen' tag.

The tags of the first type can be dealt with by classic Named Entity Recognition (NER) algorithms. Common subclasses are “People”, “Organisations”, and “Locations”. On the other hand, the second class of tags detailed above requires some specifically-trained classification algorithms. Finally, the last group (which is not used at TX Group yet) would require an additional automatic or semi-automatic process that will depend on the type of meta-information. For instance, we have seen in earlier sections that there are many ways to define evergreen content.

Searching for articles

The main benefit of merging combining different tags in the same solution lies in the possibility to use them in a unified search procedure. At TX Group, we are currently using the following logic in a beta test:

1) Given a specific article, we score all other articles published within a reasonable time window according to their similarity to the first one, with respect to tag overlap. That is, we assign some points for each joint tag, and then compute the score by simply summing up the points.

2) A major improvement is to assign points according to the frequency of a tag. For instance, if a tag is very common like 'Politics', it should correspond to fewer points than a more specific one like the name of an individual. To deal with this, we use a so-called Term-Inverse Frequency strategy: If a tag appears in only one article every 100, it receives 1/100 points on the similarity score. So the more frequent a tag is, the less important it is for the score.

3) Another positive adjustment is to linearly scale the points according to the number of occurrences of a tag, which applies only to the first group of tags described above. For instance, if two articles contain the name 'Joe Biden' several times, then the assigned points are scaled accordingly.

Figure 2 shows a screenshot of our current search GUI (graphical user interface). Our goal is to integrate this more tightly into the journalistic production process.

Figure 2: Search GUI at TX Group based on tags. The different colours correspond to the different classes and subclasses of tags described above

Technical Implementation

To power this process, a NER-tagger is required to create the tags of the first class. As already mentioned, NER stands for Named Entity Recognition, an NLP technique to extract keywords from a text that has a special meaning, like names of people. There are many paid APIs on the market that cover this niche. The ones we tested provide quite good results: Rosette is a good example. On the other hand, there are also reasonably good Python packages that provide this functionality. Arguably the most popular ones are spaCy and Flair:

spaCy

PRO: most popular NLP package, runs very fast
CON: limited quality for NER in our test cases

Flair

PRO: acceptable quality and good multi-language support
CON: quite new and far slower

In the end, since we favoured quality over speed, we settled on Flair for our NER tagging process.

On the other hand, to classify articles by creating tags of the second class, we use a standard neural network which encodes articles using word vectors. As written above, the integration of class-3 tags is still an open challenge at TX-Group – which our journalists are asking us to solve.

Thanks to Tim Nonner, Chief Data Scientist at TX Group, for contributing to the report with this case study.

There are various technical approaches that an AI solution for recommending archived or evergreen content can take. Recommendations could be based upon existing tags, content categories, seasonal search traffic, social media trends, article headlines. The greater the accuracy and comprehensiveness in which article data is structured, the better chance a programme can be built to recommend the most suitable articles from the archive.

Unfortunately for the AI, there's usually no consistent method in which a newsroom structures its article performance and data, which can cause problems when trying to highlight the right article. There are certain 'hygiene' steps that all newsrooms can take to ensure their archive is accessible and enriched with usable data. The first is unlocking raw CMS data, which isn’t always picked up by traditional analytics software such as Google Analytics, Adobe Analytics, or Chartbeat. The core information needed is the headline, published date, last modified date, word count, content categories, content tags, and the unique ID/URL of an article. Also relevant is the author, what additional sites it appeared on (if any) and, if possible, the complete body text of the article. Whilst the latter would improve the accuracy versus performing Natural Language Processing (NLP) on just the headline, this would increase the storage cost and may be challenging to access for newsrooms.

The next step is to join the raw CMS data with article performance data. As a minimum, this would require a connection with a single analytics source via a unique article ID or URL, including page views, unique users and dwell time. Calculated metrics, powered by segments, can be built to identify the referral traffic breakdown per story, such as Search Page Views, Social Page Views, and Direct Page Views, which can be helpful to understand the evergreen nature of an article. Additional third-party data sources can be connected to aid understanding of how the article was engaged with off-platform. For example, the average position in the Google SERPS via Google Search Console, and the volume of comments, likes, and shares from Facebook and Twitter Analytics.

Performance metrics do not even need to be exclusively digital, as recommending an article that appeared on the front page of a newspaper should be greater in priority than one that appeared in page 35, regardless of how well it performed on Facebook!

Building a meaningful overview

Structuring of the data in such a manner will mean that the AI solution does not recommend articles, for example, which are fewer than 50-words long, which are published to irrelevant categories or with irrelevant tags, which only had a dwell time of 15 seconds, or which never ranked on Google. This should also be newsroom-configurable in order to best link with strategy.

An additional benefit is that it can be linked to Business Intelligence (BI) and Data Visualisation software, such as Tableau, Qlik, and PowerBI. This provides the newsroom with the ability to integrate their archives by looking at the average performance over time, seeing if certain content categories perform better when published in the evenings or whether individual authors need support in their SEO due to under-indexing in search referrals.

At Archant, we segment all articles into three easily understandable performance buckets, allowing journalists and editors to know what behaviours need changing to write fewer ‘red’ stories, tying in with business performance expectations. The three buckets can be set completely by the newsroom, as what they define the targets for an article. So it can be good, OK, or bad. We have used Page Views in the past to correct behaviour at Archant.

Thanks to Nick Cameron, Head of Performance at Archant, for this case study.

Although headline generation is not directly related to the archive, some tools and methods are the same, and the content in the archive is necessary as training data for the headline-generator. In this section, you can read of Axel Springer's experience with auto-generated SEO headlines.

Methodology

The SEO title – which is the title displayed in the Google Search results – is one of the main factors that influence how high the article will rank among the search results. The higher the position in the result list, the higher the click-through rate (CTR) for each search result. The SEO-title creation is a simple but rather time-consuming task for editors. Our experiment at Axel Springer is focused on automatically generating an SEO title for each article, which then the editor just have to review and approve.

An SEO title should be descriptive, intriguing, and contain keywords that are specific to the context of the article, as well as frequently sought for. We decided to split the problem of automatic generation into two separate strands:

First: Generate a title that is basically a very short summary of the article

This is essentially an extreme form of text summarisation. In recent years, summarisation problems were usually tackled by using neural networks with an encoder-decoder architecture. To train such a model, one needs a large amount of article data. Therefore, we extracted 500,000 WELT – one of the main news brands of Axel Springer – news articles from the archives to familiarise the model with the structure of the texts and titles.

Second: Identify keywords that are important & frequently sought on Google Search

When composing the SEO-title, the SEO experts usually come up with context-specific keywords and then pick those with the highest expected search volume. To mimic this workflow we decided to take a two-step approach: First, we extracted the article’s keywords and their context-specific relevance score by applying named entity recognition (NER). Next, we used the PyTrends API to retrieve the expected search volume of the keywords. We sorted the keywords based on an XGBoost ranking trained on articles with good SEO-titles. In a last step, we tweaked the output of the neuronal network to incorporate the keywords. More information on the technical part can be found here.

The editors can now access the generated SEO-title along with relevant keywords via a browser plugin, which allowed us to work with the browser-based content management system (CMS) that’s used by journalists at WELT. Once the editor clicks on the SEO-title text field, the plugin pops up and generates a title prediction as well as a list of ranked keywords that could be included in the SEO-title. If the title prediction works fine, the editor can insert it into the CMS with just one click. If it’s of no use at all – which definitely happens from time to time – the editors can use the keyword suggestions as a supporting tool to create a title of their own.

Lessons learned

For the success of the experiment, it was essential to involve the editors in every single step of the project. We found out that the newsroom was mostly sceptical about AI tools. Therefore, we saw the need to first develop a clear understanding of what the product can and cannot do. Our team communicated a lot with the key stakeholders in the newsroom to collect feedback on each iteration and develop trust step-by-step. Finally, instead of having a fully automated SEO-title generator, we created a “human-in-the-loop” AI product that assists the editors to perform more efficiently a complex task they usually dislike. With this solution, the editors are still in control of how the article eventually appears on the web and are more willing to use the tool.

Thanks to Sebastian Maulbeck, Senior Product Owner, Content Intelligence at Axel Springer, for sharing this case study.

On the topic of automating the creation of SEO titles, we also talked with Agnes Stenbom, Responsible Data & AI Specialist at Schibsted, and fellow Collab participant. Here is what we learned from her:

Intentions

We had two key goals in experimenting with generating SEO-headlines through Machine Learning (ML). One the one hand, we wanted to create SEO-headlines for historical articles that didn’t already have an SEO headline (and thereby gain traffic and value from already performed work). On the other, we wanted to assist journalists in writing SEO-headlines for future articles. We saw this as a quite clear example of humans and AI working together to create value: journalists would bring their unique skills in creating journalistically relevant headlines, and machines would generate headlines that would be ranked high on search engines. An important note here is that the ambition was never to create and publish the headlines without human intervention. What we sought to do was to create a model that could generate suggestions that our journalists would review directly in the CMS.

Approach

After considering a number of different ML-approaches we landed on a 'stepwise' training process utilising transfer learning. Roughly speaking, our approach was to first teach the model to generate regular headlines and then add training data to make it able to also generate SEO-friendly headlines.

The output of the model was put in a collaborative sheet where an editorial team member with knowledge about SEO would assess each of the generated headlines, either accepting or rejecting them. The results were mixed, with a majority of headlines being accepted. This was exciting – not least as we are working in Swedish, in which NLP/NLG is still at a relatively early stage. However, we had set a very ambitious goal of 80% acceptance in order to take the model further and the goal was not reached.

Lessons learned

We learned a lot from this experiment, and while the specific application(s) we had in mind at the start did not go into production, we still view this process as an important contributor to organisational learning about AI's potential. One of the biggest lessons we learned was the need to include different skillsets early on in the process, preferably already from the very start. Technical and editorial teams should set goals together and then work towards those same goals collaboratively.

3. AI and 'evergeen' content: What solutions do tech companies propose?

Content is the main factor that defines a journalism brand. It is the quantity and quality of the information or entertainment offered that makes the difference, positive or negative, compared to alternatives. All this content is a valuable asset that is kept in the archives, a department that is the memory of a news organisation. In them, we find content that can enrich and complement news, but there are many other opportunities that they can offer beyond support for current content.

The objective we have worked on has been to explore the possible use of AI to exploit this type of content, and to examine whether there are tools on the market that can help. In this report, we strongly suggest that news organisations work with tech companies to look into these needs to help further develop possibilities for online journalism. We have talked to several tool providers and here are some of their insights.

For Richard Benjamins, Chief AI & Data Strategist at Telefónica, a Spanish multinational company among the world's leading telecommunications companies, a solution could be found by following two paths:

The first would be to define what "evergreen" content is (in terms of words, images, video or sound) and, automatically, with automatic learning, categorise as such those that are considered so.
Secondly, train an algorithm with Deep Learning on a document base that serves as a reference and then pass the complete repository.

Both can be possible, the question is how well it works in practice and whether it can bring value in a systematic way. In the end we are talking about the knowledge management of a company, a field in which successes are counted, although technically it is possible to achieve it. Telefónica, which is dedicated to providing services, has a unit dedicated to Big Data and AI.

Currently they do not work with projects linked to "evergreen data", but in the future they may be interested in seeing that it is an attractive field, which could be welcomed and have a future in the market. Benjamins considers it important, in order to define a valid product, to complete tests with users and define its explainability, how it would be used in daily life. "The technology is there, it wouldn't be complicated to do", he says.

Narrativa, an AI company specialised in the automatic generation of content, considers that this type of content is not only useful, but is the future, both for news media and companies in general:

"The digital transformation that has taken place in recent years and which has accelerated with the pandemic confirms the absolute importance of digital media, so that a mere online presence is no longer sufficient: it is necessary to be relevant," says David Llorente, CEO and founder of Narrativa.

However, when it comes to generating this type of content, two main difficulties are encountered. The first is that many companies currently invest a great deal of time, money and resources in the generation of manual content. This involves developing the texts manually which is slow and inflexible. Secondly, generating content is not enough, it needs to meet a series of requirements according to the needs of the medium/company in order to appear on search engines. In Narrativa they are already developing this type of technology, combining specific keywords for better SEO positioning.

The tags they use are targeted to very specific searches by users on search engines. In this way, the results are much closer to what potential customers want to find. Recently they have generated car descriptions for a client that has managed to get directly into the top 10 results shown by Google. The tools provided by artificial intelligence, they say, would allow for more varied "evergreen" content and allow journalists to focus on higher value-added tasks.

In a conversation with Danish tech startup Spor.ai, we were advised to put the power back into the hands of the journalist. After letting AI suggest a list of suggestions based on one or a combination of above mentioned, a set of filters could be introduced.

One way could be to show regular dropdown filters, but as shown below, Spor.ai suggests displaying the calculation as a knowledge graph. The journalist is now able to edit and filter the relations between the entities that define the result on the graph display. This will keep the overview of chosen relations that are more difficult to overview with regular filtering.

The application of artificial intelligence techniques offers undoubted advantages in many areas, such as natural language processing, but the problem of identifying evergreen content is potentially complex and difficult to formulate, according to José Manuel Gómez-Pérez, Director of Language Technology Research at Expert.AI.

One assumes that the complexity can be managed by training a model from scratch so that when given a document it classifies it as evergreen or not. An approach like this seems feasible.

However, it faces a variety of challenges, such as generating a large enough corpus of documents and corresponding labeling to train the model. It is technically feasible, he believes, but it requires resources to generate that data set and label it, a task that can involve significant investment of time depending on the volume that needs to be extracted and annotated.

It seems much more interesting, he says, to apply techniques based on pre-trained models that only need to be adjusted for this specific task or to apply approaches based on rules formulated by a knowledge engineer that reflects their understanding of what an evergreen content can be.

At Expert.AI they have faced similar problems in areas such as the analysis of jihadist narratives or the detection and analysis of misinformation in online media. In their own way, both the narratives and the basic topics on which disinformation is focused are evergreen content intended to capture the attention of your target audience in a timeless way. The optimal solution is to establish an alliance between artificial intelligence and the user it assists, a partnership that results in AI systems that feed on user feedback, offering increasingly better predictions.

CONCLUSIONS

Although we did not end up developing our imaginary ArcAI universal tool, we managed to collect a lot of valuable experiences and knowledge that demonstrates that solutions for leveraging the archives with AI are possible to build – and partially already out there. But we also discovered a range of challenges and constraints. If you want to get started in this field, you should consider the following:

What do you want to achieve?

There is great potential in the archive, but what are the specific needs of your newsroom? There is no reason to develop an advanced research tool if what you need is just to introduce an evergreen tag for a specific type of content or simple cyclic notifications. Different newsrooms have different needs, as well as varying definitions of what evergreen actually means for them.

What effort are you ready to invest?

Since there are very few universal tools available, you should decide for yourself the scale of the tool you need. The more advanced the technical methods are, the more development work it will require. Using Natural Language Processing (NLP), Named Entity Recognition (NER), and Machine Learning (ML) in combination with manual tagging and/or knowledge graph filters, you can get quite precise matches in your article database. But would it be enough to just put a search field in the article template of your CMS, where you can browse the archive with keywords through the Google programmable search engine? What are the criteria that should qualify a good match and how much filtering work will you put in the hands of the journalist?

Think about data hygiene.

When working with your archive, it is critical to have good consistency and structure in your database and metadata. The better the structure, the easier it will be to leverage the database with the use of AI tools.

Is your team with you on this idea?

To implement a functioning tool like this – whether it is based on manual tagging systems, explorative methods, or any other technology – you will need some internal diplomacy in your organisation to be able to make use of your personal ArcAI. Newsroom culture is often a constraint. If the tool is too complex for your journalists to be motivated to use it, the tool is pointless. If the journalists are not consistent, for example, in their tagging of evergreen, having a great tool is pointless. Media organisations often prioritise investments in technology for business intelligence and on the business side rather than on content production. Prepare your arguments for the value your tool can provide in the long term, and get your team of journalists on board.

What is technically possible in your language?

If you decide to use some of the technologies on the market – such as Parse.ly or Chartbeat, for example – be aware of how their algorithms are trained. Most often, the tools available are considerably better in English than in other languages. Most often, it makes sense to train your tool on your own archive to get the best matches.

Having answered these questions, you will discover a number of opportunities in the archive for your newsroom:

Notifying journalists when earlier content is reappearing in search engines – or if an evergreen article is relevant for the time period you are entering;
Suggesting the most relevant related stories from your collection of evergreens;
Getting more SEO value;
Reusing elements from earlier content to create timelines or other formats, and much more.

You can develop the tools in-house or join forces with a tech company. Firms like Chartbeat and Parse.ly, and obviously Google, have the ability to train their tools on millions of articles worldwide through their customers. Perhaps the main result of our team’s work in the Journalism Collab is a call for tech companies to get involved and join forces with the media industry to develop accessible tools to bring life to already-published content and help put the power of the archives in the hands of journalists.

Kristoffer Hecquet

Head of Development, Altinget (Denmark)

Connect with Kris on LinkedIn

Florencia Coelho

New Media Research and Training Manager, La Nación (Argentina)

Connect with Flor on Twitter

Sophie Casals

Head of Digital Transformation, Nice-Matin (France)

Connect with Sophie on LinkedIn

Nick Cameron

Head of Performance, Archant (UK)

Connect with Nick on LinkedIn

David Corral

Head of Innovation, RTVE (Spain)

Connect with David on LinkedIn

Padraic Cassidy

Editor, Automation & News Technology, Reuters

Conncet with Padraic on Twitter

Momi Peralta

Data Project Manager, La Nación (Argentina)

Connect with Momi on Twitter

Melissa Stevens

Digital Editor, South China Morning Post (Hong Kong)

Connect with Melissa on LinkedIn

Tim Nonner

Chief Data Scientist, TX Group (Switzerland)

Connect with Tim on LinkedIn

Sebastian Maulbeck

Senior Product Owner, Content Intelligence, Axel Springer (Germany)

Connect with Sebastian on LinkedIn

Characterization and Early Detection of Evergreen News Articles

In this article, Shuguang Wang, senior data scientist at the Washington Post, explains how the Washington Post has analyzed the characteristics of evergreen articles and has developed a model to automatically identify them as evergreen articles. More technical details can be found here.

How Le Temps wants to give a second life to its evergreen stories with Zombie

An interesting focus on an initiative promoted by the Swiss newspaper Le Temps. The idea: create a database that, over time, will hold thousands of articles of interest that could be republished. These could then be used to generate new readers. "Good articles remain good articles and deserve a second chance," explains Jean Abbiateci.

How the Financial Times identifies popular historic content for resurfacing

A dashboard shows editors which older articles are popular with readers. On average, the articles flagged by the tool and re-promoted have seen more engagement than the FT's average Facebook posts, getting three times more clicks.

Evergreen Content: What It Is, Why You Need It and How to Create It

A very complete article with a definition of what evergreen content is, as well as some basics and key points about identifying and creating evergreen content. It also has tips on how to use it and its importance in metrics, content marketing strategy, and how to maintain evergreen status of content.

Tips from WIRED’s evergreen content strategy

WIRED's Director of Audience Development details their strategy on how to get the best results from the content they have, with an archive that spans nearly three decades of activity. The article describes how, for Wired, creating a successful evergreen content strategy was based on two things: first, finding the evergreen stories, and second, doing something with them.

What is evergreen content?

An explanation of what evergreen content is, how to differentiate it from other content and how to work with it in the best way. This article also covers how evergreen content has distinct benefits over content where relevance fades over time and details why evergreen content is important for SEO. The article includes evergreen content formats and examples.

How publishers can write SEO-friendly evergreen content

In this article the authors highlight the importance for publishers of not having just one source of content and traffic - they consider creating SEO-friendly evergreen content essential to driving more traffic and ad revenue. It has a list of recommendations for writing SEO-friendly evergreen content.

Why Evergreen Content Still Matters

For the author the key is the relevance of SEO and how, in a changing world, evergreen content is one of the best contributors to SEO efforts. For this reason, he proposes different actions to ensure that this content has real and lasting value.

To find photos in our archive, we taught the CMS how to read

New York Times initiative with facial recognition and NLP to help the CMS find photos from the archive.

How The Wall Street Journal turns an internal reporting tool into a reusable news product

Wall Street Journal project to offer their audience a candidate's quotes and facts tool designed for political coverage and elections.

1. Introduction & Definitions

Language barriers

Newsroom culture

Newsroom-specific needs

Individual routines

The resource question

Definition by content type

Cyclic definition

Contextual definition

Definition by metrics

Ideation and research phase:

Research phase:

Finishing/ornamentation phase:

Metrics

Evergreen tag

Cyclic notifications

Qualitative indicators

Minimal version

Matching tags

AI-based matching techniques

Explorative model

2. Case studies on leveraging the archives

Searching for articles

Technical Implementation

Building a meaningful overview

Methodology

First: Generate a title that is basically a very short summary of the article

Second: Identify keywords that are important & frequently sought on Google Search

Lessons learned

Intentions

Approach

Lessons learned

3. AI and 'evergeen' content: What solutions do tech companies propose?

CONCLUSIONS

Kristoffer Hecquet

Florencia Coelho

Sophie Casals

Nick Cameron

David Corral

Padraic Cassidy

Momi Peralta

Melissa Stevens

Tim Nonner

Sebastian Maulbeck

Learn more about the Collab and explore the work of the other teams