I, Reporter
Illustration by Matt Daley

How “robojournalism” can save local news

I, Reporter
Illustration by Matt Daley

[su_dropcap style=”simple” size=”10″]T[/su_dropcap]wo clicks were all it took. Andrew Lundy, vice president of digital production for The Canadian Press (CP), gazed at his computer screen, amazed by what he saw: proof of new direction for journalistic innovation, and a big step towards maximizing both output and efficiency.

It was a normal work day in early September 2017. Lundy was in his office at 36 King Street East, home of CP’s Toronto newsroom. Earlier, he’d received a message on Slack“I got something cool I want to show you,” the message read. “Do you have some time today?” It was from Lucas Timmons, a digital journalist at CP who had been curious about journalistic data initiatives and automated text formulation for local news stories.

Shortly after Lundy responded with a casual “yeah sure,” Timmons walked in. On Lundy’s desktop monitor, he revealed a software he had previously finished coding. The interface seemed simple—there was a text box in the middle with one button above and another below. Timmons asked Lundy to open a recent Ontario Hockey League (OHL) box score and copy and paste the URL into the text box. He told him to click the “Generate story” button, then the one that read “Submit.” The screen loaded. Those two clicks produced what Lundy would later describe as magical.

At the time, covering junior hockey in Canada was challenging for CP. The number of games throughout the country was simply too high for their small staff of sports reporters—all of whom, Timmons explains, were too qualified to merely recap game statistics found online. “No one got into journalism, as far as I’m aware, to write hockey stories off of box scores,” he says. “It’s just not what you’re thinking of when you sign up.”

Timmons recognized that their clients, particularly local news organizations throughout the country, required junior hockey coverage, but CP wasn’t sure how to meet the demand. Their news team had no plans for automating the process. Until then.

Lundy watched words appear on the screen. “Out comes this text story that was like six paragraphs long,” Lundy remembers. The words recapped the recent OHL game details, describing who won, who scored, penalty times, and shots on net. The text followed the inverted pyramid structure; it read like a typical sports story.

But the software did have some flaws. “If a human had written it, I would have said it needs to be edited a bit—which is fine, because it’s a prototype; it wasn’t ready for prime time.”

Timmons had coded the algorithm to follow CP’s style guide and generate simple text based on datasets (collections of digital data) available online. He coded data scrapers that transfer information to a system of templates. (Data scrapers are tools that extract raw information and make numbers readable.) The text generator adapts to specific scenarios. It knows, for instance, to always spell out a number if it is the first word of a sentence, to exchange verbs depending on the score—“defeated” could be changed to “trounced” if the score gap is wide enough—and to point out numerical anomalies, such as if a player scores a hattrick, or if a goalie manages a shutout.

This technology wasn’t completely unheard of at the time. “Natural language processing,” an umbrella term for automated language interpretation, formulation, or general manipulation, had already entered the radar of a couple newsrooms in North America and Europe. But this “Hockey Bot,” as it would eventually come to be called, offered a specific solution to their problem: it could cover junior hockey, so CP’s sports journalists had time to report more substantial stories. As Lundy and Timmons marveled over the possibilities, other newsrooms were developing similar plans.

[su_dropcap style=”simple” size=”10″]I[/su_dropcap]n December of that year, PA Media (formerly called Press Association), the U.K.’s CP equivalent, conducted a journalistic experiment through their automated news service, RADAR. Their plan was to release 30,000 news stories, through 20 regional newspapers across the U.K., that were entirely produced by algorithms. These stories were to be rooted in statistical information, featuring birth trends from the Office for National Statistics, based on natal registration reports across the U.K. The project was funded by a €700,000 (around $1 million CAD) grant from Google’s European Digital News Initiative, which, in its first year, awarded more than €50 million to 252 projects in 27 European countries to help develop innovative approaches to digital news production. According to an email from Alan Renwick, the director of RADAR AI Ltd., the automated news service has produced over 280,000 stories since launching in 2018.

Automated text generation of this nature has played a major role in journalism throughout the latter half of the last decade. In the last few years, Bloomberg News produced thousands of news articles based on financial reports, using the automated content program, Cyborg, to maximize and accelerate their business reporting. Forbes uses a program called Bertie that automatically provides reporters with first drafts for news stories. The Los Angeles Times uses an artificial intelligence news generation program, called Quakebot, to report earthquake warnings from the U.S. geological survey. In 2016, The Washington Post published around 850 articles that were completely computer-generated, and received the award for “Excellence in Use of Bots” at the Global Biggies Awards in 2018.

Andrew Cochran, former head of strategy for CBC News, believes this rise of “robojournalism” could lead us far beyond sports game coverage.

Cochran has always been interested in the relationship between technology and journalism. In May 2019, he launched a website called JournalismAI.com, intended to track advances in artificial intelligence and discuss what they could mean for the future of media. His fascination with technological development dates back to the early 1960s, when he remembers sitting cross-legged on his living room floor, a little boy mesmerized by the black and white image on a cathode-ray tube TV. “Humans were leaving the earth for space and I could watch it on television, live,” he wrote in the introductory post for his website. Years passed, that little boy grew, and Armstrong walked the moon. Then, almost 50 years after Apollo 11, Cochran started reading about the role of algorithms in the distribution, verification, and production of news. He wasn’t sure if it was another giant leap for mankind.

Cochran’s site has over 350 curated articles, with news conference videos, major reports, and weekly updates on the world of AI.

He says a recent newsroom revelation is that rule-based AI is able to “augment [journalists’] work and take on some of the tedious aspects,” but, beyond the structured simplicity of formulaic weather reports or financial summaries, he also says text-generative algorithms have the potential to assist long-form investigations. “A few investigative projects have used machine learning to go through mountains of documents and try and see patterns…and help identify things that perhaps you wouldn’t have been able to find before,” he says.

According to a survey conducted by the Reuters Institute for the Study of Journalism, approximately 72 percent of 194 leading editors, media company CEOs, and digital leaders are “planning to actively experiment with artificial intelligence to support better content recommendations and to drive greater production efficiency.” However, AI is an ambiguous term, and the definition has changed almost as rapidly as the technology has evolved.

“Historically, there’s no real, precise definition [of artificial intelligence] because there’s no real precise definition of intelligence,” says Cochran. “It’s kind of our perception of what intelligence is. And so, likewise, AI is more a perception. It’s the illusion of intelligence.”

That illusion is the result of varying methodology. The method Timmons used to build his Hockey Bot, for instance, is founded on a set of programmed rules and outcomes. It is limited by what it is coded to do—structured by designed templates—and will only function based on predetermined settings. There are, however, natural language generators that have the capacity to discover options beyond their initial programming, thanks to an unlimited capacity for trial and error, and an ability to learn from interpreted patterns. In simpler words, some programs are able to formulate sentences without human control.

As far as we know, deep-learning automated text—the type that writes beyond programmed templates—hasn’t positioned itself in many mainstream newsrooms. Yet, as the technology advances and the illusion of intelligence becomes more convincing, perhaps the possibility isn’t far away.

[su_dropcap style=”simple” size=”10″]I[/su_dropcap]n February 2019, OpenAI, a non-profit artificial intelligence organization co-founded by Elon Musk, announced that it would be delaying the release of a machine-learning automated-text generator called GPT-2. As OpenAI explained on their site: “Due to our concerns about malicious applications of the technology, we are not releasing the trained model.” According to a report released February 14, 2019, the company “trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.”

Following the report, OpenAI slowly released three restricted versions of the model. Then, on November 5, 2019 they launched the full-sized program.

By analyzing 40 gigabytes of social media text—amassing over eight million documents of upvoted Reddit content—the program doesn’t just emulate language; it creates voice. But does that voice have the capacity to produce journalism?

[su_quote cite=”Andrew Lundy”]If a human had written it, I would have said it needs to be edited a bit—which is fine, because it’s a prototype; it wasn’t ready for prime time [/su_quote]

As of now, Cochran doesn’t think so. “The reservation I have about saying how [GPT-2] could be directly applicable to journalism is that, right now, it’s getting better and better at, in essence, creative writing.” The language model operates without any foundation of fact or reality, he says. “It can be remarkably readable, at least the initial paragraphs. It kind of degrades over time.”

Some online “neural networks” allow users to test GPT-2’s ability. Talktotransformer.com, created by a Toronto-based machine learning engineer named Adam King, is one of them. I decided to try it out. Excited, and oddly a little nervous, I dragged my cursor onto the centre textbox, where a prompt read: “Type something and a neural network will guess what comes next.”

I typed in the lead of this article—“Two clicks were all it took”—then hit the blue button that read, “Complete Text.”

Seconds later, the following appeared below:

Two clicks were all it took…and we were able to get into a conversation with the locals, who made it easy for us to book a scenic out-and-back on that specific day. As we hopped out of our mountain buggy, one of the locals informed us that he was feeling a little lightheaded and advised us to drive him up the 5 mile round trip. The kind of folks at the hotel were great, and took us back to the room where our next morning’s dinner was waiting.

Aside from some obvious leaps in logic—a mountain buggy is a baby stroller and dinner isn’t normally served in the morning—the machine seemed to carry the style and flow of a travel blogger. It also took less than three seconds to start generating what might’ve taken a human five minutes or longer. Right now, the program seems to structure sentences without understanding of verb and noun meaning. Should journalists be worried that this machine might one day be able to do what they do, better and faster?

“In terms of using [GPT-2] to be a sophisticated reporter, or to accurately write a feature story or something…I think that’s going to depend more on what the underlying training data that’s available,” says Cochran.

Angus Frame, senior vice president of digital product development at Torstar Corp., says there’s no reason to believe technology like GPT-2 and rule-based text generation won’t eventually become more integrated in the media industry. For now, he says, advanced tech is nothing but a productive collaboration—not potential competition.

“I would expect that technology will continue to help journalists the way it has been, probably, for the past 25 years,” he says. “I mean, I’m pretty sure when you write you are seamlessly supported in terms of grammar and spelling and other things….[A]ll of that will continue and grow. So the idea that any walk of life will be increasingly supported by technology is almost a truism. There are very few jobs in which technology plus human beings don’t make for a more successful combination.”

Frame says Torstar has been tackling a similar problem to CP, and coming up with similar results. Like Timmons and Lundy, Torstar hopes to cover more Canadian Hockey League (CHL) games for community publications. In 2018 they partnered with Automated Insights, an American-based natural language generation company. According to their website, the company “uses clear, natural language to make sense of the world’s data.”

Frame says they channel their data to Automated Insights from a company called HockeyTech, which builds data feeds for every junior hockey game played in Canada. Torstar licenses those data feeds for $100 a month. The templates they designed through Automated Insights automatically publishes the content to their clients’ websites. Unlike CP, Frame says this method has no human involvement at all. “[The content] is never seen by an editor before it’s published.”

Frame also says Torstar discloses how the words were generated in a disclaimer above each article; doing so is part of their standard practice of transparency. He says, though, it is possible for the text to pass as human-written.

Introduced in the 1950s by Alan Turing, a computer scientist, mathematician, and theoretical biologist, the “Turing Test” is a method of evaluation meant to determine whether or not a machine can pass as a human through written language. The basic premise is to have a human participant ask a machine and another human a series of questions through terminal screens. Based on the prompted responses, the questioner must then decide which responses were from the computer and which were from the other person. According to the Stanford Encyclopedia of Philosophy, the test derived from the philosophical question: “Can a machine think?”—a question Turing felt unworthy of discussion, but one which inspired new conversations about a machine’s ability to imitate thought. Believing that digital computers will become increasingly capable of excelling in what he dubbed “The Imitation Game,” Turing designed the test not only to determine a machine’s ability to appear intelligent, but a human’s inability to detect artificiality.

In a way, this psychological evaluation process was deployed when Torstar decided to survey its automated text stories with ten randomly selected people in 2018. The participants were gathered by a survey service called UserTesting, and were provided multiple text stories—some written by humans, some by machines. They were asked to identify which stories they preferred, and whether they could notice a difference. Torstar spoke to the participants one by one through video calls.

“Generally, people couldn’t tell the difference,” says Frame. “A few people liked the quotes and things that were inserted into the human-written articles, so they liked the colour better, and a few people liked the automated [text] better because it was more straightforward and to the point.” After talking to these participants, Frame believes that automation could become increasingly integrated in the media landscape.

[su_dropcap style=”simple” size=”10″]M[/su_dropcap]ost Canadian newsrooms, however, only started realizing this potential for automated text stories in the last few years, and they were, by comparison to their American neighbours, late to the party.

Lisa Gibbs began looking at machine-generated content as early as 2014, when she first started working at The Associated Press (AP)—the American equivalent of CP—as their business news editor. The previous year, AP connected with Automated Insights, the same company that later partnered with Torstar in 2018.

“News organizations cannot afford for their curious, creative, intelligent journalists to be spending time on what’s essentially very routine commodity work,” says Gibbs, who is now the director of news partnerships for AP.

Unlike Torstar and CP, which began implementing automated text formulation thanks to a demand for more hockey coverage, AP began considering machine-generated prose to expand the reach of its corporate reporting.

Gibbs says AP started recapping profit breakdowns in the form of automated earning result stories in the third quarter of 2014. AP considered it such a success that “we gradually expanded the number of earning stories that we produce,” she says.

The expansion of that number was substantial. Gibbs explains that since 2014, AP went from writing 300 earning stories each quarter, to automating 3,700. But maximizing the volume of reporting was not AP’s motivation. “We really wanted to be able to refocus our beat journalists on more interesting stories. And that’s what happened.”

With more free time cleared for its business reporters, AP was able to put the numbers aside and tell the stories that needed more human touch than the automation could provide. Gibbs believes this implementation transformed AP’s corporate coverage. “Earnings may simply fold into that larger thematic story. But, as a desk, we don’t want to be married to this over-covering earnings mentality, which inevitably leads you to think smaller.”

Thinking bigger has guided AP to a new business philosophy. “My belief is that leveraging these technologies is an efficiency play. But it can be incredibly transformative for your overall coverage strategy,” says Gibbs.

After their success with automating earnings result stories, AP developed a data distribution program to help expand coverage in new ways. Most recently, this has extended to education reporting.

In 2019, AP partnered with Newsday Media Group, a news organization in Long Island, New York, to help advance data-driven reporting for their education team. Gibbs says AP developed automation templates for school board election coverage, and also helped produce 124 personalized profiles for every school district in Long Island.

“[Newsday] would never ask one of their reporters to write 124 basic profiles of school districts—[including] graduation rate and how they scored on the standardised test. But if we could automate that and they could have it on their site, then parents would be able to easily find information about their specific community.”

In Canada, similar ideas were forming, and a bot designed for hockey coverage was about to become a national correspondent.

[su_dropcap style=”simple” size=”10″]O[/su_dropcap]n October 10, 2019, at 11:22 a.m., back at the Canadian Press office in Toronto, Lundy was in a project management meeting, between two agenda points, when he received an email from a Google representative. “After carefully reviewing your application, the GNI innovation challenge jury are delighted to inform you that your project has been selected for funding,” it read. Lundy stopped the meeting. “Hey guys! We just got the GNI!” he announced. The Google News Initiative is the North American equivalent of the Digital News Initiative in Europe, intended to financially aid innovative approaches to journalism—with contributions up to $300,000 (U.S.)—just over $400,000 CAD. Thousands of news teams across Canada and the United States applied for the funding. The other three Canadian organizations selected were Torstar, Village Media (a digital producer of local news in Ontario), and Earbank (a digital audio archive service).

“We’ve got some pretty cool ideas at CP. The biggest problem we’ve always had was just the funds, the capital to actually make it happen,” says Lundy. With Google News’ financial support, CP would be able to continue the innovative momentum that had been building since that day in 2017, when Timmons first unveiled his Hockey Bot.

A lot has happened in the last two years. CP’s success with automating the coverage of CHL games has inspired further experimentation with census stories. For the most recent federal election, Timmons designed a chatbot that automatically generates riding profiles up to 700 words. Unlike the Statistics Canada website, which Timmons says can be difficult to maneuver, their data scraper and text generator allows community-specific statistics to be more readily accessible, including options to compare data and detect numerical patterns. With modifications in story templates, Timmons was able to improve the natural language programming and save hours of time.

“One of the things I’m trying to do is diversify how we’re making money,” says Timmons. “I think all media companies need to consider that.” He believes that if he can automate ten hours out of a writer’s week, they can devote that free time to working on new projects. “We still get the content out there, it’s still high quality, it’s still [looked] over by journalists before it ever goes out the door, but it also allows us to do more things,” says Timmons. “We need that right now. Everyone sort of needs that now,”

As automation proves more efficient—and, therefore, more lucrative—ideas for localizing community stories continue to hit the drawing board. And, in 2018, the same year Google announced their North American funding initiative, local journalism in Canada needed all the help it could get.

The Postmedia-Torstar deal, which saw the companies swap more than 40 community publications, led to numerous local papers shutting down. One of these papers, The Kanata Kourier-Standard, had long ago offered Timmons his introduction to news distribution. “My first job that I had was actually delivering the Kanata Kourier-Standard,” he recalls of his days living in Kanata, Ontario, a suburb of Ottawa. Once a week, Timmons would be delivered a stack of papers in front of his house and would walk around his neighborhood and drop the copies on people’s porches.

“I was 13 or 14,” says Timmons. “My mom loved reading [the paper]….It gave a nice little look at the community. It filled what was needed—they were keeping the community informed about things….I don’t know how financially successful it was, but I know people liked getting that [paper] and now they don’t get it anymore.” Timmons says automated text generation can allow nuanced coverage for communities currently unrepresented in the aftermath of the Torstar-Postmedia swap.

“One of the really great things about CP is we’re working on providing news all across the country, and if we can start providing [automated data] to the few papers that still exist in these communities and help them out, that would be pretty wonderful.” Much like his first job, Timmons hopes to deliver news to communities that need it. And he’s got a plan for doing it: the Digital Data Desk, or as Lundy calls it, “The D3.”

With the incoming funding from Google, CP plans to have a desk devoted to maximizing digital outreach through automation. Timmons says the desk would include himself, another developer, a data librarian, and a reporter. The plan would be to find datasets through web scrapers or to open data portals to automate as many stories as possible. “We can tell local stories for local markets,” he explains.

Timmons also emphasizes that the technology would not replace the need for human journalists; rather, it would cater content to areas that aren’t getting local coverage otherwise.

As an example, he mentions the benefit of running unemployment and inflation rate stories, narrowed by province and then smaller demographics. “The problem is we have a reporter in Ottawa who does that, but they can’t write 13 different stories, plus the national story. So [this way] they could write the national story and we’ll use the data to generate provincial ones that we could send to our clients.” With the data available, CP can tunnel their focus on any chosen location. “There’s so much that never gets talked about and never gets seen….We could do it by [census metropolitan area], municipality, right down to the census tract if you have that sort of data.”

Across the Atlantic Ocean, this method has proved successful. In October 2019, Cochran walked into the PA Media newsroom in London, England. Cochran, whose visit was part of his research on the integration of AI in newsrooms, noted the “deceptively simple” layout of the room. There were numerous desks with reporters sitting on benches—about three or four people per bench. The RADAR desk, he says, had about six people, and they were “churning out an average of 3,000 [automated] stories a month.”

Cochran says these stories weren’t “peripheral stuff or specialist stuff like sports scores or earnings results.” The content included “very real, pertinent, relevant stories.” He also observed a lack of AI consultants in the room. “They are doing their work without any data scientists or computer experts on staff,” he says. “This is real journalism about a locality in the U.K. All of it’s being done in this combination of journalists and machines.”

Cochran points out that a major consideration, when it comes to any aspect of automation, is that “you can’t look an algorithm in the eye.” As language models like GPT-2 accelerate, it will be increasingly difficult to know who is really talking, he says.

[su_dropcap style=”simple” size=”10″]O[/su_dropcap]ver the phone, a few months after his visit to London, Cochran speaks to me about the early hesitation and skepticism he noted amongst fellow journalists in the late 1990s towards the rapidly emerging technology known as the Internet. He says there’s no denying that AI will change journalism as much, or more, than the expansion of the World Wide Web did in the early 2000s.

Near the end of our conversation, Cochran turns a question my way.

“What AI stuff is on your curriculum?” he asks.

“There’s not,” I tell him, though I note Ryerson University does offer optional data journalism courses. I ask him what he thinks of that.

He tells me he is a collector of quotes, and leaves me with something a drug discovery researcher named Derek Lowe told The New York Times in 2019: “It is not that machines are going to replace chemists. It’s that the chemists who use machines will replace those that don’t.”

Later, lingering over the GPT-2 trial service on my laptop, I typed “Once upon a time” into the text box, then clicked “Generate Text.”

The screen loaded: “Once upon a time…astronomers estimated the distance of the edge of our solar system, which is just 4.5 million kilometres away,” GPT-2 wrote back.

I was anticipating a more poetic, applicable ending from this advanced neural network—maybe something that would tie this story together, offhandedly, in a way that would leave Alan Turing scratching his head, something that would remind us that even the writers—the creators of prose, the challengers of ideas, the storytellers of truth, and fiction—aren’t safe from the impending threat of becoming irrelevant.

But, at least for now, the human gets the last word.

 

 

RRJ Test Drive: See the Results Online

The RRJ fed the GPT-2 model, an advanced text generator, 60
RRJ articles to see if it could mirror the collective voice of our
magazine. The results we found were troubling—from using real
names and attributing fake quotes, to generating nut graphs that
mention real organizations with fabricated statements, the model
may have opened a new doorway to defamation and disinformation.
To learn more about the experiment with an advanced text
generator, the ethics behind it, and to read what the model
generated, listen to Series 3, Episode 10 of
Pull Quotes: “The Threat of Deep Fake Text Generation.

About the author

+ posts

Sign Up for Our Newsletters

Keep up to date with the latest stories from our newsroom.

You May Also Like

Dear Journalist Episode 1: A Code of Ethics?

In the first episode of Dear Journalist, Mark Henick interviews Kevin Newman, discussing with co-hosts Hannah Mercanti and Yezua Ho afterwards. Newman is a seasoned journalist with a distinguished career in the field, best known for his role as the former chief anchor of Global National and as host and managing editor of CTV’s W5. He shares insights from his extensive experience, and a valuable lesson he learned early in his journalism career.

CBC featured more Israelis even as Palestinian casualties rose, data shows

CBC’s flagship broadcast continued to feature more Israelis than Palestinians even as the death toll in Gaza mounted. It also failed to identify by name more than a quarter of Palestinians and their allies