I first connected with Artem Bugara back in September of 2020, when he sent me the following note:
Hey Allison,
We’re collecting news articles published online and provide an API for Enterprises to get this data for insights mining.
We do our sales via [redacted] at the moment. But, we’re looking forward to know more about your platform. I found you in the YC directory. Congrats on the launch!Regards,
Artem
Artem and I became fast friends after that. Everyone reads the news, every company needs to know the news, and from my former life in finance, I knew that news data was especially important for event-driven trading strategies and special situations groups. In the past few years the sheer quantity of news data online has exploded, and much of that data is unstructured. How do you take a news article and turn it into something data science teams can use? How do you take a million news articles and make any sense of them?
Artem is a data engineer by trade, and he built an in-house news parser for his job at an insurance startup in France. He decided that he wanted to bring that experience to the entire web, making a generic parser that could handle 100,000 news sources with new releases published every second. To help build his company, he brought in his old schoolmate, Maksym Sugonyaka. Working out of the HEC incubator at Station F in Paris, NewsCatcher launched in 2020 and is already on its second version of the product, which covers 5 years of historical data and can handle more than double the volume of ongoing news updates of its first. Here’s more from Artem:
Tell us about your data.
We help companies get industry-specific news articles dataset for analysis. This data helps them better understand the market trends, risks, and opportunities so that they can gain a competitive advantage. Our core technology allows us to crawl news websites (Reuters, The New York Times, BBC, etc.), identify news articles, extract data points (title, published date, author, text, etc.) in a generic way: it does not depend on a website’s structure. At this moment, we index over 1,000,000 articles/day. And we can add specific news feeds on demand.
Why did you focus on this category?
I used to work as a lead data engineer at an insurtech startup for the aviation industry. We gathered data from different sources. We could not find any news data provider for our specific use case: they all were about providing very broad news coverage. We ended up building our very limited version in-house. That was expensive, hard to maintain, and poor quality. Before NewsCatcher, Maksym and I already worked with news data and were experienced with web scraping. We understood that we're able to build a generic solution that will be able to find, extract, and normalize news articles from any news website. Companies need our service because they must analyze as much information as possible to stay competitive. It is a question of survival for them. We give them the right data at the right time.
What makes your data valuable?
Our data is aggregated from over 100,000 news websites. For every online-published news article we extract:title
article text
published time
authors
tags
URL
publisher
Each article is also enriched with the information about a news source: country
general topic
page rank
What types of customers get the most value from your data?
Financial institutions, market researchers, brands, consulting companies, any business that has to keep an eye on the competitors, market, and trends.
What question do you get most often from prospective buyers?
1. How much historical archive do you have?
Answer: we're launching our V2 in May that will have 5 years of historical news data
2. What's the latency between article being published and NewsCatcher serving it?
Answer: a few minutes. We have to constantly re-crawl news websites to find new urls. Still, if you have some particularly valuable websites for you, we can make sure to get new content ASAP.
3. Can you tell us how many people have read each article?
Answer: No
Check out NewsCatcher’s data storefront and search for a topic of interest or subscribe to the daily news feed.
If you’re a data provider, say hello at data@getsyndetic.com.