The ease of data access has led to a paradigm shift in the way businesses operate. In the present state, apart from the traditional data, companies are increasingly using ‘alt-data’ — alternate data that can be accessed from unconventional sources like the web, customer support transcripts, sensors, satellite images and more.
Jennifer Belissent, Principal Analyst at Forrester has explained it beautifully:
We all want to know something others don’t know. People have long sought “local knowledge,” “the inside scoop” or “a heads up” – the restaurant not in the guidebook, the real version of the story, or some advanced warning. What they really want is an advantage over common knowledge – and the unique information source that delivers it. They’re looking for alternative data – or “alt-data.”
Since the web is a perpetual source of data that covers all the industries, it is a significant contributor in the ‘alt-data’ space. The applications of web data especially in financial services have a tremendous impact on this industry, since it is changing rapidly. That’s the reason majority of the leading financial information companies are crawling the web to aggregate and analyze data that helps them build robust solutions.
How web data is extracted
There are primarily three options when it comes to web data extraction:
- Do it yourself (DIY) tools
- In-house crawlers
- Managed services
So, how to select the right data extraction methodology? Well, it depends on the use case. As a rule of thumb you should first answer the following fundamental questions:
- Do you have a recurring (daily/weekly/monthly) web data requirement?
- Can you allocate a dedicated engineering team who can build the crawlers exactly as per your requirement and maintain it to ensure a steady flow of data?
- Will the volume significantly grow over time requiring a highly scalable data infrastructure?
If the answer to the first question is ‘no’, i.e., your company would not need data at a regular frequency, then it is better to use a DIY tool. The learning curve initially can be high, but this option gives you a pre-built solution. Note that in case of high data volume that cannot be supported by a tool (even though it is a one-time requirement), the project can be outsourced.
If the use case entails frequent web crawling and it is not possible to allocate dedicated resources for building a team to create a scalable data extraction infrastructure, you can engage with a fully managed service provider. The service provider would typically build custom web crawlers depending on the target site and deliver clean data sets exactly as per the requirement. This allows you to completely focus on the application of data instead of worrying about data acquisition layer.
In-house web crawling gives you complete control on the project, but at the same time it requires skilled engineers to maintain the data feed at scale (millions of records on weekly or daily basis). Note that dedicated resources are a necessity since the websites change their structure frequently and the crawler must be updated to extract the exact data points.
What data financial information companies crawl
Businesses crawl wide range of websites across the globe (in numerous languages). Here are the generic categories:
- News portals
- Company websites and government sites
- Social media and forums
- RSS feeds
Typically article title, date, full content, author details get extracted from news sites and blogs. In case of the company sites, press releases, leadership profiles, company blog, job openings, etc. get extracted. Government sites’ policy and regulations page are also monitored. Coming to social media and forums, there is a hindrance — social networks like LinkedIn disallow crawling and API is also not accessible. However, some of the social networks like Twitter are open in terms of data extraction via API access. The primary factor before crawling any site is to completely follow the robots.txt file of that site to stay out of legal issues. This file tells crawlers which pages can be crawled and what should be the crawling frequency.
Applications of web data
The web data sets in the form of alternative data can be used to build robust solutions by augmenting conventional data sources and deliver valuable intelligence. Given below are some the most common use cases:
Since equity research requires performance data of the companies, web data can be used by continuously aggregating required information. For example, pricing and inventory data available on the site including data from income statements and balance sheets can be extracted to understand how the company is doing in terms of growth. Apart from that the job postings on the company sites, company’s ratings on employer review sites, brand mentions on forums and media can also be extracted for stronger fundamental analysis. Advanced sentiment analysis also plays an important role in gauging consumer perception.
Before planning and allocating budget for investment, firms first understand the technological trend over certain time period. An important part of this analysis is the data gathered from news portals, blogs, posts on reddit and tweets — text mining techniques are applied on these data sets to uncover trending topics and the way they change. These insights can be particularly useful for venture capital firms for better portfolio allocation.
Rating agencies heavily monitor and extract data from web for the companies that are tracked in their reports. This data primarily includes the public data on the company sites, third party reviews, cash-tagged tweets and brand mentions. It is also possible to extract data in real-time if the use case requires high velocity analytics.
Governance and compliance
It is paramount for companies to comply with regulatory requirements. Hence, companies that provide advisory and risk mitigation services (example: natural disasters like flood) crawl government sites and news outlets to stay abreast with policy changes and critical events. In these cases getting live data becomes imperative.
Financial intelligence becomes powerful when the data used for analyses and reporting covers both traditional sources and newer sources. In this context the web is the low-hanging fruit with tremendous impact — there is no threshold on the amount of web data and it continuously grows while delivering market-moving insights.