Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Preserving Big Data to Live Forever
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Data Management > Best Practices > Preserving Big Data to Live Forever
Best PracticesBig DataCloud ComputingCommentaryExclusiveHadoopOpen Source

Preserving Big Data to Live Forever

paulbarsch
paulbarsch
5 Min Read
data architecture
SHARE

If anyone knows how to preserve data and information for long term value, it’s the programmers at Internet Archive, based in San Francisco, CA.  In fact, Internet Archive is attempting to capture every webpage, video, television show, MP3 file, or DVD published anywhere in the world. If Internet Archive is seeking to keep and preserve data for centuries, what can we learn from this non-profit about architecting a solution to keep our own data safeguarded and accessible long-term?

If anyone knows how to preserve data and information for long term value, it’s the programmers at Internet Archive, based in San Francisco, CA.  In fact, Internet Archive is attempting to capture every webpage, video, television show, MP3 file, or DVD published anywhere in the world. If Internet Archive is seeking to keep and preserve data for centuries, what can we learn from this non-profit about architecting a solution to keep our own data safeguarded and accessible long-term?

data architectureThere’s a fascinating 13-minute documentary on the work of data curators at the Internet Archive. The mission of the Internet Archive is “universal access to all data”. In their efforts to crawl every webpage, scan every book, and make information available to any citizen of the world, the Internet Archive team has designed a system that is resilient, redundant, and highly available.

Preserving knowledge for generations is no easy task. Key components of this massive undertaking include decisions in technology, architecture, data storage, and data accessibility.

More Read

data-driven marketing
6 Data-Driven Marketing Strategies That Are Revolutionizing Sales
Data-Driven Journalism Will Save Democracy and Your Identity, Too
Who should Jean-Claude Trichet call?
How Insurers Evaluate Data and Incorporate it Into their Business Model
Uses for Analytics and Big Data in Marketing

First, just about every technology used by Internet Archive, is either open source software or commodity hardware. For web crawling and adding content to their digital archives Heritrix was developed by Internet Archive. To enable full text search on Internet Archive’s website, Nutch running on Hadoop’s file system is utilized to “allow Google-style full-text search of web content, including the same content as it changes over time.”  There are also web sites that mention HBase could also be in the mix as a database technology.

Second, the concepts of redundancy and disaster planning are baked into the overall Internet Archive architecture. The non-profit has servers located in San Francisco, but in keeping a multi-century and beyond vision, Internet Archive mirrors data in Amsterdam and Egypt to weather the volatility of historical events.

Third, many companies struggle to decide what data they should use, archive, or throw away. However with the plummeting cost of hard disk storage, and open source Hadoop, capturing and storing all data in perpetuity is more feasible than ever. For Internet Archive all data are captured and nothing is thrown away.  

Finally, it’s one thing to capture and store data, and another to make it accessible. Internet Archive aims to make the world’s knowledge base available to everyone. On the Internet Archive site, users can search and browse through ancient documents, view recorded video from years past and listen to music from artists that no longer walk planet earth. Brewster Kahle, founder of the Internet Archive says, that with a simple internet connection; “A poor kid in Keyna or Kansas can have access to…great works no matter where they are, or when they were (composed).”

Capturing a mountain of multi-structured data (currently 10 petabytes and growing) is an admirable feat, however the real magic lies in Internet Archive’s multi-century vision of making sure the world’s best and most useful knowledge is preserved. Political systems come and go, but with Internet Archive’s Big Data preservation approach, the treasures of the world’s digital content will hopefully exist for centuries to come.

(image: data archive / shutterstock)

TAGGED:Big Data archiveBig Data PreservationBig Data storageInternet Archive
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Data Scientists
Big DataCollaborative DataData ManagementIT

4 Things Data Scientists Can Learn From SoundCloud’s Process

8 Min Read
cloud storage computing
Big DataCloud ComputingData WarehousingExclusive

Data Storage in Space? It’s Already in the Works

8 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?