Sign up | Login with →
Posted by: Bernard Marr

22 Key Big Data Terms Everyone Should Understand

Like it?


I have attempted to provide simple explanations for some of the most important technologies and terms you will come across if you’re looking at getting into big data. However, if you are completely new to the topic then you might want to start here: What the Heck is... Big Data? ...and then come back to this list later.

Here they some of the key terms:

1. Algorithm: A mathematical formula or statistical process run by software to perform an analysis of data. It usually consists of multiple calculations steps and can be used to automatically process data or solve problems.

2. Amazon Web Services: A collection of cloud computing services offered by Amazon to help businesses carry out large scale computing operations (such as big data projects) without having to invest in their own server farms and data storage warehouses. Essentially, Storage space, processing power and software operations are rented rather than having to be bought and installed from scratch.

3. Analytics: The process of collecting, processing and analyzing data to generate insights that inform fact-based decision-making. In many cases it involves software-based analysis using algorithms. For more, have a look at my post: What the Heck is… Analytics

4. Big Table: Google’s proprietary data storage system, which it uses to host, among other things its Gmail, Google Earth and Youtube services. It is also made available for public use through the Google App Engine.

5. Biometrics: Using technology and analytics to identify people by one or more of their physical traits, such as face recognition, iris recognition, fingerprint recognition, etc. For more, see my post: Big Data and Biometrics

6. Cassandra: A popular open source database management system managed by The Apache Software Foundation that has been designed to handle large volumes of data across distributed servers.

7. Cloud: Cloud computing, or computing “in the cloud”, simply means software or data running on remote servers, rather than locally. Data stored “in the cloud” is typically accessible over the internet, wherever in the world the owner of that data might be. For more, check out my post: What The Heck is… The Cloud?

8. Distributed File System: Data storage system designed to store large volumes of data across multiple storage devices (often cloud based commodity servers), to decrease the cost and complexity of storing large amounts of data.

9. Data Scientist: Term used to describe an expert in extracting insights and value from data. It is usually someone that has skills in analytics, computer science, mathematics, statistics, creativity, data visualisation and communication as well as business and strategy.

10. Gamification: The process of creating a game from something which would not usually be a game. In big data terms, gamification is often a powerful way of incentivizing data collection. For more on this read my post: What The Heck is… Gamification?

11. Google App Engine: Google’s own cloud computing platform, allowing companies to develop and host their own services within Google’s cloud servers. Unlike Amazon’s Web Services, it is free for small-scale projects.

12. HANA: High-performance Analytical Application – a software/hardware in-memory platform from SAP, designed for high volume data transactions and analytics.

13. Hadoop: Apache Hadoop is one of the most widely used software frameworks in big data. It is a collection of programs which allow storage, retrieval and analysis of very large data sets using distributed hardware (allowing the data to be spread across many smaller storage devices rather than one very large one). For more, read my post: What the Heck is... Hadoop? And Why You Should Know About It

14. Internet of Things: A term to describe the phenomenon that more and more everyday items will collect, analyse and transmit data to increase their usefulness, e.g. self-driving cars, self-stocking refrigerators. For more, read my post: What The Heck is… The Internet of Things?

15. MapReduce: Refers to the software procedure of breaking up an analysis into pieces that can be distributed across different computers in different locations. It first distributes the analysis (map) and then collects the results back into one report (reduce). Several companies including Google and Apache (as part of its Hadoop framework) provide MapReduce tools.

16. Natural Language Processing: Software algorithms designed to allow computers to more accurately understand everyday human speech, allowing us to interact more naturally and efficiently with them.

17. NoSQL: Refers to database management systems that do not (or not only) use relational tables generally used in traditional database systems. It refers to data storage and retrieval systems that are designed for handling large volumes of data but without tabular categorisation (or schemas).

18. Predictive Analytics: A process of using analytics to predict trends or future events from data.

19. R: A popular open source software environment used for analytics.

20. RFID: Radio Frequency Identification. RFID tags use Automatic Identification and Data Capture technology to allow information about their location, direction of travel or proximity to each other to be transmitted to computer systems, allowing real-world objects to be tracked online.

21. Software-As-A-Service (SAAS): The growing tendency of software producers to provide their programs over the cloud – meaning users pay for the time they spend using it (or the amount of data they access) rather than buying software outright.

22. Structured v Unstructured Data: Structured data is basically anything than can be put into a table and organized in such a way that it relates to other data in the same table. Unstructured data is everything that can’t – email messages, social media posts and recorded human speech, for example.

As always, I hope you found this post useful?

You might also be interested in my new book: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance

For more, please check out my other posts in The Big Data Guru column and feel free to connect with me via:

 TwitterLinkedInFacebookSlideshare and The Advanced Performance Institute

Authored by:

Bernard Marr

Bernard Marr is a globally regognized big data and analytics expert. He is a best-selling business author, keynote speaker and consultant in strategy, performance management, analytics, KPIs and big data.  He helps companies to better manage, measure, report and analyse performance. His leading-edge work with major companies, organisations and governments across the globe makes him a globally ...

See complete profile

Would you like to contribute to this site? Get started »

» Already a member? Login now to comment!

» Not a member? Register to comment!

Like it?

December 22, 2014

Alec Gardner says:

A nice cribsheet but the addiiton of HANA really ruins it for me. I cannot see why it deserves a mention.

By all means, its important to cover In-Memory and how it is *part* of the (Big) Data and Analytics landscape, but "HANA" is not a key term "everyone" needs to understand. Not even close.

HANA is currently offering little, other than to get SAP ERP customers off Oracle and put lipstick on the pig that is BW - a mess that SAP put their own customers into. Despite the hype SAP push, there are no reference stories presented of customers deploying it for use cases that havent already been achieved many times, many years ago on other technologies. As such, HANA isnt even a showcase of the opportunity that In-Memory presents to enhance the capability and opportunity afforded by new data and new processing techniques.

Share this comment:

Like it?

December 20, 2014

hisham wassouf says:

What about  velocity , volume and variety ?

Share this comment:

Like it?

December 17, 2014

David White says:

Just for clarity, HANA = High performance ANalytic Appliance (not application).  That said, the appliance term is a bit of a mis-nomer now that HANA is in the cloud too....

Share this comment:

Like it?

December 16, 2014

Thomas Speidel says:

I agree with Diego.  It's a bit puzzling that statistics isn't there, especially considering R is there.  I'll follow Diego's example and provide another quote:

"Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days – but we must not pretend that the traps have all been made safe. They have not."

Tim Harford, Financial Time (March 28, 2014)

Share this comment:

Like it?

December 16, 2014

Martyn Jones says:

Nice blog piece.

With regards to Diego Kuonen's comment, I would also warmly support the inclusion of the term Statistics.

Best regards,


Share this comment:

Like it?

December 16, 2014

Diego Kuonen says:

Great post! As you mention "Analytics" and "R" - an open source software environment written by statisticians for statisticians (and all people interested in data analysis), it would be fair to also add an entry for "Statistics".

For example:

  • Statistics: the science of "learning from data" (or of making sense out of data), and of measuring, controlling and communicating uncertainty.

Interpreting information extracted from (big and any) data requires statistical principles and rigour as one can easily be fooled by patterns that arise by chance.

For example, in 2009 Hal Varian (Google’s chief economist) dubbed statistician as `the sexy job in the next ten years’. More recently, Eric Schmidt (Google's chairman and former CEO) and Jonathan Rosenberg (former senior vice president of product) write in their 2014 book `How Google Works’: `big data needs statisticians to make sense of it’.

In my opinion, the key element for a successful `big data’ future are statistical principles and rigour of humans (including hopefully also plenty of statisticians)!

Saying this, it would be more than fair to also add an entry for Statistics.

Please also find a detailed presentation of my view on big data, data science and statistics at and/or

Share this comment: