Big Data Terms for Managers: Ultimate Glossary 2020
Big Data Terms Every Manager Should Know​
Big Data Terms Every Manager Should Know​

Big Data Terms Every Manager Should Know

Getting started with Big Data (BD)?

Perhaps you have already got your feet wet in the world of BD, but still looking to expand your knowledge and cover the subjects you have heard of but did not quite have time to cover?

Well, you have come to the right place.

This Big Data Glossary will briefly introduce you to the most important terms. We assure you it will be an easy and nice read!

It is not by any means exhaustive, but a good, light read prep before a meeting with a Big Data director or vendor – or a quick revisit before a job interview. Also, if you are interested in similar terms related to Artificial Intelligence, I would encourage you to visit a similar blog post on our partner company – Sigmoidal.

At the very beginning, let us explain the most important term when speaking of Big Data – Big Data itself. It is a term regarding large and complex datasets, whose analysis requires high computing power and can lead to extracting essential information and acquiring new knowledge. Here you can read more about Big Data and its applications.

Have you ever wondered how much data we generate? Tones of Gigabytes. Watch this video and find out what Big Data really is.

When you already know what Big Data is, it is a high time to jump into more advanced definitions. Below are the technical terms our engineers at DevsData consider the most essential.

So, let’s get started!

Algorithm

It is a simple term that is absolutely essential when speaking of Big Data. An algorithm is a mathematical formula or a set of instructions that we provide to the computer which describes how to process the given data in order to obtain needed information.

Artificial Intelligence

Artificial Intelligence is an intelligence presented by machines. It lets them perform tasks normally reserved for humans, such as speech recognition, visual perceptions, decision making or predict some information. 

Biometrics

Biometrics is a technology linked to recognizing people by their physical traits, like face, height, etc. It uses Artificial Intelligence algorithms.

Behavioral Analytics

Behavioral analytics is a recent development in business analytics that offers new insights into the actions of clients on e-commerce sites, web/mobile applications, online games etc. It helps the advertisers to make right deals at the right time for the right customers.

Columnar Database

A database is known as the column-oriented database, which stores column by column instead of row.

Complex Event Processing

Complex event processing (CEP) is the process of analyzing and defining data, and then combining it with inferences that can provide solutions to complex circumstances. CEP’s main role is to identify / track relevant events and to respond to them as quickly as possible.

Data Science

Data science is the specialist discipline concerned with turning data into meaning including new ideas or predictive models. This brings together skills from areas such as economics, mathematics, computer science, communication and industry experience such as knowledge of the company.

Data Scientist

Big Data Scientist is a person who can take structured and unstructured data points and use his formidable skills in statistics, maths, and programming to organize them. He applies all his analytical power such as contextual understanding, industry knowledge, and understanding of existing assumptions to uncover the hidden solutions for business development.

Business Intelligence

Business Intelligence is a procedure of processing the raw data and looking for valuable information for the purpose of improving and better understanding the business. Using BI can help to make fast and accurate business decisions.

Hadoop

When people think of big data, they immediately think about Hadoop. Hadoop (with its cute elephant logo) is an open-source software framework that consists of what is called a Hadoop Distributed File System (HDFS) and allows for storage, retrieval, and analysis of very large data sets using distributed hardware. 

Sounds complicated?

If you really want to impress someone, talk about YARN (Yet Another Resource Scheduler) which, as the name says, is a resource scheduler. I am really impressed by the folks who come up with these names. Apache foundation, which came up with Hadoop, is also responsible for Pig, Hive, and Spark (yup, they are all names of various software pieces). Aren’t you impressed with these names?

5 V's of BIG Data

Big data can be also described with these 5 words:

  • Volume – a large amount of data
  • Velocity – the speed of data processing
  • Variety – large data diversity
  • Veracity – verification of data
  • Value – what big data can bring to the user

Some say that now we there are 8V’s with Visualization, Viscosity and Virality being new ones.

Server

The server is a computer, which receives requests related to applications. Its task is to respond to those requests or send it over a network. This term is commonly used in big data.

Machine Learning

Machine Learning is the ability of computers to use them without programming new skills directly. In practice, this means algorithms that learn from data when processing them and use what they have learned to make decisions. Machine learning is used to exploiting the opportunities hidden in big data.

Cloud computing

Cloud computing is a term describing computing resources stored and running on remote servers. The resources, including software and data, can be accessed from anywhere by means of the internet.

Fuzzy Logic

Fuzzy logic is an approach to logic, which instead of judging, whether a statement is true or not (values 0 or 1), it tells the degree, how much the statement is close to the truth (values from 0 to 1). This approach is commonly used in Artificial Intelligence. 

Neural Network

Neural networks are a series of algorithms that are recognizing relationships in datasets, through a process that is similar to the functionality of the human brain. An important factor of this system is that it can generate the best possible result, without redesigning criteria for the output. Neural networks are very useful in financial areas, for instance, they can be used to forecast stock market prices.

Data Warehouse

It is a system that stores data in order to analyze and process it in the future. The source of data can vary, depending on its purpose. Data can be uploaded from the company’s CRM systems as well as imported from external files or databases.

Dashboard

This is a graphical representation of the algorithms conducting an analysis. This graphical report displays various warnings in color to indicate the status of the incident. In regular operations a green light means everything is ok, a yellow light indicates any effect due to activity and a red light means the activity has been halted.

Data Lake

A data lake is a repository that stores a huge amount of raw data in its original format. While the hierarchical data warehouse stores information in files and folders, a data lake uses a flat architecture to store data. Each item in the repository has a unique identifier and is marked with a set of metadata tags. When a business query appears, the repository can be searched for specific information, and then a smaller, separate set of data can be analyzed to help solve a specific problem.

MapReduce

It is a technique used to process large datasets with the parallel distributed algorithm on the cluster. There are two types of tasks MapReduce is responsible for. The “Map” is used to divide data and process the data at the node level. “Reduce” collects answers of Map and finds the answer to the query.

Spark (Apache Spark)

Apache Spark is a fast, in-memory data processing engine to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark is generally a lot faster than MapReduce that we discussed earlier.
Big Data terms
Data center are places responsible for storing and processing large amounts of data.

Metadata

Metadata is data that describes other data. Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. For example, author, date created and date modified and file size are very basic document metadata. In addition to document files, metadata is used for images, videos, spreadsheets and web pages.

SQL

SQL is a language used to manage and query data in relational databases. There are many relational database systems, such as MySQL, PostgreSQL or SQLite, etc. Each of these systems has its own SQL dialect, which slightly differs from others.

Queries

Queries are the questions used in order to communicate with the database. Usually, iterating over big datasets would be very time-consuming. In such a case, queries can be created i.e. database can be asked to return all records where a given condition is satisfied.

NoSQL

The term NoSQL is an abbreviation of “not only SQL”. It describes databases or database management systems, which deal with non-relational or non-structures data. Their flexibility results in the fact that they are very commonly used while processing large amounts of data.

MongoDB

MongoDB is one of the most popular NoSQL database systems. It stores data in documents, written in JSON-like format. Since it is written in a relatively low-level language (C++), it gives hugely high performance. 

Database Management System

Database Management System is software that gathers data in a structured form and provides access to that data. It builds the database and maintains it. DBMS offers a well-organized mechanism for programmers and users to build, update, retrieve and manage the data.

Object databases

Object database stores data in the form of objects. Term “object” means the same thing as in object-oriented programming, which simply means an entity of a certain class.

Data Visualization

Data visualization is a proper solution when a quick look at a large amount of information is required. Using graphs, charts, diagrams, etc. allows the user to find interesting patterns or trends in the dataset. It also helps when it comes to validating data. The human eye can notice some unexpected values when they are presented in a graphical way.

Data Mining

It is an analytical process designed to study large data resources in search of regular patterns and systematic interrelationships between variables, and then to evaluate the results by applying the detected patterns to new subsets of data. The final goal of data mining is usually to predict customer behavior, sales volume, the likelihood of customer loss, etc.

Data Cleansing

Data cleansing is a process of correcting or removing data or records from the database. This step is extremely important. During collecting data from sensors, websites or web scraping, some incorrect data may occur. Without cleansing, the user would be at risk of coming to wrong conclusions after analyzing this data.

Cluster Analysis

It is an explorative analysis that tries to identify structures within the data. Cluster analysis is also called segmentation analysis or taxonomy analysis. More specifically, it tries to identify homogenous groups of cases, i.e., observations, participants, respondents. Cluster analysis is used to identify groups of cases if the grouping is not previously known. Because it is explorative it does make any distinction between dependent and independent variables. The different cluster analysis methods that SPSS offers can handle binary, nominal, ordinal, and scale (interval or ratio) data.

Internet of Things

Internet of things, IoT, in short, is a conception of connecting devices, such as house lighting, heating or even fridges to a common network. It allows storing big amounts of data, which can later be used in real-time analytics. This term is also connected with a smart home, a concept of controlling house with phone etc.

TensorFlow & Keras

By default, Python does not have any implementations of machine learning algorithms or data structures. A developer needs to implement them by themselves, or use already prepared libraries such as TensorFlow or Keras. Tensorflow is an open-source library for symbolic math calculations as well as machine learning. It has implementations for many languages, like Python, Javascript, etc. Code written with is low-level, which can give reasonably high performance. Keras is also an open-source library used with Python, however, code is more high-level, which makes the library itself more friendly for machine learning beginners than TensorFlow.

Concurrency

Concurrency is an ability to manage multiple tasks at the time. It helps to deal with lots of processes performed by a machine. The most known example of concurrency is multitasking. 

Parallelism

Parallelism is a similar term to concurrency, but it has a slight, but important difference. Not only does parallelism let manage multiple tasks at the time, but also it lets perform multiple tasks at the time, thanks to using multicore processors; every core performs its own task.

Comparative Analytics

I’ll be going little deeper into analysis in this article as big data’s holy grail is in analytics. Comparative analysis, as the name suggests, is about comparing multiple processes, data sets or other objects using statistical techniques such as pattern analysis, filtering and decision-tree analytics etc. I know it’s getting little technical but I can’t completely avoid the jargon. Comparative analysis can be used in healthcare to compare large volumes of medical records, documents, images etc. for more effective and hopefully accurate medical diagnoses.

Search Engine

It is a software system that conducts web research under the conditions specified by the user in the search query. The most popular search engines are Google, Yahoo, and Bing. Big Data, undoubtedly, plays a spectacular role in the development of search engines. The complex systems crawl through the web and process the URLs related to the query provided by the user in order to return the most accurate results. 

Gamification

Gamification refers to the concepts used to develop the game to enhance the interaction of consumers in non-gaming companies. Various businesses use various gaming concepts to raise interest in a service or product or simply we might conclude that gamification is used to deepen the connection between their consumers and the brand.

Hive

Hive is an open source data warehouse software project based on Hadoop to provide description, review, and query of the data. Users can write queries in the language known as HiveQL, close to SQL. Hadoop is a system within the distributed computing world that manages massive datasets.

Impala

Impala is an open source MPP (Massively Parallel Processing) SQL query engine which is used to run Apache Hadoop in a computer cluster. Impala provides Hadoop with a parallel database strategy so that users can apply low-latency SQL queries to the data stored in Apache HBase and HDFS without any transformation of the data.

In-memory computing

In general, it is assumed that any computation that can be performed without accessing the I/O would be quicker. In-memory computing is a technique for moving the working datasets entirely within the collective memory of a cluster and avoiding writing intermediate calculations to the disk. Apache Spark is an in-memory computing system, which has a major advantage over I/O bound systems such as MapReduce from Hadoop.

Load balancing

Load balancing is a method that distributes the amount of workload over a computer network between two or more machines so that work is done in a short period because both users want to be served more quickly. This is the key explanation for the clustering of computer servers and can be implemented either with software or hardware, or with the combination of both.

Location Analytics

Location analytics is the method of gaining insights from the geographical aspect or business data location. This is the visual effect of evaluating and interpreting the data represented information , which allows the user to link location-related information to the dataset.

Multidimensional Database

A multidimensional database is a kind of database which is optimized for OLAP (Online Analytical Processing) applications and data warehousing. MDB can be easily created by using the input of a relational database. MDB is the ability of processing data in the database so that results can be developed quickly.

Operational Data Store

It is known as a location for collecting and storing recovered data from various sources. It helps users to run several more data operations before it is sent to the data center for monitoring.

Parallel Processing

It is a system’s ability to execute multiple tasks at the same time.

Parallel Query

To boost performance, a parallel query can be described as a query that can be performed over multiple system threads.

Pattern Recognition

A process for classifying or labeling the pattern identified during the machine learning process is known as pattern recognition.

Query

A query is a method of obtaining some kind of information to derive an answer to the question.

Search Engine

It is a software system that conducts web research under the conditions specified by the user in the search query. The most popular search engines are Google, Yahoo, and Bing.

Big Data, undoubtedly, plays a spectacular role in the development of search engines. The complex systems crawl through the web and process the URLs related to the query provided by the user in order to return the most accurate results. 

Real-time

Real-time means “as it is” and in Big Data it refers to a process or mechanism that can offer data-driven information based on what is going on at the moment

Semi-Structured Data

The data, not represented in the traditional way, is known as semi-structured data with the application of regular methods. This data is neither entirely structured nor unstructured but includes certain tags, tables of data, and elements of the structure. XML databases, texts, charts , and graphs are only a few examples of semi-structured files.

Unstructured Data

The data that cannot be described as structure is considered unstructured data. Processing and managing unstructured data is complicated. The main examples of unstructured data are information that is entered with code, images , and videos in email messages and data sources.

XML Databases

Databases which support XML format data storage are known as XML databases. In addition, these repositories are linked with the databases related to the documents. One can export, serial, and position a query on XML database records.

Python

Python is a programming language which has become very popular in the Big Data space due to its ability to work very well with large, unstructured datasets.

Semi-Structured Data

The data, not represented in the traditional way, is known as semi-structured data with the application of regular methods. This data is neither entirely structured nor unstructured but includes certain tags, tables of data, and elements of the structure. XML databases, texts, charts , and graphs are only a few examples of semi-structured files.

R

R is another programming language that is commonly used in Big Data, and can be considered to be more specialized than Python, focusing on statistics. The value lies in its management of structured data. Like Python it has an active user base that is continuously growing and introducing new libraries and enhancements to its capabilities.

Big data terms for managers - summary

Ok, that was helpful – now what? Well, you can read how to write a better code, check out examples of difficult JavaScript interview questions, or – send us an email to discuss if Big Data solutions could be applied to your business case (general@devsdata.com). Also, if you are interested in a real-life example of Big Data in real estate, you can check Adradar, which is a search engine for property and real estate, based on AI and Big Data methods.
Big Data terms
Applying Big Data in Your business will lead to Big profits and new opportunities for Your company.

Got a project idea?

Fill in an enquiry form and we’ll get back to you as soon as possible.

double quotes

Best back-end engineers I've ever worked with.

“I interviewed about a dozen different firms. DevsData LLC is truly exceptional – their backend developers are some of the best I’ve ever worked with. I’ve worked with a lot of very well-qualified developers, locally in San Francisco, and remotely, so that is not a compliment I offer lightly. Their depth of knowledge and their ability to get things done quickly.
Testimonials - Face
Nicholas Johnson

CEO OF ORANGE CHARGER LLC
MENTOR AT YCOMBINATOR

Acknowledgements

DevsData LLC  |  1820 Avenue M, Suite 481, Brooklyn, NY 11230

Market Watch
Digital Journal
ok icon
Thank you

We'll get back to you as soon as possible.