Simple and advanced technical Big Data terms
This Big Data Glossary will give you a brief introduction to the most important terms. We assure you it will be a nice and easy read!
It is by no means exhaustive, but rather a good, light read to prep before a meeting with a Big Data director or vendor, or a quick refresher before a job interview.
At the beginning, we’ll explain the most important term when speaking of Big Data – Big Data itself. It is a term relating to large and complex datasets, whose analysis require high computing power and can lead to the extraction of essential information and new knowledge acquisition. You can read more about Big Data and its applications here.
Once you know what Big Data is, it’s time to jump into more advanced definitions. Below are the technical terms our engineers at DevsData consider the most essential.
So, let’s get started!
Artificial Intelligence is an intelligence presented by machines. It lets them perform tasks normally reserved for humans, such as speech recognition, visual perceptions, decision making or making predictions.
Business Intelligence is a procedure involving the processing of raw data and looking for valuable information in order to improve and better understand the business. Using BI can assist in making fast and accurate business decisions.
Biometrics is a technology linked to recognizing people by their physical traits, like face, height, etc. It uses Artificial Intelligence algorithms.
Cloud computing is a term to describe computing resources stored and run on remote servers. The resources, including software and data, can be accessed from anywhere by means of the internet.
A Big Data Scientist is a person who can take structured and unstructured data points and use his formidable skills in statistics, maths, and programming to organize them. He applies all his analytical power such as contextual understanding, industry knowledge, and understanding of existing assumptions to uncover the hidden solutions for business development.
Data visualization is a suitable solution when a quick look at a large amount of information is required. Utilizing graphs, charts, diagrams, etc. allows the user to find interesting patterns or trends in the dataset. It also helps when validating data. The human eye can notice some unexpected values when they are presented in a graphical way.
Internet of things, or IoT in short, is the concept of connecting devices, such as residential lighting, heating, or even fridges to a common network. It allows for storing large amounts of data, which can later be used in real-time analytics. This term is also connected with a smart home, a concept of controlling technology in houses with an app, etc.
Machine Learning is the ability for computer systems to adapt and learn without programming new skills or specific instructions directly. In practice, this refers to algorithms that can learn from data during the processing of said data which can then apply what they’ve learned to make decisions. Machine learning is used to exploit the opportunities hidden in Big Data. I’ve worked with DevsData on numerous projects over the last 3 years and I’m very happy. They demonstrated a strong degree of proactivity, taking time to thoroughly understand the problem and business perspective. The solutions they designed exceeded
my expectations.
A search engine is a software system that conducts web research under the conditions specified by the user in the search query. The most popular search engines are Google, Yahoo, and Bing. Big Data undoubtedly plays a spectacular role in the development of search engines. The complex systems crawl through the web and process the URLs related to the query provided by the user in order to return the most accurate results.
Neural networks are a series of algorithms that recognize relationships in datasets through a process that is similar to the functionality of the human brain. An important factor of this system is that it can generate the best possible result without redesigning criteria for the output. Neural networks are useful in financial domains; they can be used, for instance, to forecast stock market prices.
Big Data can be also described with these 5 words:
Some say that now we there are 8Vs with Visualization, Viscosity and Virality being new additions.
It is a simple term that is absolutely essential when talking about Big Data. An algorithm is a mathematical formula or a set of instructions that we provide to the computer which describe how to process the given data in order to obtain the necessary information.
Concurrency is an ability to manage multiple tasks at the time. It helps deal with the large volume of processes performed by a machine. The most well-known example of concurrency is multitasking.
We’ll dive a little deeper into analysis in this article, as analytics are at the heart of Big Data. Comparative analysis, as the name suggests, is about comparing multiple processes, data sets or other objects using statistical techniques such as pattern analysis, filtering and decision-tree analytics etc. We know it’s getting little technical, but we can’t completely avoid the jargon! Comparative analysis can be used in healthcare to compare large volumes of medical records, documents, images etc. for more effective and hopefully accurate medical diagnoses.
A data warehouse is a system that stores data for future analysis and processing. The source of the data can vary depending on its purpose. Data can be uploaded from the company’s CRM systems as well as imported from external files or databases.
Data lakes are repositories that store a huge amount of raw data in its original format. While the hierarchical data warehouse stores information in files and folders, a data lake uses a flat architecture to store data. Each item in the repository has a unique identifier and is marked with a set of metadata tags. When a business query appears, the repository can be searched for specific information, and then a smaller, separate data set can be analyzed to help solve a specific problem.
Data Mining is an analytical process designed to study large data resources in search of regular patterns and systematic interrelationships between variables, and then to evaluate the results by applying the detected patterns to new subsets of data. The final goal of data mining is usually to predict client behavior, sales volume, the likelihood of customer churn, etc.
Data cleansing is a process of correcting or removing data or records from the database. This step is extremely important. During data collection from sensors, websites or web scraping, some incorrect data may be included. Without cleansing, the user would be at risk of coming to incorrect conclusions after analyzing this data.
Metadata is data that describes other data. Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. For example, author, date created and date modified and file size are very basic document metadata. In addition to document files, metadata is used for images, videos, spreadsheets and web pages.
The term NoSQL is an abbreviation of “not only SQL”. It describes databases or database management systems which deal with non-relational or non-structured data. Due to their flexibility they are commonly used while processing large amounts of data.
Both R and Python are some of the most commonly used open-source programming languages for Big Data. Python is considered to be slightly more user-friendly for beginners than R. It is highly flexible and efficient while processing large datasets. On the other hand, R is more specialized, as it is predominantly used for statistics. It has a large number of users who voluntarily contribute to its development by adding new libraries and packages; for example ggplot2 used for data visualization.
The server is a computer, which receives requests related to applications. Its task is to respond to those requests or send it over a network. This term is commonly used in Big Data.
SQL is a language used to manage and query data in relational databases. There are many relational database systems, such as MySQL, PostgreSQL or SQLite, etc. Each of these systems has its own SQL dialect, which differs slightly from others.
Queries are the questions used in order to communicate with a database. Usually, iterating over big datasets would be very time-consuming. In such a case, queries can be created i.e. a database can be asked to return all records where a given condition is satisfied.
Do you have IT recruitment needs?
Cluster analysis an explorative analysis that tries to identify structures within the data. Cluster analysis is also called segmentation analysis or taxonomy analysis. More specifically, it tries to identify homogenous groups of cases, i.e., observations, participants, and respondents. Cluster analysis is used to identify groups of cases if the grouping is not previously known. Because it is explorative it doesn’t make any distinction between dependent and independent variables. The different cluster analysis methods that SPSS offers can handle binary, nominal, ordinal, and scale (interval or ratio) data.
Fuzzy logic is an approach to logic in which rather than judging whether a statement is true or not (values 0 or 1), it tells the degree to which, or how close the statement is, to being true (values from 0 to 1). This approach is commonly used in Artificial Intelligence.
When people think of big data, they immediately think about Hadoop. Hadoop, with its cute elephant logo, is an open source software framework that consists of what is called a Hadoop Distributed File System (HDFS) which allows for storage, retrieval, and analysis of large data sets using distributed hardware.
Sounds complicated?
If you really want to impress someone, talk about YARN (Yet Another Resource Scheduler) which, as the name says, is a resource scheduler. I am really impressed by the folks who come up with these names. Apache foundation, which came up with Hadoop, is also responsible for Pig, Hive, and Spark (yup, they are all names of various software pieces). Aren’t you impressed with these names?
MapReduce is a technique used to process large datasets with the parallel distributed algorithm on the cluster. There are two types of tasks that MapReduce is responsible for. The “Map” is used to divide data and process it at the node level. “Reduce” collects answers of Map and finds the answer to the query.
MongoDB is one of the most popular NoSQL database systems. It stores data in documents, written in a JSON-like format. Since it is written in a relatively low-level language (C++), it is incredibly highly performing.
An object database stores data in the form of objects. The term “object” means the same as it does in “object-oriented programming”, which is simply an entity of a certain class. Retail/e-commerce Construction Pharmaceutical Telecom Financial services, Media &
Cross-industry expertise
Over the years, we've accumulated expertise in building software and conducting recruitment processes for various domains. Below are six industries in which we have particularly strong knowledge.
hedge funds
entertainment
Parallelism is similar to concurrency, but it has a slight, yet important difference. Not only does parallelism allow for the management of multiple tasks at the same time, it also allows for the performance of multiple tasks at the same time, thanks to multicore processors; every core performs its own task.
Apache Spark is a fast, in-memory data processing engine to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark is generally a lot faster than MapReduce, which we defined previously.
By default, Python doesn’t have any implementations for Machine Learning algorithms or data structures. A developer needs to implement them on their own or use already prepared libraries such as TensorFlow or Keras. TensorFlow is an open source library for symbolic mathmatical calculations as well as Machine Learning. It has implementations for many languages such as Python, Javascript, etc. The written code is low-level, which allows for reasonably high performance. Keras is another open source library used with Python, however, the code is more high-level, which makes the library itself more user-friendly for Machine Learning beginners than TensorFlow.
A Database Management System (DBMS) is specialized software designed to store, retrieve, define, and manage data in a structured way. It ensures data consistency, integrity, and security while providing tools for querying and reporting. By handling complex tasks like concurrency control and recovery from failures, a DBMS allows users and applications to interact seamlessly with vast amounts of data. Examples include Oracle, MySQL, PostgreSQL, and Microsoft SQL Server.
Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. It aims to enable machines to understand, interpret, generate, and respond to human language in a way that is both meaningful and contextually relevant. By leveraging algorithms and computational linguistics, NLP seeks to bridge the gap between human communication and machine understanding. This technology underpins various applications, such as chatbots, machine translation, sentiment analysis, and voice recognition systems. As advancements continue, NLP paves the way for more intuitive and sophisticated human-computer interactions.
Relational databases are a type of database system that organizes data into structured tables, ensuring data integrity and accuracy through the principles of the relational model. These tables, or relations, consist of rows and columns, where each row represents a unique record and each column represents an attribute of the data. A key aspect of relational databases is the use of primary and foreign keys to establish relationships between tables, allowing for complex data retrieval and operations without data redundancy. SQL (Structured Query Language) is the predominant language used to query and manipulate data in these databases. Examples of relational database management systems (RDBMS) include Oracle, MySQL, PostgreSQL, and Microsoft SQL Server. Their structured nature makes them a popular choice for a wide range of applications, from simple data storage to large-scale enterprise solutions.
Structured data refers to information that is organized in a predefined manner or schema, facilitating easy storage, querying, and analysis. Unlike unstructured data, which can be varied and lacks a specific format (e.g. plain text or video content), structured data is typically arranged in tables with rows and columns. Each row represents a record or entity, and each column signifies a specific attribute or field of that entity. Common examples of structured data include relational databases, where data is stored in tables, and CSV (Comma Separated Values) files, where data is delineated using specific delimiters. The structured nature of this data type allows for efficient querying and processing, making it highly suited for tasks like data analysis, business intelligence, and application development.
Okay, that was helpful – now what?
Since you’re already acquainted with all of the big data terms that every manager should know, you can read how to write better code, check out examples of difficult JavaScript interview questions, or – send us an email to discuss if Big Data solutions could be applied to your business case ([email protected]). Also, if you’re interested in a real-life example of Big Data in real estate, you can check Adradar, a search engine for property and real estate, based on AI and Big Data methods.
Frequently asked questions (FAQ)
DevsData – a premium technology partner
DevsData is a boutique tech recruitment and software agency. Develop your software project with veteran engineers or scale up an in-house tech team with developers with relevant industry experience.
Free consultation with a software expert
🎧 Schedule a meeting
“DevsData LLC is truly exceptional – their backend developers are some of the best I’ve ever worked with.”
Nicholas Johnson
Mentor at YC,
Ex-Tesla engineer,
Serial entrepreneur
Categories: Big data, data analytics | Software and technology | IT recruitment blog | IT in Poland | Content hub (blog)
“I interviewed about a dozen different firms. DevsData LLC is truly exceptional – their backend developers are some of the best I’ve ever worked with. I’ve worked with a lot of very well-qualified developers, locally in San Francisco, and remotely, so that is not a compliment I offer lightly. I appreciate their depth of knowledge and their ability to get things done quickly. “
Nicholas Johnson
CEO of Orange Charger LLC,
Ex-Tesla Engineer,
Mentor at YCombinator