Perhaps you have already got your feet wet in the world of BD, but still looking to expand your knowledge and cover the subjects you have heard of but did not quite have time to cover?
Well, you have come to the right place.
This Big Data Glossary will briefly introduce you to the most important terms. We assure you it will be an easy and nice read!
It is not by any means exhaustive, but a good, light read prep before a meeting with a Big Data director or vendor – or a quick revisit before a job interview. Also, if you are interested in similar terms related to Artificial Intelligence, I would encourage you to visit a similar blog post on our partner company – Sigmoidal.
At the very beginning, let us explain the most important term when speaking of Big Data – Big Data itself. It is a term regarding large and complex datasets, whose analysis requires high computing power and can lead to extracting essential information and acquiring new knowledge. Here you can read more about Big Data and its applications.
Have you ever wondered how much data we generate? Tones of Gigabytes. Watch this video and find out what Big Data really is.
When you already know what Big Data is, it is a high time to jump into more advanced definitions. Below are the technical terms our engineers at DevsData consider the most essential.
So, let’s get started!
Artificial Intelligence is an intelligence presented by machines. It lets them perform tasks normally reserved for humans, such as speech recognition, visual perceptions, decision making or predict some information.
Biometrics is a technology linked to recognizing people by their physical traits, like face, height, etc. It uses Artificial Intelligence algorithms.
When people think of big data, they immediately think about Hadoop. Hadoop (with its cute elephant logo) is an open-source software framework that consists of what is called a Hadoop Distributed File System (HDFS) and allows for storage, retrieval, and analysis of very large data sets using distributed hardware.
If you really want to impress someone, talk about YARN (Yet Another Resource Scheduler) which, as the name says, is a resource scheduler. I am really impressed by the folks who come up with these names. Apache foundation, which came up with Hadoop, is also responsible for Pig, Hive, and Spark (yup, they are all names of various software pieces). Aren’t you impressed with these names?
Big data can be also described with these 5 words:
Some say that now we there are 8V’s with Visualization, Viscosity and Virality being new ones.
The server is a computer, which receives requests related to applications. Its task is to respond to those requests or send it over a network. This term is commonly used in big data.
Cloud computing is a term describing computing resources stored and running on remote servers. The resources, including software and data, can be accessed from anywhere by means of the internet.
Fuzzy logic is an approach to logic, which instead of judging, whether a statement is true or not (values 0 or 1), it tells the degree, how much the statement is close to the truth (values from 0 to 1). This approach is commonly used in Artificial Intelligence.
Neural networks are a series of algorithms that are recognizing relationships in datasets, through a process that is similar to the functionality of the human brain. An important factor of this system is that it can generate the best possible result, without redesigning criteria for the output. Neural networks are very useful in financial areas, for instance, they can be used to forecast stock market prices.
It is a system that stores data in order to analyze and process it in the future. The source of data can vary, depending on its purpose. Data can be uploaded from the company’s CRM systems as well as imported from external files or databases.
It is a technique used to process large datasets with the parallel distributed algorithm on the cluster. There are two types of tasks MapReduce is responsible for. The “Map” is used to divide data and process the data at the node level. “Reduce” collects answers of Map and finds the answer to the query.
SQL is a language used to manage and query data in relational databases. There are many relational database systems, such as MySQL, PostgreSQL or SQLite, etc. Each of these systems has its own SQL dialect, which slightly differs from others.
Queries are the questions used in order to communicate with the database. Usually, iterating over big datasets would be very time-consuming. In such a case, queries can be created i.e. database can be asked to return all records where a given condition is satisfied.
The term NoSQL is an abbreviation of “not only SQL”. It describes databases or database management systems, which deal with non-relational or non-structures data. Their flexibility results in the fact that they are very commonly used while processing large amounts of data.
MongoDB is one of the most popular NoSQL database systems. It stores data in documents, written in JSON-like format. Since it is written in a relatively low-level language (C++), it gives hugely high performance.
Object database stores data in the form of objects. Term “object” means the same thing as in object-oriented programming, which simply means an entity of a certain class.
Data visualization is a proper solution when a quick look at a large amount of information is required. Using graphs, charts, diagrams, etc. allows the user to find interesting patterns or trends in the dataset. It also helps when it comes to validating data. The human eye can notice some unexpected values when they are presented in a graphical way.
Data cleansing is a process of correcting or removing data or records from the database. This step is extremely important. During collecting data from sensors, websites or web scraping, some incorrect data may occur. Without cleansing, the user would be at risk of coming to wrong conclusions after analyzing this data.
Internet of things, IoT, in short, is a conception of connecting devices, such as house lighting, heating or even fridges to a common network. It allows storing big amounts of data, which can later be used in real-time analytics. This term is also connected with a smart home, a concept of controlling house with phone etc.
Concurrency is an ability to manage multiple tasks at the time. It helps to deal with lots of processes performed by a machine. The most known example of concurrency is multitasking.
Parallelism is a similar term to concurrency, but it has a slight, but important difference. Not only does parallelism let manage multiple tasks at the time, but also it lets perform multiple tasks at the time, thanks to using multicore processors; every core performs its own task.
Hive is an open source data warehouse software project based on Hadoop to provide description, review, and query of the data. Users can write queries in the language known as HiveQL, close to SQL. Hadoop is a system within the distributed computing world that manages massive datasets.
Big Data, undoubtedly, plays a spectacular role in the development of search engines. The complex systems crawl through the web and process the URLs related to the query provided by the user in order to return the most accurate results.
Fill in an enquiry form and we’ll get back to you as soon as possible.
We'll get back to you as soon as possible.