Getting started with Big Data (BD)?
Perhaps you have already got your feet wet in the world of BD, but still looking to expand your knowledge and cover the subjects you have heard of but did not quite have time to cover?
Well, you have come to the right place.
This Big Data Glossary will briefly introduce you to the most important terms. We assure you it will be an easy and nice read!
It is not by any means exhaustive, but a good, light read prep before a meeting with a Big Data director or vendor – or a quick revisit before a job interview. Also, if you are interested in similar terms related to Artificial Intelligence, I would encourage you to visit a similar blog post on our partner company – Sigmoidal.
At the very beginning, let us explain the most important term when speaking of Big Data – Big Data itself. It is a term regarding large and complex datasets, whose analysis requires high computing power and can lead to extracting essential information and acquiring new knowledge. Here you can read more about Big Data and its applications.
When you already know what Big Data is, it is a high time to jump into more advanced definitions. Below are the technical terms our engineers at DevsData consider the most essential.
So, let’s get started!
Artificial Intelligence is an intelligence presented by machines. It lets them perform tasks normally reserved for humans, such as speech recognition, visual perceptions, decision making or predict some information.
Business Intelligence is a procedure of processing the raw data and looking for valuable information for the purpose of improving and better understanding the business. Using BI can help to make fast and accurate business decisions.
Biometrics is a technology linked to recognizing people by their physical traits, like face, height, etc. It uses Artificial Intelligence algorithms.
Cloud computing is a term describing computing resources stored and running on remote servers. The resources, including software and data, can be accessed from anywhere by means of the internet.
Big Data Scientist is a person who can take structured and unstructured data points and use his formidable skills in statistics, maths, and programming to organize them. He applies all his analytical power such as contextual understanding, industry knowledge, and understanding of existing assumptions to uncover the hidden solutions for business development.
Data visualization is a proper solution when a quick look at a large amount of information is required. Using graphs, charts, diagrams, etc. allows the user to find interesting patterns or trends in the dataset. It also helps when it comes to validating data. The human eye can notice some unexpected values when they are presented in a graphical way.
Internet of things, IoT, in short, is a conception of connecting devices, such as house lighting, heating or even fridges to a common network. It allows storing big amounts of data, which can later be used in real-time analytics. This term is also connected with a smart home, a concept of controlling house with phone etc.
Machine Learning is the ability of computers to use them without programming new skills directly. In practice, this means algorithms that learn from data when processing them and use what they have learned to make decisions. Machine learning is used to exploiting the opportunities hidden in big data.
I’ve worked with DevsData on numerous projects over the last 3 years and I’m very happy. They demonstrated a strong degree of proactivity, taking time to thoroughly understand the problem and business perspective. The solutions they designed exceeded my expectations.
PARTNER & EXECUTIVE VP OF VERUS FINANCIAL LLC;
INVESTOR & SERIAL ENTERPRENEUR
It is a software system that conducts web research under the conditions specified by the user in the search query. The most popular search engines are Google, Yahoo, and Bing. Big Data, undoubtedly, plays a spectacular role in the development of search engines. The complex systems crawl through the web and process the URLs related to the query provided by the user in order to return the most accurate results.
Neural networks are a series of algorithms that are recognizing relationships in datasets, through a process that is similar to the functionality of the human brain. An important factor of this system is that it can generate the best possible result, without redesigning criteria for the output. Neural networks are very useful in financial areas, for instance, they can be used to forecast stock market prices.
Big data can be also described with these 5 words:
Some say that now we there are 8V’s with Visualization, Viscosity and Virality being new ones.
It is a simple term that is absolutely essential when speaking of Big Data. An algorithm is a mathematical formula or a set of instructions that we provide to the computer which describes how to process the given data in order to obtain needed information.
Concurrency is an ability to manage multiple tasks at the time. It helps to deal with lots of processes performed by a machine. The most known example of concurrency is multitasking.
I’ll be going little deeper into analysis in this article as big data’s holy grail is in analytics. Comparative analysis, as the name suggests, is about comparing multiple processes, data sets or other objects using statistical techniques such as pattern analysis, filtering and decision-tree analytics etc. I know it’s getting little technical but I can’t completely avoid the jargon. Comparative analysis can be used in healthcare to compare large volumes of medical records, documents, images etc. for more effective and hopefully accurate medical diagnoses.
It is a system that stores data in order to analyze and process it in the future. The source of data can vary, depending on its purpose. Data can be uploaded from the company’s CRM systems as well as imported from external files or databases.
A data lake is a repository that stores a huge amount of raw data in its original format. While the hierarchical data warehouse stores information in files and folders, a data lake uses a flat architecture to store data. Each item in the repository has a unique identifier and is marked with a set of metadata tags. When a business query appears, the repository can be searched for specific information, and then a smaller, separate set of data can be analyzed to help solve a specific problem.
It is an analytical process designed to study large data resources in search of regular patterns and systematic interrelationships between variables, and then to evaluate the results by applying the detected patterns to new subsets of data. The final goal of data mining is usually to predict customer behavior, sales volume, the likelihood of customer loss, etc.
Data cleansing is a process of correcting or removing data or records from the database. This step is extremely important. During collecting data from sensors, websites or web scraping, some incorrect data may occur. Without cleansing, the user would be at risk of coming to wrong conclusions after analyzing this data.
Metadata is data that describes other data. Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. For example, author, date created and date modified and file size are very basic document metadata. In addition to document files, metadata is used for images, videos, spreadsheets and web pages.
The term NoSQL is an abbreviation of “not only SQL”. It describes databases or database management systems, which deal with non-relational or non-structures data. Their flexibility results in the fact that they are very commonly used while processing large amounts of data.
Both R and Python are one of the most commonly used open-source programming languages for Big Data. Python is considered to be slightly more friendly for beginner users than R. Besides, it is very flexible and efficient while processing large datasets. On the other hand, R is more specialized, as it is predominantly used for statistics. It has a large number of users, who voluntarily contribute to its development, by adding new libraries and packages, for example ggplot2 used for data visualization.
The server is a computer, which receives requests related to applications. Its task is to respond to those requests or send it over a network. This term is commonly used in big data.
SQL is a language used to manage and query data in relational databases. There are many relational database systems, such as MySQL, PostgreSQL or SQLite, etc. Each of these systems has its own SQL dialect, which slightly differs from others.
Queries are the questions used in order to communicate with the database. Usually, iterating over big datasets would be very time-consuming. In such a case, queries can be created i.e. database can be asked to return all records where a given condition is satisfied.
It is an explorative analysis that tries to identify structures within the data. Cluster analysis is also called segmentation analysis or taxonomy analysis. More specifically, it tries to identify homogenous groups of cases, i.e., observations, participants, respondents. Cluster analysis is used to identify groups of cases if the grouping is not previously known. Because it is explorative it does make any distinction between dependent and independent variables. The different cluster analysis methods that SPSS offers can handle binary, nominal, ordinal, and scale (interval or ratio) data.
Fuzzy logic is an approach to logic, which instead of judging, whether a statement is true or not (values 0 or 1), it tells the degree, how much the statement is close to the truth (values from 0 to 1). This approach is commonly used in Artificial Intelligence.
When people think of big data, they immediately think about Hadoop. Hadoop (with its cute elephant logo) is an open-source software framework that consists of what is called a Hadoop Distributed File System (HDFS) and allows for storage, retrieval, and analysis of very large data sets using distributed hardware.
If you really want to impress someone, talk about YARN (Yet Another Resource Scheduler) which, as the name says, is a resource scheduler. I am really impressed by the folks who come up with these names. Apache foundation, which came up with Hadoop, is also responsible for Pig, Hive, and Spark (yup, they are all names of various software pieces). Aren’t you impressed with these names?
It is a technique used to process large datasets with the parallel distributed algorithm on the cluster. There are two types of tasks MapReduce is responsible for. The “Map” is used to divide data and process the data at the node level. “Reduce” collects answers of Map and finds the answer to the query.
MongoDB is one of the most popular NoSQL database systems. It stores data in documents, written in JSON-like format. Since it is written in a relatively low-level language (C++), it gives hugely high performance.
Object database stores data in the form of objects. Term “object” means the same thing as in object-oriented programming, which simply means an entity of a certain class.
Retail / E-commerce
Parallelism is a similar term to concurrency, but it has a slight, but important difference. Not only does parallelism let manage multiple tasks at the time, but also it lets perform multiple tasks at the time, thanks to using multicore processors; every core performs its own task.
Apache Spark is a fast, in-memory data processing engine to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark is generally a lot faster than MapReduce that we discussed earlier.
Ok, that was helpful – now what?
DevsData – a premium technology partner
DevsData is a boutique software and recruitment agency. Get your software project done by Google-level engineers or scale up an in-house tech team with developers with experience relevant to your industry.
Free consultation with a software expert
DevsData LLC is truly exceptional – their backend developers are some of the best I’ve ever worked with.”
MENTOR AT YC,