Technology dictionary

Meet Datapedia

We offer you an essential glossary of Big Data and Artificial Intelligence terms.

Technology dictionary

Meet Datapedia

A

Precisión

Accuracy

The fraction of predictions that a classification model got right. In multi-class classification, accuracy is defined as follows:

Accuracy=Correct Predictions/Total Number Of Examples

In binary classification, accuracy has the following definition:

Accuracy= (True Positives +True Negatives)/Total Number Of Examples

Algoritmo

Algorithm

A series of repeatable steps for carrying out a certain type of task with data. As with data structures, people studying computer science learn about different algorithms and their suitability for various tasks.

Inteligencia artificial

Artificial intelligence

In AI’s early days in the 1960s, researchers sought general principles of intelligence to implement, often using symbolic logic to automate reasoning. As the cost of computing resources dropped, the focus moved more toward statistical analysis of large amounts of data to drive decision making that gives the appearance of intelligence.

Read more

Área bajo la curva ROC

AUC (Area Under the ROC Curve)

An evaluation metric that considers all possible classification thresholds. The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known as the False Positive rate and sensitivity is also known as True Positive rate. The Area Under the ROC (Receiver operating characteristic) curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

B

Bayes (Teorema de Bayes)

Bayes Theorem

Named after the eighteenth-century English statistician and Presbyterian minister Thomas Bayes, Bayes’ Theorem is used to calculate conditional probability. Conditional probability is the probability of an event ‘B’ occurring given that a related event ‘A’ has already occurred (P (B|A)).

Bayes (Estadística Bayesiana)

Bayesian Statistics

Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs on the evidence of new data. It is based on the use of Bayesian probabilities to summarize evidence.

Sesgo

Bias

An intercept or offset from an origin. Bias (also known as the bias term) is referred to as b or w0 in machine learning models. For example, bias is the b in the following formula:

y′=b+w1x1+w2x2+…wnxn

In machine learning, “bias is a learner’s tendency to consistently learn the same wrong thing. Variance is the tendency to learn random things irrespective of the real signal.... It’s easy to avoid overfitting (variance) by falling into the opposite error of underfitting (bias).

Big Data

Big Data

In general, it refers to the ability to work with collections of data that had been impractical before because of their volume, velocity, and variety (“the three Vs, or the four Vs if we include veracity”). A key driver of this new ability has been easier distribution of storage and processing across networks of inexpensive commodity hardware using technology such as Hadoop instead of requiring larger, more powerful individual computers. But it’s not the amount of data that’s important, it’s how organizations use this large amount of data to generate insights. Companies use various tools, techniques and resources to make sense of this data to derive effective business strategies.

Read more

Clase binaria

Binary Class

Binary variables are those variables, which can have only two unique values. For example, a variable “Smoking Habit” can contain only two values like “Yes” and “No”.

Blaze

Blaze

This is a Python library that extends the capacities of Numpy and Pandas to distributed data and streamed data. One can use it to access data from a wide number of sources such as Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc.

Bokeh

Bokeh

One can generate attractive, interactive, 3D graphics, and web applications with this Python library. It is particularly useful for applications with “live” data (in streaming).

Bot

Bot

Bot, chatbot, talkbot, chatterbot, conversational assistant, virtual assistant etc are just different ways to name computer programs that communicate with us as if they were human. In this way, bots can do many tasks, some good, such as buying tickets for concerts, unlocking a user's account, or offering options to reserve a holiday home on specific dates; and some bad, such as carrying out cyber-attacks, or causing a financial catastrophe by conducting high-speed stock trading.

The bots (diminutive of "robot") can be designed in any programming language and function as a client, as a server, as a mobile agent, etc. When they specialize in a specific function they are usually called "Expert Systems".

Analítica de Negocio

Business Analytics

Business analytics is mainly used to show the practical methodology than an organization uses to extract insights from their data. The methodology focuses on statistical analysis of the data.

Inteligencia de negocio

Business Intelligence

Business intelligence refers to a set of strategies, applications, data and technologies used by an organization for data collection, analysis and generating insights in order to derive strategic business opportunity.

C

C++

C++

This low-level programming language focuses on software such as the components of an operating system or network protocols. It is frequently used in integrated systems and infrastructures that work as sensors. Although it can be a complicated language for beginners, it has a huge potential. It has very useful Machine Learning libraries such as LibSVM, Shark y MLPack.

Variable categórica

Categorical Variable

Categorical variables (or nominal variables) are those variables with discrete qualitative values. For example, names of cities and countries are categorical.

Chatbot

Chatbot

A chatbot is a bot (see bot) or virtual assistant that uses a chat as an interface to communicate with humans.

Chi (Test chi-cuadrado)

Chi-square test

Chi-square is “ a statistical method used to test whether the classification of data can be ascribed to chance or to some underlying law” ( Wordpanda). The chi-square test “ is an analysis technique used to estimate whether two variables in a cross tabulation are correlated”.

Clasificación

Classification

This is a supervised learning method where the output variable is a category, such as “Male” or “Female” or “Yes” and “No”. Deciding whether an email message is spam or not classifies it between two categories and analysis of data about movies might lead to classification of them among several genres.  Examples of classification algorithms are Logistic Regression, Decision Tree, K-NN, SVM etc.

Segmentación

Clustering

Clustering is an unsupervised learning method used to discover the inherent groupings in the data.  For example, grouping customers based on their purchasing behavior that is further used to segment the customers. Afterwards, the companies can use the appropriate marketing tactics to generate more profits. Example of clustering algorithms: K-Means, hierarchical clustering, etc.

Coeficiente

Coefficient

A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity. When graphing an equation such as y = 3x + 4, the coefficient of x determines the line's slope. Discussions of statistics often mention specific coefficients for specific tasks such as the correlation coefficient, Cramer’s coefficient, and the Gini coefficient.

Inteligencia cognitiva

Cognitive intelligence

Cognitive Intelligence is an important part of Artificial Intelligence, that mainly covers the technologies and tools that allow our apps, websites and bots to see, hear, speak, understand and interpret the users 'needs using natural language. In summary, it is the AI application that allows machines to understand the language of its users so that they don´t have to understand the language of the machines.

Read more

Lingüística computacional

Computational linguistics

Also, natural language processing, NLP. A branch of computer science for analyzing texts of spoken languages (for example, English or Mandarin) to convert it to structured data that you can use to drive program logic. Early efforts focused on translating one language to another or accepting complete sentences as queries to databases; modern efforts often analyze documents and other data (for example, tweets) to extract potentially valuable information.

Intervalo de confianza

Confidence interval

A range specified around an estimate to indicate margin of error, combined with a probability that a value will fall in that range. The field of statistics offers specific mathematical formulas to calculate confidence intervals.

Matriz de confusión

Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model. It is a N * N matrix, where N is the number of classes. We form the confusion matrix from the prediction of model classes Vs actual classes. The 2nd quadrant is called type II error or False Negatives, whereas 3rd quadrant is called type I error or False positives.

Variable continua

Continuous variable

A variable whose value can be any of an infinite number of values, typically within a particular range. For example, if you can express age or size with a decimal number, then they are continuous variables. In a graph, the value of a continuous variable is usually expressed as a line plotted by a function. Compare discrete variable.

Correlación

Correlation

“The degree of relative correspondence between two sets of data.” If sales go up when the advertising budget goes up, they correlate.

The correlation coefficient is a measure of how closely the two data sets correlate. A correlation coefficient of 1 is a perfect correlation, 0.9 is a strong correlation, and 0.2 is a weak correlation. A coefficient of 0 would show no correlation. This value can also be negative, as when the incidence of a disease goes down when vaccinations go up. A correlation coefficient of -1 is a perfect negative correlation. Always remember, though, that correlation does not imply causation.

Covarianza

Covariance

“A measure of the relationship between two variables whose values are observed at the same time; specifically, the average value of the two variables diminished by the product of their average values.” “Whereas variance measures how a single variable deviates from its mean, covariance measures how two variables vary in tandem from their means.”

Validación cruzada

Cross-validation

When using data with an algorithm, “the name given to a set of techniques that divide up data into training sets and test sets. The training set is given to the algorithm, along with the correct answers and it becomes the set used to make predictions. The algorithm is then asked to make predictions for each item in the test set. The answers it gives are compared to the correct answers, and an overall score for how well the algorithm did is calculated.”

D

Analista de datos

Data Analyst

Responsible for analysing statistical techniques (among others)and the historical data of the organization to make better informed future decisions (from how to avoid losing customers, to defining pricing strategies). Their function is to analyse historical data to detect patterns of behaviour and/or trends. (Descriptive and/or predictive analysis). For this role, knowledge about statistics, together with critical thinking skills, are fundamental. Communication skills are also of great importance. In short, its function is "Understanding what has happened in the past to make better decisions in the future".

Controlador de datos

Data Controller

The Organization that collects the data (for RGDP purposes).

Ingeniero de datos

Data Engineer

A specialist in the management of data. “Data engineers are the ones that take the messy data... and build the infrastructure for real, tangible analysis. They run ETL software, marry data sets, enrich and clean all that data that companies have been storing for years.”

Manager de Gobernanza de datos

Data Gobernance Manager

One in charge of defining and organizing the process of collection, storage, and access to data, guaranteeing at all times its security and confidentiality. Their function is to define and verify compliance with policies and standards. Manage the life cycle of the data and make sure that they are guarded in a safe and organized manner, and that only authorized persons can access them. For this role it is necessary to combine a functional knowledge of how the databases and other associated technologies work, with a comprehensive understanding of the regulations of each industry (financial, pharmaceutical, telecommunication, etc.) In short, its function is "Define and ensure the compliance of rules that define the flow of data". Once we have a system in which the data is well organized, accessible and securely guarded, it is in our interests to take advantage of it, extracting from it those valuable "Insights" or keys about patterns of behaviour that, applied to our processes, make them more efficient and innovative day by day. This is the moment when two new roles come into play. See GDPR.

Insights de datos, Descubrimientos, Hallazgos, Claves

Data Insight

The concept "data insight" means the knowledge or deep understanding of the data in a way that can guide correct and productive business actions. "Data-driven" companies are those that make decisions based on data, in particular, on data insights (data-based decisions). LUCA solutions help companies become Data Driven companies.

Minería de datos

Data mining

Generally, the use of computers to analyze large data sets to look for patterns, allowing people make business decisions. Data mining is a study of extracting useful information from structured/unstructured data. Data Mining is used for Market Analysis, determining customer purchase pattern, financial planning, fraud detection, etc.

Data Processor – Procesador de datos

Often a third party responsible for collecting data on behalf of the controller. See GDPR.

Ciencia de datos

Data Science

Data science is a combination of data analysis, algorithmic development statistics and software engineering in order to solve analytical problems. Data science work often requires knowledge of both. The main goal is a use of data to generate business value.

Read more

Científico de datos

Data Scientist

One who is responsible for performing a prescriptive analysis of the business data history, so that you can not only anticipate what will happen in the future and when, but also give a reason why. In this way you can establish what decisions will have to be made to take advantage of a future business opportunity or mitigate a possible risk, showing the implication of each option as a result. Their functions are to build and apply Machine Learning models capable of continuing to learn and improving their predictive capacity as the volume of data collected increases. For this role, advanced knowledge of mathematics in general (and of statistics in particular), knowledge of Machine Learning, knowledge of programming in SQL, Phyton, R or Scala is necessary. On occasion, the Data Analyst is considered a Data Scientist "in training". Therefore, the border between the tasks and functions of both roles is sometimes not so clear. In short, its function is "Modelling the future".

Sujeto de datos

Data Subject

The individual whose data is being used (with effects of GDPR)

Tratamiento de datos

Data wrangling

Also, data munging. The conversion of data, often using scripting languages, to make it easier to work with. This is a very time-consuming task.

Administrador de bases de datos

Database Administrator (DBA)

Responsible for the design (physical and logical), management and administration of the databases. Its function is to guarantee Security, optimization, monitoring, problem solving, and analysis / forecasting present and future capabilities. It is a very technical role for which deep knowledge of SQL language and also, increasingly, of non-SQL databases are necessary. Likewise, management skills may be necessary to design policies and procedures for the use, management, maintenance and security of databases. In short, its function is to make sure that "the machine works".

Árbol de decisión

Decision trees

A decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input & output variables. In this technique, we split the population (or sample) into two or more homogeneous sets (or sub-populations) based on most significant splitter/differentiator in input variables.

Aprendizaje Profundo

Deep learning

Typically, a multi-level algorithm that gradually identifies things at higher levels of abstraction. For example, the first level may identify certain lines, then the next level identifies combinations of lines as shapes, and then the next level identifies combinations of shapes as specific objects. As you might guess from this example, deep learning is popular for image classification.  Deep Learning is associated with a machine-learning algorithm (Artificial Neural Network, ANN) which uses the concept of the human brain to facilitate the modeling of arbitrary functions. ANN requires a vast amount of data and this algorithm is highly flexible when it comes to modelling multiple outputs simultaneously.

Read more

Deeplearning4j

Deeplearning4j

It is a library dedicated to Deep learning, written for Java and Scala. Provides an environment for developers to train and develop AI models.

Variable dependiente

Dependent Variable

The value of a dependent value “depends” on the value of the independent variable. If you're measuring the effect of different sizes of an advertising budget on total sales, then the advertising budget figure is the independent variable and total sales is the dependent variable.

Analítica descriptiva

Descriptive Analytics

This consists in the analysis of historical data and data that is collected in real time in order to generate Insights about how business strategies have been working in the past, for example, marketing campaigns.

Read more

Digital Director ("Chief Data Officer" CDO)

One is responsible for directing, planning and controlling the digital transformation of any brand. For this reason, he is the most responsible in the areas of Data Governance, Information Management and Security.

Their function is to establish the strategy that guarantees the digital growth of the company in a sustainable way, able to adapt fluidly to the continuous changes in the digital landscape. It should also encourage the internal and external relations of the organization, attract the best talent, lead teams and solve with diplomacy the potential tensions that may arise between the different departments within company.

For this role it is very important to have a lot of experience in the digital world, strategic vision, communication skills for teamwork and creativity. The CDO must be innovative, sometimes even disruptive, and have decision-making power and resources. For this reason, they are usually under orders from the CEO.

The CDO can have some "overlaps" with the figure of the CIO ("Chief Information Officer"), but it is a role that, in regards to technological innovation, adds a clear marketing component aimed at exploiting "Digital Assets”.

Reducción de dimensionalidad

Dimension reduction

Also, dimensionality reduction. “We can use a technique called principal component analysis to extract one or more dimensions that capture as much of the variation in the data as possible. For this purpose, linear algebra is involved; “broadly speaking, linear algebra is about translating something residing in an m-dimensional space into a corresponding shape in an n-dimensional space.”

Variable discreta

Discrete Variable

A variable whose potential values must be one of a specific number of values. If someone rates a movie with between one and five stars, with no partial stars allowed, the rating is a discrete variable. In a graph, the distribution of values for a discrete variable is usually expressed as a histogram.

E

Análisis exploratorio

EDA

EDA or exploratory data analysis is a phase used for data science pipeline in which the focus is to understand insights of the data through visualization or by statistical analysis.

Arquitecto de datos

Enterprise Data Architect

Responsible for creating the structure to collect and access the data. Defines how the data moves Its main function is the design of the data usage environment. How they are stored, how they are accessed and how they are shared / used by different departments, systems or applications, in line with the business strategy. It is a strategic role, for which a vision of the complete life cycle is required. Therefore, you should consider aspects of data modeling, database design, SQL development, and software project management. It is also important to know and understand how traditional and emerging technologies can contribute to the achievement of business objectives. In short, its function is to ensure that "define the global vision."

Métricas de evaluación

Evaluation metrics

he purpose of evaluation metric is to measure the quality of the statistical / machine learning model.

Sistema experto

Expert system

Expert system It is a system that uses human knowledge captured in a computer to solve problems that would normally be solved by expert humans. Well-designed systems mimic the reasoning process that experts use to solve specific problems. These systems can work better than any human expert making decisions individually in certain domains and can be used by non-expert humans to improve their problem-solving skills.

Read more
F

Característica

Feature

The machine learning expression for a piece of measurable information about something. If you store the age, annual income, and weight of a set of people, you are storing three features about them. In other areas of the IT world, people may use the terms property, attribute, or field instead of “feature.”

Feature Selection is a process of choosing those features that are required to explain the predictive power of a statistical model, and dropping out irrelevant features. This can be done by either filtering out less useful features or by combining features to make a new one.

G

GATE

GATE

“General Architecture for Text Engineering,” is an open source, Java-based framework for natural language processing tasks. The framework lets you pipeline other tools designed to be plugged into it. The project is based at the UK’s University of Sheffield.

RGDP

GDPR

On May 25, 2018, the new General Regulation of Data Protection (GDPR) came into force. The main objective of this new regulation is to govern the collection, use and exchange of personal data. The amount of data we create every day is growing at an exponential rate, and as the regulation says, "the processing of personal data must be designed to serve humanity.

Read more

Potenciación del gradiente

Gradient Boosting

Gradient boosting o Potenciación del gradiente, es una técnica de aprendizaje automático utilizado para el análisis de la regresión y para problemas de clasificación estadística, el cual produce un modelo predictivo en forma de un conjunto de modelos de predicción débil, normalmente, árboles de decisión. Construye el modelo de forma iterativa y lo generaliza permitiendo la optimización de una función pérdida diferenciable arbitraria. (Wikipedia)

H

Hadoop

Hadoop

Hadoop is an open-source project from the Apache Foundation, introduced in 2006 and developed in Java. It has to objective of offering a working environment that is appropriate for the demands of Big Data (the 4 V’s). As such, Hadoop is designed to work with large Volumes of data, both structured and unstructured ( Variety), and to process them in a secure and efficient way ( Veracity and Velocity).

To achieve this, it distributes both the storage and processing of information between various computers working together in “ clusters”. These clusters have one or more master nodes charged with managing the distributed files where the information is stored in different blocks, as well as coordinating and executing the different tasks among the cluster’s members. As such, it is a highly scalable system that also offers software “redundancy”.

Read more

Heurístico

Heuristic

A practical and non-optimal solution to a problem, which is sufficient for making progress or for learning from.

Capa Oculta

Hidden layer

A synthetic layer in a neural network between the input layer (the features) and the output layer (the prediction). A neural network contains one or more hidden layers.

Histograma

Histogram

A graphical representation of the distribution of a set of numeric data, usually a vertical bar graph.

Datos de prueba

Holdout data

This refers to examples intentionally not used ("held out") during training. The validation data set and test data set are examples of holdout data. Holdout data helps evaluate your model's ability to generalize data other than the data it was trained on. The loss on the holdout set provides a better estimate of the loss on an unseen data set than does the loss on the training set.

Hiperplano

Hyperplane

A boundary that separates a space into two subspaces. For example, a line is a hyperplane in two dimensions and a plane is a hyperplane in three dimensions. More typically, in machine learning, a hyperplane is the boundary separating a high-dimensional space. Kernel Support Vector Machines use hyperplanes to separate positive classes from negative classes, often in a very high-dimensional space.

I

Imputación

Imputation

Imputation is a technique used for handling missing values in the data. This is done either by statistical metrics like mean/mode imputation or by machine learning techniques like kNN imputation.

Inferencia estadística

Inferential Statistics

In inferential statistics, we try to hypothesize about the population by only looking at a sample of it. For example, before releasing a drug in the market, internal tests are done to check if the drug is viable for release. But here we cannot check with the whole population for viability of the drug, so we do it on a sample which best represents the population.

Interpretabilidad

Interpretability

The degree to which a model's predictions can be readily explained. Deep models are often uninterpretable; that is, a deep model's different layers can be hard to decipher.

J

Java

Java

This is one of the most commonly used programming languages for Machine Learning, due to its consistency, clarity and flexibility. It is an open-source language that is compatible with any platform and practically any application. It features a large number of libraries, some of which are focused on the world of Machine Learning such as  Spark+MLlib, Mahout y Deeplearning4j.

K

k-means clustering

k-means clustering

It is a type of unsupervised algorithm which solves the clustering problem. It is a procedure which follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.

k-vecino más próximo

k-nearest neighbors

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function.

Keras

Keras

A popular Python machine learning API. Keras runs on several deep learning frameworks, including TensorFlow, where it is made available as tf.keras.

Curtosis o apuntalamiento

Kurtosis

Kurtosis is a descriptor of the shape of a probability distribution and is explained in terms of the central peak. Higher values of kurtosis indicate a higher, sharper peak; lower values indicate a lower, less distinct peak.

L

LibSVM

LibSVM

This C++ library can easily be used to work with Support Vector Machines (SVMs). It is used to solve classification and regression problems.

Indicador de confianza

Lift

In data mining, Lift compares the frequency of an observed pattern with how often you’d expect to see that pattern just by chance. If the lift is near 1, then there’s a good chance that the pattern you observed is occurring just by chance. The larger the lift, the more likely that the pattern is ‘real’.

Algebra lineal

Linear algebra

A branch of mathematics dealing with vector spaces and operations within them such as addition and multiplication. “Linear algebra is designed to represent systems of linear equations. Linear equations are designed to represent linear relationships, where one entity is written to be a sum of multiples of other entities. In the shorthand of linear algebra, a linear relationship is represented as a linear operator—a matrix.”

Regresión lineal

Linear Regression

A technique to look for a linear relationship (that is, one where the relationship between two varying amounts, such as price and sales, can be expressed with an equation that you can represent as a straight line on a graph) by starting with a set of data points that don't necessarily line up nicely. This is done by computing the “ least squares” line: the one that has, on an x-y graph, the smallest possible sum of squared distances to the actual data point y values. Statistical software packages and even typical spreadsheet packages offer automated ways to calculate this.

LISP

LISP

An acronym of List Procesor, this language was created by John McCarthy, now seen by many as the father of Artificial Intelligence. His idea was to optimize the functioning and use of the resources that the computers of the era had available to them. The new language, based in part on the already existing Fortran, used some innovative technique such as data “trees” (a hierarchical data structure), or the use of "symbolic computation" (also known as "computer algebra"), from which symbolic programming would later been born. Lisp did not take long in becoming the favourite language of the world of Artificial Intelligence.

Read more

Logaritmo

Logarithm

The logarithm of a number is the exponent to which another fixed number, the base, must be raised to produce that number. b y = x {\displaystyle b^{y}=x} Working with the log of one or more of a model's variables, instead of their original values, can make it easier to model relationships with linear functions instead of non-linear ones. Linear functions are typically easier to use in data analysis.

Regresión logística

Logistic Regression

A model similar to linear regression but where the potential results are a specific set of categories instead of being continuous.

M

Aprendizaje Automático

Machine learning

Machine Learning refers to the techniques involved in dealing with vast data in the most intelligent fashion (by developing algorithms) to derive actionable insights. In these techniques, we expect the algorithms to learn by themselves without being explicitly programmed.

Read more

Mahout

Mahout

This Java library is very similar to Python’s NumPy. It focuses on mathematical, algebraic and statistical expressions.

MATLAB

MATLAB

A commercial computer language and environment popular for visualization and algorithm development.

Matplotlib

Matplotlib

This Python library is used to create a variety of graphics: from histograms to line graphs and heat maps. It also allows you to use LaTeX commands to add mathematical expressions to a graph.

Mlpack

Mlpack

This C++ library aims to quickly start up Machine Learning algorithms. It integrates the algorithms into solutions at a larger scale by using simple lines of code.

Paradoja de Moravec

Moravec´s Paradox

In the 80s, Hans Moravec, Rodney Brooks and Marvin Minsky. researchers in the field of artificial intelligence and robotics raised what is known as the Moravec paradox. This paradox reflects the inherent contradiction of the fact that activities involving a high level of reasoning, such as playing chess, or doing an intelligence test, require very little computational load, while other activities of low cognitive level, such as identifying a familiar face, require a huge amount of these resources. In the words of Moravec himself: "It's relatively easy to get a machine to show an adult's performance on an intelligence test. or playing the ladies; however, it is much more difficult or even impossible to reach the level of ability of a child of one year, when it comes to perception and mobility". It is clear that the lower price and exponential growth of available computing resources can make even those sensorimotor skills come to be realized by an AI in the future. However, here another paradox comes into play, previous to Moravec's, but very closely related to it: Polanyi's paradox.

N

Bayes (Clasificador Naive Bayes)

Naive Bayes classifier

“A collection of classification algorithms based on Bayes Theorem. It is not a single algorithm but a family of algorithms that all share a common principle, that every feature being classified is independent of the value of any other feature.

Red neuronal

Neural network

A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden) consisting of simple connected units or neurons followed by nonlinearities. Neural Networks are used in deep learning research to match images to features and much more. What makes Neural Networks special is their use of a hidden layer of weighted functions called neurons, with which you can effectively build a network that maps a lot of other functions. Without a hidden layer of functions, Neural Networks would be just a set of simple weighted functions.”

Distribución normal

Normal distribution

Also, Gaussian distribution. A probability distribution which, when graphed, is a symmetrical bell curve with the mean value at the center. The standard deviation value affects the height and width of the graph. An important fact about this bell shaped curve is that you can model many natural, social and psychological phenomena. These phenomena may be affected by random variables, but the summary statistics you calculate from your samples, in fact do follow a normal distribution. The Gauss curve also makes math easy.

NoSQL

NoSQL

Traditional database systems, known as RDBMSS, are largely dependent on rows, columns, schemas, and tables, to retrieve and organize data stored in databases. To do this, they use a structured SQL query language. These systems have some problems working with Big Data such as non-scalability, lack of flexibility and performance problems.

NoSQL non-relational databases are much more flexible. They allow you to work with unstructured data, such as chat data, messaging, log data, user and session data, large data such as videos and images, as well as the Internet of things and device data. They are also designed to achieve high storage volume capacity, through distributed data storage, and information processing speed. Therefore, they are very scalable. They are also independent of the programming language.

NoSQL databases are open source, therefore its cost is affordable, but as a counterpart, it generates problems of lack of standardization and interoperability. Some NoSQL databases available on the market are Couchbase, Amazon Dynamo Db, MongoDB and MarkLogic etc.

NumPy

NumPy

A portmanteau of Numerical + Python, NumPy is the main Python library for scientific computation. One of its most powerful traits is that it can work with arrays on "n" dimensions. It also offers basic linear algebra functions, the Fourier transform, advanced random number capabilities and tools that allow it to be integrated with other low-level languages such as Fortran, C and C++.

O

Valores anómalos

Outlier

“Extreme values that might be errors in measurement and recording, or might be accurate reports of rare events.”

Sobreajuste

Overfitting

A model of training data that, by taking too many of the data's quirks and outliers into account, is overly complicated and will not be as useful as it could be to find patterns in test data.

P

Pandas

Pandas

A Python library for data manipulation popular with data scientists. See also Python.

Perceptrón

Perceptron

The perceptron is the simplest neural network, which approximates a single neuron with n binary inputs. It computes a weighted sum of its inputs and ‘fires’ if that weighted sum is zero or greater

Perl

Perl

Perl is a scripting language rooted in pre-Linux UNIX systems. Perl has always been used for especially data cleanup and enhancement tasks in text processing.

Tabla pivotante o tabla dinámica

Pivot table

Pivot tables summarize long lists of data, without requiring you to write a single formula or copy a single cell. However, the most notable feature of pivot tables is that you can arrange them dynamically. The process of rearranging your table is known as pivoting your data: you're turning the same information around to examine it from different angles.

Paradoja de Polanyi

Polanyi Paradox

Michael Polanyi was an Anglo-Hungarian scholar and philosopher who, in 1966, proposed in his book "The Tacit Dimension" that human knowledge is based, to a large extent, on rules and skills that have been taught to us by culture, tradition, evolution etc, and that, we are therefore not always fully aware of it. He defined what is called "tacit knowledge", and summed it up in this sentence: We know more than we can tell.

What he meant by this Polanyi, is that many of the tasks we perform are based on a tacit, intuitive knowledge, and that they are therefore very difficult to code or automate. Why? Because we ourselves do not know how to explain how we do it. For example, have you ever tried to explain to a child how to jump rope? At what point do you have to enter so as not to step on the rope or become entangled with it? It seems simple, right? Well, it´s not. Now imagine how you would explain it to a robot.

Read more

Exactitud y Sensibilidad

Precision and Recall

Precision is a metric for classifications models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify?

It represents how near the actual value is to the one obtained from the model or measurement. It is also known as “True Positive Rate”. 

Recall is described as the measured of how many of the positive predictions are correct.

Both precision and recall are therefore based on an understanding and measure of relevance. High precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results

Analítica predictiva

Predictive analytics

It consists of the analysis of historical business data in order to predict future behaviors that help to better planning. To do this, predictive modeling techniques are used, among others. These techniques are based on statistical algorithms and machine learning.

Modelado Predictivo

Predictive Modeling

Consiste en el desarrollo de modelos estadísticos y de aprendizaje automático que permitan predecir comportamientos futuros, basándose en datos históricos.

Analítica prescriptiva

Prescriptive analytics

It consists of the analysis of historical business data in order to predict future behaviors that help to better planning. To do this, predictive modeling techniques are used, among others. These techniques are based on statistical algorithms and machine learning.

Análisis de componentes principales

Principal component analysis

It is a machine learning algorithm that aims to reduce the dimensionality of a set of observed variables to a set of variables without linear correlation, called main components. To do this, it calculates the direction with the greatest variance and defines it as the main component. It is used mainly in exploratory data analysis and to build predictive models.

Read more

Distribución de probabilidad

Probability distribution

The probability distribution of a discrete random variable is the set of all the possible values ​​that this variable can have, together with its probabilities of occurrence. For discrete variables, the main probability distributions are the binomial, the Poisson and the hypergeometric (the latter for dependent events). For continuous variable, the distribution that is generated is normal or Gaussian.

Perfilado

Profiling

Profiling is the process of using personal data to evaluate certain personal aspects to analyse and predict behaviour/performance/reliability, etc.

Seudonimización

Pseudonymization

The pseudonymization process is an alternative to data anonymization. While anonymization involves the complete elimination of all identifiable information, pseudonymization aims to eliminate the link between a set of data and the identity of the individual. Pseudonymization examples are encryption and tokenization.

Python

Python

It is a programming language created in 1994 and is widely used in data science. For beginners, it is very easy to learn, but at the same time it is a very powerful language for advanced users, since it has specialized libraries for automatic learning and graphics generation.

Read more

Módulo (Python)

Python Module

Modules are Python's way of storing definitions (instructions or variables) in a file, so they can be used later in a script or in an interactive instance of the interpreter. Thus it is not necessary to redefine them each time. The main advantage of Python allowing to separate a program into modules is, obviously, that we can reuse them in other programs or modules. All you have to do is importing the modules you want to use in each situation. Python comes with a collection of standard modules that can be used as a basis for new program or examples to start learning with.

Librería estándar (Python)

Python Standard Library

A library is nothing more than a set of modules (see modules). Python standard  library is very wide and offers a great variety of modules that perform functions of all kinds, from modules written in C that offer access to system features such as file access (file I / O). In Python website, you can find "The Python Standard Library", a reference guide to all the modules in Python. Installers for Windows platforms usually include the complete standard library, including some additional components. However, Python installations using packages will require specific installers.

Read more
Q
R

R

R

An open-source programming language and environment for statistical computing and graph generation available for Linux, Windows, and Mac.

Bosque aleatorio

Random forest

An algorithm used for regression or classification tasks that is based on a combination of predictive trees. "To classify a new object from an input vector, each of the trees in the forest is fed with that vector. Each tree offers a classification as a result, and we say "vote" for that result. The forest chooses the classification that has the most votes among all the trees in the forest. The term "random forest" is a trademark registered by its authors.

Regresión

Regression

It is a supervised learning method where the output variable is a real and continuous value, such as "height" or "weight". Regression consists of fitting any data set to a given model. Within the regression algorithms we can find the linear, non-linear regression, by least squares, Lasso, etc.

Aprendizaje por refuerzo

Reinforcement learning

Based on studies on how to encourage learning in humans and rats based on rewards and punishments. The algorithm learns by observing the world around it. Your input information is the feedback you get from the outside world in response to your actions. Therefore, the system learns based on trial-error.

Ruby

Ruby

It is a scripting language created in 1996. It is widely used among data scientists, but it is not as popular as Python, since it offers more specialized libraries for the different tasks of Data Science.

S

SAS

SAS

A commercial statistical software suite that includes a programming language.  

Escalar

Scalar

A variable is scalar (as opposed to vectorial), when it has a value of magnitude but no direction in space, such as volume or temperature.

Scikit Learn

Scikit Learn

This Python library is built upon NumPy, SciPy and matplotlib. It contains a large number of efficient tools for Machine Learning and statistical modeling such as classification algorithms, regression, clustering and dimensionality reduction.

SciPy

SciPy

A portmanteau of Scientific Python, SciPy is a Python library that is built upon the NumPy library of scientific computation. It is one of the most useful due to its large variety of high-level modules of science and engineering, such as the Discrete Fourier Transform, linear algebra, and optimization matrices.

Scrapy

Scrapy

This Python library is used to trawl the web. It a very useful environment when wanting to obtain a given data pattern. It can trawl various pages of a website, by using the URL of the homepage, in order to collect information.

Lenguajes de programación de script

Scripting languages

Script programming languages ​​can be executed directly without the need to compile them before in binary code, as is the case with languages ​​such as Java and C. The syntax of scripting languages ​​is much simpler than that of compiled languages, making it much easier the tasks of programming and execution. Some examples of this type of languages ​​are Python, Perl, Rubi etc.

Seaborn

Seaborn

This Python library, built upon matplotlib, is used to create attractive graphics and statistic information in Python. Its objective is to give greater relevance to visualizations, within the tasks of exploring and interpreting data.

Sensibilidad y Especifidad

Sensitivity and Specificity

Son métricas estadísticas que se usan para medir el rendimiento de un clasificador binario.

La Sensibilidad (También llamada tasa de verdadero positivo, o probabilidad de detección en algunos campos) mide la proporción de casos positivos correctamente identificados por el algoritmo clasificador. Por ejemplo, el porcentaje de personas que padecen una enfermedad y que son correctamente detectadas. Su fórmula es:

Sensibilidad=Verdaderos Positivos/ (Verdaderos Positivos + Falsos Negativos) 

La Especificidad (también llamada tasa de verdaderos negativos) mide la proporción de casos negativos correctamente identificados como tales por el algoritmo clasificador. Por ejemplo, se usa para indicar el número de personas sanas que han sido correctamente identificadas como tales por el algoritmo. 

Especifidad=Verdaderos Negativos/ (Verdaderos Negativos + Falsos Positivos)

Shark

Shark

This C++ library offers linear and non-linear optimization methods. It is based on kernel methods, neural networks and other advanced machine learning techniques. It is compatible with the majority of operating systems (OS).

Consola

Shell

When the operating system is accessed from the command line we are using the console. In addition to script languages ​​such as Perl and Python, it is common to use Linux-based tools such as grep, diff, splitt, comm, head and tail to perform data preparation and debugging tasks from the console.

Spark+MLlib

Spark+Mllib

This Java library works perfectly with Spark APIs and works with NumPy. Spark speeds up the MLlib functionality, which aims to carry out a scalable and easier learning process.

Serie espacio-temporal

Spatiotemporal data

Time series data that also includes geographic identifiers such as latitude-longitude pairs.

SQL

SQL

SQL (Structured Query Language) is a standard and interactive language used to communicate to relational databases, that allows to specify different types of operations in them. The SQL is based on the use of algebra and relational calculations to query the databases in a simple way. Queries are made through a command language that allows you to select, insert, update, find out the location of the data, and more.

Desviación estándar

Standard Deviation

Es la raíz cuadrada de la varianza y se usa habitualmente para indicar cuánto se aleja de la media una medida determinada. Por ejemplo, si una observación se aleja de la media más de tres veces la desviación estándar, podemos decir en la mayoría de las aplicaciones que nos encontramos ante un caso anómalo. Los paquetes de software estadístico calculan de forma automática la desviación estándar.

Statsmodels

Statsmodels

This Python module is used of statistical modeling. It allows users to explore data, makes statistical estimations and carry out statistical tests. It offers an extensive list of descriptive statistics as well as tests and graphical functions for different types of data and estimations.

Estrato, muestreo estratificado

Strata, stratified sampling

It consists in dividing the population samples into homeogenic groups or strata and taking a random sample of each of them. Strata is also an O'Reilly conference on Big Data, Data Science and related technologies.

Aprendizaje Supervisado

Supervised learning

In supervised learning, the algorithms work with "tagged" data (labeled data), trying to find a function that, given the input variables, assign them the appropriate output tag. The algorithm is trained with a "historical" data and thus "learns" to assign the appropriate output label to a new value, that is, it predicts the output value. Supervised learning is often used in classification problems, such as identifying digits, diagnosing, or detecting identity fraud.

Máquina de vectores de soporte

Support vector machine

A support vector machine is a supervised machine learning algorithm that is used for both classification and regression tasks. They are based on the idea of ​​finding the hyperplane that best divides the data set into two differentiated classes. Intuitively, the farther away from the hyperplane our values ​​are, the more certain we are that they are correctly classified. However, sometimes it is not easy to find the hyperplane that best classifies the data and it is necessary to jump to a larger dimension (from the plane to 3 dimensions or even n dimensions). SVMs are used for tasks of text classification, spam detection, sentiment analysis, etc. They are also used for image recognition

SymPy

SymPy

This Python library is used for symbolic computation, including arithmetics, calculus, algebra, discrete mathematics and quantum physics. It can also format the results of these calculationsin LaTeX code.

T

Distribución t de Student

T-distribution

They are a variation of the normal distributions. They were discovered by William Gosset in 1908 and published under the pseudonym "Student". He needed a distribution that he could use when the sample size was small and the variance was unknown and had to be estimated from the data. The t distributions are used to take into account the added uncertainty that results from this estimate.

Serie temporal

Time series data

A time series is a sequence of measures spaced in time intervals not necessarily equal. Thus time series consist of a measure (for example, atmospheric pressure or price of an action) accompanied by a temporary seal.

U

UIMA

UIMA

The "Unstructured Information Management Architecture" was developed by IBM as an environment for analyzing unstructured data, especially natural language. OASIS UIMA is a specification that standardizes this environment and Apache UIMA is an open source implementation of this. This environment allows you to work with different tools designed to connect with it.

Aprendizaje no supervisado

Unsupervised learning

Unsupervised learning occurs when "labeled" data is not available for training. We only know the input data, but there is no output data that corresponds to a certain input. Therefore, we can only describe the structure of the data, and try to find some kind of organization that simplifies the analysis. Therefore, they have an exploratory character.

V

Vector

Vector

The mathematical definition of a vector is "a value that has a magnitude and a direction, represented by an arrow whose length represents the magnitude and whose orientation in space represents the direction". However, data scientists use the term in this sense: "ordered set of real numbers denoting a distance on a coordinate axis. These numbers can represent characteristics of a person, movie, product or whatever we want to model. This mathematical representation of the variables allows working with software libraries that apply advanced mathematical operations to the data.

A vector space is a set of vectors, for example, a matrix.

W

Weka

Weka

Weka is a set of automatic learning algorithms to perform data analytics tasks. The algorithms can be applied directly to a data set or be called from your own Java code. Weka offers tools for data pre-processing, classification, regression, clustering, association and visualization rules. It is also appropriate for the development of new machine learning models. Weka is an open source software developed by the University of Waikato in New Zealand.

X
Y
Z

References

The elaboration of this glossary is based on self-prepared contents, Wikipedia and other data Science glossaries such as:

Our site uses cookies: some are essential to make the site work; others help us to improve the user experience. By using the site, you consent to the use of these cookies. To learn more about cookies and how you can disable them, please read our cookies policy