…………………………….…………………………….……………………
1. Introduction To Elastic Stack
1.1. About the Elastic Tech Stack
The Elastic tech stack (formerly called ELK stack) is a set of tools that allows the user to
retrieve, analyze and visual data in near real time. The tech stack primarily consists of
Elasticsearch, Logstash, Kibana, Xpack and Beats.
Figure 1: The complete Elastic Tech Stack
● Elasticsearch: The database in the tech stack.
● Logstash:
The application used to collect and large volumes of data, convert them into formats that
can be used in Elasticsearch.
● Kibana:
The analytics dashboard application
● Beats:
Used to obtain small amounts of data of a single type. For example, a specific beats
instance can be created to only retrieve log files from a given source. This data can
either be sent to logstash or directly to Elasticsearch.
1
● Xpack:
Application with Kibana to monitor the various nodes in the system. It is basically an
admin tool, which also has some internal machine learning features to make some
predictions such as “which node may fail next” etc.
1. 2. About Elasticsearch
Elasticsearch is the database component of the “Elastic” or the “ELK” tech stack. It is built upon
the Apache Lucene search engine and features high among the databases used as search engines
for documents. It is flexible, in a sense that, the system can store and retrieve data from several
sources in various formats (ppt, doc, pdf etc), as long as the data can be converted to JSON.
Moreover, any number of clusters can be added to the database at anytime, thus making it easier
to horizontally scale up or down. This amount of flexibility makes the Elasticsearch engine
“Elastic”.
1.3. Apache Lucene
Apache Lucene is a high speed text search engine. It was written in java and therefore works
across platforms. The idea is similar to how Google works at a high level. There are millions of
webpages in the backend, we ask google for the list of pages that contain information related to
words in our search query, google quickly retrieves a list of relevant webpages and displays it to
us. Similarly Lucene works with text documents. It stores text files in the form of “Inverted
indices” to keep track of words in a given document. Basically, every word is keyed against the
set of documents that the word occurs in.
For example, let's say we have two documents, doc1, doc2 and doc3 with a line of text in each:
Doc1 : “john doe explained the question”
Doc2: “jane listened to john”
Doc3: “the question was easy”
The inverted index data structure:
john : {doc1, doc2}
doe : {doc1}
explained : {doc1}
the : {doc1, doc3}
question : {doc1, doc3}
jane : {doc2}
listened : {doc2}
to : {doc2}
was : {doc3}
easy : {doc3}
2
Such a data structure provides easy access to the documents associated with terms in a query.
Another important concept of Apache lucene is that of “search relevance”. Every search engine
computes a value (based on some metric) about how relevant a given document is to a specific
search query. The logic behind computing this value varies on a case by case basis. Lucene
generally computes a “tf-idf” value (Term frequency- inverse document frequency) for every
word – document pair, which is basically a measure of how important the word is to the
document. Search relevance can be computed using many methods, and discussing those is
beyond the scope of this report.
3
2. Exploring Elasticsearch
2.1. Components & Key Terminologies of Elasticsearch
i. Document:
This is a basic unit of information that can be stored in an ElasticSearch. A “document’
refers to a json file with a set of key value pairs. Anything that can be converted into a json
object can be stored in a database. For instance, a .doc file can be converted to a json object of
key value pairs, capturing all the metadata of the file. An excel sheet can be converted to a set of
json objects, each containing a row from the excel table in form of key value pairs where in the
key is the column name and the value is the cell’s content. Basically, any type of information can
be converted to a json object in various ways. It is upto the programmer to design how he/she
wants to use a document.
ii. Elasticsearch Index
A collection of documents that have similar characteristics is known as an index. For
instance, all documents associated with a given person maybe belong to the same index. The
process of inserting a document into the database is generally known as “indexing a document”.
The index names are usually in lower case and they are used while querying for or updating or
deleting a document.
iii. Type and mapping:
“Type” refers to a set of similar documents. “Mapping” associated with a given type
refers to the generic structure that can be associated with a type. Type and mapping is loosely
analogous to table name and schema definition in the RDBMS world. For instance, a document
can of a type “user”, consisting of mapping defined by fields such as “name”, “age”, “gender”
etc.
iv. Shard (Lucene index) :
Let's say we have all the news articles being set to a given index. The number of news articles
everyday might be humongous. We might not be able to store this data on the same machine. Or
the amount of documents might be so large that searching could become very slow. Hence, an
index can be split into multiple “shards” , providing a horizontal scaling mechanism. For
example, if an index were divided into two shards, one shard may hold three fifths and other
shard hold two fifths of the data respectively. All shards works independent of each other.
4
v. Node:
A set of shards is stored within a node. A node is an individual running instance of the
database i.e., if we want to know if a database is “up and running”, we check the working and
status of every node in the system. Each node is assigned a random Unique Universal Identifier
(UUID) during startup. This is used to access the particular node.
vi. Replication:
Data on a given shard in node 1 can be replicated and the replica created can be stored in
node 2. This provides a data backup in case of a system failure. Moreover, it also allows some
running parallel queries on the same set of data.
vii. Cluster:
A cluster is a set of all the nodes within a system. In other words, all the data needed for
the given application is stored on some node or the other in a cluster. Hence, a cluster can (and
mostly will) encompass physically separate machines.
Lets says our application stores news articles from nine sources and movie reviews from eight sources.
The system may be set up in this way:
Shard 1 : News articles from four sources
Shard 2 : News articles from two sources
Shard 3 : News articles from three sources
Shard 4 : Movie reviews from three sources
Shard 5 : Movie reviews from five sources
Replica 1 : Replica of Shard 1
Replica 2 : Replica of Shard 2
Replica 3 : Replica of Shard 3
Replica 4 : Replica of Shard 4
Replica 5 : Replica of Shard 5
Node 1: Shard 1, Shard 2, Shard 3, Replica 4 and
Replica 5
Node 2: Shard 1, Shard 2, Replica 1, Replica 2
and Replica 3
Cluster: Node 1 and Node 2
The cluster consists of all the data required in the
system
Figure 2: Shards and Replicas in a node
5
2.2. Types of Nodes in Elasticsearch
The Elasticsearch architecture requires the below set of nodes in the system. One node can take
up multiple roles from the list below. However, in very large systems, it is advisable to have a
node dedicated to a given functionality.
i. Master and Master eligible node :
The master node handles tasks such as assigning nodes to clusters, assigning shards to
nodes,checking the health of nodes etc. A master eligible node is one with a candidacy to be
elected the master in case the current master were to fail. By default, all the nodes in the system
are master eligible.
ii. Data node :
Data nodes are the most generic kinds of nodes in the system. They contain data within the
shards and are the primarily responsible for all the CRUD operations.
iii. Ingest node:
Ingest nodes may or maynot be extremely important based on the current system. They are
responsible for passing the documents through an “ingest pipeline” wherein a set of operations
are applied to the document. The operations are performed one after the other within this
pipeline. An example of the operation could be “delete a given field from the document”,
“modify a given field in the document” etc.
iv. Coordinating-only node:
As the name suggests, coordinating node is the node tasked with rerouting requests to the right
destinations, indexing of the incoming documents, searching and updating documents. The
coordinating node is most crucial in performing CRUD operations. Details of its involvement are
discussed in the future topics.
v. XPack node:
Also known as the Machine Learning node, the XPack is required in the system if we are to
leverage the machine learning characteristics of XPack. XPack is a provision in the Elastic tech
stack and one of its benefits it to perform some machine learning on the system itself, to learn
some off the features of the system. This requires every master eligible node to be listed as a
Machine Learning node.
6
2.3. CRUD operations in Elasticsearch
i. Create
Every new document indexing request is sent to the coordinating node. The coordinating node
uses a “murmur3” hashing function and uses the number of shards in the system to find a
suitable destination shard for the new document. It forwards the request to the node with the
given shard.
Shard number = (murmur3-hash(doc id))%(number of primary shards)
Once within the shard, the request is written to the translog (explained below) and the document
is added to the memory buffer. The memory buffer is refreshed at regular intervals (of 1s) and
the data is written to the disk. Meanwhile, the translog is persisted to the disk every 5s. Also, a
force flush is performed every 30 minutes or if the translog size crosses a threshold. During a
force flush, the data in the translog is flushed into the disk, the existing translog is deleted and a
new empty translog is created.
If the request is successful, the node sends copies of the request to the replicas of the current
shard. The figure below shows how the write request and data flows.
Figure 3: Create and Index Document
Translog:
The translog is a common concept in databases. It basically maintains a log of every write, delete
and update request that was received by the shard. If the system were to crash while a commit
was happening, the request currently in progress and the updates made up until the current
commit will be lost. If a translog if available, the system, upon restart, can replay the comment
most recently initiated.
7
ii. Update and Delete
Documents in Elasticsearch are immutable and hence cannot be altered or deleted. Therefore
update and delete is not a straightforward modification or removal of a file.
The disk consists of several segments, each of which is associated with a .del file. Whenever a
request is made to delete a file, the file is listed as deleted in the .del file and a new copy is made.
When a query is fired for the document, it will initially be picked up but will be filtered out from
the set of retrieved documents.
Updates work on very similar lines. Whenever an update request is fired, a copy of the document
is made with the new change, and the original document is marked as deleted in the .del file.
When a search request is made, the system retrieves both the documents but filters out the
documents marked as deleted, effectively retrieving only the latest version of the document.
iii. Read
The read requests consist of two phases – a query phase and a fetch phase
Figure 4: Update and Delete Operation
Query Phase:
The coordinating node creates a list of shards which has documents related to the query
and sends the request to every selected shard. Each shard performs a local search for the query.
8
A search relevance (as discussed in section 1.3.) is computed for each document and a priority
queue is built using the document id and relevance pairs. The top n requests are sent to the
coordinating node from each of these shards. The coordinating shard creates a larger priority
queue and combines all the results. It picks picks a set of top n document ids from this queue.
Fetch Phase:
Coordinating node now has a set of eligible document ids. It requests the shards where
each of these documents resides and retrieves the document from them.
2.4. Comparison between RDBMS, NoSQL & Elasticsearch
The major difference between RDBMS and Elasticsearch is that, RDBMS is very strict in terms
of predefining schema, identifying the type of fields beforehand, and normalizing the tables. On
the other hand, Elasticsearch is a schema-last data store, where a document is processed, based
on which a schema is defined. It offers dynamic typing where the type of the fields are defined
based on the first document that is indexed.
In RDBMS, the data pertaining to an entity is stored in a single table, and the relationship
between two tables is established by storing the primary key of one table as a foreign key in
another. The updation of data follows ACID(Atomicity, Consistency, Isolation, Durability)
compliancy. Although RDBMS is very good at capturing the real world relationship between
entities, it is sub par in performance time compared to Elasticsearch because joining tables is
expensive. Furthermore, performing joins between tables on different machines is not possible.
This restricts the database to be vertically scalable rather than horizontally scalable. This further
impacts the computational complexity of querying data when the data increases.
In Elasticsearch, the indexes are unrelated. Also, every index is a collection of independent
documents which implies that the documents are unrelated as well. Therefore, the changes to one
document does not affect another. This property makes Elasticsearch non-ACID compliant when
dealing with transactions that involve multiple documents. Rollback is not possible if a part of
the transaction fails. Since there is no relationship between the indexes or the documents, data is
horizontally scalable and indexing and searching is fast and lock-free.
Since establishing real world relationships between entities makes it easier to query data,
Elasticsearch provides the following mechanisms to establish relationships between the entities,
despite allowing independent and horizontally scalable data.
9
i. Application-side joins:
The following example illustrates the join established in Elasticsearch with independent data.
Consider the document for Employee and Department inserted as follows:
PUT /index1/employee/101 ______________________________ (1)
{
"name": "Hari",
"email": "Hari@andrew.cmu.com",
"title": "manager"
"department": 2 ______________________________ (3)
}
PUT /index1/department/2 ______________________________ (2)
{
"name": "Sales and Marketing"
“location”: “California”
}
In ( 1) & ( 2) : index (i.e. index1), type (i.e. employee) and id (101) together act as a primary key
In ( 3) : Employee is related to Department using department id.
Finding the employees in department 2 is a simple HTTP GET request:
GET /index1/employee/_search
{
"query": {
"filtered": {
"filter": {
"term": { "department": 2 }
}
}
}
}
Finding the employees in California departments is a two step process:
Firstly, we need to search departments in California
Secondly, we search the user to get those users that work in the departments got from the first
query.
10
GET /index1/department/_search
{
"query": {
"match": {
"location": "California"
}
}
}
GET /index1/user/_search
{
"query": {
"filtered": {
"filter": {
"terms": { "department": [1] }_________ (1)
}
}
}
}
(1) The terms filter will get populated with the results from the first query
Although this double querying is expensive as a join in RDBMS, the results can be cached for
quick retrieval.
ii. Data denormalization
In order to avoid the overhead caused in application join due to double query, the data can be
denormalized and the department location can be stored in both the department document as well
as the user document during indexing itself, as shown below. This process of denormalization
increases the speed of execution by avoiding unnecessary joins.
PUT /index1/employee/101
{
"name": "Hari",
"email": "Hari@andrew.cmu.com",
"title": "manager"
"department": {
"id": 2,
"location": "California" ______________ (1)
}
}
(1) The location of department has been denormalized and stored in user document also
11
iii. Nested objects
Closely related entities can be stored in the same document since the CRUD operations are
atomic within a document. For example, orders and orderitems can be stored in the same
document.
PUT /index2/order/1
{
"date": "10 Aug 2017",
"total": 95.65,
"orderitems": [
{
"description": "Coffee",
"Price": 10.99,
"Quantity": 2
},
{
"description": "Shampoo",
"Price": 5.99,
"Quantity": 1
}
]
}
By nesting the related orderitems within the order documents, it is easy to associate all the order
items of a particular order. It saves the overhead of joining documents. These inner orderitem
objects are treated a separate hidden documents and the hidden structure of order is as follows:
{
"date": "10 Aug 2017", _______________________ (1)
"total": 95.65,
{ __________ (2)
"orderitems.description": "Coffee",
"orderitems.Price": 10.99,
"orderitems.Quantity": 2
},
{ __________ (3)
"orderitems.description": "Shampoo",
"orderitems.Price": 5.99,
"orderitems.Quantity": 1
}
}
(1) Is the root/parent document (2) & (3) are first & second nested objects
12
The advantage of indexing these documents in a nested way is that the relationship between
fields of an object can be maintained. That is, it is easy to associate the price of coffee with
10.99.
The caveat in using nested objects is that, to add or update the nested objects, the whole
document should be reindexed. Also, the search of the document will return the complete
document, not just the nested objects. Nested objects are index time joins.
iv. Parent/child relationships
Parent child relationships are similar to nested objects. The difference is that, in nested objects
related entities are in the same document, where as in Parent/child they are stored in separate
documents. The relationship is one to many. The advantages of using this over nested objects is
that, parents and children can be updated without affecting each other. Also, the search query can
be used to retrieve the child documents alone instead of retrieving the whole document as in the
case of nested objects. It is required to specify the parent type of the child in the mapping
document. Parent/child relationships are query time join.
Characteristic Elasticsearch RDBMS(Eg: SQL) NoSQL (Eg: Mongo DB)
Database Model Document Search
engine
Entity relationship
Database
Document Store
Schema schema-last Strictly schema based schema-last
Server-side Query Possible Uses PL/SQL Javascript only
Partitioning Horizontal
Partitioning using
Sharding
Horizontal
Partitioning
Horizontal Partitioning using
Sharding
Replication Replication of Shards Master-master &
Master-slave
replication
Master-slave replication
Foreign Key Not Possible Possible Not Possible
ACID compliancy Not compliant when
transaction involves
multiple documents
Compliant Not compliant
Table 1: Differences between Elasticsearch, RDBMS, NoSQL
13
2.5. Characteristics of ElasticSearch
● Performance:
There is a latency of about one second from the time when an index for a new
document is created till it is available for search. Hence, the querying can be performed
in near real time. The searches run very fast due to the high level of indexing and
provision for parallelism. Caches are provided on certain indexes for quicker look ups.
(For instance, if all the querying is performed for tweets regarding a natural calamity,
these results are cached and hence, this data can be accessed faster). Overall, the searches
and inserts can be very fast.
● Distributed:
Elasticsearch has been designed to work in a distributed environment, especially
since, the clusters can be scaled horizontally. Therefore, elasticsearch has in-built
methodologies to address the complexities that come in a distributed computing setting,
such as failure handling, high speed distributed communication, etc.
● Interoperability:
Elasticsearch is built on Java, therefore it is compatible with several platforms.
Furthermore, the RESTful JSON API which works using HTTP protocol makes it more
interoperable. Therefore, data can be pulled from various resources and can be stored
and indexed. Furthermore, it can be sent to other applications that can query and use the
data for dashboarding, analytics or simple retrieval purposes.
● Schema last:
As with other NoSQL databases, development on Elasticsearch can be done in a
schema last approach. Elasticsearch has a concept of “mapping” which is similar to a
schema. A predefined mapping can be specified for Elasticsearch to be able to parse the
incoming data. If no such mapping is defined Elasticsearch tried to extract its own
mapping from the data. In any case, schema is generally decided by the developer
working with the application rather than during the setup of the database.
14
2.6. Popularity of ElasticSearch
Elasticsearch has seen an exponential increase in the past few years. The graph below shows the
increase in use of Elasticsearch since 2013 to October 2017.
Figure 5: Elasticsearch Usage Trend
The following reasons contribute to the popularity of Elasticsearch:
Adjustable Data Model:
Elasticsearch provides a data model that is adjustable, easy to change to be able to meet
the dynamic requirements. Traditional models require all requirements to be identified before
developing a data model. However, the complete process has to be redone from scratch if in
future the requirements or objectives of any project changes. Elasticsearch averts this problem by
providing mechanisms to build a model that is flexible to a large extent. If the schema of the data
has to be completely changed, the documents simply have to be reindexed. There is no overhead
of redesigning an entity-relationship structure, normalization of table etc as in traditional
databases.
Analytics:
Kibana software offered as a part of the elastic tech stack enables the visualization of data
and its aggregation. Since aggregation of big data is expensive, the indexing features of
elasticsearch help in quick lookup and fetching of data.
15
Fuzzy Search, Autocompletion:
Fuzzy searching allows elasticsearch to overlook errors in spellings in search requests.
Autocompletion also allows suggestion of search words based on initial input from users. This
makes elasticsearch an ideal document database for many search engines and websites. This
means that, when a json request for the word “security” as “securit”, Elasticsearch performs its
autocomplete and searches for the right word.
Moreover, when we search for something using our own search bar in a home built UI, if
we send the typed word ad hoc to the backend, elasticsearch responds with a set of suggestions
for a completed word. Our search bar can be built so as to listen to these requests and provide
suggestions in a drop down (like in a google search bar).
The autocompletion search requires less searching when compared to searching the
content itself, since the user enters partial word. Therefore, the data pertaining to autocomplete
can be stored in memory, which also makes it faster. Autocomplete data can also be scaled and
stored on separate clusters.
Service multiple customers at once:
Elasticsearch supports “multi-tenancy”. That is, a single instance of elasticsearch can
serve multiple tenants. For instance, in a website like Amazon, a single instance of elasticsearch
may serve 100 customers or more.
Although this is a good feature, it does come with certain downsides which have to be
handled with efficient indexing techniques. For example, each user may have separate
documents and should not be allowed to access documents that belong to other users. In such a
scenario, the indexing strategy should be such that there is a separate user index. But, in some
scenarios, this could lead to many tiny indices which causes a computational overhead.
Therefore, appropriate indexing strategies should be applied, after thoroughly analyzing the pros
and cons.
16
3. Logstash
3.1. About Logstash
Logstash is an open source data collection and pipelining tool, that can collect data from many
different kinds of sources. The various sources from which Logstash can collect data are:
● Web Log: Apache web log, event log, firewall logs, networking logs, etc
● HTTP Requests: Data from Twitter, GitHub, other web applications
● Database Stores: SQL & NoSQL using JDBC interface
● Data Stream: Messaging Queues like Kafka, JMS etc
● Sensors & IOT: Sensors such as temperature sensors, pressure sensors, IOT products and
associated devices.
Since the structure of data is different on all these sources, the pipelining process allows to clean
and unify the data from these sources into the Elasticsearch database. It also has in-built pattern
matching and geographical data mapping abilities to improve the process of data aggregation and
feature extraction.
Figure 6: Data Collection Sources
“Grok filter” is the main component of logstash which parses the unstructured data into
structured data and makes it queryable.
17
3.2. Working of Logstash
Figure 7: Stages in Logstash
As shown in the above figure, Logstash has the following three stages to collect data:
i. Input: Input plugins are used to read data into logstash. Some of the commonly used
plugins are filesystem, Redis, syslog, Filebeat, etc. The data generated by input plugins
are called “Events”.
Example HTTP Request Log:
55.3.244.1 GET /index.html 15824 0.043
ii. Filter: Filter plugins are used to perform intermediary transformations, data parsing and
cleaning of data. Grok, mutate, drop, geoip, clone etc, are some of the commonly used
filter plugins.
Example grok filter to parse this to JSON would be:
filter {
grok {
match=>{ "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request}
%{NUMBER:bytes} %{NUMBER:duration}" }
}
}
18
The parsed output would be:
{
client: 55.3.244.1
method: GET
request: /index.html
bytes: 15824
duration: 0.043
}
iii. Output: An event can be passed to multiple output plugins. Elasticsearch, graphite, file,
etc are examples of output plugins.
The parsed output of the logstash pipeline from the above example may be sent to the
Elasticsearch database for indexing and storage.