Essay:

Essay details:

  • Subject area(s): Marketing
  • Price: Free download
  • Published on: 14th September 2019
  • File format: Text
  • Number of pages: 2

Text preview of this essay:

This page is a preview - download the full version of this essay above.

…………………………….…………………………….…………………...

1. Introduction To Elastic Stack

1.1. About the Elastic Tech Stack

The Elastic tech stack (formerly called ELK stack) is a set of tools that allows the user to

retrieve, analyze and visual data in near real time. The tech stack primarily consists of

Elasticsearch, Logstash, Kibana, Xpack and Beats.

Figure 1: The complete Elastic Tech Stack

● Elasticsearch: The database in the tech stack.

● Logstash:

The application used to collect and large volumes of data, convert them into formats that

can be used in Elasticsearch.

● Kibana:

The analytics dashboard application

● Beats:

Used to obtain small amounts of data of a single type. For example, a specific beats

instance can be created to only retrieve log files from a given source. This data can

either be sent to logstash or directly to Elasticsearch.

1

● Xpack:

Application with Kibana to monitor the various nodes in the system. It is basically an

admin tool, which also has some internal machine learning features to make some

predictions such as “which node may fail next” etc.

1. 2. About Elasticsearch

Elasticsearch is the database component of the “Elastic” or the “ELK” tech stack. It is built upon

the Apache Lucene search engine and features high among the databases used as search engines

for documents. It is flexible, in a sense that, the system can store and retrieve data from several

sources in various formats (ppt, doc, pdf etc), as long as the data can be converted to JSON.

Moreover, any number of clusters can be added to the database at anytime, thus making it easier

to horizontally scale up or down. This amount of flexibility makes the Elasticsearch engine

“Elastic”.

1.3. Apache Lucene

Apache Lucene is a high speed text search engine. It was written in java and therefore works

across platforms. The idea is similar to how Google works at a high level. There are millions of

webpages in the backend, we ask google for the list of pages that contain information related to

words in our search query, google quickly retrieves a list of relevant webpages and displays it to

us. Similarly Lucene works with text documents. It stores text files in the form of “Inverted

indices” to keep track of words in a given document. Basically, every word is keyed against the

set of documents that the word occurs in.

For example, let's say we have two documents, doc1, doc2 and doc3 with a line of text in each:

Doc1 : “john doe explained the question”

Doc2: “jane listened to john”

Doc3: “the question was easy”

The inverted index data structure:

john : {doc1, doc2}

doe : {doc1}

explained : {doc1}

the : {doc1, doc3}

question : {doc1, doc3}

jane : {doc2}

listened : {doc2}

to : {doc2}

was : {doc3}

easy : {doc3}

2

Such a data structure provides easy access to the documents associated with terms in a query.

Another important concept of Apache lucene is that of “search relevance”. Every search engine

computes a value (based on some metric) about how relevant a given document is to a specific

search query. The logic behind computing this value varies on a case by case basis. Lucene

generally computes a “tf-idf” value (Term frequency- inverse document frequency) for every

word - document pair, which is basically a measure of how important the word is to the

document. Search relevance can be computed using many methods, and discussing those is

beyond the scope of this report.

3

2. Exploring Elasticsearch

2.1. Components & Key Terminologies of Elasticsearch

i. Document:

This is a basic unit of information that can be stored in an ElasticSearch. A “document'

refers to a json file with a set of key value pairs. Anything that can be converted into a json

object can be stored in a database. For instance, a .doc file can be converted to a json object of

key value pairs, capturing all the metadata of the file. An excel sheet can be converted to a set of

json objects, each containing a row from the excel table in form of key value pairs where in the

key is the column name and the value is the cell's content. Basically, any type of information can

be converted to a json object in various ways. It is upto the programmer to design how he/she

wants to use a document.

ii. Elasticsearch Index

A collection of documents that have similar characteristics is known as an index. For

instance, all documents associated with a given person maybe belong to the same index. The

process of inserting a document into the database is generally known as “indexing a document”.

The index names are usually in lower case and they are used while querying for or updating or

deleting a document.

iii. Type and mapping:

“Type” refers to a set of similar documents. “Mapping” associated with a given type

refers to the generic structure that can be associated with a type. Type and mapping is loosely

analogous to table name and schema definition in the RDBMS world. For instance, a document

can of a type “user”, consisting of mapping defined by fields such as “name”, “age”, “gender”

etc.

iv. Shard (Lucene index) :

Let's say we have all the news articles being set to a given index. The number of news articles

everyday might be humongous. We might not be able to store this data on the same machine. Or

the amount of documents might be so large that searching could become very slow. Hence, an

index can be split into multiple “shards” , providing a horizontal scaling mechanism. For

example, if an index were divided into two shards, one shard may hold three fifths and other

shard hold two fifths of the data respectively. All shards works independent of each other.

4

v. Node:

A set of shards is stored within a node. A node is an individual running instance of the

database i.e., if we want to know if a database is “up and running”, we check the working and

status of every node in the system. Each node is assigned a random Unique Universal Identifier

(UUID) during startup. This is used to access the particular node.

vi. Replication:

Data on a given shard in node 1 can be replicated and the replica created can be stored in

node 2. This provides a data backup in case of a system failure. Moreover, it also allows some

running parallel queries on the same set of data.

vii. Cluster:

A cluster is a set of all the nodes within a system. In other words, all the data needed for

the given application is stored on some node or the other in a cluster. Hence, a cluster can (and

mostly will) encompass physically separate machines.

Lets says our application stores news articles from nine sources and movie reviews from eight sources.

The system may be set up in this way:

Shard 1 : News articles from four sources

Shard 2 : News articles from two sources

Shard 3 : News articles from three sources

Shard 4 : Movie reviews from three sources

Shard 5 : Movie reviews from five sources

Replica 1 : Replica of Shard 1

Replica 2 : Replica of Shard 2

Replica 3 : Replica of Shard 3

Replica 4 : Replica of Shard 4

Replica 5 : Replica of Shard 5

Node 1: Shard 1, Shard 2, Shard 3, Replica 4 and

Replica 5

Node 2: Shard 1, Shard 2, Replica 1, Replica 2

and Replica 3

Cluster: Node 1 and Node 2

The cluster consists of all the data required in the

system

Figure 2: Shards and Replicas in a node

5

2.2. Types of Nodes in Elasticsearch

The Elasticsearch architecture requires the below set of nodes in the system. One node can take

up multiple roles from the list below. However, in very large systems, it is advisable to have a

node dedicated to a given functionality.

i. Master and Master eligible node :

The master node handles tasks such as assigning nodes to clusters, assigning shards to

nodes,checking the health of nodes etc. A master eligible node is one with a candidacy to be

elected the master in case the current master were to fail. By default, all the nodes in the system

are master eligible.

ii. Data node :

Data nodes are the most generic kinds of nodes in the system. They contain data within the

shards and are the primarily responsible for all the CRUD operations.

iii. Ingest node:

Ingest nodes may or maynot be extremely important based on the current system. They are

responsible for passing the documents through an “ingest pipeline” wherein a set of operations

are applied to the document. The operations are performed one after the other within this

pipeline. An example of the operation could be “delete a given field from the document”,

“modify a given field in the document” etc.

iv. Coordinating-only node:

As the name suggests, coordinating node is the node tasked with rerouting requests to the right

destinations, indexing of the incoming documents, searching and updating documents. The

coordinating node is most crucial in performing CRUD operations. Details of its involvement are

discussed in the future topics.

v. XPack node:

Also known as the Machine Learning node, the XPack is required in the system if we are to

leverage the machine learning characteristics of XPack. XPack is a provision in the Elastic tech

stack and one of its benefits it to perform some machine learning on the system itself, to learn

some off the features of the system. This requires every master eligible node to be listed as a

Machine Learning node.

6

2.3. CRUD operations in Elasticsearch

i. Create

Every new document indexing request is sent to the coordinating node. The coordinating node

uses a “murmur3” hashing function and uses the number of shards in the system to find a

suitable destination shard for the new document. It forwards the request to the node with the

given shard.

Shard number = (murmur3-hash(doc id))%(number of primary shards)

Once within the shard, the request is written to the translog (explained below) and the document

is added to the memory buffer. The memory buffer is refreshed at regular intervals (of 1s) and

the data is written to the disk. Meanwhile, the translog is persisted to the disk every 5s. Also, a

force flush is performed every 30 minutes or if the translog size crosses a threshold. During a

force flush, the data in the translog is flushed into the disk, the existing translog is deleted and a

new empty translog is created.

If the request is successful, the node sends copies of the request to the replicas of the current

shard. The figure below shows how the write request and data flows.

Figure 3: Create and Index Document

Translog:

The translog is a common concept in databases. It basically maintains a log of every write, delete

and update request that was received by the shard. If the system were to crash while a commit

was happening, the request currently in progress and the updates made up until the current

commit will be lost. If a translog if available, the system, upon restart, can replay the comment

most recently initiated.

7

ii. Update and Delete

Documents in Elasticsearch are immutable and hence cannot be altered or deleted. Therefore

update and delete is not a straightforward modification or removal of a file.

The disk consists of several segments, each of which is associated with a .del file. Whenever a

request is made to delete a file, the file is listed as deleted in the .del file and a new copy is made.

When a query is fired for the document, it will initially be picked up but will be filtered out from

the set of retrieved documents.

Updates work on very similar lines. Whenever an update request is fired, a copy of the document

is made with the new change, and the original document is marked as deleted in the .del file.

When a search request is made, the system retrieves both the documents but filters out the

documents marked as deleted, effectively retrieving only the latest version of the document.

iii. Read

The read requests consist of two phases - a query phase and a fetch phase

Figure 4: Update and Delete Operation

Query Phase:

The coordinating node creates a list of shards which has documents related to the query

and sends the request to every selected shard. Each shard performs a local search for the query.

8

A search relevance (as discussed in section 1.3.) is computed for each document and a priority

queue is built using the document id and relevance pairs. The top n requests are sent to the

coordinating node from each of these shards. The coordinating shard creates a larger priority

queue and combines all the results. It picks picks a set of top n document ids from this queue.

Fetch Phase:

Coordinating node now has a set of eligible document ids. It requests the shards where

each of these documents resides and retrieves the document from them.

2.4. Comparison between RDBMS, NoSQL & Elasticsearch

The major difference between RDBMS and Elasticsearch is that, RDBMS is very strict in terms

of predefining schema, identifying the type of fields beforehand, and normalizing the tables. On

the other hand, Elasticsearch is a schema-last data store, where a document is processed, based

on which a schema is defined. It offers dynamic typing where the type of the fields are defined

based on the first document that is indexed.

In RDBMS, the data pertaining to an entity is stored in a single table, and the relationship

between two tables is established by storing the primary key of one table as a foreign key in

another. The updation of data follows ACID(Atomicity, Consistency, Isolation, Durability)

compliancy. Although RDBMS is very good at capturing the real world relationship between

entities, it is sub par in performance time compared to Elasticsearch because joining tables is

expensive. Furthermore, performing joins between tables on different machines is not possible.

This restricts the database to be vertically scalable rather than horizontally scalable. This further

impacts the computational complexity of querying data when the data increases.

In Elasticsearch, the indexes are unrelated. Also, every index is a collection of independent

documents which implies that the documents are unrelated as well. Therefore, the changes to one

document does not affect another. This property makes Elasticsearch non-ACID compliant when

dealing with transactions that involve multiple documents. Rollback is not possible if a part of

the transaction fails. Since there is no relationship between the indexes or the documents, data is

horizontally scalable and indexing and searching is fast and lock-free.

Since establishing real world relationships between entities makes it easier to query data,

Elasticsearch provides the following mechanisms to establish relationships between the entities,

despite allowing independent and horizontally scalable data.

9

i. Application-side joins:

The following example illustrates the join established in Elasticsearch with independent data.

Consider the document for Employee and Department inserted as follows:

PUT /index1/employee/101 ______________________________ (1)

{

"name": "Hari",

"email": "[email protected]",

"title": "manager"

"department": 2 ______________________________ (3)

}

PUT /index1/department/2 ______________________________ (2)

{

"name": "Sales and Marketing"

“location”: “California”

}

In ( 1) & ( 2) : index (i.e. index1), type (i.e. employee) and id (101) together act as a primary key

In ( 3) : Employee is related to Department using department id.

Finding the employees in department 2 is a simple HTTP GET request:

GET /index1/employee/_search

{

"query": {

"filtered": {

"filter": {

"term": { "department": 2 }

}

}

}

}

Finding the employees in California departments is a two step process:

Firstly, we need to search departments in California

Secondly, we search the user to get those users that work in the departments got from the first

query.

10

GET /index1/department/_search

{

"query": {

"match": {

"location": "California"

}

}

}

GET /index1/user/_search

{

"query": {

"filtered": {

"filter": {

"terms": { "department": [1] }_________ (1)

}

}

}

}

(1) The terms filter will get populated with the results from the first query

Although this double querying is expensive as a join in RDBMS, the results can be cached for

quick retrieval.

ii. Data denormalization

In order to avoid the overhead caused in application join due to double query, the data can be

denormalized and the department location can be stored in both the department document as well

as the user document during indexing itself, as shown below. This process of denormalization

increases the speed of execution by avoiding unnecessary joins.

PUT /index1/employee/101

{

"name": "Hari",

"email": "[email protected]",

"title": "manager"

"department": {

"id": 2,

"location": "California" ______________ (1)

}

}

(1) The location of department has been denormalized and stored in user document also

11

iii. Nested objects

Closely related entities can be stored in the same document since the CRUD operations are

atomic within a document. For example, orders and orderitems can be stored in the same

document.

PUT /index2/order/1

{

"date": "10 Aug 2017",

"total": 95.65,

"orderitems": [

{

"description": "Coffee",

"Price": 10.99,

"Quantity": 2

},

{

"description": "Shampoo",

"Price": 5.99,

"Quantity": 1

}

]

}

By nesting the related orderitems within the order documents, it is easy to associate all the order

items of a particular order. It saves the overhead of joining documents. These inner orderitem

objects are treated a separate hidden documents and the hidden structure of order is as follows:

{

"date": "10 Aug 2017", _______________________ (1)

"total": 95.65,

{ __________ (2)

"orderitems.description": "Coffee",

"orderitems.Price": 10.99,

"orderitems.Quantity": 2

},

{ __________ (3)

"orderitems.description": "Shampoo",

"orderitems.Price": 5.99,

"orderitems.Quantity": 1

}

}

(1) Is the root/parent document (2) & (3) are first & second nested objects

12

The advantage of indexing these documents in a nested way is that the relationship between

fields of an object can be maintained. That is, it is easy to associate the price of coffee with

10.99.

The caveat in using nested objects is that, to add or update the nested objects, the whole

document should be reindexed. Also, the search of the document will return the complete

document, not just the nested objects. Nested objects are index time joins.

iv. Parent/child relationships

Parent child relationships are similar to nested objects. The difference is that, in nested objects

related entities are in the same document, where as in Parent/child they are stored in separate

documents. The relationship is one to many. The advantages of using this over nested objects is

that, parents and children can be updated without affecting each other. Also, the search query can

be used to retrieve the child documents alone instead of retrieving the whole document as in the

case of nested objects. It is required to specify the parent type of the child in the mapping

document. Parent/child relationships are query time join.

Characteristic Elasticsearch RDBMS(Eg: SQL) NoSQL (Eg: Mongo DB)

Database Model Document Search

engine

Entity relationship

Database

Document Store

Schema schema-last Strictly schema based schema-last

Server-side Query Possible Uses PL/SQL Javascript only

Partitioning Horizontal

Partitioning using

Sharding

Horizontal

Partitioning

Horizontal Partitioning using

Sharding

Replication Replication of Shards Master-master &

Master-slave

replication

Master-slave replication

Foreign Key Not Possible Possible Not Possible

ACID compliancy Not compliant when

transaction involves

multiple documents

Compliant Not compliant

Table 1: Differences between Elasticsearch, RDBMS, NoSQL

13

2.5. Characteristics of ElasticSearch

● Performance:

There is a latency of about one second from the time when an index for a new

document is created till it is available for search. Hence, the querying can be performed

in near real time. The searches run very fast due to the high level of indexing and

provision for parallelism. Caches are provided on certain indexes for quicker look ups.

(For instance, if all the querying is performed for tweets regarding a natural calamity,

these results are cached and hence, this data can be accessed faster). Overall, the searches

and inserts can be very fast.

● Distributed:

Elasticsearch has been designed to work in a distributed environment, especially

since, the clusters can be scaled horizontally. Therefore, elasticsearch has in-built

methodologies to address the complexities that come in a distributed computing setting,

such as failure handling, high speed distributed communication, etc.

● Interoperability:

Elasticsearch is built on Java, therefore it is compatible with several platforms.

Furthermore, the RESTful JSON API which works using HTTP protocol makes it more

interoperable. Therefore, data can be pulled from various resources and can be stored

and indexed. Furthermore, it can be sent to other applications that can query and use the

data for dashboarding, analytics or simple retrieval purposes.

● Schema last:

As with other NoSQL databases, development on Elasticsearch can be done in a

schema last approach. Elasticsearch has a concept of “mapping” which is similar to a

schema. A predefined mapping can be specified for Elasticsearch to be able to parse the

incoming data. If no such mapping is defined Elasticsearch tried to extract its own

mapping from the data. In any case, schema is generally decided by the developer

working with the application rather than during the setup of the database.

14

2.6. Popularity of ElasticSearch

Elasticsearch has seen an exponential increase in the past few years. The graph below shows the

increase in use of Elasticsearch since 2013 to October 2017.

Figure 5: Elasticsearch Usage Trend

The following reasons contribute to the popularity of Elasticsearch:

Adjustable Data Model:

Elasticsearch provides a data model that is adjustable, easy to change to be able to meet

the dynamic requirements. Traditional models require all requirements to be identified before

developing a data model. However, the complete process has to be redone from scratch if in

future the requirements or objectives of any project changes. Elasticsearch averts this problem by

providing mechanisms to build a model that is flexible to a large extent. If the schema of the data

has to be completely changed, the documents simply have to be reindexed. There is no overhead

of redesigning an entity-relationship structure, normalization of table etc as in traditional

databases.

Analytics:

Kibana software offered as a part of the elastic tech stack enables the visualization of data

and its aggregation. Since aggregation of big data is expensive, the indexing features of

elasticsearch help in quick lookup and fetching of data.

15

Fuzzy Search, Autocompletion:

Fuzzy searching allows elasticsearch to overlook errors in spellings in search requests.

Autocompletion also allows suggestion of search words based on initial input from users. This

makes elasticsearch an ideal document database for many search engines and websites. This

means that, when a json request for the word “security” as “securit”, Elasticsearch performs its

autocomplete and searches for the right word.

Moreover, when we search for something using our own search bar in a home built UI, if

we send the typed word ad hoc to the backend, elasticsearch responds with a set of suggestions

for a completed word. Our search bar can be built so as to listen to these requests and provide

suggestions in a drop down (like in a google search bar).

The autocompletion search requires less searching when compared to searching the

content itself, since the user enters partial word. Therefore, the data pertaining to autocomplete

can be stored in memory, which also makes it faster. Autocomplete data can also be scaled and

stored on separate clusters.

Service multiple customers at once:

Elasticsearch supports “multi-tenancy”. That is, a single instance of elasticsearch can

serve multiple tenants. For instance, in a website like Amazon, a single instance of elasticsearch

may serve 100 customers or more.

Although this is a good feature, it does come with certain downsides which have to be

handled with efficient indexing techniques. For example, each user may have separate

documents and should not be allowed to access documents that belong to other users. In such a

scenario, the indexing strategy should be such that there is a separate user index. But, in some

scenarios, this could lead to many tiny indices which causes a computational overhead.

Therefore, appropriate indexing strategies should be applied, after thoroughly analyzing the pros

and cons.

16

3. Logstash

3.1. About Logstash

Logstash is an open source data collection and pipelining tool, that can collect data from many

different kinds of sources. The various sources from which Logstash can collect data are:

● Web Log: Apache web log, event log, firewall logs, networking logs, etc

● HTTP Requests: Data from Twitter, GitHub, other web applications

● Database Stores: SQL & NoSQL using JDBC interface

● Data Stream: Messaging Queues like Kafka, JMS etc

● Sensors & IOT: Sensors such as temperature sensors, pressure sensors, IOT products and

associated devices.

Since the structure of data is different on all these sources, the pipelining process allows to clean

and unify the data from these sources into the Elasticsearch database. It also has in-built pattern

matching and geographical data mapping abilities to improve the process of data aggregation and

feature extraction.

Figure 6: Data Collection Sources

“Grok filter” is the main component of logstash which parses the unstructured data into

structured data and makes it queryable.

17

3.2. Working of Logstash

Figure 7: Stages in Logstash

As shown in the above figure, Logstash has the following three stages to collect data:

i. Input: Input plugins are used to read data into logstash. Some of the commonly used

plugins are filesystem, Redis, syslog, Filebeat, etc. The data generated by input plugins

are called “Events”.

Example HTTP Request Log:

55.3.244.1 GET /index.html 15824 0.043

ii. Filter: Filter plugins are used to perform intermediary transformations, data parsing and

cleaning of data. Grok, mutate, drop, geoip, clone etc, are some of the commonly used

filter plugins.

Example grok filter to parse this to JSON would be:

filter {

grok {

match=>{ "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request}

%{NUMBER:bytes} %{NUMBER:duration}" }

}

}

18

The parsed output would be:

{

client: 55.3.244.1

method: GET

request: /index.html

bytes: 15824

duration: 0.043

}

iii. Output: An event can be passed to multiple output plugins. Elasticsearch, graphite, file,

etc are examples of output plugins.

The parsed output of the logstash pipeline from the above example may be sent to the

Elasticsearch database for indexing and storage.

...(download the rest of the essay above)

About this essay:

This essay was submitted to us by a student in order to help you with your studies.

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, . Available from:< https://www.essaysauce.com/essays/marketing/2017-10-30-1509375928.php > [Accessed 16.10.19].