Storage and Retrieval of Images in DBs: Learn Storage and Retrieval Techniques for Social Netwk Images

Image Storage and Retrieval in Databases

Ashwin Chavan (K00368506)

Arun Kumar Cheruku (K00380187)

A report submitted for the partial fulfillment

of the requirements of the course

Database Systems (CSEN 5314)

(Computer Science)

Texas A&M University, Kingsville

Spring 2016

Table of Contents

1. Introduction

2. Current Approaches

A. Storage in Binary form

B. Static/Dynamic Image Paths

3. Comparison of the approaches

4. Case Study – Facebook approach for large number of images

5. Conclusion

6. References

1. Introduction

With the advancement in hardware technology in the last decade, smartphones capable of taking high resolution, some even in High Definition resolution of 10180P or more, are now easily available in the market and for a fraction of the cost when compared to the prices a few years ago. Apart from smartphones, there are also various other devices available for capturing pictures, few of the prominent ones include GoPro and DSLR Cameras.

The growth of social networking in the last decade has been phenomenal. There are now various applications (companies) that provide social networking capabilities. The prominent ones include Facebook, Twitter, Instagram and Snapchat. The user base for these companies is growing exponentially.

One of the culmination of the above two trends is the increase in photo sharing habits of the user. The readily available apps for the above organizations, seamless integration with other photo-editing apps, lower costs of devices, integration of non-smartphone devices to social network apps have all contributed to extensive photo sharing activities by the user. The statistics for the same are jaw-dropping – Snapchat has 200 million users who share 8,796 photos per second! Facebook has 1.39 billion users and has 350 million photos posted per day, 4,501 per second! Instagram has 300 million users who post 70 million photos per day, 810 per second! With numbers like these, it becomes important to understand where and how the organizations are storing and retrieving these large numbers of images.

The obvious answer is Databases. In this report, we’ll discuss the basics of how the images are stored and retrieved from the database, sample examples with code snippets, a brief overview of the pros and cons of each approach and finally, a case study of Facebook, to briefly illustrate their approach in storing large number of images.

2. Current Approaches

There are currently 2 approaches to storing images. They are

A. Storing Images as binary data in the database records.

B. Storing Image files in file system and storing their absolute/relative paths in the database records.

We take a look at each approach in brief detail in the following sections.

A. Storing Images as Binary Data

In this approach, the image is converted into binary form by the application, and this binary data is stored in the database records. The representation of binary data in Java is done by using the java.sql.Blob interface. BLOB is a built-in type that stores a Binary Large Object as a column value in a row of a database table. This object is valid for the duration of the transaction in which is created.

The operation is briefly illustrated as follows.

Fig 1: Common Application Architecture

The above diagram shows the most common architecture used in application. Here, the application is responsible for performing CRUD (Create, Read, Update, Delete) operations on the database. The application, when using binary approach follows these steps to store and retrieve the images from the database. Here, we see the sample code snippets for Java programming language. Similar implementations for other languages can be done by using the libraries provided in their packages.

Storing Images

To store images, the image will first need to be read from a source, the most common source being a file upload from the local file system.

Once the image file is read into the application, it is converted into binary data, or bytes using any native or 3rd party libraries available.

Fig 2: Code Snippet to convert data into Java Byte array

The above code snippet illustrated the conversion of data into a Java byte array, and then storing the bytes into the database. Note that there are other variant as to how this can be implemented. For example, the following code snippet illustrates the usage a native method called setBinaryStream, included in the Blob interface. This method reads the data directly from the InputStream as needed until the end-of-file is reached.

Fig 3: Code snippet for Blob interface usage

Depending on the requirements and the availability of the libraries, appropriate design decision is taken as to which approach would be suitable in the given situation.

Retrieving Images

The retrieval part of the images can be logically grouped into 2 parts – read the bytes from the database and convert the byte data to image.

The retrieval of bytes stored in the database can be done in standardized ways; it is as simple as retrieving any other data of other datatypes. Once the table has being queried to include the column having the image bytes, we can read the bytes from the result set, just as we were reading any other data. The ResultSet interface provides methods that specify which type of data needs to be retrieved. The following code snippet shows the retrieval of byte array data that was stored in the database.

Fig 4: Code snippet for retrieval of byte array data

The ResultSet interface also provides a method to retrieve the BLOB data directly. The following code snippet shows the retrieval of BLOB data that was stored in the database.

Fig 5: Code snippet for retrieval of BLOB data

Depending on the requirements, we can process the byte array to retrieve the image in various ways. Some of them include writing the byte array directly into a file to create a jpeg or any other image file, using image objects like BufferedImage, ImageIcon and others to continue further processing in the application etc.

Code snippet illustrating creating image file from byte array

Fig 6: Code snippet for creating image file from byte array

Code snippet illustrating conversion into some image objects

Fig 7: Code snippet for conversion into image object

B. Storing Image files’ absolute/relative paths

In this approach, the image is stored on the file system in a particular location, and this location of the image is stored in the database records. The file system could be on the same server or on a different server altogether. There is no special handling required to store or retrieve the images as the image is stored as is. However, one needs to take into account the integrity of the file paths to the images, as these paths vary on different operating systems.

The operation is briefly illustrated as follows.

The above diagram shows the architecture used in the application. Here, the application performs CRUD operations on both, the database and the file system – the file paths in the database and the actual image files in the file system. The application, when using the file system approach follows these steps to store and retrieve the images from the database. Here, we see the sample code snippets for Java programming language. Similar implementations for other languages can be done by using the libraries provided in their packages.

Storing images

The images are stored in 2 ways. The image, once read from the source, can be directly uploaded to the server or can be read into the application and written as a file on the server.

One of the most common ways reading an image and writing as a file is by using the ImageIO native api provided by Java. The ImageIO class is a handy class to read and write image into local system/server. It can read images from either a local file system or a URL as shown below.

Fig 9: Code snippet for reading image from File

Fig 10: Code snippet for reading image from URL

Once the image as been read into a file, it can be saved onto the file system as shown below. It can be saved into various formats like JPEG, GIF, PNG etc.

Fig 11: Code snippet for writing image types

Apart from the native Java library, there are plenty of third party libraries that assist in these operations. Some libraries even provide api’s for extended image processing like reading image info, reading and writing a variety of metadata, conversion of images to RGB etc. One of the most prominent libraries is the Apache Commons Imaging library that is written in 100% pure Java. It has a wide variety of functionalities offered for image processing.

Retrieving Images

Depending on the application developed, the images stored can be retrieved directly by referencing them to the location on the server or by reading it into the application. The image files’ location which is read from the database will have to be modified appropriately before using it in the application.

Here’s an example of the image being read into the application and loaded into a Java Swing application.

Fig 12: Code snippet for image being read into the application

Here’s an example of HTML code where the image is being directly referenced from the file system on the server.

Fig 13: Code snippet for directly referenced from the file system on the server.

3. Comparison of the approaches

Storage on Database is generally more expensive than storage on a filesystem but it triumphs when it is required to maintain transactional integrity between the image and its metadata. There are various scenarios where it makes sense to store images in a database – the applications that require maintenance of history, example invoice history. For applications that require ACID compliance and referential integrity, database is the only way to go. Also, access control of the images stored inside the database is easy. And they do not require a separate strategy for backups. There are, however, problems associated with this approach. It becomes increasingly difficult where the number and size of the images handled by the application increases, the performance is the first aspect that’s hit. This approach also requires additional logic to retrieve and stream the images, along with a factor of latency.

The file system approach is definitely faster than the database approach. We don’t need any additional logic to access the images stored in the file system. Performance does not take a hit while storing large number of images, as the application stores only the path of the image in the database.

There is also an additional overhead of checking and making sure the image data has been pushed to the file system. There is no way we can guarantee that the image and its metadata stored in the database are referring to the same file.

A study by Microsoft Research and UC Berkeley indicates that objects greater than 1MB had a clear advantage on NTFS over SQL Server. Whereas objects lesser than 256 KB, the database approach has the upper hand. For the objects within this size range, it depends on the amount of write operations in the workload, and the storage period of a backup in the system.

4. Case Study – Facebook approach for storing large number of images

Haystack in an object oriented system optimized for Facebook’s Photos application. Facebook currently stores over 260 billion images, equivalent to over 20 petabytes of data. Users upload over a billion new photos (∼60 TB) each week and Facebook serves over 1 million images per second at a peak time. Currently, Facebook is the largest photo sharing site with users having uploaded over 65 billion photos. Facebook creates and stores four images for every photo uploaded by the user, which means over 260 billion images and more than 20 petabytes of data. Every week, over 1 billion images are uploaded by the users on Facebook, and Facebook is currently serving more than 1 million images every second.

Haystack is an object store that the engineers designed for the Facebooks’ photo sharing application. Here, data is stored once, read several times, and never changed, and occasionally deleted. Since the performances of the traditional file systems were inadequate under the stressful workload, the engineers designed their own storage system for storing images. This new design, Haystack, was far less expensive and also had greater performance than the older approaches which leveraged Network Attached Storage (NAS) appliances over Network File System (NFS).

Fig 14: Architecture of Facebook’s Haystack

Here, each image was saved in its own file on network attached storage appliances of commercial capacity.

The Photo Store servers are a group of machines will then mount the volumes generated by the network attached storage appliances over network file system. The above diagram describes the architecture of the Haystack. The Photo Store server uses the URL of the image and then retrieves the volume and the file path. It then reads the same from the network file system and sends the result back to the Content Delivery Network (CDN).

The architecture is composed of three components namely Haystack Store, the Directory, and the Cache.

The Haystack Store wraps the persistent storage system for the images. It manages the file system level metadata of the. The capacity of the store is organized by their physical volumes. One example could be organizing a 10 TB server into 100 physical volumes having a storage capacity of 100 GB each. These volumes, residing on different machines are further grouped into logical volumes. Whenever an image is uploaded, it is stored first stored in the corresponding logical volume. This image is again written into all the physical volumes corresponding to it. This redundant storage helps in scenarios where data is lost due to hardware issues like failure of drive, bugs etc.

The Haystack Directory has the mapping between these physical and logical volumes, along with some other metadata about the application, like the logical volume where every image is residing, and the ones having free space. The Haystack Cache acts as the internal content delivery network. It shields the Haystack Store from requests for the most popular images and also provides insulation if the content delivery network nodes fail and the data needs to be retrieved again.

5. Conclusion

The decision to choose between the BLOB and the file system approach is driven by various factors, the primary ones being application simplicity, manageability and performance. In general, the databases are used for storing small objects as they can do so efficiently, while the file systems are efficient for storing larger objects. They key is to find the right approach based on the requirements of the application, and if required, implement a hybrid approach by finding a break-even point i.e. the right balance between the two approaches if required.

6. References

1) To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?

Russell Sears , Catharine van Ingen , Jim Gray

1: Microsoft Research,

2: University of California at Berkeley

sears@cs.berkeley.edu, vanIngen@microsoft.com, gray@microsoft.com

2) Finding a needle in Haystack: Facebook’s photo storage

Doug Beaver, Sanjeev Kumar,

Harry C. Li, Jason Sobel, Peter Vajgel, Facebook Inc.

{doug, skumar, hcli, jsobel, pv} @facebook.com

Essay: Storage and Retrieval of Images in DBs: Learn Storage and Retrieval Techniques for Social Netwk Images

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: