In today’s digital world, with the exponential growth of data, new approaches to aggregate and analyze data are bringing considerable benefits to many fields such as healthcare, Internet of Things (IoT), social media, business, and public policy. Data Science (DS) is considered as an interdisciplinary field that covers how data is prepared, analyzed, interpreted, modeled, and presented. It is a combination of data analytics, machine learning, math, and statistic, as well as domain and business knowledge. One of the main goals of DS is to leverage Big Data technologies with an adept analysis to obtain as much information as possible from the data and facilitate the decision-making process. Many research areas such as medicine and astrophysics have heavily utilized DS, usually focusing on structured scientific data. Using DS, the scientist can obtain a better understanding of the data and conduct a more precise analysis. In addition, DS has become a crucial foundation for Artificial Intelligence (AI) based on the right mix of machine learning and domain knowledge and continues to impact all aspects of life, through the discovery of new knowledge and hidden meaning within the data.
The rapid diffusion of DS has advanced the development of the core theme within the Multimedia Society. It can be seen that many IEEE groups have been focusing on or turning to data-driven methods. The mission of the DS Initiative is to provide a commonplace of references for data-related activities in multimedia. Although it has greatly accelerated research related to multimedia data analytics, it is still far from creating comprehensive, efficient, and automated policy-making systems. This problem is mainly due to the fact that multi-media data is one of the most abundant sources of information and knowledge, and it is defined as all forms of “human information” [1]. The integration, transformation, and indexing of multimedia data bring significant challenges in data management and analysis. These challenges include big data, heterogeneous, and multidisciplinary nature of DS as well as heavily depending on the domain and expert knowledge. Some of the challenges are summarized below.
Nowadays, each person is capable of producing terabytes of data in various formats. The user-generated contents usually come in big volumes and variety. One of the main challenges in data science is how to store, manage, analyze, and utilize this huge amount of data in an efficient manner. On the other hand, high-dimensional mediums of data, as well as advanced data science solutions, provides great opportunities to capture complex and subtle patterns that fall out of the scope of traditional approaches. It is imperative to gain meaningful knowledge from this large amount of data through the advanced big data analytics. Also, it is necessary to consider the quality and security of data while uncovering the hidden patterns and insights. Data science have been playing a key role to address these challenges in multimedia big data. On the other hand, security is an important factor to be considered in many multimedia systems. Some media can be reverse-engineered to identify private information. Dissemination of personal data may lead to public disturbance. IoT is a network of physical devices which are often embedded with sensors for collecting data at an unprecedented scale and depth. With the popularity of IoT, the high-level, long-term goal is to research how to use the sensors to collect data regarding human behavior in a manner that preserves privacy but provides adequate information to identify abnormal activities and suggest possible interventions. It is challenging to protect data privacy from adversary parties in a healthcare information system without affecting the data storage, processing and communication phases of an analytic task [5]. An ideal privacy situation would require data and models to always be protected and only be accessible from the data contributors or the users’ devices.
Another important challenge to be considered in DS is how to manage both structured and unstructured data effectively. Multimedia data usually contains various forms of media, such as text, image, video, geographic coordinates, and even pulse waveforms [2] which comes from multiple sources. In contrast with structured data that can be easily stored and manipulated using relational databases, unstructured data will not follow a consistent and pre-defined data model. DS can be considered as a big umbrella covering big data, machine learning, and data mining solutions to store, handle, and analyze such heterogeneous data. For instance, NoSQL (Not Only SQL) as a big data solution can be used to support customized, flexible, and scalable databases for applications involving non-uniform and large-scale data. Moreover, machine learning and data mining techniques have been widely used to analyze unstructured data such as video, text, and audio.
Traditionally, academic researchers worked independently to design experiments within their specialties. Because of the widespread engagement of DS in scientific research, the interaction between multiple disciplines becomes crucial yet challenging, which requires a precise understanding of a broader range of knowledge. Researchers from different fields not only provide their domain expertise but also contribute to the interconnection resources. Multimedia data, which is heavily used in many research fields, can be shared to accomplish more complex research tasks when several disciplines work on a common research objective. However, independent researchers perform their investigations from specific aspects of cognition. The diversity of the experimental design leads to a different requirement of the data. Without enough data in a required format provided by one research team, the other research teams involved may not be able to conduct the proper research to fulfill their distinct research objectives. DS provides generic methodologies to clean and manipulate the raw data into a form ready for their data analysis.
As data analytics is becoming increasingly complex by integrating multimedia data, it requires the appropriate domain knowledge to formulate possible solutions and validate the results. Deep knowledge and expertise of the underlying domain are essential for understanding the meaning of the data [3]. For example, a self-driving car needs more than just image processing to make a decision. It must detect a traffic sign and recognize what is the rule or message that the sign conveys (e.g. Stop or Yield). Domain knowledge also impacts the processes such as data collection and experimental design [4]. By implementing an appropriate integration between the scientific approach and domain-specific rules, some of the media can reveal hidden information about data that cannot be easily seen by a human. In public sectors, multimedia content analytics have been introduced into the criminal justice system. Mobile data, satellite images, or Google Street View assist rough predictions of economic well-being, allocations of fire and health inspectors in cities, as well as other urban applications [5].
All in all, we are collecting large and varying multimedia data from various sources in a fast manner which was not possible before. This exponential multimedia data growth needs advanced data science and big data strategies to efficiently capture, store, clean, analysis, mine, and visualize the data. The potential areas mentioned above, not only require advanced machine learning computational algorithms, but also emphasize the essential methods and practical knowledge from long-established scientific research which makes use of empirical evidence. Therefore, multidisciplinary approaches are necessary for multimedia big data to achieve its full potential in business, science, and policy-making. Both multi-media and data science fields are going to grow fast and become more important with time. Thus, it is important to effectively integrate these techniques to solve the existing challenges in various real-world applications for better problem solving and decision making.