IntroductionManaging huge amounts of structured and unstructured data is crucial to the success of every company that needs systematic organization and governance to ensure their data is of high quality and suitable for analytics and business intelligence applications. Although the key aspects of big data can be summarized to the popular 3 Vs of Volume, Velocity, and Variety, there are also other key questions that every company needs to ask when choosing the proper process they need to store and transform their data. Show
Big Data Aspects
Big Data ManagementDue to the exponential growth of enterprise data stores, managing big data has become an increasingly challenging task. Cross-industry research has shown that most organizations only use half of their structured and near one percent of their unstructured data in decision making analyses. Many organizations tend to keep as much data as possible since there is no way to predict which data sources will be valuable in the future. [1] They often find outdated data or conflicts with other copies in other systems and with the availability of so many data sources, implementing an efficient data management technique can be a cumbersome task. Data spread can result in records being maintained across multiple locations and introduces the risk of duplication which leads to increased management costs and inconsistent security policies. Not having a data management strategy can lead to lack of trust, missing on great opportunities, decreased customer satisfaction, as well as non-compliance and regulatory penalties. Due to these reasons, organizations tend to implement the disciplines of data management by investing in policies and big data tools that can help them with their needs. Big data management can be considered as a broad term that includes data cleansing, integration, migration, preparation, enrichment, analytics, quality, management, reporting, governance, and planning. Depending on an enterprise’s needs, the focus and resource allocated on each of these processes can vastly differ. Figure 1. From Data collection to Predictive AnalyticsData Repository StrategyData repositories can be a great solution to data management challenges by centralizing data in one system that refers to the metadata and a single logical namespace. They are used to keep a specific population of data isolated in a data storage entity or entities to mine data for business insights, reporting needs, or machine learning. This term is often used adjacent with a data warehouse or a data mart and its main benefit is to make reporting or analytics easier due to data being isolated. An effective data repository strategy requires a coherent tactic to unify, regulate, evaluate, and deploy the huge amount of data resources. This will enable enhanced data management capabilities that will ultimately enhance the analytics and query performance. The first step in defining a data repository strategy is to clarify the primary purpose of an organization’s data objective that will guide them in their data management approaches. A robust data strategy encompasses several elements. This includes creating a data architecture that covers the entire enterprise, defines business needs, and prioritizes data quality and integration. It additionally enables accountability by defining standards on data retention and reducing risk and complexity. [2] Upon implementing the data strategy, companies are faced with multiple approaches and their decision is based on available resources or previous experiences of the organization. While some strategies help organizations ensure guidelines governing data privacy and the integrity of data distributed through the internal sources, other strategies might focus more on supporting business decisions by creating rapid frameworks that provide real-time quick insights, predictive modeling, and interactive dashboards. Whereas most companies require a balance of these two approaches and choose flexibility to succeed, some would put more emphasis on one with appropriate trade-offs. Figure 2. Key Factors When Deciding a Data StrategyData RepositoriesEnterprise Data WarehouseAn Enterprise Data Warehouse (EDW) can be summarized as a subject-oriented database or a collection of databases that gathers data from multiple sources and applications into a centralized source ready for analytics and reporting. It stores and manages all the historical business data of an enterprise.[3] Organizing, transforming, and aggregating various inputs of data sources can save valuable time and management costs for an Artificial Intelligence (AI) ready data structure. This is where Extract, Transform, Load (ETL) or ELT approaches are often used, and big data distributed frameworks like Hadoop or Apache Spark help organizations with heavy data cleansing and transformation. Figure 3. Data Warehouse OverviewThey key difference between data warehouses and standard operational databases is that the latter are optimized to preserve precise accuracy in an instance and keep track of rapid data updates while data warehouses provide broad range view of the data over time. Although data warehouses are a popular tool to manage big data, they can become expensive when an organization needs to scale them, and they do not perform well when handling unstructured or complex data formats. The architectural complexity of EDWs offers many benefits to an organization:
Data MartsWhile data warehouse (DW) can effectively deal with large data sets, real-time artificial intelligence, and data analysis for different subsets of business operations requires the usage of data marts (DM). DMs can be considered as scaled down versions of DWs with a more limited scope or a logical subset of them that aims to meet the information need of a specific group of end users in different business units or departments and usually provide aggregated data for a focused content or a customized decision support. They come in dependent and independent formats in which the former gets populated from a EDW and the latter is taken directly from an Operational Data Store (ODS). DMs reduce the load of queries, transformations, and heavy network usage from other data sources in the organization and provides a customized DM available to end users, giving them more access and control. DMS can also introduce several inherent problems such as information siloing and limiting user access. Data LakesData Lakes (DL) are another type of data repository with a key difference that the data is stored in its raw native format without any transformation. The data can be structured or unstructured and this makes DLs fit for bulk data types such as server logs, clickstreams, social media, or sensor data. The data is just stored in the repository without knowing what type of analysis will be done or whether it will be ever used in an analysis. This in return will require a lot of preprocessing when data needs to be used for business insights. Figure 4. Simple Representation of Data LakeDLs have lower storage costs due to their more open-source nature and undefined structure and can be established in an organization’s data center with in-house management or in cloud services of different vendors such as Amazon, Microsoft, or Google. While DWs are targeted towards decision makers with transformations in place, DLs require specialized data scientists to preprocess and analyze the data and they can improve customer interactions, R&D innovations and increase operational efficiencies. Transactional StoresTransactional data stores (TS) are optimized for row-based operations such as reading and writing individual records while maintaining data integrity. However, they are not specifically built for analytics, yet due to their place in production environments for many years, they can be used for analytic queries as well as low latency information monitoring. TSs are ACID (atomicity, consistency, isolation, durability) compliant, meaning they guarantee data validity despite errors and ensure that data does not become corrupt because of a failure of some sort. This is crucial to business use cases that require a high level of data integrity such as transactions happening in banking. TSs are designed to run in production systems and due to their row-based low latency nature can run operations or queries that require to be nearly in sync with the master database. While DWs due to their column-based nature are optimized to read data, TSs perform better in writing. This might not be a huge problem for companies with small volumes of data but as the available data increases, this can create a difference in choosing the right data strategy. Operational Data StoresAn operational data store (ODS) is another way to mitigate the challenge of querying up-to-date data from DWs and can be considered as a staging area that provides query capabilities. The ODS can provide fine-grained non-aggregated data that is closer to real-time as it is received before heavy transformations and loading operations which takes the burden off from transactional systems. They are used for operational reporting and as a complimentary element to EDWs. Their general purpose is to integrate data from different sources into a single structure via data cleaning, resolving redundancies, and establishing business rules. ODS can be a key component of a EDW and due to their multi-purpose structure enables transactional and decision support processing. Data stored in ODS are transaction oriented and smaller in size compared to DWs.[5] ConclusionBig data management is a necessity for every company. It improves their customer understanding and innovation in developing new products while enabling big financial and business decision making due to the analysis of large amounts of data for every department. Establishing a data strategy requires problem definition and understanding the business needs of each company to improve their data systems and source management. Although not all companies need to start worrying about big data management in the beginning, it will be a requirement to start considering when traditional databases are not performing well enough and not providing the benefits of big data repositories. This usually becomes apparent when every aspect of competitive advantage, innovation, revenue growth, and client acquisitions reach a plateau. It is noteworthy to add that each data repository comes with its own disadvantages. Some companies use data lakes by storing all their data without effective use of information extraction for each department and this fails their business strategy initiative. Dumping data without any goals into a data warehouse will lead to high costs for management, losing track of what is stored and not taking advantage of the newly established resources. In most cases a data strategy might not provide business value overnight and it is rather a gradual improvement that needs small steps in every stage through feedback and evaluation. A data repository does not guarantee the success of a company’s data strategy; however, it does reduce the likelihood of common failure scenarios, excessive costs and time used in extracting value from data and orients a company for future innovation. References
Which type of database is used as a repository?Relational Databases (RDBMS)
What is a repository for storing large amount of data?A data warehouse is a large data repository that aggregates data usually from multiple sources or segments of a business, without the data being necessarily related.
What is data repository in database?A data repository refers to an enterprise data storage entity (or sometimes entities) into which data has been specifically partitioned for an analytical or reporting purpose.
What is repository in computer?In information technology, a repository (pronounced ree-PAHZ-ih-tor-i) is a central place in which an aggregation of data is kept and maintained in an organized way, usually in computer storage.
|