Discussion of the data preparation process
Table of Contents



Task 2.1: Discussion of the data preparation process

Data collection, filtering and integration process

Data collection

According to Roh et al. (2019), statistical computations use a range of different forms such as closed questions, correlation and regression, mean, median or modal actions for the quantitative data collecting. This approach is cheaper than qualitative techniques of data collecting and may be used in a short period. No mathematical computations are included. This is closely linked to non-quantifiable components. Included in these qualitative methods are interviews, surveys, findings, case studies, etc. Several ways are available for gathering such data. Although it is feasible to acquire non-shelf software for logging different GUI operations, this solution is not readily adaptable to particular training and testing features. In addition, data from true users of such instruments is not easily collected and is not accessible for study purposes in literature. This study is the first tool that we have produced in this field (Barrett and Twycross, 2018). Surveys, known as longitudinal studies, track the number of persons who ask them repeated questions throughout time. These studies enable scientists to regularly study traits throughout time. In structuring their data files, longitudinal studies may be highly complicated and need assessment of the available documentation.

Data filtering

As per the viewpoint of Lin et al. (2018), there are several reasons why it is normal practice to filter data – particularly big data. The real data may be filtered using practically one attribute or any attribute value of the database. Data may be filtered and altered and modified simultaneously. It is usual practice to generate records with some methods of arranging the data while producing the filtering process. The sorting usually takes place by include only valuable qualities. For example, data about the social security number of the individual might be part of the output of a person’s output. Or for manufacturing products, the filtered output may include component number characteristics and lot number and date of production. Or if the filtered data were of real estate, a characteristic that is included as a key may be the property address. The generation of subsets of data is one effect of filtering. In reality, the effect is the formation of a sub-set of data when data is filtered.

Figure 1: Data Filtering

(Source: Science Direct, 2017)

The analyst constructing the filtering mechanism may, nevertheless, want to prepare a sub-set of data for future analyses. If a subset is generated otherwise, it may be helpful to prepare a little for the subset to be made usable for future analytical processing. The generation of subsets of data is a consequence of filtering. In reality, the effect is the formation of a subset of data when data is filtered (Shi et al. 2020). However, the analyst who creates the filtering mechanism may be interested in preparing a sub-set of data for further analyses. In other words, it may be worthwhile to prepare the subset for future analytics if a subset is generated. If this is the case then the subset may be beneficial.

Data integration process

In order to create a unified view of data from many sites and formats, a defined data integration solution is required. This may also include opportunities when two organisations merge or consolidate their internal apps. Data integration may also be of use when a better and broader data storage facility is developed; eventually, it leads to a better and more efficient analysis.

Figure 2: Data integration framework

(Source: Data Warehouse, 2019)

Data Migration

Li and Liu (2020) commented that data migration is the process of shifting data from place to place, format or application to place. It is generally induced by the establishment of a new data system or location. The move from onsite to cloud-based storage and applications is a typical reason nowadays. The contrast between inclusion and migration is vital to note. Integration is the collection of procedures that make it possible to transform data from multiple sources into business knowledge. Data migration is another process involving the movement of data across kinds of storage, formats, architectures and systems.

Enterprise Application Integration

The EAI is a group of ways to achieve interoperability across various technologies that companies use. Enterprise application integration. It involves specifically addressing the difficulties associated with the organization’s modular design. In general, integration was handled using point-to-point integration before EAI techniques; where a single connection for each pair of systems or applications to interact is constructed. EAI solutions now incorporate middleware approaches to assist in centralise and standardise procedures across a whole infrastructure.

Master Data Management

Li and Liu (2020) mentioned that Master Data Management (MDM) is an area that focuses on business collaboration and IT to ensure that the common master data assets are consistent, accurate, responsible and semantic. The key data covers the key identities and properties such as customers, vendors, locations and much more. To achieve efficient and successful continuing MDM, continual enhancement of data and a fully implemented data quality plan are critical. To generate a single version of the truth, numerous data pieces must be harmonised and sync. To facilitate these initiatives and more, change management is important if MDM methods and procedures are to be adopted in a given company.

Task 2.2: Perform data modelling

Data modelling is the process through which a data template is created to store data in a database. This data model represents conceptual data items, the relationships of distinct data items and the rules (Liu et al. 2019). The modelling of data supports the visualisation of data and enforces corporate norms, regulatory compliance and data regulations of the government. Data Models guarantee uniformity in names, default values, semantics, safety and data quality. The data model is described as an abstract model organising the description of the data, data semantics and restrictions of the data. The data model highlights which data is required and how it should be arranged rather than which data activities. Data Model is like a construction design for an architect that helps to create conceptual models and to establish a connection between data.

Data may be modelled on several abstract levels. The process starts by gathering information from stakeholders and end-users regarding business needs. These rules are then converted into data structures to produce a concrete architecture of the database. A data model may be likened to a road map, a blueprint of an architect or any formal representation that allows a greater knowledge of what is intended. Modelling of data uses defined schemes and formal methods. This is a standardised, consistent and predictable approach to define and manage data resources throughout or outside an organisation (Pan et al. 2018). Conceptual data models are also known as domain models and give a wide picture of the composition of the system, the organisation of it and the company regulations. In the process of compiling the first project requirements, conceptual models are often generated. They typically contain corporate classes, their features and restrictions, their linkages and their security needs and data integrity standards.

Selection and justification of the inferential model

Ding et al. (2017) opined that the inferential model suggests that communication is built on communicators who deduce what others believe or want from the facts in the context. Deductions or informed estimations are deductions. Since we cannot read the substance of other people’s thinking straight away, we have to do everything we can to understand what they thinkthat is, the meme states that they plan to convey based on their conduct, i.e. based on the social stimuli. To be effective, communicators have two unique sorts of intents to show and recognise: informational intents and communicative intents. Cooperation is a phrase it is virtually sure to know; if it read this phrase, it probably relies on phrases such as “joint labour,” “shared effort” or “teamwork.” These are typical terms of this word and may define what individuals communicate (Csky et al. 2019). However, in our use of this phrase we wish to be more exact. In order to communicate effectively, communication collaboration is the only sort of collaboration that is necessarily necessary. Without it, precise references to other meme states based on the stimulus they give would be almost impossible. However, informational collaboration is also necessary for communicating true or correct representations of the meme States themselves. It can communicate well without the collaboration of information, but it may probably lead to disappointment: in other words, shared understanding creates contents that at least one person does not think true to the interaction.

Selection and justification of the machine learning model

In the time of using machine learning models, the training process generates the results and those results are described as mathematical models of the real-world processes. machine-learning algorithms locate patterns in the training dataset, which is utilised to estimate the target function and is crucial in translating the input values to the output values from the available dataset. Classification models, Regression models, Clustering, Dimensionality Reductions, Principal Component Analysis, and more are all machine learning approaches that are categorised as Classification models. It is the challenge of classifying objects within a restricted set of alternatives (Nemati et al. 2018).

Figure 3: Machine Learning Model

(Source: Research Gate, 2016)

Thabtah and Peebles (2020) stated that it might be stated as follows: Email spam detection, for example, involves predicting whether an email is or is not spam. Finally, it will record several vital classifier models. The number of predictors employed to predict the independent variable is known as the dimensionality of the model. A very common issue with datasets seen in the real world is that the number of variables might be excessively large. overfitting is also associated with models that include a large number of variables. Factors do not have the same impact on goals; for example, many variables may contribute to our objective, but not all of them impact it in the same way.

Application of statistical tools

MS Excel: A statistical analysis and research using Microsoft Excel is a blended learning curriculum where the students get a good understanding of both theoretical knowledge and practice. This subject serves as a foundation for all of the analysis and research classes. Instead of using academic theories, information is provided with more relevant examples and circumstances that give it practical advice on how to utilise the research and analytical abilities in the business. Higher education in research and analytics might be referred to as a foundation degree, provided it are aiming for anything more. About 80 statistical functions are included in the Excel 2003 edition (Toaar et al. 2018). Descriptive statistics like mean and standard deviation, dispersion, measures of variability, and kurtosis may also be calculated by using the addition function. This is especially useful for calculating correlation, regression analysis slope, and intercept.

SPSS: SPSS is often used by academics to gather data. The SPSS data input interface is similar to any other programme in the Table. Variables and quantitative data may be entered and a data file may be stored. it may also arrange the data in SPSS by assigning various variables to attributes. For instance, it may provide a variable as a nominal, which is kept in PLCs. The next time it views the file, which may take weeks, months, or even years, it will find out how the data is structured precisely. The most apparent application of SPSS for statistical testing is to utilise the programme. SPSS has included into the programme many of the most regularly used statistical tests. All relevant results are shown on the output file once it does a statistical test.

Figure 4: Application of SPSS

(Source: Osabov, 2019)

Reporting on the initial outcomes

As noted by Singh et al. (2019), Researchfish employs a common result structure. These have been established by funding agencies and it may establish records explaining the results, results and effects that it assigns to the support of the research council. There are a related set of questions for each kind of joint result. Until most data are gathered and analysed, the major interpreting task cannot be performed. For others, the data already exist and the study process starts to understand them far sooner. The presentation of results presents a large degree of discipline heterogeneity. For example, an oral and marketing thesis may both employ interview data obtained and analysed in the same manner, but the manner the findings are presented will be extremely distinct since the questions they attempt to answer vary. The outcomes of that analysis are extremely different. The outcomes of experimental research will be presented differently once again.

Explanation of what the decision in question

The approach focuses on the large volumes of operational choices which your business has to make every day: choices in your operational processes and choices taken by your contact centre agents and other frontline personnel. These choices relate to a single encounter or transaction, as opposed to strategy and other management choices, such as how to target one single consumer, handle a single claim or how to price a single sale. The stack contains business rules and management systems of corporate rules, sophisticated analytics and analytical workshops, and even optimization technologies. In decision-making services, these technologies combine. These components offer automated and continually enhanced business and service-oriented choices. There are a lot of decision-making processes with firms providing distinct methods. There is also a future standard in development, the model and notation of the Object Management Group. This standard, like the Business Process Model and Notation, is not aimed at technique but rather at standardising the representation of decision models. This gives users access to a large community and a platform for a more widespread exchange of knowledge in any suitable modelling of choice.

  • The most efficient technique of defining a choice is to identify a question to be answered and provide a range of potential solutions to the choice.
  • Any choice demands the provision of information, facts when made. These may be transaction data, reference data or any supporting information that the decision refers to.
  • Choice making also calls for information, know-how, which explains how this decision should be taken. It might be based on policies, rules, best practices, field knowledge or data analysis. It may be expressed or stated precisely by, for example, decision tables and business rules.
  • Choices may be broken down, broken down into decisions about its component. The responses from the choices on the components are information that the parent has to provide. In addition, the information and knowledge necessary to make these sub-decisions may be described and also dissected. Even highly complicated judgments may be disrupted unless clearly described.

Figure 5: Decision making through data modelling

(Source: GWP, 2018)

Data mining is a logical procedure that searches for the most important data to discover varied quantities of information (Isaac & Dixon 2017). This system’s purpose is to find previously obscure designs. You may use these examples to address certain difficulties when you have noticed them (Shaweta 2014). Data analysis helps customers to break down, choose and shorten detected links from a variety of measures or locations. The first step of data mining is defining the issue. In this stage, a person links with the domain experts to identify the problem and select the company goals, differentiate the important people and find out current solutions. It has clear phrasing in the field of study. An overview of the problem and its limits has been completed (Thusoo, A. et al. 2010). The goals of the assignment should then be turned into the DMKD aims and the first choice of prospective data mining tools may be included. The second phase of data mining is understanding data. This phase involves data collecting and the choice of information, including its configuration and size, which will be needed. If there are basic information, certain features might be more necessary (Fawcett & Provost, 2015). We must next verify the value of the DMKD target information. Information on compliance, excess, missing attributes, the authenticity of property assessment, etc. should be checked. The third stage of data mining is data preparation. This is the important step forward in the accomplishment of the full information sharing process; it typically covers part of the overall enterprise effort (Foreman, 2013).

Task 2.3: Discussion of data analysis

Every comprehensive statistical study is based on descriptive data analysis. It gives a broad look into the data. The descriptive statistics represent the performance of the company. The prior results reveal the reasons for their success by the use of historical data. Good knowledge of the link between cause and effect may help companies reconsider their strategy. This enables you to extend your consumer base for a time. The illustrative factual investigation may provide the research group with several supporting data snippets, two of these being the average weight or middle time, and the number of oil changes in some random weather or the repetitive circulation. From the facts presented, the inspection group established that furniture goods had regularly been purchased for 14.6 years and that 38% of gross earnings had purchased them within 14-16 minutes, while just 2% of gross profit had purchased furniture before the beginning of the year 2000. The complete appropriation for recurrences is described here. This representation of execution can appeal to and entice the busy prospective customer. This investigation produced a linear regression model: y=3.534 +0.647X1 +. The regression model shows that Big Data and Performance have a substantial link. The determination coefficient is 0.734, indicating that the amount of substantial data investment may account for around seventy-four per cent of the performance variance.


Model R R Square Adjusted R Square Std. The error of the Estimate
1 0.908 0.734 0.756 6.54567


In this development, we are selecting which information will be used as a contribution to phase 4 information mining devices. It may include evaluating data, conducting relationships and important tests and cleansing of information such as verification of information records, evacuating or modifying turmoil, etc (Finlay 2014). Cleaned information may also be managed by highlights of measurement and extraction computations, by inducing the new quality state by discretion and by describing the breakdown of information. The result would be fresh records of information that met specified information needs for the DM instruments to be used. The fourth phase contains information mining devices that are typically less time consuming than arranging for information. This development ensures that the information mining instruments arranged are used and new tools are determined. Information mining tools integrate a wide range of computations, such as Bayesian approaches, developmental registration, AI, neural networks, grouping, preparatory operations. It ends by splitting information into subsets, developing models of information for these subsets and metaprocess from these models of information (Madden 2016).

Model Sum of Square Df Mean Square F Sig.
Regression 605.334 36 642.076 14.647 0.33
Residual 128.536 266 42.536    
Total 734.364 302      


This technique treats minor information models as information rather than massive first-line measures, which significantly lower the overall computing costs. The fifth phase in data mining is the evaluation of the acquired knowledge. This step includes the interpretation of the findings, the assessment of new data, the translations by area experts and the verification of the impact of the data obtained. Only the authorised models (effects of the application of many information mining devices) are maintained. It may be possible for the whole DMKD method to detect which optional actions may have been taken to enhance results. An overview of the errors that were made was ready. The use of identified expertise is the last step of data extraction. This advance is completely in the hands of the database owner. It involves organising where and how information is found. The domain of application should be spread throughout diverse places in the existing space. A system should be established and the job reported to screen the execution of the discovered knowledge.

Task 2.4: Strategic Recommendations

To drive IKEA performance the large data findings may be utilised. While organisations commit outstanding energy analysing information from buyers and leading adaption opens, focusing on boosting profitability and delivery is as important. Information and research may take on an immense burden to reduce waste and streamline company processes (Revels & Nussbaumer, 2013). After descriptive and inferential statistics, different discoveries are made. In this research, a linear regression model was Y=3,534+0,647×1 + that indicates that large data have an impact on IKEA performance. The determination coefficient is 0.734 for descriptive analysis, showing that around 73.4% of performance variations can be justified by the degree of big data expenditure. The management of the organisation must examine the offered suggestions in the document. The firm will particularly leverage large data to promote its market function. For example, information links may be recognised by announcements and exhibit boards and knowledgeable supervisors may be provided with specific knowledge for cost assessments, peer benchmarking, and evaluation. In addition, the use of surveys for quantifying essential metrics across areas such as operational splendour, item development and labour arrangements may provide certain information pieces for understanding complicated company conditions (Sen & Sinha 2005). Business assessment may also enhance the way associations bring in, hold and build capacity. A counselling group in Asia, for instance, decided late to undergo a notable reconstruction procedure. As a part of this work, representatives with a high potential for success had to be destigmatised.

Initially, the review committee will streamline information, such as know-how, training, exercise, age, conjugal and socioeconomic position. However, after processing the aggregated information, the group was able to discern between the employee profiles, which had the highest chance of special occupations. The review and research also indicated the essential tasks that had the biggest influence on the overall growth of the organisation. This means that the organisation has reconstructed around the essential workplaces and capabilities. The supply chain is outstanding links to find crucial opportunities and favourable conditions, given their complicated character and mainly because of their enormous commitment to the structure of costs for a business. Then managers may leap deeply into opportunities of specified improvements, such as stock management, channel management, acquiring and coordinating. Big data must be used to alleviate dangers. Organized information, for example, databases and unsuitable information, such as websites, online journals and web-based live channels, poses a great risk today to associations. Organizations might end up in a better position to assess and forecast chance by using danger assessments. The good investigation must be seen by administrators as a broad-based process, and techniques should be developed to draw information across multiple levels and abilities of associations to a single focus stage. By establishing a common pattern for assessing and supervising risks, companies will certainly strengthen opportunity reflection in their core process of fundamental leadership and foresee possible scenarios.

Reference List

Barrett, D. and Twycross, A., 2018. Data collection in qualitative research.

Csky, R., Purgai, P. and Recski, G., 2019. Improving neural conversational models with entropy-based data filtering. arXiv preprint arXiv:1905.05471.

Ding, F., Wang, Y., Dai, J., Li, Q. and Chen, Q., 2017. A recursive least squares parameter estimation algorithm for output nonlinear autoregressive systems using the input-output data filtering. Journal of the Franklin Institute, 354(15), pp.6938-6955.

Li, M. and Liu, X., 2018. The least-squares based iterative algorithms for parameter estimation of a bilinear system with autoregressive noise using the data filtering technique. Signal Processing, 147, pp.23-34.

Li, M. and Liu, X., 2020. Maximum likelihood least squares based iterative estimation for a class of bilinear systems using the data filtering technique. International Journal of Control, Automation and Systems, 18(6), pp.1581-1592.

Lin, H., Yan, Z., Chen, Y. and Zhang, L., 2018. A survey on network security-related data collection technologies. IEEE Access, 6, pp.18345-18365.

Liu, L., Ding, F., Xu, L., Pan, J., Alsaedi, A. and Hayat, T., 2019. Maximum likelihood recursive identification for the multivariate equation-error autoregressive moving average systems using the data filtering. IEEE Access, 7, pp.41154-41163.

Nemati, S., Holder, A., Ramzi, F., Stanley, M.D., Clifford, G.D. and Buchman, T.G., 2018. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Critical care medicine, 46(4), p.547.

Pan, J., Ma, H., Jiang, X., Ding, W. and Ding, F., 2018. An adaptive gradient-based iterative algorithm for multivariable controlled autoregressive moving average systems using the data filtering technique. Complexity, 2018.

Roh, Y., Heo, G. and Whang, S.E., 2019. A survey on data collection for machine learning: a big data-ai integration perspective. IEEE Transactions on Knowledge and Data Engineering.

Shi, K., Wang, J., Tang, Y. and Zhong, S., 2020. Reliable asynchronous sampled-data filtering of TS fuzzy uncertain delayed neural networks with stochastically switched topologies. Fuzzy Sets and Systems, 381, pp.1-25.

Singh, A.R., Rohr, B.A., Gauthier, J.A. and Nrskov, J.K., 2019. Predicting chemical reaction barriers with a machine learning model. Catalysis Letters, 149(9), pp.2347-2354.

Thabtah, F. and Peebles, D., 2020. A new machine learning model based on induction of rules for autism detection. Health informatics journal, 26(1), pp.264-286.

Toaar, M., Ergen, B., Cmert, Z. and zyurt, F., 2020. A deep feature learning model for pneumonia detection applying a combination of mRMR feature selection and machine learning models. IRBM, 41(4), pp.212-222.

Thusoo, A. et al., 2010. Data warehousing and analytics infrastructure at Facebook Proc. of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1013-1020.

Shaweta, 2014. A review on designing of distributed data warehouse and new trends in distributed data warehousing, International Journal of Computer Science and Information Technologies, 5 (2), 1692-1695.

Isaac W. & Dixon A., 2017. Why big-data analysis of police activity is inherently biased. The Conversation.

Fawcett, T. & Provost, F., 2013. Data science for business, O’Reilly Media Inc.

Foreman, J. (2013) Data Smart: Using Data Science to Transform Information into Insight. John Wiley & Sons.

Finlay, S. (2014) Predictive Analytics, Data Mining and Big Data (Business in the Digital Economy). Palgrave Macmillan.

Revels, M., and Nussbaumer, H., 2013. Data mining and data warehousing in the airline industry, Academy of Business Research Journal, 3, pp69-82

Madden, S., 2016. Mesa takes data warehousing to new heights, Communications of the ACM, 59, 7, pp 116.

Sen, A., and Sinha, A., 2005. A comparison of data warehousing methodologies, Communications of the ACM, 48, 3, pp79-84.


  +1 718 717 2861         [email protected]