1) Title : Study on Data Mining and Big Data
• Methodology: Algorithms
• Description: Data Mining contains of several algorithms that fall into four different categories(Shobana et al. 2015)
Association Rule
Clustering
Classification
Regression Association algorithms are used to search for relationships between variables. It is applied when searching for frequently visited Items. In short association algorithms establish relationships among objects
Clustering algorithms are used to discover structures and groups in the data, e.g. it classifies the data belongs to which group
Classification algorithms deal with associating unknown structures to known structures
Regression algorithms finds functions to model the data.(Shobana et al.
…show more content…
2012)
3) Title: Big Data analytics in healthcare: promise and potential
• Methodology: Questionnaire(Groves et al.
…show more content…
2014)
• Methodology: Modelling, Algorithms
• Description: Association algorithms in data mining are used to for relationships between variables. It is applied when searching for frequently visited items
Association algorithms and predictive modelling can analyse the buying habits of pregnant women and identify products that were used as indicators that a person is pregnant(Mayer-Schönberger & Cukier. 2014). Retail Companies can use that information and market those products to pregnant women.
5) Title Critical Questions for Big Data(Boyd & Crawford 2012)
• Methodology: Focus groups, Case Study(Boyd & Crawford
The K-Means algorithm is used for cluster analysis by dividing data points into k clusters. The K means algorithm will group the data into the cluster based on feature similarity.
...o cut. The brief idea is clustering is done around half data through Hierarchical clustering and succeed by K-means for the remaining. In order to create super-rules, Hierarchical is terminated when it generates the largest number of clusters.
The attribute set used in classification process is partitioned into two disjoint sets as test set and training set. The test set contains the attribute set with class predefined class label. Normally, the class tag arrives from prior experiential data. The test data can be represented as: (a1, a2, …, an; c), where ai is the attribute c represents the class. Even though the class tags of these testing data are unknown, the classes that these data belong can be predicted. As shown in the figure 5.1, a classification model can be considered as a black box that automatically assigns a class tag when a attribute set of unknown classes is provided. The classification step in data mining consist of two phases as given below
This chapter discusses Page Rank Algorithm essential ideas and analyzes its computational formula and then mentions some problems related to the algorithm. With the rapid development of world –wide web, the users face the problem of retrieving useful information from the large number of disordered and scattered information. However, current search engines cannot fully satisfy the user’s need of high-quality information search services but the most classic web structure mining algorithm is Page Rank Algorithm. The Page Rank algorithm is based on the concepts that if a page contains important links towards it then the links of this page towards the other page are also to be considered as important pages. Page Rank algorithm calculates the importance of web pages using the link structure of the web pages. This approach explores the idea of simply counting in-links equally, by normalizing the number of links on a page when distributing rank scores. Therefore, Page Rank (i.e. a numeric value that represents how important a page is on the web) takes the back links into account and propagates the ranking through links: a page has a high rank if the sum of the ranks of its back links (in links) is high. It (Page Rank Algorithm) is one of the methods that Google (famous search engine) uses to determine the importance or relevance of a web page.
Over the past few decades, the generation and availability of information over the cyberspace is increasing enormously. There exist an alarming need for solutions that will help to filter the relevant data from the collection of disorganised data for the users to select the most suitable data from the available collection of data. A lot of strategies have been developed, that assist in the selection of relevant information for the user. Applications on the internet are making searching convenient for users by incorporating recommender systems within the applications which helps to filter unwanted information, predict the needs and preferences of users (Long, Zhang, & Hu, 2011) and provide suggestions to the users. When compared to the other fields of information systems, recommender systems is a relatively new field, as it initially used to be a part of information retrieval and management sciences.
This chapter gives the overview of the Association Rule Mining. It gives the importance of the Market Basket Analysis and its usefulness in increasing the sales of the supermarket. This chapter also provides an overview of the data mining process used in market basket analysis and the proposed approaches. The works of a few scientists are cited and utilized as proof to confidence the ideas clarified in the theory. Every such proof utilized is recorded as a part of the reference area of this thesis.
The main aim of any industry today is to pay attention to the multiple opportunities existing for improvements. The links of the mining industry to its primary resources multiplies with its functions whereas in other industries, there are layers of processes between the primary resources and the final product (McKay, 2009). As described by Carroll & Buchholtz (2014) currently, the ethics linked with sustainable extraction has been established around 2 key concepts corporate social responsibility (CSR) and transparency. Transparency initiatives focus on exposure of revenue transactions between the public and private sectors within the extractive industry projects. On the other hand, corporate responsibility focuses on enhancing the association between the communities and companies (Carroll & Buchholtz, 2014). The goal of making transparency efforts has triggered legislative activity and advocacy in the UK, U.S. and Canada which are the host markets for majority of the global mining shares. Companies are employing greater resources and staff for ensuring the benefits of mining development reaching the communities as improved education, infrastructure and services (Hsieh, 2006).
It's a well known fact that humans have the ability to effeciently recognize patterns. Some people who work for Google, have highlighted the fact that backliinks, keywords, title tags and meta descritpoions are greate factors which can be utulied to sort and rank websites. However, the concept of recognizing such patterns on a massive scope is something that humans cannot easily do. Machines on the other hand, are extremly effeeint at gathering data. However, unlike humans they cannot recognize patterns as easily in terms of how certain patterns fit into the overal big picture as well as to understand what that pictures mean.
In the development of web search Link analysis the analysis of hyperlinks and the graph structure of the Web have been helpful which is one of the factors considered by web search engines in computing a composite rank for a web page on any given user query. The directed graph configuration is known as web graph. There are several algorithms based on link analysis. The important algorithms are Hypertext Induced Topic Search (HITS), Page Rank, Weighted Page Rank, and Weighted Page Content Rank.
To the programming community the algorithms described in this chapter and their methods are known as feature selection algorithms. This theoretical subject has been examined by researchers for decades and a large number of methods have been proposed. The terms attribute and feature are interchangeable and refer to the predictor values throughout this chapter, and for the remainder of the thesis. In this theoretical way of thinking, dimensionality reduction techniques are typically made up of two basic components, [21], [3], [9].
The key objective in any data mining activity is to find as many unsuspected relationships between obtained data sets as possible to be able to achieve a better understanding on how the data and its relationships are useful to the data owner. The potential of knowledge discovery using data mining is huge and data mining has been applied in many different knowledge areas such as in large corporations to optimize their marketing strategies or even to smaller scale in medicinal research where data mining is used to find the relationship patient’s data with the corresponding medicinal prescription and symptoms.
The data is being derived from health-care system, clinical trials, real-time monitoring, and also other sources. For prediction of required patterns machine learning algorithms are used. The new knowledge is discovered which can lead to resistance in knowledge -based systems with the use of casual relationships. The biggest challenge in making the predictive analysis operational is it offers an often wrong prediction and is difficult to attain the correct predictions and also cost for this prediction is high.
Problems like this relates with usage of Web. Hence, there is a need of cleaning and constructing or structuring Web log data, which is nothing but data preprocessing part in Web Usage Mining [3]. Data preprocessing plays a vigorous role because of redundant irrelevant log data nature [4]. Thus, we find that, data preprocessing is one basic and essential part of Web-page recommendation. This paper is structured as below: Section 2 comprises a review of Web page recommendation. Section 3 clarifies categorization of recommendation system and web mining, and it discusses how data preprocessing is related to Web page recommendation. Section 4 illustrates data preprocessing and its steps. Section 5 provides comparative analysis of data preprocessing techniques use; and finally, section 6 gives the
Classification algorithms is the process of a computer relating a subject to a category. To best explain this concept, Stephen Marsland states “…consider a vending machine, where we use a neural network to learn to recognize different coins” (Machine Learning, Section 1.4). The computer learns by analyzing large amounts of data and then categorizing the data. This is how a computer system can identify a certain illness to assist medical staff identify a certain type of illness or disease. In addition, supervised learning can also utilize regression
DENG ZhiHong, WANG ZhongHui and JIANG JiaJian, ’A new algorithm for fast mining frequent itemsets using N-lists, Science China Press and Springer-Verlag, Berlin, Heidelberg, 2012.