Hadoop for Big data

What is Hadoop?

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large datasets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

alt text

Ability to store and process huge amounts of any kind of data, quickly.
Computing power.
Fault tolerance.
Flexibility
Low cost.
Scalability.

What is MapReduce

MapReduce – a parallel processing software framework. It is comprised of two steps. Map step is a master node that takes inputs and partitions them into smaller subproblems and then distributes them to worker nodes. After the map step has taken place, the master node takes the answers to all of the subproblems and combines them to produce output

alt text

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. It relaxes a few POSIX requirements to enable streaming access to file system data.

alt text

Set up with Hadoop

1.Download the platform Virtual Box from Cloudera VMs:

link:Cloudera downloads

2.Select “Virtual Box” and then click on “get it now”. And it will ask you to fill out some information about you or sign in. After that, agree with the contract. And it will start downloading the Virtual Box.

alt text

3.Unzip the file, you will see two file under the folder

alt text

4.Download the VirtualBox

VirtualBox select OS X hosts for mac user.

alt text

This is the VirtualBox interface looklike: alt text

5.Then slecte “File” –> “import Application” to import “cloudera-quickstart-vm-5.12.0-0-virtualbox.ovf” alt text

How to Install RStudio Server on CentOS

(RStudio Server is the web edition of RStudio which is a series of tools designed to facilitate the coding job using the R programming language.)

Step 1: Update the system

sudo yum install epel-release
sudo yum update
sudo shutdown -r now

Step 2: Install R

sudo yum install R -y

Step 3: Install RStudio Server

Install the latest stable release of RStudio Server.

cd
wget https://download2.rstudio.org/rstudio-server-rhel-1.0.136-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-rhel-1.0.136-x86_64.rpm -y

Note: find the latest release of RStudio Server from this webset: official download page

Step 4: Access RStudio Server from a web browser

sudo firewall-cmd --permanent --zone=public --add-port=8787/tcp
sudo firewall-cmd --reload

Project abstract

Data source: Breast Cancer Wisconsin (Diagnostic) Data Set This data source is use to predict whether the cancer is benign or malignant

Build Prediction Model in R

Load Data:

require(caret)
require(mlbench)

# load the data
cancerData <- read.csv("data.csv", sep = ",", header = T)

Feature selection:

set.seed(123)

# calculate correlation matrix
correlationMatrix <- cor(cancerData[,3:32])

# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)

# print indexes of highly correlated attributes
print(highlyCorrelated)

# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm
results <- rfe(cancerData[,3:32], cancerData[,2], sizes=c(3:32), rfeControl=control)

# summarize the results
print(results)

The list of chosen features and Plot of features by accuracy

predictors(results)
plot(results, type=c("g", "o"))

Subsetting the data using the selected features

features <- predictors(results)
newdata <- cancerData[, features]
newdata$diagnosis <- cancerData$diagnosis

Partition the data into training and testing being 70% and 30% resoectively.

inTrain <- createDataPartition(y = newdata$diagnosis ,
                               p=0.7, list=FALSE)
training <- newdata[inTrain,]
testing <- newdata[-inTrain,]
dim(training)

Generalized Linear Model

set.seed(323)
modelFit1 <- train(diagnosis ~.,data=training, preProcess = c("center", "scale"), method="glm")
pred1 <- predict(modelFit1,newdata=testing)
confusionMatrix(pred1,testing$diagnosis)

MapReduce Simulation

Run MapReduce Locally.

 $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar grep input output 'dfs[a-z.]+'

Web Application

Support or Contact

Having trouble with Pages? Check out our documentation or contact support and we’ll help you sort it out.

Hadoop

Big data program