Big Data at Work: 2014

Wednesday, 29 October 2014

Calling SQL Stored procedure from R

The Following lines in R calls a stored procedure in SQL SEREVR

library(RODBC)
conn <- odbcDriverConnect('driver={SQL Server};server=HostName;database=DatabaseName;uid=useName;pwd=Password')
query <- paste("exec dbo.R_getData ");
res<-sqlQuery(conn, query);

Monday, 30 June 2014

Enable oozie workflow for new mapReduce API

If you are using new MapReduce API, and wants to execute implement DAG with oozie workflow, then you
may face the below exception :

java.lang.RuntimeException: Error in configuring object
 at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
 at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.jav

Caused by: java.lang.reflect.InvocationTargetException

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
 ... 9 more

Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class TestMapper not org.apache.hadoop.mapred.Mapper
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899)
 
Caused by: java.lang.RuntimeException: class TestMapper not org.apache.hadoop.mapred.Mapper
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:893)
 ... 16 more

To solve the above issue, we need to have two properties in workflow.xml,





<property>

    <name>mapred.reducer.new-api</name>

    <value>true</value>

  </property>

  <property>

    <name>mapred.mapper.new-api</name>

    <value>true</value>

</property>

now replace the workflow.xml in HDFS with updated one, the issue will be resolved.....

Thursday, 15 May 2014

Accesing HBase Data using Hive Query Language ( with Probable Exceptions)

1. create a table in HBase

hbase(main):001:0> create 'hbaseTable','cf1'
0 row(s) in 1.4830 seconds

2. insert data into table

hbase(main):002:0> put 'hbaseTable','row1','cf1:name','giri'
0 row(s) in 0.0800 seconds

hbase(main):003:0> put 'hbaseTable','row2','cf1:name','Anamika'
0 row(s) in 0.0070 seconds

3. scan the table data

hbase(main):004:0> scan 'hbaseTable'
ROW COLUMN+CELL
row1 column=cf1:name, timestamp=1400133482419, value=giri
row2 column=cf1:name, timestamp=1400133502249, value=Anamika
2 row(s) in 0.0360 seconds

4. Now we need to add the below jar files to hive.

guava-11.0.2.jar,
hive-hbase-handler-0.10.0.24.jar,
hbase-0.94.5.jar,
zookeeper-3.4.5.23.jar

we have number of ways to do this,

one way is to add jar files to

export HIVE_AUX_JARS_PATH=/usr/lib/guava-11.0.2.jar:/usr/lib/hive-hbase-handler-0.10.0.24.jar/ ...... remaining jars

other way directly add jars in the hive console,

hive> add jar /usr/lib/hbase/lib/guava-11.0.2.jar;
Added /usr/lib/hbase/lib/guava-11.0.2.jar to class path
Added resource: /usr/lib/hbase/lib/guava-11.0.2.jar

..... add remaining jars also.

now create hive table using the below syntax:

hive> CREATE TABLE hiveTable(key int, name string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name")
TBLPROPERTIES ("hbase.table.name" = "hbaseTable");

now you can use hiveql language to query HBase data.

Troubleshooting:

You may get the below exception :

java.lang.ClassNotFoundException: org.apache.hadoop.hbase.MasterNotRunningException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)… 21 more

or you may get any zookeeper related issues.

you can resolve the above issues by setting the below 2 properties in hive prompt;

set hbase.zookeeper.quorum=your zookeeper nodes;

set zookeeper.znode.parent=hbase-unsecure;

Thursday, 20 February 2014

Tracking All The Background Process in a shell script

Let's assume we have a scenario to write a shell script, where need to execute some statements in parallel, after finishing those need to execute few more statements.

for example if i have a shell script,

cmd1 &
cmd2 &
cmd3

i want to execute cmd3 after finishing cmd1 and cmd2.

This scenario may look simple, but it is not.

When i run statements in parallel using & operator, then that statement will be executed in background, so the other set of statements will be executed immediately, these will not wait for background processes.

i have solved this problem with the below script,

i am running one infinite loop, there i am tracking all the background processes, if all the background processes are done, then only i am running the final statements.

cmd1 &
cmd2 &
while true
do
if [ `jobs | grep Running | wc -l` -eq 0 ]; then
cmd 3
break;
fi;
done

Wednesday, 19 February 2014

Checking Oracle Connection Status Using Shell Script

We can check the oracle status in a number of ways using shell script.

I am mentioning one of the ways to check the connectivity,

Whenever we connect to oracle database successfully, then it will produce success message which contains "Connected to" as sub-string, i am checking for this string using grep command

if sqlplus schemaname/password@databasename < /dev/null | grep 'Connected to'; then

echo "Database Connection is OK .......Starting Export Process ...."

else

echo "Database Connection is not successful .."

exit;

Monday, 17 February 2014

Setting up Passwordless SSH

SSH is often used to login from one system to another without requiring passwords. This will be required when you run a cluster, it will not ask you for the password again and again.

steps:

1. First log in on Host1 with user user1 and generate a pair of authentication keys.

Do not enter a passphrase:

user1@Host1:~> ssh-keygen –t rsa

2. Now use ssh to create a directory ~/.ssh as user user2 on Host2 .

( if directory already exists, then no issues with that)

user1@Host1:~>ssh user2@Host2 mkdir –p .ssh

user2@Host2’s password:

3. Finally append Host1's new public key to user2@Host2:.ssh/authorized_keys and

enter Host2's password one last time:

user1@Host1:~>cat /home/hadoop/.ssh/id_rsa.pub | ssh user2@Host2 'cat >> .ssh/authorized_keys'

hadoop@Host2’s password:

now you can login to Host2 using user2 from Host1 machine

user1@Host1:~>ssh user2@Host2 hostname

NOTE: If you face any issue while logging in, please make the below changes:

Put the public key in .ssh/authorized_keys2
Change the permissions of .ssh to 700
Change the permissions of .ssh/authorized_keys2 to 640

Big Data at Work