Thursday, June 25, 2015

, ,

Joins in Hadoop

You are insane to do this with mapreduce. Use Pig or Hive, or Spark and perform a join. This will take you less than ten minutes, including the time to download and install pig or hive and run them on your data. For example, see http://pig.apache.org/docs/r0.15.0/basic.html#join-inner

For curiosity's sake, check out this join implementation in Python: https://github.com/bd4c/big_data_for_chimps-code/blob/master/examples/ch_07/join.py


Using Java and mapreduce Apis to solve this problem is an exercise in pure futility. Unless you're doing this to learn, in which case these links should help.

My book (with Flip Kromer), Big Data for Chimps, covers joins in Pig and Python: https://github.com/infochimps-labs/big_data_for_chimps/blob/master/Ch07-joining_patterns.asciidoc

It is due out in a few weeks.

On Wednesday, June 24, 2015, Harshit Mathur wrote:
So basically you want <vertex_id,partitionId> as your key..?
If this is the case, then you can have your custom key object by implementing writablecomparable.

But i am not sure if the logic permits to do this in this single map reduce job. As per my understanding of your problem, what you want to achieve will be done in two jobs.

On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar wrote:
but in the reducer for Job1, you have :
vertexId => {adjcencyVertex complete list, partitonid=0}
so partition Id's for vertices in the adjacency list are not available. So essentially what I am trying to get output as

<vertex_id,partitionId>,<list >
where each element of list is of type <vertex_id,partitionId>

can this be achieved in single map-reduce job?

Thanks
Ravikant




On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur wrote:
yeah you can store it as well in your custom object like you are storing adjacency list.

On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar wrote:
Hi Harshit,

Is there any way to retain the partition id for each vertex in the adjacency list?


Thanks

Ravikant

On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar wrote:
Thanks Harshit

On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur wrote:
Hi,


This may be the solution (i hope i understood the problem correctly)

Job 1:

You need to  have two Mappers one reading from Edge File and the other reading from Partition file.
Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
Now you can have a custom writable (say GraphCustomObject) holding the following,
1)type : a representation of the object coming from which mapper
2)Adjacency vertex list: list of adjacency vertex
3)partiton Id: to hold the partition id

Now the output key and value of the EdgeFileMapper will be,
key=> vertexId
value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be present in this file)

The output of PartitionFileMapper will be,
key=>vertexId
value=>{type=partitionfile; adjcencyVertex=0, partitonid)


So in the Reducer for each VertexId we will can have the complete GraphCustomObject populated.
vertexId => {adjcencyVertex complete list, partitonid=0}

The output of this reducer will be,
key=> partitionId
Value=> {adjcencyVertexList, vertexId}
This will be the stored as output of job1.

Job 2
This job will read the output generated in the previous job and use identity Mapper, so in the reducer we will have
key=> partitionId
value=> list of all the adjacency vertexlist along with vertexid



I know my explanation seems a bit messy, sorry for that.

BR,
Harshit








On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar wrote:
Hi Hadoop user,

I want to use hadoop for performing operation on graph data
I have two file :

1. Edge list file
        This file contains one line for each edge in the graph.
sample:
1    2 (here 1 is source and 2 is sink node for the edge)
1    5
2    3
4    2
4    3
5    6
5    4
5    7
7    8
8    9
8    10

2. Partition file :
         This file contains one line for each vertex. Each line has two values first number is <vertex id> and second number is <partition id >
 sample : <vertex id>  <partition id >
2    1
3    1
4    1
5    2
6    2
7    2
8    1
9    1
10    1


The Edge list file is having size of 32Gb, while partition file is of 10Gb.
(size is so large that map/reduce can read only partition file . I have 20 node cluster with 24Gb memory per node.)

My aim is to get all vertices (along with their adjacency list )those  having same partition id in one reducer so that I can perform further analytics on a given partition in reducer.

Is there any way in hadoop to get join of these two file in mapper and so that I can map based on the partition id ?

Thanks

Ravikant



--
Harshit Mathur





--
Harshit Mathur




--
Harshit Mathur


--
Russell Jurney
,

Hadoop doesn't work after restart

Try running fsck

On Wed, Jun 24, 2015 at 2:54 PM, Ja Sam wrote:
I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from Hortonworks). Yesterday a lot of things happened nad in some point of time we decided to one by one reboot all datanodes. Unfortunate the operator did monitor the namenode health monitor.
The result of above operation is that all datanodes shows as dead nodes, all blocked are lost, ... .
In one datanode which we decided to reboot it once again to see if datanode will log anything interesting. The log finished with informations:
INFO  ipc.Server (Server.java:run(861)) - IPC Server Responder: starting  INFO  ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting  
and hangs here. In the same time on namnode I can see only two types of messages:
INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR* completeFile: [SOME PATH] is closed by DFSClient_NONMAPREDUCE_288661168_33  
and a lot of:
WARN  blockmanagement.BlockManager (PendingReplicationBlocks.java:pendingReplicationCheck(249)) - PendingReplicationMonitor timed out blk_1074405820_668233  
Today we decided to restart name node and all data nodes. After restart website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't see any errors in log except 5 like bellow:
 ERROR datanode.DataNode (DataXceiver.java:run(225)) - maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation  src: /node1:33470 dest: /node3:50010  
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020 already exists in state FINALIZED and thus cannot be created.
3 out of 5 nodes shows as lived, but refresh of hadoop status page takes more than 10 minutes. 
The question of course is: what should I check or do now?


Friday, June 19, 2015

,

copy data from one hadoop cluster to another hadoop cluster + cant use distcp

You can use node.js for this.

On Tue, Jun 23, 2015 at 8:15 PM, Divya Gehlot wrote:
Can you please elaborate it more.
On 20 Jun 2015 2:46 pm, "SF Hadoop" wrote:
Really depends on your requirements for the format of the data.

The easiest way I can think of is to "stream" batches of data into a pub sub system that the target system can access and then consume.

Verify each batch and then ditch them. 

You can throttle the size of the intermediary infrastructure based on your batches. 

Seems the most efficient approach.

On Thursday, June 18, 2015, Divya Gehlot wrote:
Hi,
I need to copy data from first hadoop cluster to second hadoop cluster.
I cant access second hadoop cluster from first hadoop cluster due to some security issue.
Can any point me how can I do apart from distcp command.
For instance
Cluster 1 secured zone -> copy hdfs data  to -> cluster 2 in non secured zone


Thanks,
Divya



,

Using UserGroupInformation in multithread process

Do you mean TGT tickets or tokens? Anyway, they should be across threads. Did you check if you're using the same UGI object in different threads?



Thanks,

Zhijie


From: Gaurav Gupta
I am using UserGroupInformation to get the Kerberos tokens.
I have a process in a Yarn container that is spawning another thread (slave). I am renewing the Kerberos Tokens in master thread but the slave thread is still using older Tokens.
Are tokens not shared across threads in same JVM?

Thanks
Gaurav
,

YARN container killed as running beyond memory limits

Hi,
   From the logs its pretty clear its due to 
"Current usage: 576.2 MB of 2 GB physical memory used; 4.2 GB of 4.2 GB virtual memory used. Killing container."

+ Naga

From: Arbi Akhina


Hi, I've a YARN application that submits containers. In the AplicationMaster logs I see that the container is killed. Here is the logs:

Jun 17, 2015 1:31:27 PM com.heavenize.modules.RMCallbackHandler onContainersCompleted INFO: container 'container_1434471275225_0007_01_000002' status is ContainerStatus: [ContainerId: container_1434471275225_0007_01_000002, State: COMPLETE, Diagnostics: Container [pid=4069,containerID=container_1434471275225_0007_01_000002] is running beyond virtual memory limits. Current usage: 576.2 MB of 2 GB physical memory used; 4.2 GB of 4.2 GB virtual memory used. Killing container. Dump of the process-tree for container_1434471275225_0007_01_000002 :  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE  |- 4094 4093 4069 4069 (java) 2932 94 2916065280 122804 /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms512m -Xmx2048m -XX:MaxPermSize=250m -XX:+UseConcMarkSweepGC -Dosmoze.path=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_000002/Osmoze -Dspring.profiles.active=webServer -jar /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_000002/heavenize-modules.jar   |- 4093 4073 4069 4069 (sh) 0 0 4550656 164 /bin/sh /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_000002/startup.sh   |- 4073 4069 4069 4069 (java) 249 34 1577267200 24239 /usr/lib/jvm/java-7-openjdk-amd64/bin/java com.heavenize.yarn.task.ModulesManager -containerId container_1434471275225_0007_01_000002 -port 5369 -exe hdfs://hadoop-server/user/hadoop/heavenize/heavenize-modules.jar -conf hdfs://hadoop-server/user/hadoop/heavenize/config.zip   |- 4069 1884 4069 4069 (bash) 0 0 12730368 304 /bin/bash -c /usr/lib/jvm/java-7-openjdk-amd64/bin/java com.heavenize.yarn.task.ModulesManager -containerId container_1434471275225_0007_01_000002 -port 5369 -exe hdfs://hadoop-server/user/hadoop/heavenize/heavenize-modules.jar -conf hdfs://hadoop-server/user/hadoop/heavenize/config.zip 1> /usr/local/hadoop/logs/userlogs/application_1434471275225_0007/container_1434471275225_0007_01_000002/stdout 2> /usr/local/hadoop/logs/userlogs/application_1434471275225_0007/container_1434471275225_0007_01_000002/stderr  

I don't see any memory excess, any idea where this error comes from?
There is no errors in the container, it just stop logging as a result of being killed.
,

LDAP GroupMapping configuration

Hi!

I have this kind of filter working for me:

<property>
<name>hadoop.security.group.mapping.ldap.search.filter.user</name>
<value>(&amp;(&amp;(objectclass=user)(sAMAccountName={0})))</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.search.filter.group</name>
<value>(objectclass=group)</value>
</property>

All other fields are similar. Check this out :)

Sergey Kazakov


2015-06-16 15:38 GMT+05:00 Jiaming Lin :
Hi
I have some problems while using LDAPGroupMapping.
I can't retrieve the group mappings.

And I have added the Hadoop services and group into my LDAP server.
Here is my ldif file.

And my core-site.xml configurations about the LDAPGroupMapping

After those setups and I restart the HDFS and YARN, and execute commands
# su hdfs
# hdfs groups
return no groups, which should be expected "hadoop" as before
Can someone help me?





--
С уважением,
Казаков Сергей

Tuesday, June 16, 2015

, ,

Two hadoop nodes on same machine while a second machine not joining the cluster

Fixed, the two machines had same hostname :(

On Tue, Jun 16, 2015 at 5:55 PM, Arbi Akhina wrote:
Here is the content of configuration files on both machines:

Here is the content of core-site.xml
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop-server</value>
  </property>
</configuration>

Here is the content of mapred-site.xml
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>


On Tue, Jun 16, 2015 at 4:37 PM, vedavyasa reddy  wrote:
hi,
check in slave node Hadoop_home/etc/hadoop/"core-site.xml"whether host name of the master is given correctly or not. i.e., "hdfs://master-hostname:9000"

and if these is correct then check the "mapred-site.xml" whether "job Tracker address" is given correct or not.

On Tue, Jun 16, 2015 at 6:23 PM, Arbi Akhina  wrote:
Hi,
I have a test cluster of two machines, on both of them hadoop is installed. I've configured the hadoop cluster but on admin UI (as in the below picture) I see that two nodes are running on the same master machine, and that the other machine has no Hadoop node.
enter image description here
On master machine following services are running:
~$ jps  26310 ResourceManager  27593 Jps  26216 DataNode  26135 NameNode  26557 NodeManager  26701 JobHistoryServer  
On the slave machine:
~$ jps  2614 DataNode  2920 Jps  2707 NodeManager  
I don't why the slave is not joining the cluster (It was before). I tried to shutdown all servers on both machines and format HDFS then restarting everything but that didn't help. Any help to figure what's causing that behavior is appreciated.



Popular Posts