Skip to content

Errors&Fixes

Asma Aloufi edited this page Oct 3, 2019 · 13 revisions

Mesos Errors:

HDFS Errors:

1. Datanode failed to start after calling the start dfs script (due to Incompatible clusterIDs in namenode and datanode)

Try to run this command in one of the datanode

sudo /hadoop/bin/hdfs datanode

If you see the following error, then read on

19/06/02 10:52:28 INFO common.Storage: Using 1 threads to upgrade data directories (dfs.datanode.parallel.volumes.load.threads.num=1, dataDirs=1)
19/06/02 10:52:28 INFO common.Storage: Lock on /data/hadoop/data/in_use.lock acquired by nodename [email protected]
19/06/02 10:52:28 WARN common.Storage: Failed to add storage directory [DISK]file:/data/hadoop/data/
java.io.IOException: **Incompatible clusterIDs in /data/hadoop/data: namenode clusterID = CID-79fa7c14-46a5-4c3d-b693-c026a8d0d3b9; datanode clusterID = CID-c63786f8-0907-4356-b4f0-f079abd3eafa**
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:760)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:293)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadDataStorage(DataStorage.java:409)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:388)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
        at java.base/java.lang.Thread.run(Thread.java:844)
19/06/02 10:52:28 ERROR datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid c9692d9e-a360-423d-89bb-626ad5b85748) service to master/10.10.1.2:9000. Exiting.
java.io.IOException: All specified directories have failed to load.
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:557)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
        at java.base/java.lang.Thread.run(Thread.java:844)
19/06/02 10:52:28 WARN datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid c9692d9e-a360-423d-89bb-626ad5b85748) service to master/10.10.1.2:9000
19/06/02 10:52:28 INFO datanode.DataNode: Removed Block pool <registering> (Datanode Uuid c9692d9e-a360-423d-89bb-626ad5b85748)
19/06/02 10:52:30 WARN datanode.DataNode: Exiting Datanode
19/06/02 10:52:30 INFO datanode.DataNode: SHUTDOWN_MSG:

Solution: We need to remove the old clusterID assignment to the datanode. Then, reformat namenode will make sure the clusterID is consistent for namenode and datanode.

on datanode

sudo rm -rf /hdfs/*

on namenode

sudo /hadoop/sbin/stop-dfs.sh
sudo rm -rf /hdfs/*
sudo /hadoop/bin/hdfs namenode -format
sudo /hadoop/sbin/start-dfs.sh

Reference from this post

2. datanode can not connect to namenode, connection refused

java.net.ConnectException: Call From marta-komputer/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Solution:

<!-- default RPC IP,and use 0.0.0.0 to represent all ips-->
    <property>
        <name>dfs.namenode.rpc-bind-host</name>
        <value>0.0.0.0</value>
    </property>

3. Thread failed to detach from JVM

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe858021c27, pid=7166, tid=0x00007fe859382700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_201-b09) (build 1.8.0_201-b09)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.201-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x8cac27]  Monitor::ILock(Thread*) [clone .part.2]+0x17
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /var/lib/mesos/slaves/7fc7fe9a-3b38-418e-89c2-8b4a4068fa9a-S1/frameworks/5d9c828d-b45d-470b-ab17-a69f822379a2-0000/executors/driver-20190602122729-0009/runs/d827d88f-b5da-4f21-bdd6-9a6ef9fa9fb4/hs_err_pid7166.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

This is a problem caused by function call from shared library failed to release resources and crashes the JVM at closing. This problem exists when you have two separate share libraries expose functionalities through JNI.

Solution: apply this patch and replace libhdfs.so on all nodes.

4. compiling hadoop from source, but missing protoc 2.5.0

Solution:

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.bz2
tar xvf protobuf-2.5.0.tar.bz2
cd protobuf-2.5.0
./configure CC=clang CXX=clang++ CXXFLAGS='-std=c++11 -stdlib=libc++ -O3 -g' LDFLAGS='-stdlib=libc++' LIBS="-lc++ -lc++abi"
make -j 4 
sudo make install
protoc --version

More info, instructions for installing Hadoop on mac osx

5. zookeeper failed to start because user 'zookeeper' not found

You can double check on this error with sudo tail -n100 /var/log/syslog.

To fix this problem, if zookeeper is installed through apt get, then sudo vim /etc/rc1.d/K01zookeeper and change NAME=zookeeper to NAME=root.

Then, restart zookeeper

sudo systemctl daemon-reload
sudo systemctl stop zookeeper.service
sudo systemctl start zookeeper.service

and restart mesos as well

sudo systemctl restart mesos-master
sudo systemctl restart zookeeper
sudo systemctl restart spark