Elasticsearch discovery in EC2

Elasticsearch out of the box does a good job of locating nodes and building a cluster. Just assign everyone the same cluster name and ES does the rest. But running a cluster on Amazon’s EC2 presents some additional challenges specific to that environment. I recently set up a docker-ized Elasticsearch cluster on two EC2 medium instances, and thought I would put down in one place the main things that I ran into setting it up.

The most fundamental constraint in EC2 is that multicast discovery (unsurprisingly) will not work. EC2 doesn’t allow multicast traffic. This gives you two choices: use unicast discovery, or set up EC2 dynamic discovery. Unicast discovery requires that all nodes have the host IPs for the other nodes in the cluster. I don’t like that idea, so I chose to set up dynamic discovery using the EC2 APIs. Getting EC2 dynamic discovery to work requires changes in three places: the Elasticsearch installation, the configuration file at /etc/elasticsearch/elasticsearch.yml, and finally the instance and security configuration on Amazon. I’ll take these in order.

The best way to integrate EC2 into the Elasticsearch discovery workflow is to use the cloud-aws plugin module. You can install this at any time with the following command:

sudo /usr/share/elasticsearch/bin/plugin -install \
elasticsearch/elasticsearch-cloud-aws/2.0.0.RC1

This will pull the latest plugin from the download site and install it, which is basically just extracting files. Note that the version in the command is the latest one. You can check here to see if it is still current. And that’s all there is to that. Adding the cloud-aws plugin enables the discovery.ec2 settings in elasticsearch.yml, which is where we’ll head next.

The Elasticsearch config file is located in /etc/elasticsearch/elasticsearch.yml, and we’ll want to change/add a few things there. First, obviously, give everyone the same cluster name:

cluster.name: my_cluster

Another setting that makes sense in production, at least, is to require the cloud-aws plugin to be present on start:

plugin.mandatory: cloud-aws

The next two settings are required for the cloud-aws plugin to communicate on your behalf with the great AWS brain in the sky:

cloud.aws.access_key: NOTMYACCESSKEYATALL
cloud.aws.secret_key: ITw0UlDnTB3seCRetiFiPuTItHeR3

Not to digress too much into AWS access management, but if you’ve set things up the right way then the access key and secret key used above will be for an IAM sub-account that grants just the specific permissions needed. The next two settings deal with discovery directly:

discovery.type: ec2
discovery.zen.ping.multicast.enabled: false

The first one just sets the discovery type to use the EC2 plugin. Pretty self-explanatory. The second one disables the use of multicast discovery, on the principle of “it doesn’t work, so don’t try it.” The last two settings we’ll discuss can be seen as alternatives to one and other, but are not mutually exclusive, which requires a bit of explanation.

Basically, we may need to filter the instances that get returned to the discovery process. When the cloud-aws plugin queries EC2 and gets a list of addresses back, it is going to assume they are all Elasticsearch nodes. During discovery it will try to contact them, and if some are not actually nodes it will just keep trying. This behavior makes sense with the multicast discovery process, because if you are not listening for multicast traffic then you don’t respond to it. But the EC2 discovery APIs will return all the instances in an availability zone, so we need some way to identify to Elasticsearch discovery which ones are really nodes.

One way to do this is by using a security group. You can create a security group and assign all the instances that you want in a particular cluster to that group, then make the following addition to the config:

discovery.ec2.groups: my_security_group

This setting tells the plugin to return only those instances that are members of the security group ‘my_security_group.’ Since you will need a security group anyway, as explained below, this is a convenient way to separate them from the crowd. But there can be cases where you don’t want to partition on the security group name. You might, for example, want to have one security group to control the access rules for a set of instances representing more than one cluster. In that case you can use tags:

discovery.ec2.tag.my_tag: my_tag_value

This setting tells the plugin to return only those instances that have the tag ‘my_tag’ with the value ‘my_tag_value.’ This is even more convenient, since it doesn’t involve mucking about with security groups, and setting the tag on an instance is easily done. Finally, as mentioned before these aren’t mutually exclusive. You can use the groups option to select a set of instances, and then partition them into clusters using tags.

And that’s it for the elasticsearch.yml settings, or at least the ones I had to change to make this work on EC2. There are a lot of other options if your specific case requires tweaking, and you can find an explanation of them here. The last thing I want to go into are the necessary steps to take in the Amazon EC2 console with respect to configuration. These fall into two areas: security groups and instance configuration. I don’t want to digress far into the specific AWS console steps, but I’ll outline in general what needs to happen.

Security groups first. You’re going to need all the nodes in your cluster to be able to talk to each other, and you’re going to want some outside clients to be able to talk to the nodes in the cluster as well. There are lots of different cases for connecting clients for querying purposes, so I’m going to ignore that topic and focus on communications between the nodes.

The nodes in a cluster need to use the Elasticsearch transport protocol on port 9300, so you’ll need to create a security group, or modify an existing one that applies to your instances, and create a rule allowing this traffic. For my money the easiest and most durable way to do this is to have a single security group covering just the nodes, and to add a rule allowing inbound TCP traffic on port 9300 from any instance in the group. If you are using the discovery.ec2.groups method discussed above, make sure to give your group the same name you used in the settings.

The last point is instance configuration, and for my purposes here it’s really just setting the tag or security group membership appropriately so that the instance gets included in the list returned from discovery. There are lots of other specifics regarding how to set up an instance to run Elasticsearch efficiently, but those are topics for another time (after I figure them out myself!)

The very last thing I want to mention is a tip for Docker users. If you’re running Elasticsearch inside a container on EC2 your discovery is going to fail. The first node that starts is going to elect itself as master, and if you query cluster health on that node it will succeed and tell you there is one node in the cluster, which is itself. The other nodes will hang when you try the same query and that is because they are busy failing discovery. If you look in /var/log/elasticsearch/cluster_name.log you’re going to see some exceptions that look like this:

[2014-03-18 18:20:04,425][INFO ][discovery.ec2            ]
[MacPherran] failed to send join request to master 
[[Amina][Pi6vJ470SYy4fEQhGXiwEA][38d8cffbdcd5][inet[/172.17.0.2:9300]]],
reason [org.elasticsearch.transport.RemoteTransportException:
[MacPherran][inet[/172.17.0.2:9300]][discovery/zen/join]; 
org.elasticsearch.ElasticsearchIllegalStateException: 
Node [[MacPherran][aHWxgYAhSpaWlPEXNhs7RA][4fb92271cce6][inet[/172.17.0.2:9300]]] 
not master for join request from 
[[MacPherran][aHWxgYAhSpaWlPEXNhs7RA][4fb92271cce6][inet[/172.17.0.2:9300]]]]

I cut out a lot of detail, but basically the reason this is happening is that, in this example, node MacPherran is talking to itself and doesn’t know it. The problem is caused because a running Docker container has a different IP address than the host instance it is running on. So when the node does discovery it first finds itself at the container IP, something like:

[MacPherran][inet[/172.17.0.2:9300]]

And then finds itself at the instance IP returned from EC2:

[MacPherran][inet[/10.147.39.21:9300]]

From there things do not go well for the discovery process. Fortunately this is easily fixed with another addition to elasticsearch.yml:

network.publish_host: 255.255.255.255

This setting tells Elasticsearch that this node should advertise itself as being at the specified host IP address. Set this to the host address of the instance and Elasticsearch will now tell everyone that is its address, which it is, assuming you are mapping the container ports over to the host ports. Now, of course, you have the problem of how to inject the host IP address into the container, but hopefully that is a simpler problem, and is left as an exercise for the reader.

15 thoughts on “Elasticsearch discovery in EC2

  1. Mark,

    Docker 1.0 supports connecting to host network interface directly. Do you think it will help node discovery to be easier than before?

  2. Hi, Anand. Yes, at the very least the –net option will remove the requirement to inject the host IP into the ES config inside the container so it will publish the right interface address. On the other hand there are some potential security implications to permitting access to the host network stack by the container, so it is a tradeoff. Personally I’m not all that comfortable with –net yet. Thanks for the comment.

  3. Hi, Pavel. I would be happy to take a look at your issue and discuss it if you will describe it in a comment here so that the question and subsequent discussion is more easily available to future readers. Thank you. –Mark

  4. Hi Mark,

    That’s fair :)

    My issue is the next.
    Here is my elasticsearch.yml configuration

    cloud.aws.access_key: [MY_ACCESS_KEY]
    cloud.aws.secret_key: [MY_SECRET_KEY]

    plugin.mandatory: cloud-aws

    cluster.name: "LogstashCluster"

    node.name: "morbius"

    discovery.type: "ec2"
    discovery.ec2.groups: [MY_GROUP]
    discovery.ec2.host_type: "public_ip"
    discovery.ec2.ping_timeout: "10s"
    cloud.aws.region: "eu-west-1"

    discovery.zen.ping.multicast.enabled: false

    network.publish_host: [PUBLIC_IP]

    However, when I restart both servers, in the logs I see the next:

    [2014-07-21 08:01:53,919][TRACE][discovery.zen.ping.unicast] [morbius] [1] failed to connect to [#cloud-i-hash-0][morbius.domain.com][inet[PUBLIC_DNS/PRIVATE_IP:9300]]
    org.elasticsearch.transport.ConnectTransportException: [][inet[PUBLIC_DNS/PRIVATE_IP:9300]] connect_timeout[30s]
    at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:692)
    at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:652)
    at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:619)
    at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:150)
    at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:286)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

    However, I’m able to telnet from the host morbius to the other host on the port 9300. And there are no firewall restrictions.

  5. Hi, Pavel. You say that there are no firewall rules, and that you can telnet to one host from the other on 9300. That’s interesting because by default instances in a security group cannot talk to each other unless a rule is added to allow it. See the following page: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_SecurityGroups.html. There is a default security group with default rules, but it’s hard to imagine that it includes rules permitting TCP traffic on 9300. What I can say is that in our own deployment I had to add a custom TCP rule to the security group containing the ES cluster, permitting TCP traffic on 9300-9400. I would give that a try and see whether it works. I can’t explain the telnet behavior without actually reviewing your security group configuration.

  6. In the latest version of ES (1.4.0) you don’t need to any special magic network config to enable a Docker based ec2 cluster.

    elasticsearch.yml

    cloud:
    aws:
    access_key:
    secret_key:
    region: "us-west-2"
    discovery:
    type: ec2
    ec2.groups: elastic_search_group
    zen.ping.multicast.enabled: false
    plugin.mandatory: "cloud-aws"
    network.publish_host: _ec2:privateIp_
    cluster:
    name: eyewatch
    path:
    logs: /data/log
    data: /data/data
    plugins: /data/plugins
    work: /data/work
    script.disable_dynamic: false
    script.default_lang: mvel
    bootstrap.mlockall: true

    docker run command:

    docker run -d \
    --name es \
    -e ES_HEAP_SIZE=2500m \
    -p 9200:9200 -p 9300:9300 \
    -v /data/conf:/data/conf \
    -v /data/data:/data/data \
    -v /data/log:/data/log \
    --restart="on-failure:99" \
    ezegolub/elasticsearch /elasticsearch/bin/elasticsearch -Des.config=/data/conf/elasticsearch.yml

    Docker image:
    https://registry.hub.docker.com/u/ezegolub/elasticsearch/

  7. Hi Mark, thanks for your note about Docker on Amazon. I’m wondering as docker in Amazon does only allow first port to be mapped. If we map 9300 on container port to 80 on host port for discovery, then we have to give up port 9200 which is mainly used. Do you have any suggestions on this? Thanks a lot.

  8. Hi Tuan. Sorry it has taken me so long to respond to this. I’m not sure I understand why you would need to map port 9300 to port 80 on the host for discovery purposes. Elasticsearch nodes use 9300 for internode communications as noted in the post. Perhaps if you could give me a little more background on what you’re trying to do. Thanks for stopping by.

  9. Thanks for the article. Very handy. I also use the _ec2:privateIp_ technique:

    discovery.zen.ping.multicast.enabled: false
    network.publish_host: _ec2:privateIp_

    and then just punch holes in the security group for those private IP addresses with a bash script, something like this:


    instance_id=i-12345678
    private_ip_address=$(aws ec2 describe-instances --instance-ids $instance_id | jq -r '.Reservations[].Instances[].PrivateIpAddress')
    security_group_name=my-security-group
    aws ec2 create-security-group --group-name $security_group_name --description "$security_group_name"
    security_group_id=$(aws ec2 describe-security-groups --group-names $security_group_name --output text --query 'SecurityGroups[0].GroupId')
    aws ec2 authorize-security-group-ingress --group-id $security_group_id --protocol tcp --port 9300 --cidr $private_ip_address/32

Leave a Reply

Your email address will not be published. Required fields are marked *