/etc/hosts conflict causes pve-cluster fail · issue #15 · lae/ansible-role-proxmox

Adding servers to the cluster

Adding a server (node) to the cluster will need a little preparation. Specifically, because we use private IP addresses for the cluster, we need to force other nodes to do the same when trying to contact another node. In other words, if server1 wants to contact server2, it should use the 192.x range instead of the public IP address.

So, based on the above example, on server1 we need to add a line to the /etc/hosts like this:

cat >> /etc/hosts <<EOF192.168.15.21 server2.myprovider.com server2EOF

Note the double “>>” brackets. If you use a single “>” one, you overwrite the entire file with just that line. You’ve been warned.

And on server2, we need to make sure server1 can be contacted using its private IP as well, so on that server, we perform:

cat >> /etc/hosts <<EOF192.168.15.20 server1.myprovider.com server1EOF

All of this can be made much fancier with your own DNS server and bindings, but again, this is beyond the scope and goes on the assumption you don’t mind doing this for the 2, 5 or 10 servers or so you may have. If you have a few hundred, then I wouldn’t expect you to be looking at a “Poor Man’s” setup.

On the server that you will be adding to the cluster, make sure that you can successfully ping that private IP address of the “main server”.If tested OK, then still on that server (thus the one that isn’t yet part of the cluster), type:

pvecm add server1

Where “server1” is the “main server” (the one on which you first created the cluster). It will ask you for the root password for SSH for server1, and then does its thing with configuration.

Note: If you have disabled password-based root logins using SSH, you may have to temporarily enable it. Using SSH keys would be a preferred method over passwords.

After this has been done, the node should automatically on your web-based GUI and can be verified from the CLI using:

pvecm nodes

If the nodes show up in the “pvecm nodes” command and GUI, then you have successfully created the cluster.

Note: A note about a 2-node cluster and quorum .

Troubleshooting

Known issues

quorum.expected_votes must be configured

If the logs show something like:

corosync:   Quorum provider: corosync_votequorum failed to initialize.
corosync:   Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'

Your hosts file entry for the corosync hostname and the one in ring0_addr from corosync.conf do not match or could not be resolved.

Fix them up and reboot/restart. If you need to change something in corosync.conf but have no write permissions see .

crit: cpg_send_message failed: 9

If this pops up on only one node restart the pve-cluster service with:

systemctl restart pve-cluster.service

If that does not solve the problem or it’s on all node check your firewall and switch, the may block or not support multicast.

Also your may have a switch with IGMP snooping enabled but no active multicast querier in the network. Install such a multicast querier or disable IGMP Snooping on the switch.
Installing a IGMP querier is recommended, as it boosts the performance of the network and multicast itself.

Unknown issues

Ask for support. In the meantime revert back to the backed up corosync.conf. See ‘Write config when not quorate’ and then overwrite the config with the backup on each node, increase the config versions inside it and give attention that the versions is the same on all nodes. Then reboot the cluster.

Write config when not quorate

If you need to change /etc/pve/corosync.conf on an node with no quorum, and you know what you do, use:

pvecm expected 1

to set the expected vote count to 1. This makes the cluster quorate and you can fix your config, or revert it back to the back up.

If that wasn’t enough (e.g.: corosync is dead) use:

systemctl stop pve-cluster
pmxcfs -l

to start the pmxcfs in a local mode. You have now write access, so you need to be very careful with changes!

RRP modes

Note: Active mode is not completely stable, yet. Always use passive mode for production use.

Citing the corosync.conf man page:

Active replication offers slightly lower latency from transmit to delivery in faulty network environments but with less performance.
Passive replication may nearly double the speed of the totem protocol if the protocol doesn't become CPU bound.
The final option is none, in which case only one network interface will be used to operate the totem protocol.

On Cluster Creation

The pvecm create command provides the additional parameters ‘-bindnet1_addr’, ‘-ring1_addr’ and ‘-rrp_mode’, those can be used for RRP configuration.

See the and sections for information about the addresses.

Note, when you only set the ring 1 addresses ring 0 will be set to the default values (local ip address and nodename).

On Running Cluster

Use the same steps described in the section to edit the corosync config.

In the editor adapt the following attributes:

add a new interface section to the totem section of the config.
add «ring1_addr: <ring1_hostname>» entries to each node section.

It should look something like:

totem {
  cluster_name: tweak
  config_version: 2
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.10.1.62
    ringnumber: 0
  }
  interface {
    bindnetaddr: 10.10.3.62
    ringnumber: 1
  }
}

nodelist {
  node {
    name: pvecm62
    nodeid: 1
    quorum_votes: 1
    ring0_addr: coro0-62
    ring1_addr: coro1-62
  }

 node {
    name: pvecm63
    nodeid: 2
    quorum_votes: 1
    ring0_addr: coro0-63
    ring1_addr: coro1-63
  }

   # other cluster nodes here
}

 # other config sections here

rename the config file

mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf

reboot the first node, and look in the logs if corosync does not throw errors and could make an healthy cluster by itself on the new network.
When something failed look at the troubleshooting section.

then reboot every node, one after the other, if HA is enabled reboot the node which is the current HA master at last, to speed up the process.

Proxmox VE vs RHEV

Это можно сделать через основную панель либо правую панель: кликните по пункту с иконкой «жесткий диск» (в правой панели перед этим раскройте выпадающий список). На картинке хранилище имеет название «local(Debian-77-wheezy-64-minimal)».

Затем перейдите во вкладку «Содержимое».

Нажмите «Загрузить» для загрузки iso образа.

В появившемся окне нажмите «Выбрать файл…».

Выберите файл iso образа.

Дождитесь пока iso образ загрузится с вашего компьютера.

После загрузки вы можете увидеть ваш iso образ в списке. Далее перейдите на страницу виртуальной машины. Это можно сделать, нажав в правой панели по пункту с иконкой «монитор» (на картинке виртуальная машина имеет название «100(vm)»).

Перейдите во вкладку «Оборудование».

Два раза нажмите по пункту «CD/DVD привод» чтобы примонтировать iso образ.

В появившемся окне выберите хранилище, в которое был загружен iso образ ранее.

Затем выберите сам iso образ.

Нажмите «ОК», чтобы примонтировать ваш iso образ.

Теперь в значение пункта «CD/DVD привод» должно отображаться имя вашего iso образа.

Для установки ОС из iso образа необходимо включить (или перезагрузить) виртуальную машину. Это можно сделать в вспомогательном меню управления виртуальной машины, кликнув по «Запуск». Чтобы данное меню появилось, необходимо кликнуть правой кнопкой мыши по пункту с иконкой «монитор» в правой понели (на картинке виртуальная машина имеет название «100(vm)»).

Затем для продолжения установки ОС необходимо подключиться к виртуальной машине. Это также можно сделать в вспомогательном меню управления виртуальной машины, кликнув по «Консоль». Перед подключением к виртуальной машине необходимо установить java и добавить url панели в список исключений java. Не забудьте перезагрузить браузер, чтобы изменения настроек безопасности java вступили в силу.

В открывшемся окне разрешите запуск плагина.

В окне предупреждения нажмите «Continue».

Во 2-м окне предупреждения поставьте галочку «I accept the risk and want to run this application» и нажмите «Run».

Если вы увидите сообщение от том что подключиться к виртуальной машине не получилось, нажмите кнопку «Запустить».

Теперь вы можете продолжить установку ОС.

Recovery

If you have major problems with your Proxmox VE host, e.g. hardware
issues, it could be helpful to just copy the pmxcfs database file
/var/lib/pve-cluster/config.db and move it to a new Proxmox VE
host. On the new host (with nothing running), you need to stop the
pve-cluster service and replace the config.db file (needed permissions
0600). Second, adapt /etc/hostname and /etc/hosts according to the
lost Proxmox VE host, then reboot and check. (And don’t forget your
VM/CT data)

Remove Cluster configuration

The recommended way is to reinstall the node after you removed it from
your cluster. This makes sure that all secret cluster/ssh keys and any
shared configuration data is destroyed.

In some cases, you might prefer to put a node back to local mode without
reinstall, which is described in

Recovering/Moving Guests from Failed Nodes

For the guest configuration files in nodes/<NAME>/qemu-server/ (VMs) and
nodes/<NAME>/lxc/ (containers), Proxmox VE sees the containing node <NAME> as
owner of the respective guest. This concept enables the usage of local locks
instead of expensive cluster-wide locks for preventing concurrent guest
configuration changes.

As a consequence, if the owning node of a guest fails (e.g., because of a power
outage, fencing event, ..), a regular migration is not possible (even if all
the disks are located on shared storage) because such a local lock on the
(dead) owning node is unobtainable. This is not a problem for HA-managed
guests, as Proxmox VE’s High Availability stack includes the necessary
(cluster-wide) locking and watchdog functionality to ensure correct and
automatic recovery of guests from fenced nodes.

If a non-HA-managed guest has only shared disks (and no other local resources
which are only available on the failed node are configured), a manual recovery
is possible by simply moving the guest configuration file from the failed
node’s directory in /etc/pve/ to an alive node’s directory (which changes the
logical owner or location of the guest).

For example, recovering the VM with ID 100 from a dead node1 to another
node node2 works with the following command executed when logged in as root
on any member node of the cluster:

mv /etc/pve/nodes/node1/qemu-server/100.conf /etc/pve/nodes/node2/

Before manually recovering a guest like this, make absolutely sure
that the failed source node is really powered off/fenced. Otherwise Proxmox VE’s
locking principles are violated by the mv command, which can have unexpected
consequences.

Guest with local disks (or other local resources which are only
available on the dead node) are not recoverable like this. Either wait for the
failed node to rejoin the cluster or restore such guests from backups.

Containers and VMs

You can now create containers and VMs that can be migrated between the nodes.

You can either assign the private IP address directly (venet, only on OpenVZ containers) or as a network device (veth) attached to vmbr1.

The private IP address should be within the range of your specified netmask on vmbr1. So going by the above example of using 192.168.14.0/23, that’s anything between 192.168.14.1 and 192.168.15.254. Make sure the IP isn’t already used by another VM or a node (see initial notes, re 192.168.14.x for VMs).

If you fire up the VM, its private IP address should be ping-able from any server, and from within the container / VM, you can ping any private as well as public IP address (the latter thanks to masquerading configured with the tinc-up script). If this is not the case, the network configuration was not done correctly.