NoSQL databases are every bit as popular today as more conventional, relational ones. One of the most popular NoSQL systems is Apache Cassandra. It’s designed to deal with big data, and can be scaled across large numbers of servers. This makes it resilient and highly available.

This package is relatively new on Fedora, since it was introduced on Fedora 26. The following article is a short tutorial to set up Cassandra on Fedora for a development environment. Production deployments should use a different set-up to harden the service.

Install and configure Cassandra

The set of database packages in Fedora’s stable repositories are the client tools in the cassandra package. The common library is in the cassandra-java-libs package (required by both client and server). The most important part of the database, the daemon, is available in the cassandra-server package. Some more supporting packages may be listed by running the following command in a terminal.

dnf list cassandra\*

First, install and start the service:

$ sudo dnf install cassandra cassandra-server $ sudo systemctl start cassandra

To enable the service to automatically start at boot time, run:

$ sudo systemctl enable cassandra

Finally, test the server initialization using the client:

$ cqlsh Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh > CREATE KEYSPACE k1 WITH REPLICATION = { ' class ' : ' SimpleStrategy ' , ' replication_factor ' : 1 } ; cqlsh > USE k1 ; cqlsh:k 1> CREATE TABLE users (user_name varchar, password varchar, gender varchar, PRIMARY KEY (user_name)) ; cqlsh:k 1> INSERT INTO users (user_name, password, gender) VALUES ( ' John ' , ' test123 ' , ' male ' ) ; cqlsh:k 1> SELECT * from users ; user_name | gender | password -----------+--------+---------- John | male | test123 (1 rows)

To configure the server, edit the file /etc/cassandra/cassandra.yaml. For more information about how to change configuration, see the the upstream documentation.

Controlling access with users and passwords

By default, authentication is disabled. To enable it, follow these steps:

By default, the authenticator option is set to AllowAllAuthenticator. Change the authenticator option in the cassandra.yaml file to PasswordAuthenticator:

authenticator: PasswordAuthenticator

Restart the service.

$ sudo systemctl restart cassandra

Start cqlsh using the default superuser name and password:

$ cqlsh -u cassandra -p cassandra

Create a new superuser:

cqlsh> CREATE ROLE <new_super_user> WITH PASSWORD = '<some_secure_password>' AND SUPERUSER = true AND LOGIN = true;

Log in as the newly created superuser:

$ cqlsh -u <new_super_user> -p <some_secure_password>

The superuser cannot be deleted. To neutralize the account, change the password to something long and incomprehensible, and alter the user’s status to NOSUPERUSER:

cqlsh> ALTER ROLE cassandra WITH PASSWORD='SomeNonsenseThatNoOneWillThinkOf' AND SUPERUSER=false;

Enabling remote access to the server

Edit the /etc/cassandra/cassandra.yaml file, and change the following parameters:

listen_address: external_ip rpc_address: external_ip seed_provider/seeds: "<external_ip>"

Then restart the service:

$ sudo systemctl restart cassandra

Other common configuration

There are quite a few more common configuration parameters. For instance, to set the cluster name, which must be consistent for all nodes in the cluster:

cluster_name: 'Test Cluster'

The data_file_directories option sets the directory where the service will write data. Below is the default that is used if unset. If possible, set this to a disk used only for storing Cassandra data.

data_file_directories: - /var/lib/cassandra/data

To set the type of disk used to store data (SSD or spinning):

disk_optimization_strategy: ssd|spinning

Running a Cassandra cluster

One of the main features of Cassandra is the ability to run in a multi-node setup. A cluster setup brings the following benefits:

Fault tolerance: Automatically replicates data to multiple nodes for fault-tolerance. Also, it supports replication across multiple data centers. You can replace failed nodes with no downtime.

Automatically replicates data to multiple nodes for fault-tolerance. Also, it supports replication across multiple data centers. You can replace failed nodes with no downtime. Decentralization: There are no single points of failure, no network bottlenecks, and every node in the cluster is identical.

There are no single points of failure, no network bottlenecks, and every node in the cluster is identical. Scalability & elasticity: Can run thousands of nodes with petabytes of data. Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.

The following sections describe how to setup a simple two-node cluster.

Clearing existing data

First, if the server is running now or has ever run before, you must delete all the existing data (make a backup first). This is because all nodes must have the same cluster name and it’s better to choose a different one from the default Test cluster name.

Run the following commands on each node:

$ sudo systemctl stop cassandra $ sudo rm -rf /var/lib/cassandra/data/system/*

If you deploy a large cluster, you can do this via automation using Ansible.

Configuring the cluster

To setup the cluster, edit the main configuration file /etc/cassandra/cassandra.yaml. Modify these parameters:

cluster_name: Name of your cluster.

num_tokens: Number of virtual nodes within a Cassandra instance. This option partitions the data and spreads the data throughout the cluster. The recommended value is 256.

seeds: Comma-delimited list of the IP address of each node in the cluster.

listen_address: The IP address or hostname the service binds to for connecting to other nodes. It defaults to localhost and needs to be changed to the IP address of the node.

rpc_address: The listen address for client connections (CQL protocol).

endpoint_snitch: Set to a class that implements the IEndpointSnitch. Cassandra uses snitches to locate nodes and route requests. The default is SimpleSnitch, but for this exercise, change it to GossipingPropertyFileSnitch which is more suitable for production environments: SimpleSnitch: Used for single-datacenter deployments or single-zone in public clouds. Does not recognize datacenter or rack information. It treats strategy order as proximity, which can improve cache locality when disabling read repair. GossipingPropertyFileSnitch: Recommended for production. The rack and datacenter for the local node are defined in the cassandra-rackdc.properties file and propagate to other nodes via gossip.

auto_bootstrap: This parameter is not present in the configuration file, so add it and set to false. It makes new (non-seed) nodes automatically migrate the right data to themselves.

Configuration files for a two-node cluster follow.

Node 1:

cluster_name: 'My Cluster' num_tokens: 256 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider - seeds: 10.0.0.1, 10.0.0.2 listen_address: 10.0.0.1 rpc_address: 10.0.0.1 endpoint_snitch: GossipingPropertyFileSnitch auto_bootstrap: false

Node 2:

cluster_name: 'My Cluster' num_tokens: 256 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider - seeds: 10.0.0.1, 10.0.0.2 listen_address: 10.0.0.2 rpc_address: 10.0.0.2 endpoint_snitch: GossipingPropertyFileSnitch auto_bootstrap: false

Starting the cluster

The final step is to start each instance of the cluster. Start the seed instances first, then the remaining nodes.

$ sudo systemctl start cassandra

Checking the cluster status

In conclusion, you can check the cluster status with the nodetool utility:

$ sudo nodetool status Datacenter: datacenter1 =============== Status=Up/Down | / State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.2 147.48 KB 256 ? f50799ee-8589-4eb8-a0c8-241cd254e424 rack1 UN 10.0.0.1 139.04 KB 256 ? 54b16af1-ad0a-4288-b34e-cacab39caeec rack1 Note: Non-system keyspaces don ' t have the same replication settings, effective ownership information is meaningless

Cassandra in a container

Linux containers are becoming more popular. You can find a Cassandra container image here on DockerHub:

centos/cassandra-3-centos7

It’s easy to start a container for this purpose without touching the rest of the system. First, install and run the Docker daemon:

$ sudo dnf install docker $ sudo systemctl start docker

Next, pull the image:

$ sudo docker pull centos/cassandra-3-centos7

Now prepare a directory for data:

$ sudo mkdir data $ sudo chown 143:143 data

Finally, start the container with a few arguments. The container uses the prepared directory to store data into and creates a user and database.

$ docker run --name cassandra -d -e CASSANDRA_ADMIN_PASSWORD=secret -p 9042:9042 -v `pwd`/data:/var/opt/rh/sclo-cassandra3/lib/cassandra:Z centos/cassandra-3-centos7

Now you have the service running in a container while storing data into the data directory in the current working directory. If the cqlsh client is not installed on your host system, run the one provided by the image with the following command:

$ docker exec -it cassandra 'bash' -c 'cqlsh '`docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' cassandra`' -u admin -p secret'

Conclusion

The Cassandra maintainers in Fedora seek co-maintainers to help keep the package fresh on Fedora. If you’d like to help, simply send them email.

Photo by Glen Jackson on Unsplash.