Tag Archive | load

Efficiently delete data with JPA and Hibernate

You may come to the situation where you have to perform a bulk deletion on a huge amount of datasets stored in a relational database. If you use JPA with Hibernate as underlying OR mapper, you might try to call the remove() method of the EntityManager in a way like the following:

public void removeById(long id) {
    RootEntity rootEntity = entityManager.getReference(RootEntity.class, id);
    entityManager.remove(rootEntity);
}

First of all, we load a reference representation of the entity we want to delete and then pass this reference to the EntityManager. Let’s assume the RootEntity from above has a child relation to a class called ChildEntity:

@OneToMany(mappedBy = "rootEntity", fetch = FetchType.EAGER, cascade = CascadeType.ALL)
private Set childEntities = new HashSet(0);

If we now turn on the property show_sql of hibernate, we will wonder what SQL statements are issued:

    select
        rootentity0_.id as id5_1_,
        rootentity0_.field1 as field2_5_1_,
        rootentity0_.field2 as field3_5_1_,
        childentit1_.PARENT as PARENT5_3_,
        childentit1_.id as id3_,
        childentit1_.id as id4_0_,
        childentit1_.field1 as field2_4_0_,
        childentit1_.field2 as field3_4_0_,
        childentit1_.PARENT as PARENT4_0_
    from
        ROOT_ENTITY rootentity0_
    left outer join
        CHILD_ENTITY childentit1_
            on rootentity0_.id=childentit1_.PARENT
    where
        rootentity0_.id=?

    delete
    from
        CHILD_ENTITY
    where
        id=?

   delete
   from
       ROOT_ENTITY
   where
       id=?

Why does Hibernate first load all data into memory in order to delete this data immediately afterwards? The reason is that JPA’s lifecycle requires that the object is in “managed” state, before it can be deleted. Only in this state all lifecycle functionality like interceptors is available (see here). Therefore Hibernate issues a SELECT query before the deletion in order to transfer both RootEntity and ChildEntity to the “managed” state.
But what can we do, if we just want to delete RootEntity and ChildEntity, if we know the id of RootEntity? The answer is to use a simple DELETE query like the following one. But due to the integrity constraint on the child table, we first have to delete all depending child entities. The following code demonstrates how:

List childIds = entityManager.createQuery("select c.id from ChildEntity c where c.rootEntity.id = :pid").setParameter("pid", id).getResultList();
for(Long childId : childIds) {
    entityManager.createQuery("delete from ChildEntity c where c.id = :id").setParameter("id", childId).executeUpdate();
}
entityManager.createQuery("delete from RootEntity r where r.id = :id").setParameter("id", id).executeUpdate();

The above code results in the three SQL statements we would have expected by calling remove(). Now you may argue, that this way of deletion is more complicated than just calling the EntityManager’s remove() method. It also ignores annotations like @OneToMany and @ManyToOne we have placed in the two entity classes.
So why not write some code that uses the knowledge about the two entities that already exists in the two class files? First of all, we look for @OneToMany annotations using reflection in the RootEntity class, extract the type of the child entity and then look for its back relation field annotated with @ManyToOne. Having done this, we can easily write the three SQL statements in a more generic way:

public void delete(EntityManager entityManager, Class parentClass, Object parentId) {
    Field idField = getIdField(parentClass);
    if (idField != null) {
        List oneToManyFields = getOneToManyFields(parentClass);
        for (Field field : oneToManyFields) {
            Class childClass = getFirstActualTypeArgument(field);
            if (childClass != null) {
                Field manyToOneField = getManyToOneField(childClass, parentClass);
                Field childClassIdField = getIdField(childClass);
                if (manyToOneField != null && childClassIdField != null) {
                    List childIds = entityManager.createQuery(String.format("select c.%s from %s c where c.%s.%s = :pid", childClassIdField.getName(), childClass.getSimpleName(), manyToOneField.getName(), idField.getName())).setParameter("pid", parentId).getResultList();
                    for (Long childId : childIds) {
                        entityManager.createQuery(String.format("delete from %s c where c.%s = :id", childClass.getSimpleName(), childClassIdField.getName())).setParameter("id", childId).executeUpdate();
                    }
                }
            }
        }
        entityManager.createQuery(String.format("delete from %s e where e.%s = :id", parentClass.getSimpleName(), idField.getName())).setParameter("id", parentId).executeUpdate();
    }
}

The methods getFirstActualTypeArgument(), getManyToOneField(), getIdField() and getOneToManyFields() in the code above are not depicted here, but do what their name sounds like. Once implemented we can easily delete all entities beginning with the root of the tree.

A simple example application that can be used to examine the behavior and solution described above, can be found on github.

Advertisements

JNDI load balancing with jboss-ejb-client on JBoss AS 7

Let’s assume we have a client application that accesses a remote stateful session bean (SFSB) on a JBoss Application Server AS 7. The SFSB is accessed via JNDI lookup as described here: EJB invocations via JNDI. The SFSB is clustered via the @Clustered annotation as described here: Clustered EJBs.

@Stateful
@Clustered
@Remote(TestRemote.class)
public class TestBean implements TestRemote {
...
}

Note that the infinispan cache used to cluster the SFSB is only started, when during the deployment one SFSB with the @Clustered annotation is found. If we want to start two JBoss servers on the same machine, we can do this with the following command line invocations:

standalone.bat -c standalone-ha.xml -Djboss.node.name=nodeA -b 127.0.0.1
standalone.bat -c standalone-ha.xml -Djboss.socket.binding.port-offset=100 -Djboss.node.name=nodeB -b 127.0.0.1

It is important to note, that both servers have to be bind to a specific ip address. If you bind both server with the option “-b 0.0.0.0” the clustering doesn’t start (see here). Both servers also do have to have a different node name.

The client uses the following properties file to access the SFSB via JNDI:

remote.connectionprovider.create.options.org.xnio.Options.SSL_ENABLED=false
invocation.timeout=10000
remote.connections=default

remote.connection.default.host=127.0.0.1
remote.connection.default.port=4447
remote.connection.default.connect.options.org.xnio.Options.SASL_POLICY_NOANONYMOUS=false
remote.connection.default.connect.options.org.xnio.Options.SASL_POLICY_NOPLAINTEXT=false

remote.clusters=ejb
remote.cluster.ejb.connect.options.org.xnio.Options.SASL_POLICY_NOANONYMOUS=false
remote.cluster.ejb.connect.options.org.xnio.Options.SASL_POLICY_NOPLAINTEXT=false
remote.cluster.ejb.connect.options.org.xnio.Options.SSL_ENABLED=false

Two things are important. First of all we only have to define one of the two servers (here 127.0.0.1:4447). The other server is detected automatically via a topology message that the client receives after he has connected to the first server. It is also important to mention that this topology information is received with some latency, thus if you try to lookup all your SFSBs directly after the first lookup, your client program could be to fast to integrate the information about the second server and therefore all SFSB are executed on the first server. The name of the cluster has also to be defined (here with the property remote.clusters). Then for each defined cluster (here ejb) the SASL_POLICY as well as the SSL configuration is given.

If you now look up the remote bean with the following code, all invocations are load balanced to both server instances:

	private TestRemote lookupRemoteBean() throws NamingException {
		logger.info("Using jboss-ejb-client.");
		final Hashtable<String, String> jndiProperties = new Hashtable<String, String>();
		jndiProperties.put(Context.URL_PKG_PREFIXES, "org.jboss.ejb.client.naming");
		final Context context = new InitialContext(jndiProperties);
		final String appName = "jboss-ejb-client";
		final String moduleName = "server-ejb";
		final String distinctName = "";
		final String beanName = TestBean.class.getSimpleName();
		final String viewClassName = TestRemote.class.getName();
		String lookupString = "ejb:" + appName + "/" + moduleName + "/" + distinctName + "/" + beanName + "!"
				+ viewClassName + "?stateful";
		logger.debug(String.format("Looking up: %s", lookupString));
		return (TestRemote) context.lookup(lookupString);
	}

The appName and moduleName are chosen as described here.

There is also one more caveat: You can set a selector for the EJB client context to use programmatically:

final EJBClientConfiguration ejbClientConfiguration = new PropertiesBasedEJBClientConfiguration(
				clientConfigProps);
final ContextSelector<EJBClientContext> ejbClientContextSelector = new ConfigBasedEJBClientContextSelector(
				ejbClientConfiguration);
EJBClientContext.setSelector(ejbClientContextSelector);

Here the properties object clientConfigProps is created dynamically during runtime and contains in our example the same information as the properties file from above. If you set this selector directly before each lookup, the topology information is requested again and arrives too late, due to the latency mentioned before. Therefore again all clients are executed on the first server.