An enrichment plug-in to enrich Primo records with data provided from Nielsen. The data is searchable and displayable. Although the plug-in is written for Nielsen, it can practically be used to enrich the records by any kind of data.
- Author: Masud Khokhar
- Additional author(s):
- Institution: University of Oxford
- Year: 2011
- License: BSD style
- Short description: Use, modification and distribution of the code are permitted provided the copyright notice, list of conditions and disclaimer appear in all related material.
- Link to terms: [Detailed license terms]
- Skill required for using this code:
Often you would like to enrich your Primo records with additional data. To do this, you need to write an enrichment plug-in based on the Ex Libris Enrichment plug-in documentation (see here). However, the documentation for this is rather limited, and things don't always work out. As an example, the first load of Nielsen data is about 50 XML files each consisting of 50,000 records. Loading this data and enriching records with this can be time consuming. For example, our optimized plug-in took about 0.6 seconds per record where no match was found and 1.2 seconds where a match was found. For our 8M records, this reaches a time consumption of 6.5M (no match = 45days) and 2.5M (match = 34 days) which is approximately 79 days. Not practical by any means.
To solve this problem, we generate an index of Nielsen data using Apache Solr, and called the Solr web service to acquire results directly from the enrichment plug-in. This dramatically increased response time to less than 6 hours + the normal re-norm time.
The Nielsen enrichment plug-in provided here uses Apache Solr service to acquire data and then enrich the PNX of the records where a match is found. The plug-in can easily be modified to use different sources of data, plain XML file, a Lucene Index from XML data, a Lucene Index from database, PDFs etc.
Stable (Tested on Primo v4 and v3 - may work on v2 as well)
A recently released JDK, primo_publishing-api.jar (available at /exlibris/primo/p3_1/ng/primo/home/system/publish/client/), commons-lang-2.5.jar (easily available online).
Search for "12056229". In either brief display or full display, under "Additional Information", the Table of Contents, Short Description and Long Description sections are all coming from Nielsen data.
Step 1. Download the attached code and modify it according to your needs. The code is reasonably well commented to explain what it does.
Step 2. Configure the Enrichment plug-in on the server.
Step 3. Download Apache Solr and install it.
Step 4. Write a Data Import Handler (DIH) for Apache Solr to digest XML files.
Step 5. Generate the Solr Index.
Step 6. Configure back office to run the plug-in.
Step 7. Start Solr Server and then test the pipe.
Details of these steps follow.
No explanation required here. Please see download section to find the Java code.
This is the same except to change p3_1 to p4_1 in the directory name. Note that the URL for the solr server is set in the code (at localhost) so if the solr server is elsewhere this needs to be changed before compilation.
For version 3:
To configure the plug-in on the server, you need to put it in the following directory. As primo user, type
- be_production (enter)
This should take you to /exlibris/primo/p3_1/ng/primo/home/profile/publish/publish/production.
- cd conf/enrichPlugin/lib
Now your full path is: /exlibris/primo/p3_1/ng/primo/home/profile/publish/publish/production/conf/enrichPlugin/lib
Place your code file here. Copy the primo_publishing-api.jar and commons-lang-2.5.jar files to the same directory. See software requirements section for more details on how to get these files.
Now compile the Java code.
- javac -cp primo_publishing-api.jar:commons-lang-2.5.jar EnrichRecords.java
This will give you a warning. You can ignore the warning.
- jar -cvf EnrichRecords.jar EnrichRecords.class
This should create a jar file called EnrichRecords.jar in the current directory.
Now go up one directory level to: /exlibris/primo/p3_1/ng/primo/home/profile/publish/publish/production/conf/enrichPlugin
You will see a file called "custom_enrich_tasks_list.xml".
Modify this file. Here are contents of my file.
Note: I have noticed that problems occur if I introduce a package here. Others have not faced this problem. Also the name of the task should match the name of the jar file we have created.
Once this is done, restart the BO server.
The address given and the bundled solr/tomcat download no longer exist. Instead, installation of tomcat6 using the RH RPM works fine and does not appear to affect anything in the tomcat installations already present inside primo. The current version of solr (4.4.0 at time of writing) can be found at http://lucene.apache.org/solr/. To install solr into tomcat, put the solr.war file into /usr/share/tomcat6/webapps. See http://wiki.apache.org/solr/SolrTomcat for further information.
There is some reconfiguration needed because solr as downloaded is pre-configured to work with jetty (which is bundled with the download, but we are using tomcat for version 4 and Jetty for version 3). Set up $SOLR_HOME (/usr/src/solr-4.4.0/solr/collection1) as an exported environment variable in /etc/profile - this will make things easier. From $SOLR_HOME, edit solr/collection1/conf/solrconfig.xml and set
instead of the existing value.
The jar files which solr uses for logging need to be copied to the appropriate location. (If this is not done, tomcat will fail to start with the error SEVERE: Error filterStart in catalina.out - which can be debugged by looking in the current
file. Copy all the files from /usr/src/solr-4.4.0/example/lib/ext/ to $TOMCAT_HOME/lib.
Ensure that tomcat can create the index as needed:# chmod a+w /usr/src/solr-4.4.0/solr/collection1Now configure tomcat to be able to find the solr installation. Create a new file as /tmp/solr.xml (the reason for putting it here is that it will be automatically deleted by tomcat when a deploy fails, as happens if there is an error in the file):
Copy the file to $TOMCAT_HOME/conf/Catalina/localhost/.
Now start tomcat (service tomcat6 start).
Check the logs in /var/log/tomcat6. If the start up has been successful, you can now delete /tmp/solr.xml. Check by connecting to http://localhost:8080/solr.
Note: /usr/src/solr-4.4.0/solr/solr.xml is used to configure the cores and collections for solr. If no collections are configured here explicitly, solr assumes that there is a single collection named "collection1". Thus, we need to edit the configuration for collection1 without changing the default global configuration. So edit /usr/src/solr-4.4.0/solr/collection1/conf/solrconfig.xml to add a data handler as per Masud's instructions. I've placed the data import handler at the end of the current list of request handlers, before the search components section of the config file. The file name referenced is now /exlibris/enrichment/data-config.xml.
Now edit the referenced file, creating the enrichment directory first. This file needs to be readable by the solr process. Then edit /exlibris/enrichment/data-config.xml (which will also need to be readable by the solr process). Now create the directory /exlibris/enrichment/data and make this readable and writable by the solr process.
Now edit the file /etc/solr/collection1/conf/schema.xml. The changes here are as described in instructions for version 3.
Download the files from Neilsen to the data directory created above. To get a new full load of data, it is necessary to contact Nielsen to arrange it (the script deletes old files after they've been indexed). The data consists of a directory containing a number of files with the suffix ".add.zip". The solr installation was tested by loading one of these files by hand. Before doing so, it is necessary to ensure that the start.jar script is pointing to the correct port (8080).$ unzip 15560_34120_00_20130927_00.add.zip
$ java -jar -Durl=http://localhost:8080/solr/update /usr/src/solr-4.4.0/example/exampledocs/post.jar 15560_34120_00_20130927_00.addNote that the previous method using start.jar won't work because post.jar is configured for use with jetty.
Once this has run (should take much less than a second to report success), unzip all the tar files and load them.$ unzip *.zip # the \ is because unzip is not happy if the * is processed by the shell
$ rm 15560_34120_00_20130927_00.add # so as not to index it a second time
$ java -jar -Durl=http://localhost:8080/solr/update /usr/src/solr-4.4.0/example/exampledocs/post.jar *.addThe unzipping and uploading should each take less than 90 seconds.
Download and install Apache Solr. This is quite simple. At this time, Apache Solr's most recent version is 1.4.1 which can be downloaded from here. This bundle packs Jetty as a server so you don't need to install TomCat separately.
You can download apache-solr-1.4.1.tgz file and untar it.
- tar zxvf apache-solr-1.4.1.tgz
- cd apache-solr-1.4.1/example/solr/conf
Modify solrconfig.xml file and along side the other <requestHandler> sections, add a new one for Data Import Handler (DIH).
Here is mine as an example.
Once this is done, you need to write a new file called "data-config-xml-mk.xml" in the correct path (as you suggest in the above file). This file will describe which elements to look for in the data using XPath.
Here is mine as an example. Please delete the "-"s in line 6 underneath. They were only inserted so that the code displays properly.
Notice that the data files path is /exlibris/Enrichment/Nielsen/Data.
Now go back to the Solr directory.
- cd apache-solr-1.4.1/example/solr/conf
Modify schema.xml file.
Delete everything in the <fields></fields> section and introduce the ones that you are interested in (These fields should match the ones described in "data-config-xml-mk.xml" file . For example, my section looks like:
If you want to introduce a unique key, then change its value to the one you want as unique key. Mine is:
Make sure you comment the CopyField tags as shown below:
Save the file.
Now download all Nielsen data files and put them in the correct directory, in this case, /exlibris/Enrichment/Nielsen/Data. Once done, invoke the DIH with a full import status to generate the first index.
- curl http://localhost:8983/solr/dataimport -F command=full-import
In our case, the full data load of about 2.3GB took less than 10 minutes.
You can also check the status of the process by this command.
- curl http://localhost:8983/solr/dataimport | grep status
Once this index is generated, you are ready to move onto the next step.
Open BO administration and go to "Ongoing Configuration Wizards", then "Pipe Configuration Wizard" and then "Enrichment Sets Configuration".
Edit the enrichment set your pipe uses, and check the box with title "User Plugin Enrichment".
Start the Solr server. This can be done by:
Going to the directory where Solr is installed.
- cd /exlibris/tmp/Enrichment/Solr/example
- java -jar start.jar
(you can also start the service in background using)
- nohup java -jar start.jar &
Now run the pipe and see if you get any queries on the Solr terminal window.
At the moment, I have not added any logging. In my tests (where I did introduce logging), there were complications as the plug-in was unable to acquire a lock on a new log file no matter what the permissions of the directories/files are. If you want to introduce logging, I would suggest using the (be_log) log files and log4j.
To Ulrike Krabo and Douglas Campbell for their help and ideas and contributing their work here in EL Commons.
Of course this may not be the perfect way to do this, but it works. However, if you do have any questions/improvements/suggestions, please feel free to let me know. My email is: masud(dot)khokhar(at)bodleian(dot)ox(dot)ac(dot)uk
To check whether the Solr server is running, one way is to:
- netstat -an |grep 8983
and if a process is Listening on this port, then the server is running. However, often our staff does not have direct access to the servers. You can easily create a small php script which can check for you if something is listening on that port and display the message accordingly. I am attaching that here in case if someone finds that useful as well.
This example uses cURL but you can also use file_get_contents function of php to do the same.
Thank you and hope this helps someone