Duplicate Detection Vector

If you are working with Primo VE and not Primo, see Understanding the Dedup and FRBR Processes (Primo VE).

Return to main page

The dedup vector includes the following:

  • Type (T). The type defines the matching rules that will be used. Currently Primo allows the following types:

    • Non-serials (T=1)—for all other records (refer to The Non-Serials Vector and Algorithm).

    • Serials (T=2)—for serial records (refer to The Serials Vector and Algorithm).

      These rules are based on the matching algorithms developed together with the California Digital Library (CDL).
    • Articles (T=3)—for articles (refer to Deduplication Algorithm for Articles).

      If you would like to skip duplicate detection for individual records (such as analytic records for Aleph pipes), you can set this field to 99 in the normalization rules.

The Serials and Non-serials duplication detection algorithms have two phases: Candidate Selection and Record Matching. The Articles duplication detection algorithm has only a match phase.

  • Candidate Fields (C1-C10)—The Candidate Selection phase locates up to a set number of potential records for matching. This section in the vector is indexed in the persistence layer. The indexes are used to locate candidates.

  • Matching Fields (F1-F20)—During the record matching phase, fields from the Matching Fields section are compared. Fields that match are assigned weight points, as determined by the rules used. Records that cross the threshold are considered duplicates and are assigned the MatchID of the matching record.

All of the fields in the vector should be normalized. Normalization routines may be different for different sources.

The following sections describe the various vectors and matching algorithms.

Dedup vectors and keys are limited to 4000 bytes. If this limit is reached, you may receive an SQL exception error on the P_DEDUP_VECTOR table.

For more details, see Harvesting a record fails with an UncategorizedSQLException error on the P_DEDUP_VECTOR table.

The Serials Vector and Algorithm

The following types of vectors exist for serials:

  • Candidate

  • Matching

For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

Serials Candidate Vector

The following table describes the fields in the Candidate vector.

Serials Candidate Vector Fields
Field ID Field Content Note

C1

UnivID, UnivID_invalid

This is a unique universal ID (for example, LCCN).

C2

ISSN, ISSN_invalid, ISSN_cancelled

C3

Short Title

C4

Place of Publication

Only the first occurrence is used.

C5

Single match ID

Intended for the Alma’s MMS ID or another ID that is reliable enough to serve as the sole basis for the match.

In the Candidate phase of the algorithm, there is an OR operator between the following candidate fields (C1, C2, C3). The fourth candidate field is added if many candidates are located. The fourth candidate is added with an AND operator.

If there is a match on C5, the records are considered a match and will not continue to the matching stage, which is based on the other metadata elements.

Serials Matching Vector

The following table describes the fields in the Matching vector.

Serials Matching Vector Fields
Field ID Field Content Note

F1

UnivID

F2

Univ_invalid

Multiple occurrences are delimited by a semicolon.

F3

ISSN

Multiple occurrences are delimited by a semicolon.

F4

ISSN_invalid

Multiple occurrences are delimited by a semicolon.

F5

ISSN_cancelled

Multiple occurrences are delimited by a semicolon.

F6

Start publication year

F7

Full title

F8

Brief title

Remove subtitle and any additional information.

F9

Country of publication

F10

Place of publication

F11

Main entry (author, corporate body, meeting)

The matching takes place in two stages, quick and full.

The quick match compares the following fields:

  • Single match ID

  • UnivID/UnivID_invalid

  • ISSN/ISSN_invalid/ISSN_cancelled

  • Full title

The full match compares all fields in the vector.

The following table lists the default weights for quick and full matches for serials. If 800 points are reached in the quick-match stage, the records are considered a match. If not, the record proceeds to the full-match stage, which checks all fields. As in the quick-match stage, if 800 points are reached, the records are considered a match.

In both the quick-match and full-match stages, the weight from the UnivID and ISSN matches is compared, and the higher of the two weights, not the sum, is assigned to the record.

For every group, only the highest weight is assigned.
Default Weights for Quick and Full Matches For Serials
Field ID Fields for Comparison Result Points

F1/F2

UnivID/UnivID_invalid

Match on Univ_ID

200

Match on UnivID_invalid

50

Match between UnivID and UnivID_invalid

100

No match on UnivID

-470

No match between UnivID and UnivID_invalid

-50

No match on UnivID_invalid

0

Either or both records missing field

0

F3/F4/F5

ISSN/ISSN_invalid/ISSN_cancelled

Match on ISSN

200

Match on ISSN_invalid

50

Match on ISSN_cancelled

10

Match ISSN and ISSN_invalid

100

Match between ISSN and ISSN_cancelled

50

Match between ISSN_invalid and ISSN_cancelled

30

No match on ISSN

-250

No match ISSN_invalid and ISSN_cancelled

0

Either record or both records missing field

0

F7

Full Title

Exact match on title and title NOT in table of common titles

600

Exact match on title and title IS in table of common titles

135

Match on truncated title and truncated title in the list of common titles

135

Match on truncated title and truncated title not in the list of common titles

175

No match

-600

Calculate weight based on percentage of keywords from title that match x 75

*

Calculate weight based on percentage of keywords from title that match x 75 + 50

*

F6

Date

Exact match

225

+/- 1 year

50

+ /- 2 years

25

If first three digits match, check the 4th digit and if either record has a 0

20

No match

-150

The value is missing from either or both records

0

F9

Country of Publication

Match

40

No match

-20

Either record or both records missing the value

0

F10

Place of Publication

Exact match on normalized place of publication

200

Either or both records are missing the subfield

0

No match on normalized place of publication

-100

F11

Main Entry

If the normalized contents of the fields match, then it is considered a full match even if the data was found in different fields

200

If one or both main entries are missing

0

If more than 60% of the keywords from main entry fields match and are in the same order

75 times the percentage of words that match plus 25

If more than 60% of the keywords from main entry fields match but are not in the same order

75 times the percentage of words that match

If 60% or less of the keywords in main entry fields match

-250

The Non-Serials Vector and Algorithm

The following types of vectors exist for non-serials:

  • Candidate

  • Matching

For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

Non-Serials Candidate Vector

The following table describes the fields in the Candidate vector.

Non-Serials Candidate Vector Fields
Field ID Field Content Note

C1

UnivID and UnivID_invalid

A unique universal ID (for example, LCCN)

C2

ISBN, ISBN_invalid

Multiple occurrences delimited by a semicolon.

C3

Short title

The first 25 characters of the normalized title.

C4

Year

 

C5

Single match ID

Intended for the Alma’s MMS ID or another ID that is reliable enough to serve as the sole basis for the match.

In the Candidate algorithm, there is an OR operator between the following candidate fields (C1, C2, C3) if more than 150. The fourth candidate field is added only if too many candidates are located. The fourth candidate is added with an AND.

If there is a match on C5, the records are considered a match and will not continue to the matching stage, which is based on the other metadata elements.

Non-Serials Matching Vector

The following table describes the fields in the Matching vector.

Non-Serials Matching Vector Fields
Field ID Field Content Note

F1

UnivID

 

F2

UnivID_invalid

Multiple occurrences are delimited by a semicolon.

F3

ISBN

Multiple occurrences are delimited by a semicolon.

F4

ISBN_invalid

Multiple occurrences are delimited by a semicolon.

F5

Short title

The first 25 characters of the normalized title.

F6

Year

 

F7

Full title

 

F8

Country of publication

 

F9

Pagination

The highest number in the pagination field should be used.

F10

Publisher

 

F11

Main entry (author, corporate body, meeting)

 

The matching takes place in two stages: quick and full.

The quick match stage compares the following fields:

  • Single match ID

  • UnivID/UnivID_invalid

  • ISBN/ISBN_invalid

  • Short title

  • Year

If 850 points are reached, the records are considered a match. If not, the record proceeds to full-match stage, which uses all fields except the full title is used instead of the short title. If 875 points are reached, the records are considered a match.

In both the quick-match and full-match stages, the weight from the UnivID and ISBN matching is compared, and the higher weight of the two stages, not the sum, is assigned to the record.

For every group, only the highest weight is assigned.
Default Weights for Quick and Full Matches For Non-Serials
Fields for Comparison Result Points

UnivID/UnivID_invalid*

Match on valid UnivID

200

Match on invalid UnivID

50

Match between valid and invalid

100

Field present in both records but no match

-320

Either record or both records missing

0

ISBN/ISBN_invalid*

Match between valid ISBN

85

Match between invalid ISBN

10

Match between valid and invalid

30

Field present in both records but no match

-225

Either record or both records missing

0

Date

Exact match

200

+/- 2 years

-25

No match

-250

Value missing

0

Short-Title

Exact match on first 25 characters

450

Non match

0

Full-Title

Exact match

600

Either title contained within the other title

350

Either title shorter than nine characters

0

Matching keywords

450 x (% of matching words)

Matching keywords in order

450 x (% + 50)

Non-match

-600

Country of Publication

Exact match

40

Either one missing

0

Non-match

-205

Pagination

Exact match, and the value is greater than 10

100

Exact match, and the value is less than or equal to 10

50

Values differ by 1-10 pages, and both values are greater than 10

50

Values differ by 1-10 pages, and either value is less than or equal to 10

20

Non-match (values differ by more than 10 pages)

-225

Publisher

Exact match

100

Either missing

0

Occur within the other

100

Non-match

-25

Main Entry

Exact match

125

Both main entries missing

75

Half (or more) of the main entry keywords are common and in the same order

% common keywords x 80 + 10

Half (or more) of the main entry keywords are common, but are not in the same order

% common keywords x 80

Present in one record but missing in the other

-25

There is a known issue with the weight of F11 fields. Primo currently gives +25 points to one missing main entry instead of -25.

Non-match

-200

Deduplication Algorithm for Articles

The Deduplication algorithm for articles matches a single key that is also used in the candidate and matching phases. Use C1 for the candidate key and F1 for the match. In addition to remote searches, the deduplication algorithm can be used for records that are harvested into the local Primo repository. In both cases, a single key is created from the following elements:

  • ISSN, DOI, or normalized journal title

  • Start page, author, or author last name

  • Publication year, issue, or part

  • Normalized article title

In order to create a dedup key, the record must include all of the dedup key elements. Records match when the dedup keys are identical.

For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

If you want to load articles into the local repository, create a dedup vector as follows:

Dedup Vector
Field ID Field Content Note

T

3 OR 99

Use type 99 for records that do not include all required data elements. This can be done by first creating rules that assign the type 99 to the records that do not have the following fields in the addata section. Create a separate rule for every group of element:

  • If record does not have an ISSN, DOI, or a Journal title, use type 99.

  • If the record does not have StartPage, author, or author last name, use type 99.

  • If the record does not have PublicationYear, Issue, or Part, use type 99.

  • If the record does not have an ArticleTitle, use type 99.

All other records should get type 3.

C1

The match key created from the following elements as a single string:

(ISSN, DOI, or Journal title) + (StartPage, author, or author last name) + (PublicationYear, issue, or part) + (ArticleTitle)

 

F1

The match key created from the following elements as a single string:

(ISSN, DOI, or Journal title) + (StartPage, author, or author last name) + (PublicationYear, issue, or part) + (ArticleTitle)

 
The following rules are used to create a normalization article title:
  • Replace the following characters with a space: !@#$%^&*()_+-={}}[]:";<>?,./~`

  • Remove all blank characters.

  • Save the last 25 characters of the title.

  • Change the characters to lowercase characters.

The Deduped-Merged Record

The system creates the merged record based on the preferred record, where the fields in the following sections are merged from all records in the dedup group:

  • Control– most fields are merged

  • Display– After the source and availlibrary fields are merged, the other fields are taken from the preferred record.

  • Links– all fields are merged and duplicate fields are removed

  • Search– all fields are merged and duplicate fields are removed

  • Sort– only fields from preferred are taken

  • Facets– all fields are merged and duplicate fields are removed

  • Dedup– not relevant

  • FRBR– all fields merged and duplicate fields are removed

  • Delivery– all fields are merged

  • Ranking– the highest value is taken from all records

  • Enrichment– not relevant

  • Additional data– all fields are merged and duplicate fields are removed

  • Local fields– all local fields are included

To enable the system to identify the original source record, the dedup process adds a subfield O ($$O) and a subfield V ($$V). The content of $$O is the original PNX record ID, and the content of $$V is the value of the original field. The system uses $$O when it needs to link between fields that are derived from the same source PNX record - all fields with the same $$O derive from the same source record.

The $$V and $$O are added to fields from the control, display, links, and delivery sections. For example, a deduped record will have multiple <sourceid/> fields in the control section:

<sourceid>$$VBBI$$OBBI004876460</sourceid>
<sourceid>$$VBBI$$OBBI004550753</sourceid>

In this example, the value of the original control/sourceid fields is BBI, and the record IDs of the source PNX record are BBI004876460 and BBI004550753.

The following figure shows an example of a deduped-merged PNX record:

<record> <control> <sourceformat>MARC21</sourceformat> <sourcesystem>$$VILS$$OBBI004876460</sourcesystem> <sourcesystem>$$VILS$$OBBI004550753</sourcesystem> <recordid>dedupmrg2284018</recordid> <originalsourceid>$$VPRM01$$OBBI004876460</originalsourceid> <originalsourceid>$$VPRM01$$OBBI004550753</originalsourceid> <sourceid>$$VBBI$$OBBI004876460</sourceid> <sourceid>$$VBBI$$OBBI004550753</sourceid> <sourcerecordid>$$V004876460$$OBBI004876460</sourcerecordid> <sourcerecordid>$$V004550753$$OBBI004550753</sourcerecordid> </control>

<display> <type>book</type> <title>Language development and learning to read the scientific study of how language development affects reading skill</title> <creator>Diane McGuinness</creator> <contributor>NetLibrary, Inc.</contributor> <publisher>Cambridge, Mass. : MIT Press</publisher> <creationdate>c2005</creationdate> <format>x, 494 p. : ill. ; 24 cm..</format> <identifier>$$CISBN$$V142372612X (electronic bk.)</identifier> <subject>Reading -- Research; Language acquisition -- Research; Electronic books</subject> <language>eng</language> <source>$$VBBI$$OBBI004876460</source> <source>$$VBBI$$OBBI004550753</source> <availlibrary>$$INORTH$$LNINTE$$Savailable$$33$$40$$5N$$60$$OBBI004876460</availlibrary> <availlibrary>$$ISOUTH$$LKINTE$$1Internet$$Scheck_holdings$$OBBI004876460</availlibrary> <availlibrary>$$ISOUTH$$LLINTE$$1Book$$Scheck_holdings$$OBBI004876460</availlibrary> <availlibrary>$$INORTH$$LNWILS$$1General collection$$2(LB1050.6 .M34 2005 )$$Savailable$$31$$40$$5N$$60$$OBBI004550753</availlibrary> <availinstitution>$$INORTH$$Savailable</availinstitution> <availinstitution>$$ISOUTH$$Scheck_holdings</availinstitution> <availpnx>available</availpnx> </display>

<links> <linktotoc>$$Tamazon_toc$$DTable of Contents$$OBBI004876460</linktotoc> <linktoabstract>$$Tsyndetics_abstract$$DAbstract$$OBBI004876460</linktoabstract> <linktouc>$$Tamazon_uc$$DThis item in Amazon.com$$OBBI004876460</linktouc> <linktouc>$$Tworldcat_isbn$$DThis item in WorldCat®$$OBBI004876460</linktouc> <linktoexcerpt>$$Tsyndetics_excerpt$$DExcerpt from item$$OBBI004876460</linktoexcerpt> <openurl>$$Topenurl_journal$$OBBI004876460</openurl> <openurlfulltext>$$Topenurlfull_journal$$OBBI004876460</openurlfulltext> <linktoholdings>$$V$$TILS_holdings$$OBBI004876460</linktoholdings> <linktoholdings>$$V$$TILS_holdings$$OBBI004550753</linktoholdings> <backlink>$$V$$TILS_backlink$$DThis item in the Library Catalog$$OBBI004876460</backlink> <backlink>$$V$$TILS_backlink$$DThis item in the Library Catalog$$OBBI004550753</backlink> <linktorsrc>$$V$$Uhttp://www.netLibrary.com/summary.asp?id=138523$$Dfor Primo University Crookston access$$OBBI004876460</linktorsrc> <linktorsrc>$$V$$Uhttps://www.lib.umn.edu/slog.phtml?url=http://www.netLibrary.com/ summary.asp?id=138523$$DNorth Campus access$$OBBI004876460</linktorsrc> <linktorsrc>$$V$$Uhttp://www.netLibrary.com/summary.asp?id=138523$$DNorth Campus access$$OBBI004876460</linktorsrc> </links>

<search> <creatorcontrib>NetLibrary, Inc.</creatorcontrib> <creatorcontrib>Net Library, Inc</creatorcontrib> <title>Language development and learning to read the scientific study of how language development affects reading skill /</title> <subject>Electronic books.</subject> <general>[electronic resource] :</general> <isbn>142372612X</isbn> <recordid>BBI004876460</recordid> <searchscope>SOUTH</searchscope> <scope>SOUTH</scope> <creatorcontrib>Diane McGuinness</creatorcontrib> <creatorcontrib>McGuinness, D</creatorcontrib> <creatorcontrib>Diane McGuinness.</creatorcontrib> <title>Language development and learning to read : the scientific study of how language development affects reading skill /</title> <subject>Reading Research.</subject> <subject>Language acquisition Research.</subject> <general>MIT Press,</general> <isbn>0262134527</isbn> <creationdate>2005</creationdate> <sourceid>BBI</sourceid> <recordid>BBI004550753</recordid> <rsrctype>book</rsrctype> <searchscope>NORTH</searchscope> <searchscope>BBI</searchscope> <scope>NORTH</scope> <scope>BBI</scope> </search> <sort> <creationdate>2005</creationdate> </sort>

<facets> <collection>NINTE</collection> <collection>KINTE</collection> <collection>LINTE</collection> <toplevel>online_resources</toplevel> <creatorcontrib>NetLibrary, Inc</creatorcontrib> <genre>Electronic books</genre> <language>eng</language> <creationdate>2005</creationdate> <topic>Reading-Research</topic> <topic>Language acquisition-Research</topic> <collection>NWILS</collection> <toplevel>available</toplevel> <creatorcontrib>McGuinness, D</creatorcontrib> <prefilter>books</prefilter> <rsrctype>books</rsrctype> <classificationlcc>L - Education.-Theory and practice of education-Teaching (Principles and practice)-Reading (General)</classificationlcc> </facets> <dedup> <t>1</t> <c2>142372612X</c2> <c3>languagedevelopmentaadingskill</c3> <c4>2005</c4> <f3>142372612X</f3> <f5>languagedevelopmentaadingskill</f5> <f6>2005</f6> <f7>language development and learning to read the scientific study of how language development affects reading skill</f7> <f8>mau</f8> <f9>x, 494 p. :</f9> <f10>mit press</f10> <f11>mcguinness diane</f11> </dedup> <frbr> <t>1</t> <k1>$$Kmcguinness diane$$AA</k1> <k3>$$Klanguage development and learning to read the scientific study of how language development affects reading skill$$AT</k3> </frbr>

<delivery> <institution>$$VNORTH$$OBBI004876460</institution> <institution>$$VSOUTH$$OBBI004876460</institution> <delcategory>$$VOnline Resource$$OBBI004876460</delcategory> <institution>$$VNORTH$$OBBI004550753</institution> <delcategory>$$VPhysical Item$$OBBI004550753</delcategory> </delivery> <enrichment> <classificationlcc>LB1050.6</classificationlcc> </enrichment> <ranking> <booster1>1</booster1> <booster2>1</booster2> </ranking> <addata> <addau>NetLibrary, Inc</addau> <eissn>0262134527 0765805723</eissn> <isbn>142372612X</isbn> <oclcid>61704190</oclcid> <btitle>Language development and learning to read the scientific study of how language development affects reading skill</btitle> <aulast>McGuinness</aulast> <aufirst>Diane</aufirst> <au>McGuinness, Diane</au> <date>2005</date> <risdate>c2005.</risdate> <isbn>0262134527</isbn> <format>book</format> <ristype>BOOK</ristype> <notes>Includes bibliographical references (p. [447]-477) and indexes.</notes> <cop>Cambridge, Mass.</cop> <pub>MIT Press</pub> <lccn>2004062118</lccn> <btitle>Language development and learning to read : the scientific study of how language development affects reading skill</btitle> <genre>book</genre> </addata> </record>