Package

Configuration

Persistent user configuration (email and optional OpenAlex API key) stored at ~/.config/collabnet/config.json. Interactive first-time setup runs automatically when no configuration file exists.

collabnet.config.get_config() dict

Return configuration, running interactive setup on first use.

Returns:

Configuration dictionary with email and optionally api_key.

Return type:

dict

collabnet.config.setup_config() dict

Run interactive first-time configuration setup.

Prompts the user for their email address and optional OpenAlex API key, then saves the result to CONFIG_FILE.

Returns:

Configuration dictionary with email and optionally api_key.

Return type:

dict

collabnet.config.load_config() dict

Load configuration from the user config file.

Returns:

Configuration dictionary with email and optionally api_key.

Return type:

dict

collabnet.config.save_config(config: dict) None

Persist configuration to the user config file.

Parameters:

config (dict) – Configuration dictionary to save.

QueryOA

A utility class to collect works for the research question at hand. This can be a collection of topic IDs, a list of journal IDs, or several institutions, identified by their ROR IDs. By default all records within the defined year range are gathered. This can take several minutes for large numbers and requires sufficient disc space.

class collabnet.utils.QueryOA(config_email: str | None = None, query_list: list = None, year_range: tuple = None, out_path: Path = PosixPath('.'), query_type: str = 'topic', n_max: int | None = None, force: bool = False)

Query Open Alex to receive publication records.

On first use (when no config_email is supplied and no saved configuration exists) the constructor runs an interactive prompt to collect the user’s email address and optional OpenAlex API key and saves them to ~/.config/collabnet/config.json.

Parameters:
  • config_email (str or None, optional) – Contact email used for OpenAlex polite API requests. If None the value is read from the saved configuration; interactive setup is triggered when no configuration file exists yet.

  • query_list (list) – List of query terms or identifiers to query against OpenAlex.

  • year_range (tuple[int, int]) – Inclusive (start_year, end_year) for filtering publication years.

  • out_path (pathlib.Path, optional) – Directory where results/artifacts should be written. Defaults to the current directory.

  • query_type (str, optional) – Type of query to perform ("journal", "topic", "institution", or "search"). Defaults to "topic". Use "search" to search across title, abstract, and full text for arbitrary words or phrases.

  • n_max (int or None, optional) – Maximum number of records to retrieve per query entry. None retrieves all available records.

  • force (bool, optional) – If False (default), skip entries whose output file already exists in out_path and return the existing path instead. Set to True to overwrite cached files and re-fetch from the API.

_affiliation_query(entry: str, year_range: tuple)

Run a institution-based query against OpenAlex for a single entry.

Parameters:
  • entry (str) – ROR ID to query (See ror.org for search options).

  • year_range (tuple[int, int]) – Inclusive (start_year, end_year) range for filtering publication years.

Returns:

Publication records matching the topic and year filter.

Return type:

list[dict]

_journal_query(entry: str, year_range: tuple)

Run a journal-based query against OpenAlex for a single entry.

Parameters:
  • entry (str) – Journal identifier to query (journal ids start with the letter s).

  • year_range (tuple[int, int]) – Inclusive (start_year, end_year) range for filtering publication years.

Returns:

Publication records matching the topic and year filter.

Return type:

list[dict]

_run_query(query_list, query_type, year_range, out_path) list

Execute an OpenAlex query and collect publication records.

Parameters:
  • query_list (list) – List of query terms or identifiers to query against OpenAlex.

  • query_type (str) – Type of query to perform (e.g., "topic", "journal", "institution", "search").

  • year_range (tuple[int, int]) – Inclusive (start_year, end_year) for filtering publication years.

  • out_path (pathlib.Path) – Directory where results or intermediate artifacts may be written.

Returns:

Retrieved publication records.

Return type:

list[dict]

_search_query(entry: str, year_range: tuple)

Run a free-text search query against OpenAlex for a single entry.

Searches across title, abstract, and full text (where indexed) for the given words or phrase.

Parameters:
  • entry (str) – Search term or phrase to look for in publications.

  • year_range (tuple[int, int]) – Inclusive (start_year, end_year) range for filtering publication years.

Returns:

Publication records matching the search term and year filter.

Return type:

list[dict]

_topic_query(entry: str, year_range: tuple)

Run a topic-based query against OpenAlex for a single entry.

Parameters:
  • entry (str) – Topic identifier to query (topic ids start with the letter t).

  • year_range (tuple[int, int]) – Inclusive (start_year, end_year) range for filtering publication years.

Returns:

Publication records matching the topic and year filter.

Return type:

list[dict]

run() list

search_topics

Search OpenAlex topic names and descriptions by keyword to discover topic IDs suitable for use with QueryOA.

collabnet.utils.search_topics(keywords: str | list[str], n_max: int | None = 50) DataFrame

Search OpenAlex topic names and descriptions for keywords.

Each keyword (or phrase) is submitted as a separate search against the OpenAlex Topics endpoint, which matches against topic names, descriptions, and keywords. Results from multiple searches are merged and deduplicated.

The returned DataFrame’s id column contains short topic IDs (e.g. "t12377") that can be passed directly to QueryOA with query_type="topic".

Email and API key are loaded automatically from the saved configuration (~/.config/collabnet/config.json); interactive setup runs on first use when no configuration file exists.

Parameters:
  • keywords (str or list[str]) – A single search term/phrase or a list of terms. Each term is searched independently and results are combined.

  • n_max (int or None, optional) – Maximum number of topic results to retrieve per keyword. Defaults to 50. Pass None to retrieve all matches.

Returns:

DataFrame with columns id, display_name, description, keywords, domain, field, subfield, works_count, sorted by works_count descending.

Return type:

pandas.DataFrame

TransformOA

Transform collected OpenAlex work collections into suitable format for further analysis. The full OpenAlex data is reduced and certain sub-fields are extracted to have a flat structure. Each source file is transformed and saved in the output format as a JSON file. This allows the processing of large files also for machines with less memory. None the less for edge cases the routine might lead to Out-of-Memory errors.

class collabnet.data.TransformOA(src_files: list, out_path: str = '.')

Read and transform Open Alex data.

File list should be a list of file paths. Source files have to be in JSON format.

Using TransformOA.run() has the following workflow:
  1. Open the UTF-8 encoded JSON file.

  2. Load a list of OpenAlex Work records.

  3. Normalize each record via _process_entry.

  4. Create a pandas DataFrame and write it as JSON to self.out_path using the same base filename.

Side effects:
  • Writes a JSON file to <self.out_path>/<file_path.name>.

Note:
  • UnicodeDecodeError from the JSON files are caught and logged.

  • Other I/O and JSON parsing errors will propagate to the caller.

Parameters:
  • src_files (list[str] | list[pathlib.Path]) – List of source file paths to JSON files containing OpenAlex records.

  • out_path (str | pathlib.Path, optional) – Directory where transformed outputs will be written. Defaults to the current directory.

_getAuthorAffilID(authors) tuple

Extract author and affiliation identifiers and associated country codes.

Expects the OpenAlex “authorships” structure:
  • Each item should contain an “author” dict with an “id”.

  • Each item should contain an “institutions” list; each institution may have “id” and “country_code”.

Returns a tuple:
  • List of (author_id, [institution_ids]) pairs.

  • List of country codes (one per institution encountered, duplicates possible).

Note: Missing keys are returned as empty dicts ({}).

Parameters:

authors (list[dict]) – List of authorship entries from an OpenAlex Work.

Returns:

(author_affils, countries) where author_affils is list[tuple[str | dict, list[str | dict]]] and countries is list[str | dict]

Return type:

tuple[list[tuple[str | dict, list[str | dict]]], list[str | dict]]

Raises:

TypeError – If authors is not iterable.

_getField(topics) list

Collect fields from an OpenAlex Work.

Expects a list of topic dicts, each with a “field” key.

Note: Missing keys are returned as empty dicts ({}).

Parameters:

topics (list[dict]) – Topics section from an OpenAlex Work.

Returns:

List of field strings.

Return type:

list[str | dict]

Raises:

TypeError – If topics is not iterable.

_getJournalID(publication_location) str | None

Return the source (journal) identifier from a Work’s primary location.

Expects the “primary_location” structure:
  • A dict containing a “source” dict with an “id” key.

Parameters:

publication_location (dict) – The Work’s primary_location object.

Returns:

Source ID if available; otherwise None.

Return type:

str | None

_getTopicID(topics) list

Collect topic identifiers from an OpenAlex Work.

Expects a list of topic dicts, each with an “id” key.

Note: Missing keys are returned as empty dicts ({}).

Parameters:

topics (list[dict]) – Topics section from an OpenAlex Work.

Returns:

List of topic IDs.

Return type:

list[str | dict]

Raises:

TypeError – If topics is not iterable.

_process_entry(work: dict) dict

Normalize a single OpenAlex Work record.

Extracts a subset of fields from the raw work, reconstructs the abstract from the abstract_inverted_index, replaces it with a plain-text abstract, and enriches the record with processed authorships, countries, topics, and primary_location identifiers.

Required input keys in work for transformations:
  • abstract_inverted_index

  • authorships

  • topics

  • primary_location

Output keys in the returned dict:
  • abstract (reconstructed from abstract_inverted_index)

  • authorships (processed via _getAuthorAffilID)

  • countries (derived from authors’ affiliations)

  • topics (processed via _getTopicID)

  • primary_location (processed via _getJournalID)

The keys id, doi, title, type, publication_year and referenced_works are copied.

Parameters:

work (dict) – Raw OpenAlex Work object as returned by the API.

Returns:

Normalized work record with selected fields and derived values.

Return type:

dict

Raises:
  • TypeError – If work is not a mapping-like object.

  • KeyError – If any required field is missing from work.

_process_file(file_path: Path) None

Transform one source JSON file into the normalized output format.

Parameters:

file_path (pathlib.Path) – Path to a UTF-8 encoded JSON file containing a list of OpenAlex Work objects.

Returns:

None

Return type:

None

Raises:
  • FileNotFoundError – If the input file does not exist.

  • json.JSONDecodeError – If the file content is not valid JSON.

  • OSError – For general I/O errors while reading or writing.

run() None

Process all source files defined in self.src_files.

Iterates over self.src_files with a progress bar and invokes _process_file for each entry.

Returns:

None

Return type:

None

join_to_df

A routine to join all transformed files into a single pandas DataFrame.

collabnet.data.join_to_df(src_folder: Path, out_folder: Path = PosixPath('.'), selection: str = '*.json') DataFrame

Create combined dataframe from JSON files. Entries are de-duplicated based on their work ID.

If only specific JSON files should be joined, specify a regex pattern like ‘works_s*.json’. Supports glob to search files in subfolders, e.g. ‘**/*.json’ for all JSON files in all subfolders.

Parameters:
  • src_folder (pathlib.Path) – Root directory to search for input JSON files.

  • out_folder (pathlib.Path, optional) – Directory where any outputs or artifacts may be written. Defaults to the current directory.

  • selection (str, optional) – Glob pattern used to select input JSON files relative to src_folder. Defaults to *.json.

Returns:

Combined dataframe of unique works from the selected JSON files.

Return type:

pandas.DataFrame

CreateNetwork

Create networks for co-author and co-country analysis from the gathered and transformed data as described above.

class collabnet.analysis.CreateNetwork(dataframe: DataFrame, year_range: tuple, interval: int = 1, out_path: Path = PosixPath('.'), net_type: str = 'coauthor')

Create networks for co-author and co-country analysis

Builds time-sliced networks from a dataframe of OpenAlex works (e.g., as produced by TransformOA). A time window can be defined as (year - window, year), e.g. (1960 - 5, 1960) -> (1955, 1960), to gather all entries for that range in one network.

Expected dataframe columns:
  • publication_year: int

  • authorships: list of (author_id, [institution_ids]) for coauthor mode

  • countries: list[str] of ISO 3166-1 alpha-2 codes for cocountry mode

Parameters:
  • dataframe (pandas.DataFrame) – Source records to derive networks from.

  • year_range (tuple[int, int]) – Inclusive (start_year, end_year) for filtering works.

  • interval (int, optional) – Size of each time window in years. Defaults to 1.

  • out_path (pathlib.Path, optional) – Directory where network files/exports will be written.

  • net_type (str, optional) – Network type to build: “coauthor” or “cocountry”. Defaults to “coauthor”.

_create_edge(row: dict) list

Create all co-author edges for a single publication record.

Generates unordered author pairs (nC2) from the authorships field of the input row. Any author entries with missing/None IDs are ignored. Single- author publications yield no edges.

Expected input structure:
  • row[“authorships”]: list of (author_id, [institution_ids])

where author_id may be str or None.

Optional fields (copied into edge metadata if present):
  • row[“id”], row[“doi”], row[“title”], row[“type”],

  • row[“publication_year”], row[“countries”], row[“topics”],

  • row[“primary_location”], row[“referenced_works”], etc.

Returned edge format:
  • A list of tuples (author_u, author_v, metadata)

where: - author_u: str — source author ID - author_v: str — target author ID - metadata: dict — publication-level attributes carried from row

(contents depend on available keys)

Notes:
  • Pairs are unique combinations (no self-pairs, order-independent).

  • Entries with missing author IDs are skipped.

Parameters:

row (dict | pandas.Series) – Publication record providing authorships and optional metadata.

Returns:

List of co-author edges with metadata for the given publication.

Return type:

list[tuple[str, str, dict]]

Raises:
  • KeyError – If ‘authorships’ is missing from the input row.

  • TypeError – If ‘authorships’ is not iterable.

_write_graphml()

Build rolling-window networks and export them to GraphML files.

For each year in the inclusive range [year_range[0], year_range[1]], this method:

  1. Selects works with publication_year in [year - self.interval, year] (inclusive).

  2. Creates per-publication co-occurrence edges via _create_edge(row).

  3. Flattens edges and aggregates by (source, target) to compute

  • weight: number of co-occurrences in the window,

  • paper: unique list of publication identifiers,

  • years: unique list of publication years,

  • topics: unique list aggregated across publications.

  1. Builds an igraph Graph and writes a GraphML file named f”{self.net_type}_{year}.graphml” to self.out_path.

Expected dataframe columns and behavior:
  • publication_year: int; used for time-window filtering.

  • _create_edge(row) must return a list of tuples with the following order per edge: (source, target, weight, paper, title, years, topics)

where:
  • source, target: author or country IDs (str-like),

  • weight: typically 1 per publication-level edge,

  • paper: publication/work identifier (e.g., OpenAlex ID),

  • title: publication title (not used in aggregation),

  • years: publication year (int),

  • topics: list of topic IDs.

Single-actor rows should yield no edges.

Returns:

None

Return type:

None

Raises:
  • KeyError – If required dataframe columns are missing (e.g., publication_year) or if _create_edge omits expected fields.

  • ValueError – If edge tuples do not match the expected shape/order.

  • ImportError – If igraph or required serialization utilities are missing.

  • OSError – For I/O errors while writing GraphML files.

run() str

Build and export networks for all years in the configured range.

Returns:

"Done" when all networks have been written.

Return type:

str

CalculateAICI

Calculate adjusted internationalization collaboration index (AICI) from a collection of publication records.

class collabnet.analysis.CalculateAICI(dataframe: DataFrame)

Calculate adjusted internationalization collaboration index.

Input dataframe generated by collabnet.data.OpenAlex.

_check_country_exist(row: dict) bool

Return True if the row has at least one country entry.

Parameters:

row (dict) – Publication record with a countries field.

Return type:

bool

_check_is_international(row: dict) bool

Return True if the row has authors from more than one distinct country.

Parameters:

row (dict) – Publication record with a countries field.

Return type:

bool

_check_no_country_exist(row: dict) bool

Return True if the row has no country entries.

Parameters:

row (dict) – Publication record with a countries field.

Return type:

bool

_generate_df(dataframe: DataFrame) DataFrame

Calculate AICI statistics grouped by publication year.

Parameters:

dataframe (pandas.DataFrame) – Subset of publication records to aggregate.

Returns:

DataFrame with columns year, papers, with_affil, no_affil, is_international.

Return type:

pandas.DataFrame

complete_df() DataFrame

Calculate AICI values for the full dataframe.

Returns:

DataFrame with columns level, year, papers, with_affil, no_affil, is_international. The level column is always "global".

Return type:

pandas.DataFrame

country_compare(country_list: list) DataFrame

Create data to compare country AICI.

country_df(country: str) DataFrame

Calculate AICI values for papers involving a specific country.

Parameters:

country (str) – ISO 3166-1 alpha-2 country code to filter by.

Returns:

DataFrame with columns level, year, papers, with_affil, no_affil, is_international. The level column equals the supplied country code.

Return type:

pandas.DataFrame

AffiliationBias

Quantify how affiliation-data availability depends on research field, journal, year, and publication type via logistic regression.

class collabnet.analysis.AffiliationBias(dataframe: DataFrame, n_top_journals: int = 20, topic_map=None, time_interactions: bool = True)

Quantify how affiliation-data availability depends on research field, journal, year, and type.

For each paper, computes whether it has affiliation data (has_affil), then fits a logistic regression to estimate which factors predict missingness. Time-dependent effects are captured via year × category interaction terms.

Parameters:
  • dataframe (pandas.DataFrame) – Source records (output of collabnet.data.join_to_df()).

  • n_top_journals (int, optional) – Number of most frequent journals to treat as individual categories; the rest are collapsed to "other". Defaults to 20.

  • topic_map (pandas.DataFrame or None, optional) – Optional DataFrame from collabnet.utils.search_topics() mapping topic IDs to field names. When supplied, field is added as a predictor.

  • time_interactions (bool, optional) – If True (default), include year × category interaction terms to test whether bias changes over time. Disable for sparse datasets.

_CATEGORY_COLORS = {'field': '#9467bd', 'interaction': '#7f7f7f', 'intercept': '#d62728', 'journal': '#2ca02c', 'type': '#ff7f0e', 'year': '#1f77b4'}
_build_model_df() DataFrame
static _categorize(name: str) str
plot_availability(by: str = 'type', ax=None)

Bar chart of affiliation availability rate grouped by a categorical variable.

Parameters:
  • by (str, optional) – Column to group by. One of "type", "journal_cat", or "field". Defaults to "type".

  • ax (matplotlib.axes.Axes or None, optional) – Matplotlib axes to draw on. If None, a new figure is created.

Returns:

Figure containing the bar chart.

Return type:

matplotlib.figure.Figure

plot_coefficients(result_df=None, ax=None)

Horizontal forest plot of regression coefficients with 95% CIs.

Significant predictors are shown at full opacity; non-significant ones are faded. Points are colour-coded by predictor category.

Parameters:
  • result_df (pandas.DataFrame or None, optional) – Output of run(). If None, run() is called first.

  • ax (matplotlib.axes.Axes or None, optional) – Matplotlib axes to draw on. If None, a new figure is created.

Returns:

Figure containing the coefficient plot.

Return type:

matplotlib.figure.Figure

Line plot of affiliation availability rate per year, one line per category.

Makes time-dependent bias directly visible: a diverging spread of lines indicates that the bias for different categories is growing or shrinking.

Parameters:
  • by (str, optional) – Column to group by. One of "type", "journal_cat", or "field". Defaults to "type".

  • ax (matplotlib.axes.Axes or None, optional) – Matplotlib axes to draw on. If None, a new figure is created.

Returns:

Figure containing the time-trend plot.

Return type:

matplotlib.figure.Figure

run() DataFrame

Fit logistic regression and return tidy coefficient DataFrame.

Returns:

DataFrame with columns predictor, category, coef, se, z, p_value, ci_lower, ci_upper, significant.

Return type:

pandas.DataFrame

ror2name

Resolve a ROR identifier to the institution’s display name via the ROR API.

collabnet.analysis.ror2name(ror)

generateCirclePlot

Generate a polar circle plot showing the collaboration partners of a given institution (identified by its ROR ID) over a specified year range.

collabnet.analysis.generateCirclePlot(ror: str, yearStart: int, yearEnd: int, index_path: str, color_special: str = 'tab:blue', showPlot: bool = False, exclude_ror: bool = False, other_target_rors: list = [])