Text Metrics

Useful utils for extracting frequency, surprisal and other word-level properties from text. The functions get_frequency,get_surprisal andget_metrics (for both metrics) can be used to retrieve these values for any piece of text. If add_parsing_features is set to True for get_metrics, more word-level features are added to the result (e.g POS, NER, MORPH etc.) E.g., running

from text_metrics.surprisal_extractors.extractors_constants import SurpExtractorType
from text_metrics.surprisal_extractors.extractor_switch import (
    get_surp_extractor,
)
from text_metrics.ling_metrics_funcs import get_metrics

text = """Many of us know we don't get enough sleep, but imagine if there was a simple solution:
getting up later. In a speech at the British Science Festival, Dr. Paul Kelley from Oxford University
said schools should stagger their starting times to work with the natural rhythms of their students.
This would improve exam results and students' health (lack of sleep can cause diabetes, depression,
obesity and other health problems).""".replace("\n", " ").replace("    ", "")

model_name = "gpt2"
surp_extractor = get_surp_extractor(
    extractor_type=SurpExtractorType.CAT_CTX_LEFT, model_name=model_name
)
parsing_model = spacy.load("en_core_web_sm")

metrics = get_metrics(
    target_text=text,
    surp_extractor=surp_extractor,
    parsing_model=parsing_model,
    parsing_mode="re-tokenize",
    add_parsing_features=True,
)

Should result in

	Word	Length	Wordfreq_Frequency	subtlex_Frequency	gpt2_Surprisal	Word_idx	Token	POS	TAG	Token_idx	Relationship	Morph	Is_Content_Word	Reduced_POS	Head_word_idx	n_Lefts	n_Rights	AbsDistance2Head	Distance2Head	Head_Direction
0	Many	4	10.2645	11.4053	7.2296	1	Many	ADJ	JJ	0	nsubj	['Degree=Pos']	True	ADJ	4	0	1	3	3	RIGHT
1	of	2	5.31617	6.39588	1.76724	2	of	ADP	IN	1	prep	[]	False	FUNC	1	0	1	1	-1	LEFT
2	us	2	9.82828	9.16726	1.56595	3	us	PRON	PRP	2	pobj	['Case=Acc', 'Number=Plur', 'Person=1', 'PronType=Prs']	False	FUNC	2	0	0	1	-1	LEFT
3	know	4	9.63236	7.41279	3.44459	4	know	VERB	VBP	3	parataxis	['Tense=Pres', 'VerbForm=Fin']	True	VERB	7	1	0	3	3	RIGHT
4	we	2	8.17085	6.75727	5.35026	5	we	PRON	PRP	4	nsubj	['Case=Nom', 'Number=Plur', 'Person=1', 'PronType=Prs']	False	FUNC	7	0	0	2	2	RIGHT

Working With Text DataFrames

This repo also supports word-level feature extraction for multiple texts and surprisal-extraction models. Note that each text is processed independently. Exmaple usage is shown below:

from text_metrics.merge_metrics_with_eye_movements import extract_metrics_for_text_df_multiple_hf_models
from text_metrics.surprisal_extractors import extractors_constants

text_df = pd.DataFrame(
    {
        "Phrase": [1, 2, 1, 2],
        "Line": [1, 1, 2, 2],
        "Target_Text": [
            "Is this the real life?",
            "Is this just fantasy?",
            "Caught in a landslide,",
            "no escape from reality",
        ],
        "Prefix": ["pre 11", "pre 12", "pre 21", "pre 22"],

    }
)

text_df_w_metrics = extract_metrics_for_text_df_multiple_hf_models(
    text_df=text_df,
    text_col_name="Target_Text",
    text_key_cols=["Line", "Phrase"],
    surprisal_extraction_model_names=[
        "gpt2",
        "EleutherAI/pythia-70m",
        "state-spaces/mamba-130m-hf",
    ],
    surp_extractor_type=extractors_constants.SurpExtractorType.CAT_CTX_LEFT,
    add_parsing_features=False,
    model_target_device="cuda",
    # arguments for the function extract_metrics_for_text_df
    extract_metrics_for_text_df_kwargs={
        "ordered_prefix_col_names": ["Prefix"],
    },
)

Input:

	Phrase	Line	Target_Text	Prefix
0	1	1	Is this the real life?	pre 11
1	2	1	Is this just fantasy?	pre 12
2	1	2	Caught in a landslide,	pre 21
3	2	2	no escape from reality	pre 22

Output:

	index	Word	Length	Wordfreq_Frequency	subtlex_Frequency	gpt2_Surprisal	Line	Phrase	EleutherAI/pythia-70m_Surprisal	state-spaces/mamba-130m-hf_Surprisal
0	0	Is	2	6.41735	6.75709	10.6974	1	1	9.5653	10.2678
1	1	this	4	7.24113	6.93294	2.35303	1	1	7.33117	5.59489
2	2	the	3	4.21893	5.04894	2.3539	1	1	2.90964	1.84557
3	3	real	4	11.2949	11.1044	3.76594	1	1	5.73141	4.99041
4	4	life?	4	10.3317	10.2571	8.25526	1	1	7.84282	6.57673
5	0	Is	2	6.41735	6.75709	10.0461	1	2	9.6152	10.7826
6	1	this	4	7.24113	6.93294	2.61381	1	2	6.51316	5.26467
7	2	just	4	8.53818	7.68143	6.07112	1	2	6.84601	4.86802
8	3	fantasy?	7	14.8828	15.8738	8.53002	1	2	12.7546	8.56943
9	0	Caught	6	13.6205	13.3412	13.8529	2	1	13.3173	14.6917
10	1	in	2	5.74855	6.64024	2.12986	2	1	2.23954	2.46692
11	2	a	1	5.44851	5.57752	2.16311	2	1	1.38347	1.51784
12	3	landslide,	9	18.3709	20.175	9.13496	2	1	13.5699	14.0326
13	0	no	2	8.80229	7.35099	8.02664	2	2	7.13618	7.29309
14	1	escape	6	14.482	14.4265	9.8561	2	2	15.3783	13.2179
15	2	from	4	7.87155	8.9012	3.11976	2	2	3.54212	4.43229
16	3	reality	7	13.6536	14.9758	5.36706	2	2	8.1663	5.98277

Integration With Eye Movement Data

The function add_metrics_to_word_level_eye_tracking_report from merge_metrics_with_eye_movements.py processes an SR interest area report file, where each row represents a word for which eye tracking data is collected. The "IA_ID" column indicates the index of the word in the textual item, and the "IA_LABEL" column contains the word itself. The function performs the following steps:

Extracts a dataframe where each row is a unique textual item from the word-level dataframe.
Applies the extract_metrics_for_text_df_multiple_hf_models function to the unique textual items dataframe to obtain word-level features for each word.
Merges the word-level dataframe with the metrics dataframe to add word metrics for each word.

from text_metrics.merge_metrics_with_eye_movements import add_metrics_to_word_level_eye_tracking_report

df = pd.read_csv("path/to/interest_area_report.csv")
textual_item_key_cols = [
        "paragraph_id",
        "batch",
        "article_id",
        "level",
        "has_preview",
        "question",
    ]
ia_report_enriched = add_metrics_to_word_level_eye_tracking_report(
    eye_tracking_data=df,
    textual_item_key_cols=textual_item_key_cols,
    surprisal_extraction_model_names=['gpt2', 'EleutherAI/pythia-70m', 'state-spaces/mamba-130m-hf'],
    spacy_model_name="en_core_web_sm",
    parsing_mode="re-tokenize",
    model_target_device='cuda',
    hf_access_token='',
    # CAT_CTX_LEFT: Buggy version from "How to Compute the Probability of a Word" (Pimentel and Meister, 2024). For the correct version, use the SurpExtractorType.PIMENTEL_CTX_LEFT
    surp_extractor_type=extractor_switch.SurpExtractorType.CAT_CTX_LEFT,
)

Surprisal

The standard surprisal exctractor is SurpExtractorType.CAT_CTX_LEFT. Currently, it supports the following models:

GPT-2 family:
"gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"

GPT-Neo family:
"EleutherAI/gpt-neo-125M", "EleutherAI/gpt-neo-1.3B", "EleutherAI/gpt-neo-2.7B",
"EleutherAI/gpt-j-6B", "EleutherAI/gpt-neox-20b"

OPT family:
"facebook/opt-125m", "facebook/opt-350m", "facebook/opt-1.3b", "facebook/opt-2.7b",
"facebook/opt-6.7b", "facebook/opt-13b", "facebook/opt-30b", "facebook/opt-66b"

Pythia family:
"EleutherAI/pythia-70m", "EleutherAI/pythia-160m", "EleutherAI/pythia-410m", "EleutherAI/pythia-1b",
"EleutherAI/pythia-1.4b", "EleutherAI/pythia-2.8b", "EleutherAI/pythia-6.9b", "EleutherAI/pythia-12b",

Llama family:
"meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-13b-hf", "meta-llama/Llama-2-70b-hf"

Gemma family (also `*-it` versions):
"google/gemma-2b", "google/gemma-7b", "google/recurrentgemma-2b", "google/recurrentgemma-9b"

Mamba family: (install (pip): mamba-ssm, causal-conv1d>=1.2.0)
"state-spaces/mamba-130m-hf", "state-spaces/mamba-370m-hf",
"state-spaces/mamba-790m-hf", "state-spaces/mamba-1.4b-hf",
"state-spaces/mamba-2.8b-hf"

Mistral family:
"mistralai/Mistral-7B-Instruct-v0.*" where * is the version number

The surprisal exctractors SurpExtractorType.CAT_CTX_LEFT and SurpExtractorType.PIMENTEL_CTX_LEFT implement the buggy and corrected versions of surprisal calculation as defined in Pimentel and Meister (2024), respectively.
Words are split by white space, and include adjacent punctuation.
In line with Pimentel and Meister (2024), in case the model implements BOS representation (e.g., GPT-2), we use it and don't add a whitespace prefix.
A surprisal of a word is the sum of the surprisal of the subwords (as split by the tokenizer) that make up the word.
Note that Llama, Gemma, Mistral require huggingface access token that can be given as a parameter to the function get_surp_extractor.

Prefix Conditioned Surprisal

In many cases, we are interested in the surprisal of a word given a prefix, Where this prefix is concatenated to the left of the target text. By using the left_context_text parameter of the get_metrics function, along with the surprisal extractors SurpExtractorType.CAT_CTX_LEFT or SurpExtractorType.PIMENTEL_CTX_LEFT, you can concatenate a text before the target text. E.g., running

question = "What university is Dr. Kelley from?"

metrics = get_metrics(
    target_text=text,
    surp_extractor=surp_extractor,
    parsing_model=parsing_model,
    parsing_mode="re-tokenize",
    left_context_text=question,
    add_parsing_features=True,
)

will affect the surprisal estimates in the column gpt2_Surprisal the following way: (higher color intensity -> higher surprisal)

Without Left Context

With Left Context

As can be seen, the surprisal values for "Dr. Paul Kelly" have been changed drastically due to the left context.

Note that regardless of the edition of left context (all extractors support the "no left context" setting), SurpExtractorType.CAT_CTX_LEFT and SurpExtractorType.PIMENTEL_CTX_LEFT implement the buggy and corrected (respectively) versions of surprisal calculation as defined in Pimentel and Meister (2024).

In addition to the ability to prompt the model with textual context as a prefix, the extractors SurpExtractorType.SOFT_CAT_WHOLE_CTX_LEFT and SurpExtractorType.SOFT_CAT_SENTENCES allow for prompting the transformer at the initial embedding level wth two aggregation levels of the prefix. In SurpExtractorType.SOFT_CAT_WHOLE_CTX_LEFT the whole prefix is aggregated into a single output vector before being concatenated to the target text. In SurpExtractorType.SOFT_CAT_SENTENCES the prefix is split into sentences, each sentence is aggregated into a single output vector, and the output vectors are concatenated to the target text. This allows for a more nuanced control over the prefixing of the model, and can be used to investigate the effect of different types of context on the surprisal of a word.

Frequency

Frequency is extracted via the wordfreq package.

Words are split by white space.
A frequency of a word does not include adjacent punctuation.
Frequency is given by the figure-skating metric, from multiple sources.
See the package documentation for more in-depth descriptions.

Setup

Before running the scripts, you need to download the en_core_web_sm model from spacy: python -m spacy download en_core_web_sm

To install the package - pip install git+https://github.com/lacclab/text-metrics.git.

To make changes - clone to your local station.

Required package versions:

pandas>=2.1.0
transformers>=4.40.1
wordfreq>=3.0.3
numpy>=1.20.3

Run python -m doctest -v text_metrics/utils.py before committing to ensure that the docstrings are up to date.

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
.vscode		.vscode
Archive		Archive
text_metrics		text_metrics
.gitignore		.gitignore
README.md		README.md
Reading_times_heatmap.html		Reading_times_heatmap.html
colorized_texts.html		colorized_texts.html
colorized_texts_w_surp.html		colorized_texts_w_surp.html
environment.yml		environment.yml
ln_shared_data		ln_shared_data
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Metrics

Working With Text DataFrames

Integration With Eye Movement Data

Surprisal

Prefix Conditioned Surprisal

Without Left Context

With Left Context

Frequency

Setup

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

lacclab/text-metrics

Folders and files

Latest commit

History

Repository files navigation

Text Metrics

Working With Text DataFrames

Integration With Eye Movement Data

Surprisal

Prefix Conditioned Surprisal

Without Left Context

With Left Context

Frequency

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages