User Details
- User Since
- Oct 1 2018, 2:19 PM (346 w, 4 d)
- Availability
- Available
- IRC Nick
- isaacj
- LDAP User
- Isaac Johnson
- MediaWiki User
- Isaac (WMF) [ Global Accounts ]
Mon, May 19
Final report drafted. We'll update if a decision is made to publish on Meta as well.
- Quality metrics were covered under T386448
- Computational framework covered under: https://gitlab.wikimedia.org/repos/research/llm_evaluation/-/tree/main/llmperf?ref_type=heads
Fri, May 16
Updates:
- Sent out review requests
- Registration has opened so will be looking into that
- ACL reviews out so we can look for other Wikimedia papers to invite as well
Just adding some quick thoughts of nice-to-haves:
- Regarding Being able to access model outputs at scale would likely unlock additioanal use cases for liftwing for the mobile apps and other teams in the long term:
- This would be a more sustainable fix for the article-country model where we need country predictions for all of an article's links. We are currently using a static database hosted on LiftWing but that means it's constantly out-of-date and a bulkier image and more complicated pipelines for "re-training" the model: T385970.
- Over the years, I've built little prototype user-scripts (details) that can query model outputs for all of the links in a given article in order to visualize them as you browse. For example, highlighting links based on whether they were biographies of men, women, or non-binary folks and displaying statistics about the distribution. This allows for easily visualizing gender bias in links. I also had one for showing article quality predictions for links so editors could see which articles to prioritize for improvement that are relevant to a given topic. I was just running offline bulk predictions and caching them in a database hosted on Cloud VPS but this would be a cool use-case to support officially.
- I personally would love a solution based on Search because we already use their index for hosting predictions for many recommendation use-cases because it's highly accessible and they already handle the messiness of updating indexes to keep up with edits. In theory if they make the cirrusdoc (or similar) endpoint efficient for this sort of use-case, you also get some nice behavior for free such as the ability to use generator queries -- e.g., a single query to get topic predictions for links from the River Nile page: https://en.wikipedia.org/w/api.php?action=query&generator=links&titles=River_Nile&prop=cirrusdoc&format=json&formatversion=2&cdincludes=weighted_tags&gplnamespace=0&gpllimit=100&redirects. This is currently an expensive query for them and not meant to be used in production use-cases but the simplicity of it is quite beautiful. If the solution is a LiftWing cache, that's okay too but we might consider if there are ways to make it accessible in a similar way.
- There's a new topic model under discussion (report) with meetings to hopefully kick off soon about next steps for bringing to production. I personally think it'd be great to have this new model for this use-case too as it would e.g., give reliable data on the gender of biographies that folks are reading. We also are in early discussions about what it would mean to incorporate a "time" element into the model so that e.g., you could see if folks were reading predominantly about more current or historical topics. The main blocker to this time topic is also a question of how to efficiently serve the data from the Search index so it's worth considering alongside this ask. There's no phab ticket yet but I've talked with the Search Platform about it.
Thu, May 15
Updating deadline to June 30th -- this ended up being submitted to ACL's Industry Track but had a single bad review. We're assessing options for whether to incorporate into T382727 or keep as a standalone paper. We should make a decision by the end of the quarter.
Moving the deadline back one week -- I'm currently reviewing the paper and we'll make some decisions shortly on next steps.
Wed, May 14
Tue, May 13
Fri, May 9
Prioritization update:
- The RecentChanges/Watchlist survey has been deprioritized as it's unlikely that the technical interventions will generate a change in the satisfaction metrics due to a focus on smaller wikis. Given that the RecentChanges/Watchlist survey has been deployed, we are going to leave it open for the standard window. We won't be prioritizing the analysis right now but believe that the data can still be valuable outside of measuring satisfaction changes (e.g., % of highly-active editors who say they do patrolling/moderation tasks). We will pick those basic analyses up perhaps in Q1 if we have more time. We won't run any follow-up surveys though.
- This does not affect the Nuke surveys, which will still proceed as expected.
Wed, May 7
Tue, May 6
Still worthwhile long-term task but moving to freezer until we have an more urgent need for it.
Yu-Ming and I discussed -- leaving in backlog for another month because we might want to pick up if there's an opportunity and ultimately the decision brief should tell us whether we'll be picking up reader survey work soon. I'll be mindful though and move to decline if that becomes the direction we're going.
Wed, Apr 30
I'm late on my acknowledgement but thanks both for engaging here and being open to the feedback!
Fri, Apr 25
A few notes from a discussion yesterday with myself, @Pablo, @MNeisler, and @ppelberg:
- Context: research is doing some work adjacent to this space in T392210: [WE1.5.3] Wikipedia Patrolling Measurement and we wanted to be aware of opportunities for overlap/collaboration.
- The two areas where that work might be able to feed in nicely here:
- @Pablo has already done some work around detecting policy mentions in edit summaries and we will try to incorporate that into our patrolling dataset work. It not only is useful for understanding what policies are being broken but also for understanding how often editors receive that direct feedback via edit summaries when reverted.
- @Pablo developed code for tracking changes to issue templates (citation needed etc.) in T384600. If there are good opportunities to expand out the edit types we track beyond this, that also could be helpful per Peter's comment above about the role edit-types could play
- Scope of interest to Editing:
- Top-20 largest languages (where the largest moderator burden is likely to exist)
- Newcomer essentially means <100 edits here.
- Focus on main namespace. In this case, also focus on VisualEditor
- Focus on edits that elicit a "negative" response -- reverts are the most obvious of these but messages to user/talk pages could be another indication (though harder to measure).
- This data would likely be useful in Q1 as decisions start to be made about edit checks to consider in Q2. The Editing team is pretty open at the moment though ideally the Edit Checks that they work on relate to core Wikimedia policies that have salience across many wikis (even if their implementation/interpretation varies).
- Outcome: at this point, @MNeisler isn't planning on picking up this ticket and Research isn't intending to answer it directly but we're hopeful that the dataset that comes out of T392210 can be used for these purposes or easily extended to answer these questions. We'll keep each other in the loop where relevant.
A thought: we might also consider adding editing interface as a facet of the edit in case certain decisions around gaps need to take that into consideration -- e.g., like how Editing works with VE so has that as their focus (e.g., T354303). This should be pretty easy to add in later so not something we need to decide now.
Thu, Apr 24
Adding some context from an internal discussion to hopefully help whoever picks this up:
Apr 23 2025
@OKarakaya-WMF a few questions and thoughts to help us figure out how to proceed below. Apologies for the mountain of text but I'm trying to write down my thinking. A general theme is that I didn't design this model to be optimal in performance (because quality labels are inherently messy to begin with) but rather optimal in interpretability/usability. I think Sucheta is helping us to get a lot better at being explicit about some of these design constraints with the newer models but this one originally came from Research's needs followed by an internship and so was a bit more haphazard in its development/hand-off. I don't want to turn this project into something bigger than you all want it to be but if we want to make broader changes, we might want to consider getting some input from Enterprise (so it's not just me saying this and that) and try to write some of our guardrails/needs down more clearly.
Apr 21 2025
Stumbled across this and boldly resolving -- we have the https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Content/Wikidata_item_page_link table now and it looks like folks decided that we didn't need historical mapping information. I'd still love one day for us to publish a regular dump of this table for external users but that's a different task: T258514
Apr 18 2025
@MoritzMuehlenhoff I think we can decline this task now as I think this is being tracked under T389809?
Thanks @diego for putting this together! I'll work on prioritizing. A few thoughts / questions in the meantime to consider:
- What factors should we hold steady? Presumably a uniform number of training examples? Are they randomly sampled from all data before the cut-off though or is there some sort of stratification by time or other approach that should be used?
- Some stretch ideas once the basic system is working:
- Do you want to explore strategies for reducing the impact of model drift? For instance, testing whether training the model with the most recent data for a particular cut-off as opposed to a more uniform distribution over time? Other data sampling strategies? I wonder if it would make sense to try different types of pre-trained models (for the multilingual revert risk) as well to see if larger/newer base models perhaps are less sensitive to drift? I guess probably can't find good one-to-one comparisons in that regard though so hard to know how much can be learned from switching out the base models.
- I wonder if there's any reason to also explore knowledge cut-offs in the sense of when the article was created? Maybe new articles introduce new vocabulary that throws off the multilingual revert risk model and degrades performance? I wrote up some ideas around how to approach this idea in T383090 and you could just use article creation date.
- I also would be curious to see if we could test whether you see stable patterns as far as which revisions the models get wrong -- e.g., the newest 3 models correctly predicting a particular test row but the oldest 7 models not?
Updates:
- We moved back the deadline by a week (now April 30th)
- Offering office hours explicitly to folks intending to submit has been helpful I think. I've had two folks use them now and I was able to give some ideas of work to be aware of and important framing considerations for the workshop.
- About to do another round of reminders about the workshop
The other idea is to train a regression model by converting labels into numbers.
I think I'm in favor of the first approach but I think getting some evaluation scores from both approaches would help to decide.
Yeah, the first model that used wikitext was a regression model where I converted the labels into floats essentially based on their frequency (see this notebook). I found that it didn't work as well (comparison) so I switched to the OrderedModel, which felt a bit more direct/appropriate to me and I still was able to come up with a reasonable way to convert the label probabilities back into a single [0-1] score. But if you can find a way to make the regression one work better, go for it!
when we hear back about whether the ideas for the quant exploration are on the right track, would you be able to provide a bit of further scoping to this task (updating the description) and confirm who will be working on it? (tentatively next week) Thanks for your help
Acknowledging and yes, I can take that on once we get the okay to proceed with what's been proposed.
Apr 17 2025
FYI T348863: Baseline: Size of content moderation backlog - FlaggedRevs by KC also has some notebooks for patrolling data that might be of use.
Unassigning you Pablo and moving this back to research-ideas in the meantime as you'll be working on the related task: T392210
Hey -- glad to see you working on this @OKarakaya-WMF ! A few pointers in case they're helpful and don't hesitate to ask other questions:
- For the model in production, the code for training can be found here: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Quality/html-qual-exploration.ipynb. I think you're using an old script from an earlier version of the model. If the documentation is out of date somewhere, let me know and I'll fix (apologies)!
- We've long avoided features related to edits/editors when it comes to quality based on this work: https://grouplens.org/site-content/uploads/2013/09/wikisym2013_warnckewang-cosley-riedl.pdf. Essentially, those features might improve performance but they're fragile because they aren't really tied to the content itself and also not helpful feedback to an editor -- e.g., telling an editor that the way to improve a given article is to edit it more. It could be useful data to compare performance with and without editor-related features but I'd be hesitant to deploy to production unless we could make a strong case.
- I considered template counts in the past but dropped it because it wasn't adding much signal and I also struggled with justifying why more templates should be associated with higher quality -- i.e. similar to above, what would it mean to tell an editor that to improve the quality of the article, they should add more templates? I added the infobox feature as one common template that felt reasonable to include as a feature and valid recommendation to editors. The messagebox feature is also a template but a negative feature (because it indicates some sort of quality issue with the article). Curious to see what you come up with there.
- I could easily be wrong but I'm pretty sure we're already using the new API endpoints for fetching HTML instead of the old RESTBase ones. See: https://www.mediawiki.org/wiki/RESTBase/service_migration#Parsoid_endpoints but the Content Transform folks would know better.
Apr 14 2025
Glad to see the latency dropping! One thought: I suspect if we further instrumented the preprocess step, much of the latency is from how long it takes to get the HTML for the revision. @cscott gave that great DPE Deep Dive talk a month ago or so about Parser cache and how it works so tagging him to hopefully help clarify my guesses. The LiftWing model calls the "https://{lang}.wikipedia.org/w/rest.php/v1/revision/{revid}/html" endpoint (code) when assessing the quality for a given revision. I think that means:
- Presuming that these are old revids that are being used in the load testing, my understanding is that they probably weren't cached for the initial load test. They may or may not have been cached then for the follow-up requests which would greatly speed up responses.
- My guess is that Enterprise is intending to hit the LiftWing API for new revisions so presumably that means the HTML will be cached already so an accurate assessment might be achieved by running a "warm-up" test on the revision IDs used in testing first to force them into the Parser cache and then run the actual load tests. But also, if you want to know worst-case, you might want to switch it to choose random (old) revision IDs or something like that instead.
Apr 11 2025
Apr 10 2025
Apr 9 2025
Mar 28 2025
@Isaac please review for sign-off
Thank you - done!
If you choose to create a separate task for the paper writing, feel free to resolve this one (thanks!). Otherwise, let's keep it open for the next week and then resolve it when the paper is submitted. I did a quick read-through of the paper as of Friday morning my time. A few thoughts:
- I know you plan to do some clean-up of the paper so I didn't pay too much attention to particulars. One thing: Footnote 8 about HTML dataset. It actually was collected from the APIs by Fabian and not the dumps (because we needed individual revisions not a snapshot in time). That dumps endpoint for the Enterprise Dumps are being deprecated too in favor of their Snapshot APIs, so I'd either link to there (https://enterprise.wikimedia.com/api/) or documentation about Parsoid's APIs (https://www.mediawiki.org/wiki/RESTBase/service_migration#Parsoid_endpoints).
- My major point would be to motivate why maintenance tagging is important to understand upfront in the Introduction (beyond it not receiving much study). A few potential ideas:
- You highlight usage of the templates within ML research and it might be worth raising that up -- i.e. these templates are a source of labels for training classification models and so it's important to understand how they're used in practice and whether this tagging extends across many language communities. We also provide a more scalable data collection approach that could be used for those ends.
- These templates are also used as a source of tasks for editors themselves via recommender systems. You could cite SuggestBot as well as Newcomer Tasks.
- Highlight the importance of having these templates as a pathway separate from reverting content. This offers Wikipedians the ability to flag issues without reverting edits. This is an important alternative remedy given that reverting edits can have a negative impact on newcomer retention (Rise and decline). You might note too though that other work has not shown that tagging necessarily leads to change in the editors who are being flagged: https://dl.acm.org/doi/abs/10.1145/3274406.
- I think it's worthwhile to mention the similarity between this particular system and the turn to crowd-sourced moderation on X/Facebook. Just something like: "Given the growing interest in crowd-sourced moderation of this style in social media platforms like X (maybe cite something like https://dl.acm.org/doi/abs/10.1145/3686967) or Facebook (not sure if there's a research paper yet but citing a news article could work), it's important to understand how it works on platforms with a long history of community moderation like Wikipedia."
Switching to expectation of reporting on Meta -> Officewiki so we have more time to share out with relevant parties and incorporated any of their feedback before publishing. Rather than resolve this and open a new task for the remaining tasks, I'm going to extend the deadline to the Youth Conference (when we definitely want a public report) and add a new Meta milestone.
Thanks @YLiou_WMF !
Thanks @Pablo ! I'd talk with Diego but for the meta page, it might be worthwhile to make it a sub-page of https://meta.wikimedia.org/wiki/Research:Develop_a_working_definition_for_moderation_activity_and_moderators to preserve the continuity. In theory, we'll continue to do deeper dives into other aspects of moderation that were defined in that original report and it might be nice to keep them together.
Thanks @MGerlach !
Doing a little phab clean-up and resolving this. Focus groups wrapped up and resulted in the following artifacts:
- Report (led by Alex)
- Summary of taxonomy changes
- Updated prototype for testing
Follow-up work is being discussed to establish a hypothesis for making the updates to the production model and incorporating it into our recommender systems.
I added a final status to each of the project in the task description. A few notable things:
- @CMyrick-WMF will put together a final status report for the Language Gaps metrics work by April 7th -- that will be added to T348246 which can then be resolved.
- We wrapped up the focus group phase of the topic model V2 project. This led to a project report and updated prototype and description of taxonomy change.
- @TAndic 's Editor Metric Consultation work has been slightly extended into April by Movement Insights but should wrap up then.
- We will reopen consultation support for Small Language Projects and Identify Web Scraping as needed in Q4 though nothing specific expected at this time.
- We should receive initial paper notification for the Epistemic Injustice Paper on April 8th but no further work should be required there.
Mar 26 2025
Thanks @DDeSouza! I noticed a few small things that got missed on my original copyedit pass so I went ahead and fixed them. If that merge request looks good to you, then I think we're ready to publish.